├── README.md ├── figure ├── rect.svg ├── positions.svg ├── plot.svg ├── checkboard.svg ├── plot2.svg ├── dotplot.svg ├── twocoords.svg ├── ridgeline.svg └── parallel.svg ├── .github └── FUNDING.yml ├── matrixtricks.md ├── bugs.md ├── baseplotting.md └── fundamentals.md /README.md: -------------------------------------------------------------------------------- 1 | # R notes # 2 | 3 | A personal collection of notes and insights about R. 4 | 5 | All files listed here are subject to updates and changes. 6 | 7 | 1. [R fundamentals](fundamentals.md) 8 | 2. [R base plotting without wrappers](baseplotting.md) 9 | 3. [R bugs and inconsistencies](bugs.md) 10 | 4. [R matrix tricks](matrixtricks.md) 11 | 12 | -------------------------------------------------------------------------------- /figure/rect.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /.github/FUNDING.yml: -------------------------------------------------------------------------------- 1 | # These are supported funding model platforms 2 | 3 | github: [karoliskoncevicius] 4 | patreon: # Replace with a single Patreon username 5 | open_collective: # Replace with a single Open Collective username 6 | ko_fi: # Replace with a single Ko-fi username 7 | tidelift: # Replace with a single Tidelift platform-name/package-name e.g., npm/babel 8 | community_bridge: # Replace with a single Community Bridge project-name e.g., cloud-foundry 9 | liberapay: # Replace with a single Liberapay username 10 | issuehunt: # Replace with a single IssueHunt username 11 | otechie: # Replace with a single Otechie username 12 | custom: # Replace with up to 4 custom sponsorship URLs e.g., ['link1', 'link2'] 13 | -------------------------------------------------------------------------------- /figure/positions.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /matrixtricks.md: -------------------------------------------------------------------------------- 1 | # Matrix Tricks # 2 | 3 | Useful commands for working with matrices. 4 | 5 | 6 | ### Creating a matrix ### 7 | 8 | Multiple ways to quickly create helper matrices. 9 | 10 | ``` 11 | matrix(0, 3, 3) mat.or.vec(3,3) 12 | [,1] [,2] [,3] [,1] [,2] [,3] 13 | [1,] 0 0 0 [1,] 0 0 0 14 | [2,] 0 0 0 [2,] 0 0 0 15 | [3,] 0 0 0 [3,] 0 0 0 16 | 17 | 18 | .row(c(3,3)) .col(c(3,3)) 19 | [,1] [,2] [,3] [,1] [,2] [,3] 20 | [1,] 1 1 1 [1,] 1 2 3 21 | [2,] 2 2 2 [2,] 1 2 3 22 | [3,] 3 3 3 [3,] 1 2 3 23 | 24 | 25 | 26 | rbind(1:3, 3:1, 1:3) cbind(1:3, 3:1, 1:3) 27 | [,1] [,2] [,3] [,1] [,2] [,3] 28 | [1,] 1 2 3 [1,] 1 3 1 29 | [2,] 3 2 1 [2,] 2 2 2 30 | [3,] 1 2 3 [3,] 3 1 3 31 | 32 | 33 | diag(3) outer(1:3, 1:3) 34 | [,1] [,2] [,3] [,1] [,2] [,3] 35 | [1,] 1 0 0 [1,] 1 2 3 36 | [2,] 0 1 0 [2,] 2 4 6 37 | [3,] 0 0 1 [3,] 3 6 9 38 | ``` 39 | 40 | 41 | --- 42 | 43 | 44 | ### Matrix of objects ### 45 | 46 | Matrix can contain various classes. 47 | Below is a matrix of data.frames. 48 | 49 | ``` 50 | mat <- matrix(list(iris, mtcars, USArrests, chickwts), ncol=2) 51 | [,1] [,2] 52 | [1,] data.frame,5 data.frame,4 53 | [2,] data.frame,11 data.frame,2 54 | ``` 55 | 56 | Element selection in such matrices works based on list behaviour. 57 | 58 | ``` 59 | mat[[2,2]] mat[[2,2]] 60 | 61 | ``` 62 | 63 | 64 | --- 65 | 66 | 67 | ### Subtracting a vector from each row/column ### 68 | 69 | Subtract column/row means from each column/row. 70 | 71 | ``` 72 | X - rowMeans(X)[row(X)] X - colMeans(X)[col(X)] 73 | ``` 74 | 75 | Since matrices are just vectors ordered column-by-column row-wise subtraction can be simplified. 76 | The example below will subtract all means from the first column, then repeat this for all the means in the second column, etc. 77 | As a result every row will have its corresponding mean subtracted. 78 | 79 | ``` 80 | X - rowMeans(X) 81 | ``` 82 | 83 | These methods are general and work with any operations, not just subtraction. 84 | 85 | ``` 86 | (X - rowMeans(X)) / matrixStats::rowSds(X) # scale each row 87 | (X - colMeans(X)[col(X)]) / matrixStats::colSds(X)[col(X)] # scale each column 88 | ``` 89 | 90 | 91 | --- 92 | 93 | 94 | ### Order matrix elements by row/column ### 95 | 96 | Order each row/column of a matrix separately. 97 | 98 | ``` 99 | matrix(X[order(row(X), X)], nrow=nrow(X), byrow=TRUE) # by row 100 | matrix(X[order(col(X), X)], nrow=nrow(X)) # by column 101 | ``` 102 | 103 | Missing values are placed last. 104 | `na.last` argument in the `order()` function can be used to control this behaviour. 105 | 106 | ``` 107 | matrix(X[order(row(X), X, na.last=FALSE)], nrow=nrow(X), byrow=TRUE) 108 | matrix(X[order(col(X), X, na.last=FALSE)], nrow=nrow(X)) 109 | ``` 110 | 111 | This method is a lot faster than using `apply()`. 112 | 113 | 114 | -------------------------------------------------------------------------------- /bugs.md: -------------------------------------------------------------------------------- 1 | # R bugs and inconsistencies # 2 | 3 | The list only includes bugs that were detected during work. 4 | Still active issues are on the top, solved issues are at the bottom. 5 | 6 | 7 | ## Active ## 8 | 9 | ### Matrix column names can themselves have names #### 10 | 11 | When setting column names for a matrix the column names can themselves have names. 12 | 13 | ``` 14 | mat <- data.matrix(iris) 15 | colnames(mat) <- c(A = "a", B = "b", C = "c", D = "d", E = "e") 16 | 17 | colnames(mat) 18 | # A B C D E 19 | # "a" "b" "c" "d" "e" 20 | ``` 21 | 22 | The names are gone after any subset operation the matrix. 23 | 24 | ``` 25 | colnames(mat[1:10,]) 26 | # [1] "a" "b" "c" "d" "e" 27 | ``` 28 | 29 | Moreover, this is not the case for data.frames. 30 | 31 | ``` 32 | df <- iris 33 | colnames(df) <- c(A = "a", B = "b", C = "c", D = "d", E = "e") 34 | 35 | colnames(df) 36 | # [1] "a" "b" "c" "d" "e" 37 | ``` 38 | 39 | - [Report on Bugzilla](https://bugs.r-project.org/show_bug.cgi?id=18558) 40 | 41 | 42 | ### Selection of NA rownames from a matrix ### 43 | 44 | R allows selecting elements either by index or by name. 45 | Having NA values in those selection are allowed and return an element with an unknown value - NA. 46 | 47 | ``` 48 | x <- c(a=1, b=2, c=3) 49 | 50 | x[c(1,NA,3)] x[c("a", NA, "c")] 51 | # a c # a c 52 | # 1 NA 3 # 1 NA 3 53 | ``` 54 | 55 | But the above operations are inconsistent when working with matrices. 56 | Having NA selections when selecting by index works fine but the same operation throws an error when selecting by names. 57 | 58 | ``` 59 | mat <- matrix(c(1:6), ncol=3) 60 | colnames(mat) <- c("a","b","c") 61 | 62 | mat[,c(1,NA,3)] mat[,c("a", NA, "c")] 63 | # a c # 64 | # [1,] 1 NA 5 65 | # [2,] 2 NA 6 66 | ``` 67 | 68 | - [Report on Bugzilla](https://bugs.r-project.org/show_bug.cgi?id=18481) 69 | 70 | 71 | --- 72 | 73 | 74 | ### Calling is.nan() on a data.frame ## 75 | 76 | 77 | Function `is.nan()` doesn't work on data.frames while `is.na()` does. 78 | 79 | ``` 80 | is.na(iris) is.nan(iris) 81 | # # 82 | ``` 83 | 84 | 85 | --- 86 | 87 | 88 | ### Autocompletion inside square brackets ### 89 | 90 | Autocompletion for list names does not work inside of square brackets. 91 | 92 | ``` 93 | iris$Sp iris[iris$Sp] 94 | # works # does not work 95 | ``` 96 | 97 | - [Question on StackOverflow](https://stackoverflow.com/q/30737225/1953718) 98 | - [Issue on Henrik's WishlistForR](https://github.com/HenrikBengtsson/Wishlist-for-R/issues/129) 99 | 100 | 101 | --- 102 | 103 | 104 | ### Inconsistent output format for distribution functions ### 105 | 106 | Distribution functions like `pnorm()` have an inconsistent output format for an emtpy `<0,0>` dimension matrix. 107 | 108 | ``` 109 | x <- matrix(numeric(), nrow=0, ncol=0) 110 | 111 | qbeta(x, 1, 1) # numeric(0) 112 | qbinom(x, 1, 1) # numeric(0) 113 | qbirthday(x, 1, 1) # 114 | qcauchy(x) # numeric(0) 115 | qchisq(x, 1) # <0 x 0 matrix> 116 | qexp(x, 1) # <0 x 0 matrix> 117 | qf(x, 1, 1) # numeric(0) 118 | qgamma(x, 1) # numeric(0) 119 | qgeom(x, 1) # <0 x 0 matrix> 120 | qhyper(x, 1, 1, 1) # numeric(0) 121 | qlnorm(x) # numeric(0) 122 | qlogis(x) # numeric(0) 123 | qnbinom(x, 1, 1) # numeric(0) 124 | qnorm(x) # numeric(0) 125 | qpois(x, 1) # <0 x 0 matrix> 126 | qsignrank(x, 1) # <0 x 0 matrix> 127 | qsmirnov(x, c(1,1)) # numeric(0) 128 | qt(x, 1) # <0 x 0 matrix> 129 | qtukey(x, 1, 1) # numeric(0) 130 | qunif(x) # numeric(0) 131 | qweibull(x, 1) # numeric(0) 132 | qwilcox(x, 1, 1) # numeric(0) 133 | ``` 134 | 135 | When the matrix is non empty all the functions (except `qbirthday()`) preserve the original matrix format. 136 | Therefore, in my opinion, returning `<0 x 0 matrix>` is the correct behaviour. 137 | 138 | - [Report on Bugzilla](https://bugs.r-project.org/show_bug.cgi?id=18509) 139 | 140 | 141 | --- 142 | 143 | 144 | ### Calling rbind() on a data.frame with 0 columns drops all rows ### 145 | 146 | When called on a data.frame with 0 columns `rbind()` drops all the rows, while `cbind()` keeps them. 147 | This behaviour is also inconsistent with matrices, where both operations preserve the number of rows. 148 | 149 | ``` 150 | mat <- matrix(nrow=2, ncol=0) df <- as.data.frame(mat) 151 | 152 | dim(mat) dim(df) 153 | # [1] 2 0 # [1] 2 0 154 | 155 | dim(cbind(mat, mat)) dim(cbind(df, df)) 156 | # [1] 2 0 # [1] 2 0 157 | 158 | dim(rbind(mat, mat)) dim(rbind(df, df)) 159 | # [1] 4 0 # [1] 0 0 160 | ``` 161 | 162 | Note: this behaviour is documented. 163 | 164 | - [Discussion on R-devel](https://stat.ethz.ch/pipermail/r-devel/2019-May/077796.html) 165 | - [Question on StackOverflow](https://stackoverflow.com/q/52233413/1953718) 166 | - [Issue on Henrik's WishlistForR](https://github.com/HenrikBengtsson/Wishlist-for-R/issues/77) 167 | 168 | 169 | --- 170 | 171 | 172 | ### Some tests can silently return NA results ### 173 | 174 | When provided with all constant values some statistical tests silently produce NaN statistics and NA pvalues. 175 | 176 | ``` 177 | x <- c(1,1,1,1,1,1) 178 | g <- c("a","a","a","b","b","b") 179 | 180 | bartlett.test(x, g) # 181 | fligner.test(x, g) # 182 | kruskal.test(x, g) # 183 | oneway.test(x ~ g) # 184 | t.test(x ~ g) # 185 | var.test(x ~ g) # 186 | wilcox.test(x ~ g) # 187 | ``` 188 | 189 | Another case to check - when the values are constant within groups but differ between the groups 190 | 191 | ``` 192 | x <- c(1,1,1,2,2,2) 193 | g <- c("a","a","a","b","b","b") 194 | 195 | bartlett.test(x, g) # 196 | fligner.test(x, g) # 197 | kruskal.test(x, g) # 198 | oneway.test(x ~ g) # 199 | t.test(x ~ g) # 200 | var.test(x ~ g) # 201 | wilcox.test(x ~ g) # 202 | ``` 203 | 204 | Note: the desired behaviour, in my opinion, is for all the cases with silent NA values or errors is to produce a warning instead. 205 | 206 | - [Discussion on R-devel](https://stat.ethz.ch/pipermail/r-devel/2020-December/080334.html) 207 | 208 | 209 | --- 210 | 211 | 212 | ### Misleading error in paired Wilcoxon's test ### 213 | 214 | When a paired `wilcox.test()` is called with missing observatiosn in `y` it shows a misleading error about observations missing in `x`. 215 | 216 | ``` 217 | wilcox.test(c(1,2), c(NA_integer_,NA_integer_), paired=TRUE) 218 | # 219 | ``` 220 | 221 | - [Discussion on R-devel](https://stat.ethz.ch/pipermail/r-devel/2019-December/078774.html) 222 | 223 | 224 | --- 225 | 226 | 227 | ### Inconsistent confidence level range between tests ### 228 | 229 | The acceptable input range for confidence level parameter `conf.level` is not consistent between tests. 230 | 231 | ``` 232 | x <- rnorm(10) 233 | y <- rnorm(10) 234 | g <- gl(2, 5) 235 | 236 | binom.test(1, 1, 1, conf.level=1) # 237 | cor.test(x, y, conf.level=1) # 238 | fisher.test(g, g, conf.level=1) # 239 | t.test(x, y, conf.level=0) # 240 | mantelhaen.test(g, g, g, conf.level=1) # 241 | var.test(x, y, conf.level=0) # 242 | wilcox.test(x, y, conf.int=0.95, conf.level=1) # 243 | ``` 244 | 245 | 246 | ## Solved ## 247 | 248 | 249 | ### Tolerance issues for ties in Wilcoxon's test ### 250 | 251 | Paired version of `wilcoxon.test(paired = TRUE)` has tolerance issues when detecting ties. 252 | 253 | ``` 254 | wilcox.test(c(4,3,2), c(3,2,1), paired=TRUE) 255 | # 256 | 257 | wilcox.test(c(0.4,0.3,0.2), c(0.3,0.2,0.1), paired=TRUE) 258 | # 259 | ``` 260 | 261 | - [Discussion on R-devel](https://stat.ethz.ch/pipermail/r-devel/2019-December/078774.html) 262 | - [Report on Bugzilla](https://bugs.r-project.org/show_bug.cgi?id=16138) 263 | 264 | 265 | --- 266 | 267 | 268 | ### Missing information about Wilcoxon's test variant ### 269 | 270 | 271 | Information showing if normal approximation was used is missing from the output of `wilcox.test()`. 272 | 273 | ``` 274 | wilcox.test(rnorm(10), exact=FALSE, correct=FALSE) 275 | wilcox.test(rnorm(10), exact=TRUE, correct=FALSE) 276 | # different results, but same data and identical method description 277 | ``` 278 | 279 | - [Discussion on R-devel](https://stat.ethz.ch/pipermail/r-devel/2019-December/078774.html) 280 | 281 | 282 | --- 283 | 284 | 285 | ### Wilcoxon's test has inconsistent behaviour with infinite values ### 286 | 287 | 288 | Different versions of `wilcox.test()` treats infinite values differently. 289 | 290 | ``` 291 | wilcox.test(c(1,2,3,4), c(0,9,8,Inf)) 292 | # Inf is removed 293 | 294 | wilcox.test(c(1,2,3,4), c(0,9,8,Inf), paired=TRUE) 295 | # Inf is treated as a highest rank 296 | ``` 297 | 298 | - [Discussion on R-devel](https://stat.ethz.ch/pipermail/r-devel/2019-December/078774.html) 299 | 300 | 301 | --- 302 | 303 | 304 | ### Fligner's test can show significance for constant values ### 305 | 306 | In some cases `flinger.test()` can produce highly significant result whan all groups have constant values. 307 | 308 | ``` 309 | fligner.test(c(1,1,1,2,2,2), c("a","a","a","b","b","b")) 310 | # p-value < 2.2e-16 311 | ``` 312 | 313 | - [Discussion on R-devel](https://stat.ethz.ch/pipermail/r-devel/2019-June/078038.html) 314 | 315 | 316 | --- 317 | 318 | 319 | ### Calling any() on a logical data.frame ### 320 | 321 | Function `any()` does not work on a logical data.frame but works on a numeric data.frame. 322 | 323 | ``` 324 | any(data.frame(A=0L, B=1L)) any(data.frame(A=TRUE, B=FALSE)) 325 | # TRUE # 326 | ``` 327 | 328 | My opinion - `any()` not working on a data.frame might be an intended behaviour. 329 | However then it also should not be working on a numeric data.frame. 330 | Having a logical function - `any()` work on numeric inputs but not logical inputs is not expected. 331 | 332 | - [Question on StackOverflow](https://stackoverflow.com/q/60251847/1953718) 333 | - [Issue on Henrik's WishlistForR](https://github.com/HenrikBengtsson/Wishlist-for-R/issues/112) 334 | 335 | -------------------------------------------------------------------------------- /figure/plot.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /figure/checkboard.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /figure/plot2.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /baseplotting.md: -------------------------------------------------------------------------------- 1 | # R base plotting without wrappers # 2 | 3 | Base plotting is as old as R itself yet for most users it remains mysterious. 4 | They might be using `plot()` or even know the full list of its parameters but most never understand it fully. 5 | This article attempts to demystify base graphics by providing a friendly introduction for the uninitiated. 6 | 7 | 8 | ## Deconstructing a plot ## 9 | 10 | Quickly after learning R users start producing various figures by calling `plot()`, `hist()`, or `barplot()`. 11 | Then, when faced with a complicated figure, they start stacking those plots on top of one another using various hacks, like `add=TRUE`, `ann=FALSE`, `cex=0`. 12 | For most this marks the end of their base plotting journey and they leave with an impression of it being an ad-hoc bag of tricks that has to be learned and remembered but that otherwise is hard, inconsistent, and unintuitive. 13 | Nowadays even experts who write about base graphics[^jeffleek] or compare it with other systems[^davidrobinson] share the same opinion. 14 | However, those initial functions everyone was using are only wrappers on top of the smaller functions that do all the work. 15 | And many would be surprised to learn that under the hood base plotting follows the paradigm of having a set of small functions that each do one thing and work well with one another. 16 | 17 | Let's start with the simplest example. 18 | 19 | ``` 20 | plot(0:10, 0:10, xlab = "x-axis", ylab = "y-axis", main = "my plot") 21 | ``` 22 | 23 | ![](figure/plot.svg) 24 | 25 | 26 | The `plot()` function above is really just a wrapper that calls an array of lower level functions. 27 | 28 | ``` 29 | plot.new() 30 | plot.window(xlim = c(0,10), ylim = c(0,10)) 31 | points(0:10, 0:10) 32 | axis(1) 33 | axis(2) 34 | box() 35 | title(xlab = "x-axis") 36 | title(ylab = "y-axis") 37 | title(main = "my plot") 38 | ``` 39 | 40 | Written like this all the elements comprising the plot become clear. 41 | Every new function call draws a single object on top of the plot produced up until that point. 42 | It becomes easy to see which line should be modified in order to change something on the plot. 43 | Just as an example let's modify the above plot by: 1) adding a grid, 2) removing the box around the plot, 3) removing the axis lines, 4) making axis labels bold, 5) turning the annotation labels red, and 6) shifting the title to the left. 44 | 45 | ``` 46 | plot.new() 47 | plot.window(xlim = c(0,10), ylim = c(0,10)) 48 | grid() 49 | points(0:10, 0:10) 50 | axis(1, lwd = 0, font.axis=2) 51 | axis(2, lwd = 0, font.axis=2) 52 | title(xlab = "x-axis", col.lab = "red3") 53 | title(ylab = "y-axis", col.lab = "red3") 54 | title(main = "my plot", col.main = "red3", adj = 0) 55 | ``` 56 | 57 | ![](figure/plot2.svg) 58 | 59 | In each case to achieve the wanted effect only a single line had to be modified. 60 | And the function names are very intuitive. 61 | A person without any R experience would have no trouble saying which element on the plot is added by which line or changing some of the parameters. 62 | 63 | So, in order to construct a plot various functions are called one by one. 64 | But where do we get all the names for those functions? 65 | Do we need to remember hundreds of them? 66 | Turns out the set of all the things you might need to do on a plot is pretty limited. 67 | 68 | ``` 69 | par() # specifies various plot parameters 70 | plot.new() # starts a new plot 71 | plot.window() # adds a coordinate system to the plot region 72 | 73 | points() # draws points 74 | lines() # draws lines connecting 2 points 75 | abline() # draws infinite lines throughout the plot 76 | arrows() # draws arrows 77 | segments() # draws segmented lines 78 | rect() # draws rectangles 79 | polygon() # draws complex polygons 80 | 81 | text() # adds written text within the plot 82 | mtext() # adds text in the margins of a plot 83 | 84 | title() # adds plot and axis annotations 85 | axis() # adds axes 86 | box() # draws a box around a plot 87 | grid() # adds a grid over a coordinate system 88 | legend() # adds a legend 89 | ``` 90 | 91 | The above list covers majority of the functionality needed to recreate almost any plot. 92 | And for demonstration `example()` can be used to quickly see what each of those functions do, i.e. `example(rect)`. 93 | R also has some other helpful functions like `rug()` and `jitter()` to make certain situations easier but they are not crucial and can be implemented using the ones listed above. 94 | 95 | Function names are quite straightforward but what about their arguments? 96 | Indeed some of argument names, like `cex` can seem quite cryptic. 97 | But the argument name is always an abbreviation for a property of the plot[^parameters]. 98 | For example `col` is a shorthand for "color", `lwd` stands for "line-width", and `cex` means "character expansion". 99 | Good news is that in general the same arguments stand for the same properties across all of base R functions. 100 | And for a specific function `help()` can always be used in order to get the list of all arguments and their descriptions. 101 | 102 | To further illustrate the consistency between arguments let's return to the first example. 103 | By now it should be pretty clear, with one exception - the `axis(1)` and `axis(2)` lines. 104 | Where do those numbers: `1` and `2` came from? 105 | The numbers specify the positions around the plot and they start from `1` which refers to the bottom of the plot and go clockwise up to `4` which refers to the right side. 106 | The picture below demonstrates the relationship between numbers and four sides of the plot. 107 | 108 | ``` 109 | plot.new() 110 | box() 111 | mtext("1", side = 1, col = "red3") 112 | mtext("2", side = 2, col = "red3") 113 | mtext("3", side = 3, col = "red3") 114 | mtext("4", side = 4, col = "red3") 115 | ``` 116 | 117 | ![](figure/positions.svg) 118 | 119 | The same position numbers are used throughout the various different functions. 120 | Whenever a parameter of some function needs to specify a side, chances are it will do so using the numeric notation described above. 121 | Below are a few examples. 122 | 123 | ``` 124 | par(mar = c(0,0,4,4)) # margins of a plot: c(bottom, left, right , top) 125 | par(oma = c(1,1,1,1)) # outer margins of a plot 126 | axis(3) # side where axis will be displayed 127 | text(x, y, "text", pos = 3) # pos selects the side the "text" is displayed at 128 | mtext("text", side = 4) # side specifies the margin "text" will appear in 129 | ``` 130 | 131 | Another important point is vectorization. 132 | Almost all the arguments for base plotting functions are vectorized. 133 | For example, when plotting rectangles the user does not have to add each point of each rectangle one by one within a loop. 134 | Instead he or she can draw all the related objects with one function call while at the same time specifying different positions and parameters for each. 135 | 136 | ``` 137 | plot.new() 138 | plot.window(xlim = c(0,3), ylim = c(0,3)) 139 | 140 | rect(xleft = c(0,1,2), ybottom = c(0,1,2), xright = c(1,2,3), ytop = c(1,2,3), 141 | border = c("pink","red","darkred"), lwd = 10 142 | ) 143 | ``` 144 | 145 | ![](figure/rect.svg) 146 | 147 | Here is another example producing a check board pattern. 148 | 149 | ``` 150 | plot.new() 151 | plot.window(xlim = c(0,10), ylim = c(0,10)) 152 | 153 | xs <- rep(1:9, each = 9) 154 | ys <- rep(1:9) 155 | 156 | rect(xs-0.5, ys-0.5, xs+0.5, ys+0.5, col = c("white","darkgrey")) 157 | ``` 158 | 159 | ![](figure/checkboard.svg) 160 | 161 | 162 | ## Constructing a plot ## 163 | 164 | One of base R graphics strengths is it's flexibility and potential for customization. 165 | It really shines when the user needs to follow a particular style found in an existing example or a template[^stackoverflow]. 166 | Below are a few illustrations demonstrating how different base functions can work together and reconstruct various types of common figures from scratch. 167 | 168 | 169 | - Annotated barplot of USA population growth over time. 170 | 171 | ``` 172 | x <- time(uspop) 173 | y <- uspop 174 | 175 | plot.new() 176 | plot.window(xlim = range(x), ylim = range(pretty(y))) 177 | 178 | rect(x-4, 0, x+4, y) 179 | text(x, y, y, pos = 3, col = "red3", cex = 0.7) 180 | 181 | mtext(x, 1, at = x, las = 2, cex = 0.7, font = 2) 182 | axis(2, lwd = 0, las = 2, cex.axis = 0.7, font.axis = 2) 183 | title("US population growth", adj = 0, col.main = "red2") 184 | ``` 185 | 186 | ![](figure/barplot.svg) 187 | 188 | In this case for each rectangle four sets of points had to be specified: x and y for the left bottom corner plus x and y for the top right corner. 189 | In the end, even so this is a more complicated example, we still added all the different pieces of information using single function calls to `rect()`, `text()`, and `mtext()`. 190 | 191 | 192 | - Parallel coordinates plot using the "iris" dataset. 193 | 194 | ``` 195 | palette(c("cornflowerblue", "red3", "orange")) 196 | 197 | plot.new() 198 | plot.window(xlim = c(1,4), ylim = range(iris[,-5])) 199 | grid(nx = NA, ny = NULL) 200 | abline(v = 1:4, col = "grey", lwd = 5, lty = "dotted") 201 | 202 | matlines(t(iris[,-5]), col = iris$Species, lty = 1) 203 | 204 | axis(2, lwd = 0, las = 2) 205 | mtext(variable.names(iris)[-5], 3, at = 1:4, line = 1, col = "darkgrey") 206 | 207 | legend(x = 1, y = 2, legend = unique(iris$Species), col = unique(iris$Species), 208 | lwd = 3, bty = 'n') 209 | ``` 210 | 211 | ![](figure/parallel.svg) 212 | 213 | In this example we used a special function `matlines()` that draws one line for each column in a matrix. 214 | We also did a few other things that were novel so far: changed the default numeric colors via `palette()`, and used factor levels for specifying the actual colors within `matlines()` and `legend()`. 215 | Changing palette allows us to customize the color scheme while passing factors for the color arguments guarantees that the same colors are consistently assigned to the same factor levels across all the different functions. 216 | 217 | 218 | - Dot plot of death rates in Virginia in 1940. 219 | 220 | ``` 221 | colors <- hcl.colors(5, "Zissou") 222 | ys <- c(1.25, 2, 1.5, 2.25) 223 | 224 | plot.new() 225 | plot.window(xlim = range(0,VADeaths), ylim = c(1,2.75)) 226 | 227 | abline(h = ys, col = "grey", lty = "dotted", lwd = 3) 228 | points(VADeaths, ys[col(VADeaths)], col = colors, pch = 19, cex = 2.5) 229 | text(0, ys, colnames(VADeaths), adj = c(0.25,-1), col = "lightslategrey") 230 | 231 | axis(1, lwd = 0, font = 2) 232 | title("deaths per 1000 in 1940 Virginia stratified by age, gender, and location") 233 | 234 | legend("top", legend = rownames(VADeaths), col = colors, pch = 19, horiz = TRUE, 235 | bty = "n", title = "age bins") 236 | ``` 237 | 238 | ![](figure/dotplot.svg) 239 | 240 | In this graph the groups were stratified by gender, age, and location 241 | For each group the y-axis hights were chosen manually and `col()` function was used to repeat those heights for all numbers within the matrix. 242 | 243 | 244 | - Dual coordinate plot using the "mtcars" dataset. 245 | 246 | ``` 247 | par(mar = c(4,4,4,4)) 248 | plot.new() 249 | 250 | plot.window(xlim = range(mtcars$disp), ylim = range(pretty(mtcars$mpg))) 251 | 252 | points(mtcars$disp, mtcars$mpg, col = "darkorange2", pch = 19, cex = 1.5) 253 | axis(2, col.axis = "darkorange2", lwd = 2, las = 2) 254 | mtext("miles per gallon", 2, col = "darkorange2", font = 2, line = 3) 255 | 256 | plot.window(xlim = range(mtcars$disp), ylim = range(pretty(mtcars$hp))) 257 | 258 | points(mtcars$disp, mtcars$hp, col = "forestgreen", pch = 19, cex = 1.5) 259 | axis(4, col.axis = "forestgreen", lwd = 2, las = 2) 260 | mtext("horse power", 4, col = "forestgreen", font = 2, line = 3) 261 | 262 | box() 263 | axis(1) 264 | mtext("displacement", 1, font = 2, line = 3) 265 | title("displacement VS mpg VS hp", adj = 0, cex.main = 1) 266 | ``` 267 | 268 | ![](figure/twocoords.svg) 269 | 270 | Here we visualized two scatter plots, with different y-axes, on a single figure. 271 | The trick here is to change the coordinate system in the middle of the plot using `plot.window()`. 272 | But note that plots with double y-axes are frowned upon so do not take this example as a suggestion. 273 | 274 | 275 | - Ridgeline Density Plot showing chicken weight distributions stratified by feed type. 276 | 277 | ``` 278 | dens <- tapply(chickwts$weight, chickwts$feed, density) 279 | 280 | xs <- Map(getElement, dens, "x") 281 | ys <- Map(getElement, dens, "y") 282 | ys <- Map(function(x) (x-min(x)) / max(x-min(x)) * 1.5, ys) 283 | ys <- Map(`+`, ys, length(ys):1) 284 | 285 | plot.new() 286 | plot.window(xlim = range(xs), ylim = c(1,length(ys)+1.5)) 287 | abline(h = length(ys):1, col = "grey") 288 | 289 | Map(polygon, xs, ys, col = hcl.colors(length(ys), "Zissou", alpha = 0.8)) 290 | 291 | axis(1, tck = -0.01) 292 | mtext(names(dens), 2, at = length(ys):1, las = 2, padj = 0) 293 | title("Chicken weights", adj = 0, cex = 0.8) 294 | ``` 295 | 296 | ![](figure/ridgeline.svg) 297 | 298 | In this example majority of the work is done preparing the densities by transforming y values to a range of 0 - 1.5 and then adding a different offset for each feed type. 299 | To make the drawn densities overlap nicely we plot them starting from the topmost and going down. 300 | After that `Map()` and `polygon()` do all the work. 301 | 302 | 303 | - Violin Plot of chicken weights versus feed type 304 | 305 | 306 | ``` 307 | dens <- tapply(chickwts$weight, chickwts$feed, density) 308 | cols <- hcl.colors(length(dens), "Zissou") 309 | 310 | xs <- Map(getElement, dens, "y") 311 | ys <- Map(getElement, dens, "x") 312 | 313 | xs <- Map(c, xs, Map(rev, Map(`*`, -1, xs))) 314 | ys <- Map(c, ys, Map(rev, ys)) 315 | 316 | xs <- Map(function(x) (x-min(x)) / max(x-min(x) * 1.1), xs) 317 | xs <- Map(`+`, xs, 1:length(xs)) 318 | 319 | 320 | plot.new() 321 | plot.window(xlim = range(xs), ylim = range(ys)) 322 | grid(nx = NA, ny = NULL, lwd = 2) 323 | 324 | Map(polygon, xs, ys, col = cols) 325 | 326 | axis(2, las = 1, lwd = 0) 327 | title("Chicken weight by feed type", font = 2) 328 | 329 | legend("top", legend = names(dens), fill = cols, ncol = 3, inset = c(0, 1), 330 | xpd = TRUE, bty = "n") 331 | ``` 332 | 333 | ![](figure/violinplot.svg) 334 | 335 | Now the polygons are double sided, and so we need to mirror and duplicate the `xs` and `ys`. 336 | In the code above 5th and 6th lines do this job. 337 | After that the plotting is almost identical to the previous example. 338 | There is one additional trick[^legendinset] used on the legend where we use "inset" to push it over to the other side. 339 | 340 | 341 | - Correlation matrix plot using variables from the "mtcars" dataset. 342 | 343 | ``` 344 | cors <- cor(mtcars) 345 | cols <- hcl.colors(200, "RdBu")[round((cors+1)*100)] 346 | 347 | par(mar = c(5,5,0,0)) 348 | plot.new() 349 | plot.window(xlim = c(0,ncol(cors)), ylim = c(0,ncol(cors))) 350 | 351 | rect(row(cors)-1, col(cors)-1, row(cors), col(cors)) 352 | symbols(row(cors)-0.5, col(cors)-0.5, circles = as.numeric(abs(cors))/2, 353 | inches = FALSE, asp = 1, add = TRUE, bg = cols 354 | ) 355 | 356 | mtext(rownames(cors), 1, at=1:ncol(cors)-0.5, las=2) 357 | mtext(colnames(cors), 2, at=1:nrow(cors)-0.5, las=2) 358 | ``` 359 | 360 | ![](figure/corrplot.svg) 361 | 362 | Here second line assigns colors for each correlation value by transforming correlation from a range of -1:1 to 0:200. 363 | Then we used `rect()` function to get the grid and `symbols()` for adding circles with specified radii. 364 | The resulting figure is similar to the one implemented by `corrplot` library[^corrplot]. 365 | 366 | 367 | ## Summary ## 368 | 369 | R base plotting system has several polished and easy to use wrappers that are sometimes convenient but in the long run only confuse and hide things. 370 | As a result most R users are never properly introduced to the real functions behind the base plotting paradigm and are left confused by many of its perceived idiosyncrasies. 371 | However, if inspected properly, base plotting can become powerful, flexible, and intuitive. 372 | Under the hood of all wrappers the heavy lifting is done by a small set of simple functions that work in tandem with one another. 373 | Often a few lines of code is all it takes to produce an elegant and customized figure. 374 | 375 | 376 | [^jeffleek]: ["Why I don't use ggplot2"](https://simplystatistics.org/2016/02/11/why-i-dont-use-ggplot2/) by Jeff Leek 377 | [^davidrobinson]: ["Why I use ggplot2"](http://varianceexplained.org/r/why-I-use-ggplot2/) by David Robinson 378 | [^parameters]: ["Graphics parameter mnemonics"](https://www.stat.auckland.ac.nz/~paul/R/parMemnonics.html) by Paul Murrell 379 | [^stackoverflow]: ["Reproducing the style of a histogram plot in R"](https://stackoverflow.com/q/27934840/1953718) on stackoverflow.com 380 | [^corrplot]: ["corrplot"](https://cran.r-project.org/web/packages/corrplot/index.html) package on CRAN 381 | [^legendinset]: [answer about plotting the legend in the margins with base R](https://stackoverflow.com/a/49501243/1953718) on stackoverflow.com 382 | 383 | -------------------------------------------------------------------------------- /figure/dotplot.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /figure/twocoords.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /fundamentals.md: -------------------------------------------------------------------------------- 1 | # R fundamentals # 2 | 3 | This article focuses on R-specific concepts like vectorisation, recycling, subsetting, matrix and data.frame objects, and some built in types. 4 | The text is laconic, there is no hand-holding, so it is best to consume it slowly. 5 | 6 | 7 | ## Types ## 8 | 9 | R stores data in a number of different object types. 10 | Below are some objects of different types: 11 | 12 | ```r 13 | TRUE # logical 14 | 1L # integer 15 | 1 # double 16 | 1+0i # complex 17 | "one" # character 18 | ``` 19 | 20 | Some of these types have values with special meaning: 21 | 22 | ```r 23 | NaN # double "not a number" 24 | Inf # double "infinity" 25 | -Inf # double "negative infinity" 26 | NA # logical "missing" value 27 | NA_integer_ # integer "missing" value 28 | NA_real_ # double "missing" value 29 | NA_complex_ # complex "missing" value 30 | NA_character_ # character "missing" value 31 | NULL # special variable without a type 32 | ``` 33 | 34 | In addition R also uses other, more complex types: `raw`, `list`, `environment`, `closure`. 35 | 36 | 37 | ## Operators ## 38 | 39 | For manipulating and combining objects R has a number of standard operators: 40 | 41 | ```r 42 | + # addition 43 | - # subtraction 44 | / # division 45 | * # multiplication 46 | ^ # power 47 | ** # alternative notation of power 48 | ! # negation 49 | & # logical and 50 | | # logical or 51 | == # equals 52 | != # not equals 53 | > # greater than 54 | >= # greater than or equal 55 | < # less than 56 | <= # less than or equal 57 | && # scalar version of logical and 58 | || # scalar version of logical or 59 | ``` 60 | 61 | Types and operators can be combined in a standard infix form: 62 | 63 | ```r 64 | TRUE | FALSE # TRUE 65 | 2 + 2 # 4 66 | 3 * 10 # 30 67 | 3+0i + 0+1i # 3+1i 68 | 3 == 10 # FALSE 69 | 4 < Inf # TRUE 70 | "a" != "b" # TRUE 71 | ``` 72 | 73 | Some operators are used with a single object: 74 | 75 | ```r 76 | !TRUE # FALSE 77 | -10L # -10 78 | ``` 79 | 80 | Some operators can combine different types, casting the result to the most general type: 81 | 82 | ```r 83 | TRUE + 1L # 2 84 | 2L * 2.2 # 4.4 85 | 2 & TRUE # TRUE 86 | 0 & TRUE # FALSE (0 is treated as FALSE, any other number as TRUE) 87 | ``` 88 | 89 | Special type values are returned in special circumstances: 90 | 91 | ```r 92 | 10 / 0 # Inf 93 | -2 / 0 # -Inf 94 | 0 / 0 # NaN 95 | -1+0i / 0 # NaN+NaNi 96 | ``` 97 | 98 | If a value is missing operator will determine whether the result can be known: 99 | 100 | ```r 101 | 1^NA # 1 102 | 2^NA # NA 103 | NA^0 # 1 104 | 105 | TRUE | NA # TRUE 106 | TRUE & NA # NA 107 | FALSE | NA # NA 108 | FALSE & NA # FALSE 109 | ``` 110 | 111 | Scalar versions of logical AND and OR operators are useful for preventing unnecessary right-hand-side computations. 112 | 113 | ```r 114 | FALSE & stop("stop") # 115 | FALSE && stop("stop") # FALSE 116 | TRUE | stop("stop") # 117 | TRUE || stop("stop") # FALSE 118 | ``` 119 | 120 | When multiple operators are used within one expression their order is determined by precedence. 121 | 122 | ```r 123 | 2 + 2 / 2 # result = 3, because division has higher precedence 124 | ``` 125 | 126 | Since the available symbols on the keyboard are not enough to cover all necessary operators, other, lesser used operators are defined by surrounding them with a percent symbol `%`: 127 | 128 | ```r 129 | %/% # integer division 130 | %% # residual after integer division (modulo) 131 | 132 | 27 %/% 24 # result = 1 133 | 27 %% 24 # result = 3 134 | ``` 135 | 136 | 137 | ## Variables ## 138 | 139 | Variable names must start with a dot or a letter and contain only letters, numbers, dots, and underscores. 140 | They cannot overwrite existing special values like `NA` and `TRUE`. 141 | Three assignment operators can be used to achieve the same thing: 142 | 143 | ```r 144 | x <- 2 145 | x = 2 146 | 2 -> x 147 | ``` 148 | 149 | In addition R allows to define "non syntactic" identifiers. 150 | Standard variable names must follow defined rules but backtick quotation forces any string to be evaluated as an identifier: 151 | 152 | ```r 153 | x <- 1 154 | 2 <- 1 # error 155 | `2` <- 1 # bypasses normal evaluation and treats "2" as variable name 156 | 157 | x # standard way of referring to variable named "x" 158 | `x` # refers to the same variable 159 | 2 # object 2 of type double 160 | `2` # refers to non-syntactic variable "2" 161 | ``` 162 | 163 | 164 | ## Function calls ## 165 | 166 | Functions are called by typing the function name and parentheses with a list of passed parameters. 167 | Parameters can be specified by either their position or their name. 168 | 169 | ```r 170 | sqrt(3) 171 | sqrt(x = 3) 172 | ``` 173 | 174 | All the operators demonstrated previously are also functions and can be called in the same manner: 175 | 176 | ```r 177 | `+`(2, 3) 178 | `|`(x = TRUE, y = FALSE) 179 | `!`(TRUE) 180 | ``` 181 | 182 | Assignment itself is also a function call: 183 | 184 | ```r 185 | `<-`("newvar", 2) # creates a variable "newvar" 186 | `=`("newvar", 2) # same 187 | ``` 188 | 189 | A few functions that would have been useful so far: 190 | 191 | ```r 192 | ls() # lists all assigned variables 193 | typeof() # returns the type of an object 194 | ``` 195 | 196 | 197 | ## Getting help ## 198 | 199 | Functions implemented in R have documentation pages which are searchable via another function: `help()`: 200 | 201 | ```r 202 | help(sum) 203 | help(`<-`) 204 | help(help) 205 | ?sum # shortcut for calling help() 206 | ``` 207 | 208 | Documentation pages can also exist on their own, without being attached to function names: 209 | 210 | ```r 211 | help(Syntax) # to read about precedence levels 212 | help(Reserved) # R's list of reserved symbols 213 | help(NumericConstants) # parser rules for treating numeric constants 214 | help(Quotes) # parser rules for treating quoted strings 215 | ``` 216 | 217 | In addition there is a function for searching the documentation pages: 218 | 219 | ```r 220 | help.search("concatinate") 221 | ??concatinate # shortcut for calling help.search() 222 | ``` 223 | 224 | 225 | ## Vectors ## 226 | 227 | Almost every object in R is vectorized - it can contain multiple elements and has an attribute called "length". 228 | 229 | ```r 230 | length(NaN) 231 | length(1) 232 | length("what") # string is not a vector, but an element 233 | ``` 234 | 235 | A vector with multiple elements is created with a "combine" function `c()`: 236 | 237 | ```r 238 | c(-1L, 0L, 1L) 239 | c(1+1i, -1+3i) 240 | c(TRUE, FALSE, TRUE) 241 | c("a", "b", "any string") 242 | ``` 243 | 244 | `c()` can be used for combining multi-element vectors too: 245 | 246 | ```r 247 | vec1 <- c(1, 2) 248 | vec2 <- c(2, 3) 249 | c(vec1, vec2) 250 | c(c(1,2), c(2,3)) # same 251 | c(c(1,2,2,3)) # also same 252 | c(1,2,2,3) # same as well 253 | ``` 254 | 255 | An increasing/decreasing sequence of integers can be created using a shortcut: 256 | 257 | ```r 258 | 1:2 259 | 10:1 260 | -100:100 261 | ``` 262 | 263 | When multiple different types are combined the resulting vector takes the most general type: 264 | 265 | ```r 266 | c(TRUE, 1L) # integer 267 | c(TRUE, 1) # double 268 | c(1L, 1+1i) # complex 269 | c(TRUE, "A") # character 270 | c(NA_integer_, NA_real_) # double 271 | ``` 272 | 273 | Most operators, when applied on vectors, apply the operation to each element separately: 274 | 275 | ```r 276 | x <- c(1, 2, 3) 277 | y <- c(0, 10, 1) 278 | 279 | x + 1 # 2 3 4 280 | x + y # 1 12 4 281 | x > 2 # FALSE FALSE TRUE 282 | x > y # TRUE FALSE TRUE 283 | ``` 284 | 285 | Scalar AND and OR operators are an exception and only work on vectors of length 1. 286 | 287 | ```r 288 | x <- c(TRUE, FALSE) 289 | y <- c(FALSE, TRUE) 290 | 291 | x & y # FALSE FALSE 292 | x && y # 293 | x | y # TRUE TRUE 294 | x || y # 295 | ``` 296 | 297 | Lots of functions work with vectors too, and provide a reasonable result: 298 | 299 | ```r 300 | x <- c(1, 2, 3, 4, 5) 301 | y <- c(2, 1, 3, 1, 5) 302 | 303 | mean(x) 304 | sum(y) 305 | log2(x) 306 | cor(x, y) 307 | paste(x, y) 308 | ``` 309 | 310 | Vectors can be empty (have 0 elements): 311 | 312 | ```r 313 | logical() 314 | integer() 315 | double() 316 | character() 317 | complex() 318 | 319 | length(complex()) 320 | length(c(numeric(), numeric(), numeric())) 321 | ``` 322 | 323 | Finally, elements within a vector can have names: 324 | 325 | ```r 326 | x <- c(a=1, b=2, c=3, d=4) 327 | names(x) 328 | ``` 329 | 330 | There is no requirement for the names of a vector to be unique: 331 | 332 | ```r 333 | x <- c(a=1, b=2, b=3, d=4) 334 | names(x) 335 | ``` 336 | 337 | 338 | ## Recycling ## 339 | 340 | R can apply binary operators on vectors of different lengths. 341 | In this case recycling takes place - the shorter vector is extended to match the length of the longer one by recycling its elements starting from the first one: 342 | 343 | ```r 344 | x <- c(1, 1, 1, 1) 345 | y <- c(0, 1) 346 | 347 | # original after recylcing result 348 | x + y # x = 1 1 1 1 x = 1 1 1 1 1 2 1 2 349 | # y = 0 1 y = 0 1 0 1 350 | ``` 351 | 352 | Recycling helps writing common element-wise vector expressions without resorting to loops: 353 | 354 | ```r 355 | # subtract 1 from each element in x 356 | x <- c(1, 2, 3, 4) 357 | x - 1 358 | 359 | # check which elements of x are greater than 2 360 | x <- c(1, 2, 3, 4) 361 | x > 2 362 | 363 | # turn every second element of x to 0 364 | x <- c(1, 2, 3, 4) 365 | y <- c(0, 1) 366 | x * y 367 | ``` 368 | 369 | When the longer vector is not a multiple of the shorter one - recycle takes place but a warning is displayed: 370 | 371 | ```r 372 | c(1, 2, 3) - c(0, 1) # result: 1 1 3 373 | 374 | # Warning message: 375 | # In c(1, 2, 3) - c(0, 1) : 376 | # longer object length is not a multiple of shorter object length 377 | ``` 378 | 379 | 380 | ## Subsetting ## 381 | 382 | A vector can be treated as an array of elements, hence we might want to subset it by selecting only a fraction of those elements. 383 | In R this is performed using the `[` operator. 384 | 385 | ```r 386 | x <- c(10, 20, 30, 40) 387 | x[1] 388 | ``` 389 | 390 | Elements can be selected by specifying the desired positions (numbering in R starts from 1). 391 | 392 | ```r 393 | x <- c(10, 20, 30, 40) 394 | 395 | x[4] 396 | x[1:2] 397 | x[0] # selects "nothing" - returns an empty vector 398 | x[c(1,1)] # same element can be selected multiple times 399 | x[5] # NA (element with unknown value) 400 | ``` 401 | 402 | Elements can be selected by specifying the positions that should be dropped out: 403 | 404 | ```r 405 | x <- c(10, 20, 30, 40) 406 | 407 | x[-1] 408 | x[c(-1,-3)] 409 | x[c(-1,-1)] # same element is "removed twice" 410 | x[-5] # doesn't change x in any way 411 | ``` 412 | 413 | Elements can be selected by specifying `TRUE` if the element should be returned, and `FALSE` otherwise: 414 | 415 | ```r 416 | x <- c(10, 20, 30, 40) 417 | 418 | x[c(TRUE, FALSE, FALSE, FALSE)] 419 | x[c(TRUE, TRUE, FALSE, TRUE)] 420 | x[c(FALSE, TRUE)] # c(FALSE, TRUE) recycled - returns every second element 421 | 422 | # one interesting example 423 | x[x > 20] # 1) x > 20 compares each element of x with 20 (20 is recycled) 424 | # 2) x > 20 creates indexing vector of TURE/FALSE values for each element in x 425 | # 3) the obtained vector is then used for subsetting x 426 | # 4) as a result only x elements that are greater than 20 are returned 427 | ``` 428 | 429 | Elements can be selected by their name. 430 | 431 | ```r 432 | x <- c(a=10, b=20, c=30, d=40) 433 | 434 | x["a"] 435 | x[c("a", "b")] 436 | x[c("e")] # NA - element with unknown value 437 | x[-c("a")] # error - this syntax cannot be used for dropping elements 438 | 439 | # when names are not unique - first matching element is returned 440 | x <- c(a=10, b=20, b=30, d=40) 441 | 442 | x["b"] # 20 443 | ``` 444 | 445 | Empty selection selects all elements: 446 | 447 | ```r 448 | x <- c(10, 20, 30, 40) 449 | 450 | x[] 451 | ``` 452 | 453 | 454 | ## Replacement ## 455 | 456 | Elements in a vector can be replaced using the indexing syntax with an assignment. 457 | 458 | ```r 459 | x <- c(10, 20, 30, 40) 460 | x[1] <- 0 # replaces first element with 0 461 | ``` 462 | 463 | Replacing by index: 464 | 465 | ```r 466 | x <- c(10, 20, 30, 40) 467 | 468 | x[4] <- c(51) 469 | x[c(1,3)] <- c(12, 32) 470 | x[0] <- 0 # doesn't do anything 471 | x[5] <- 50 # appends "z" to the 5th position of the vector 472 | x[10] <- 100 # appends "z" to the 10th position, adds NA to 6th:9th 473 | ``` 474 | 475 | Replacing by excluding positions: 476 | 477 | ```r 478 | x <- c(10, 20, 30, 40) 479 | 480 | x[-c(1,2,3)] <- 41 481 | x[-1] <- c(22,32,43) 482 | x[-5] <- 100 # will replace every element 483 | ``` 484 | 485 | Replacing by logical indexing: 486 | 487 | ```r 488 | x <- c(10, 20, 30, 40) 489 | 490 | x[c(TRUE, FALSE, TRUE, FALSE)] <- c(11,31) 491 | x[c(TRUE, TRUE, TRUE, FALSE)] <- c(11,31) # c(11, 31) will be recycled 492 | x[c(FALSE, TRUE)] <- c(22,42) # c(FALSE TRUE) will be recycled 493 | x[x < 10] <- NA 494 | ``` 495 | 496 | Replacing by name: 497 | 498 | ```r 499 | x <- c(a=10, b=20, c=30, d=40) 500 | 501 | x["b"] <- 21 502 | x[c("a","d")] <- c(11,41) 503 | x["z"] <- 2 # will append a new element named "z" 504 | 505 | # when names are not unique - first matching element is replaced 506 | x <- c(a=10, b=20, b=30, d=40) 507 | 508 | x["b"] <- 21 # replaces b=20 element 509 | ``` 510 | 511 | An empty assignment is a shortcut for replacing all elements: 512 | 513 | ```r 514 | x <- c(10, 20, 30, 40) 515 | 516 | x <- 0 # replaces x with 0 517 | x[] <- 0 # replaces all elements of x with 0 518 | ``` 519 | 520 | After replacement, if needed, the vector is converted to the most general type: 521 | 522 | ```r 523 | x <- c(1, 2, 3, 4) # x is double 524 | x[4] <- "e" # x is now character 525 | ``` 526 | 527 | ## List type ## 528 | 529 | In a vector all elements have the same type. 530 | Lists can store multiple elements of different types. 531 | 532 | ```r 533 | list(1L, 0, "a", TRUE) # values are not converted to character type 534 | ``` 535 | 536 | Since list can store vectors - elements in a list can have any length: 537 | 538 | ```r 539 | l <- list(c("a","b"), c(TRUE, FALSE), 1:100) 540 | 541 | length(l) # length is the nuber of elements in a list: 3 542 | ``` 543 | 544 | List can store other lists within itself: 545 | 546 | ```r 547 | list(list(1, 3+0i), list("a")) 548 | list(list(1, 3+0i), "a") # one list can have both lists and simple vectors 549 | list(list(1, list("a"))) # lists can be nested more than 2 levels deep 550 | list(a=TRUE, b=1:100) # list elements can have names 551 | ``` 552 | 553 | A list can be empty and store empty elements 554 | 555 | ```r 556 | l1 <- list() 557 | l2 <- list(numeric(), list(), complex()) 558 | 559 | length(l1) # 0 560 | length(l2) # 3 561 | ``` 562 | 563 | 564 | Same subsetting rules described before also apply to lists: 565 | 566 | ```r 567 | l <- list(a=TRUE, b=c("a","b"), c=1:100) 568 | 569 | l[1] # list with one element 570 | l[c(1,2)] 571 | l[-1] 572 | l[c(TRUE, FALSE, TRUE)] 573 | l[c("a","c")] 574 | l[] 575 | ``` 576 | 577 | Special double bracket `[[` operator is used for extracting elements from a list. 578 | But it only allows indexing by positive numbers and names: 579 | 580 | ```r 581 | l <- list(a=TRUE, b=c("a","b"), c=1:100) 582 | 583 | l[[1]] # first element of a list 584 | l[[c(1,2)]] # error - cannot return two objects, without a list to hold them 585 | l[[-c(1,2)]] # error - negative indices cannot be used for list elements 586 | l[[c(TRUE,FALSE,TRUE)]] # error - logical indices cannot be used 587 | l[[c("a")]] 588 | l[[c("a","b")]] # error - cannot return two objects, without a list to hold them 589 | ``` 590 | 591 | Elements in nested lists can be selected by stacking double brackets: 592 | 593 | ```r 594 | l <- list(list(TRUE), list(list(1, 2:3))) 595 | 596 | l[[1]] # selects first element - a list 597 | l[[1]][[1]] # selects second element from the first element (list) 598 | l[[2]][[1]][[1]] # deeper nested selection (hard to describe, sorry!) 599 | l[[2]][[1]][[2]][2] # select an element of a deeply nested vector 600 | ``` 601 | 602 | Elements in a list are replaced using procedures equivalent to subsetting them: 603 | 604 | ```r 605 | l <- list(a=TRUE, b=c("a","b"), c=1:100) 606 | 607 | l[[1]] <- FALSE # replace the first list element with FALSE 608 | l[1] <- list(FALSE) # same as above 609 | l[1] <- FALSE # also same as above 610 | l[[1]] <- list(FALSE) # ! replaces first element with a list 611 | l[2] <- NULL # removes the second element from the list 612 | ``` 613 | 614 | Same replacement procedures work on nested lists: 615 | 616 | ```r 617 | l <- list(list(TRUE), list(list(1, 2:3))) 618 | 619 | l[[1]][[1]] <- FALSE 620 | l[[2]][[1]][[1]] <- 0 621 | l[[2]][[1]][[2]][1] <- -1 # replace element of a deeply nested vector 622 | l[[2]][[2]] <- list() # replace a whole nested list 623 | ``` 624 | 625 | Selecting a single element from a list by name is a frequent procedure, therefore it has a shortcut operator: 626 | 627 | ```r 628 | l <- list(a=TRUE, b=c("a","b"), c=list(c1=1, c2=2)) 629 | 630 | l$a # same as l[["a"]] 631 | l$c # same as l[["c"]] 632 | l$c$c1 # same as l[["c"]][["c1"]] 633 | ``` 634 | 635 | Same shortcut can be used for replacement: 636 | 637 | ```r 638 | l <- list(a=TRUE, b=c("a","b"), c=list(c1=1, c2=2)) 639 | 640 | l$a <- FALSE # replace element "a" 641 | l$c$c1 <- 0 # replace "c1" element of the list "c" 642 | l$c <- NA # replace whole "c" list with an object NA 643 | ``` 644 | 645 | 646 | ## Matrices ## 647 | 648 | Matrices are vectors with an additional atribute of dimension. 649 | Below are multiple equivalent ways of creating matrices: 650 | 651 | ```r 652 | x <- c(1, 2, 3, 4) 653 | 654 | matrix(x, nrow=2, ncol=2) 655 | dim(x) <- c(2,2) 656 | attr(x, "dim") <- c(2,2) 657 | ``` 658 | 659 | The attributes of a matrix can be obtained via functions: 660 | 661 | ```r 662 | X <- matrix(1:10, nrow=2, ncol=5) 663 | 664 | dim(X) 665 | nrow(X) 666 | ncol(X) 667 | ``` 668 | 669 | Since matrices are constructed from vectors - any vector type, including a list, can be turned into a matrix: 670 | 671 | ```r 672 | matrix(c(TRUE, FALSE, TRUE, FALSE), nrow=2, ncol=2) 673 | matrix(1:4, nrow=2, ncol=2) 674 | matrix(c("a","b","c","d"), nrow=2, ncol=2) 675 | matrix(list(TRUE, 2, "a", 3+0i), nrow=2, ncol=2) 676 | ``` 677 | 678 | A matrix, like a vector, can be empty: 679 | 680 | ```r 681 | matrix(logical(), nrow=0, ncol=0) 682 | matrix(double(), nrow=0, ncol=0) 683 | matrix(list(), nrow=0, ncol=0) 684 | ``` 685 | 686 | Under the hood the matrix is a vector, "folded" into a matrix form by first filling up the first column, then the second column, and so on: 687 | 688 | ```r 689 | x <- 1:20 690 | 691 | matrix(x, ncol=4) 692 | 693 | # [,1] [,2] [,3] [,4] 694 | # [1,] 1 6 11 16 695 | # [2,] 2 7 12 17 696 | # [3,] 3 8 13 18 697 | # [4,] 4 9 14 19 698 | # [5,] 5 10 15 20 699 | ``` 700 | 701 | And all operators that work on vectors treat matrices as a vectors. 702 | Here is what happens when a vector is added to a matrix: 703 | 704 | ```r 705 | X <- matrix(1:8, nrow=4) 706 | v <- 1:2 707 | 708 | X + v # 1) X is flattened out to a vector form 1 2 3 4 5 6 7 8 709 | # 2) v is recycled to match length of X 1 2 1 2 1 2 1 2 710 | # 3) element-wise addition takes place 2 4 4 6 6 8 9 10 711 | # 4) X is re-folded into a matrix again, column by column: 712 | # [,1] [,2] 713 | # [1,] 2 6 714 | # [2,] 4 8 715 | # [3,] 4 8 716 | # [4,] 6 10 717 | ``` 718 | 719 | This allows specifying some common operations easily: 720 | 721 | ```r 722 | X <- matrix(1:20, nrow=4) 723 | Y <- matrix(runif(20), nrow=4) 724 | 725 | X + 1 # add a number to each element of a matrix 726 | X > 2 # compare each element of a matrix with a number 727 | X / Y # divide elements of one matrix from another 728 | X < Y # compare two matrices element-wise 729 | X - rowMeans(X) # subtract the mean of each row of a matrix 730 | ``` 731 | 732 | Just like operators, functions that work on vectors treat matrices as vectors: 733 | 734 | ```r 735 | X <- matrix(1:20, ncol=4) 736 | 737 | length(X) # 20 738 | sum(X) # sum of all elements 739 | sqrt(X) # square root of each element within the matrix 740 | ``` 741 | 742 | Many specialized functions and operators are available for working with matrices. Some of them: 743 | 744 | ```r 745 | X <- matrix(rnorm(20), ncol=4) 746 | Y <- matrix(rnorm(20), nrow=4) 747 | 748 | t(X) # transpose of a matrix 749 | diag(X) # obtain matrix diagonal 750 | X %*% Y # matrix multiplication 751 | X %x% Y # kronecker product of two matrices 752 | ``` 753 | 754 | Elements within a matrix can be subsetted and replaced as if they were in a vector: 755 | 756 | ```r 757 | X <- matrix(1:20, ncol=4) 758 | 759 | X[1] 760 | X[c(2,3,5)] 761 | X[-2] 762 | X[c(TRUE,FALSE)] 763 | X[X >= 10] 764 | 765 | X[1] <- 0 766 | X[c(2,3,5)] <- c(-2,-3,-5) 767 | X[X >= 10] <- 10 768 | ``` 769 | 770 | In addition matrix allows performing same operations on rows and columns: 771 | 772 | ```r 773 | X <- matrix(1:20, ncol=4) 774 | X[1:2, 1:2] 775 | ``` 776 | 777 | Selecting rows and columns by an index: 778 | 779 | ```r 780 | X <- matrix(1:20, ncol=4) 781 | 782 | X[1:2,] # first two rows 783 | X[,3:4] # 3rd and 4th column 784 | X[1:2, 3:4] # combination of the above 785 | X[1,] # selecting a single row returns a simple vector (not a matrix) 786 | X[,2] # same with columns 787 | ``` 788 | 789 | Removing rows and columns by an index: 790 | 791 | ```r 792 | X <- matrix(1:20, ncol=4) 793 | 794 | X[-1,] # drop first row 795 | X[,-c(2,3)] # drop second and third columns 796 | X[-1, -c(2,3)] # combination of the above 797 | X[-1, c(2,3)] # removal of rows combined with selection of columns 798 | X[,-c(1:3)] # when single row/column is left - a vector is returned 799 | ``` 800 | 801 | Selecting rows and columns by logical indices: 802 | 803 | ```r 804 | X <- matrix(1:20, ncol=4) 805 | 806 | X[c(TRUE,FALSE,TRUE,FALSE,FALSE),] # select 1st and 3rd rows 807 | X[,c(TRUE,FALSE,TRUE,FALSE)] # select 1st and 3nd columns 808 | X[c(TRUE,FALSE,TRUE,FALSE,FALSE), c(TRUE,FALSE)] # combination (with recycling) 809 | X[,c(TRUE,FALSE,FALSE,FALSE)] # returns a vector 810 | 811 | # a more practical example 812 | X[rowMeans(X) > 10,] # all rows with mean above 10 813 | ``` 814 | 815 | Rows and columns of a matrix can also have their own names that can be used for selection: 816 | 817 | ```r 818 | X <- matrix(1:20, ncol=4) 819 | rownames(X) <- paste0("r", 1:5) 820 | colnames(X) <- paste0("c", 1:4) 821 | 822 | X[c("r1","r2),] # rows named "r1" and "r2" 823 | X[,c("c1","c3)] # columns names "c1" and "c3" 824 | X[c("r1","r2"), c("c1","c3)] # combination of the above 825 | X[,"c4"] # returns a vector 826 | ``` 827 | 828 | In addition matrix also has a special subseting format - by row-column pairs: 829 | 830 | ```r 831 | X <- matrix(1:20, ncol=4) 832 | i <- rbind(c(1,2), # element in row 1 column 2 833 | c(2,2), # element in row 2 column 2 834 | c(4,1) # element in row 4 column 1 835 | ) 836 | 837 | X[i] # selects the elements in all the specified positions 838 | ``` 839 | 840 | And the same with names: 841 | 842 | ```r 843 | X <- matrix(1:20, ncol=4) 844 | rownames(X) <- paste0("r", 1:5) 845 | colnames(X) <- paste0("c", 1:4) 846 | i <- rbind(c("r1","c2"), 847 | c("r2","c2"), 848 | c("r4","c1") 849 | ) 850 | 851 | X[i] # same as before 852 | ``` 853 | 854 | Elements within the matrix can be replaced using all the subsetting examples outline above. 855 | But to save time and space these will not be demonstrated further. 856 | 857 | 858 | ## Dataframes ## 859 | 860 | Dataframes in R are implemented as a class on top of lists, with the restriction of each element within a list having the same length. 861 | Such implementation allows to construct tables, where each list element is interpreted as a separate column. 862 | As a result, unlike a matrix, `data.frame` class can contain columns of different types. 863 | 864 | ```r 865 | l <- list(id=1:5, name=c("a","b","c","d","e"), state=c(TRUE,TRUE,FALSE,TRUE,TRUE)) 866 | as.data.frame(l) 867 | 868 | # id name state 869 | # 1 1 a TRUE 870 | # 2 2 b TRUE 871 | # 3 3 c FALSE 872 | # 4 4 d TRUE 873 | # 5 5 e TRUE 874 | 875 | # same as above: 876 | df <- data.frame(id=1:5, name=c("a","b","c","d","e"), state=c(TRUE,TRUE,FALSE,TRUE,TRUE)) 877 | ``` 878 | 879 | This also permits to have nested lists in the columns of a data frame: 880 | 881 | ```r 882 | df <- data.frame(id=1:2, name=c("Albert","Bob")) 883 | df$objects <- list(c("key","pin"), c("ball")) 884 | 885 | # id name objects 886 | # 1 1 Albert key, pin 887 | # 2 2 Bob ball 888 | ``` 889 | 890 | Since data frames are lists - they are subsetted using list subset operator: 891 | 892 | ```r 893 | df <- data.frame(id = 1:3, name = c("Albert","Bob","Cindy")) 894 | 895 | df[1] # first list element (first column) 896 | df[1:2] # first two list elements (first and second columns) 897 | df[-1] # removes first list element (first column) 898 | df[[2]][2:3] # second and third vector elements from the second column 899 | df$id # list element (column) by name 900 | df$name[1:2] # first two vector elements of "name" column 901 | ``` 902 | 903 | Equivalent operators can be used for replacement. 904 | Here only a few more practical examples are provided: 905 | 906 | ```r 907 | df <- data.frame(id = 1:3, name = c("Albert","Bob","Cindy")) 908 | 909 | df$name[df$id=="2"] <- "Bobby" 910 | df$surname <- c("Thompson", NA, "Friedman") 911 | df$surname[df$name=="Bobby"] <- "Smith" 912 | ``` 913 | 914 | Since data frame is arranged as a matrix it also allows matrix subset operations, with equivalent forms used for replacement: 915 | 916 | ```r 917 | df <- data.frame(id = 1:3, name = c("Albert","Bob","Cindy")) 918 | 919 | df[,1] # first column (like with a matrix, returns a simple vector) 920 | df[1,] # first row of a data frame (returns a data frame) 921 | df[df$id==2, ] # all entries for id = 2 922 | df[df$id %in% 2:3, ] # all entries for ids 2 and 3 923 | ``` 924 | 925 | Just like a matrix the data frame can have row names, with column names often provided by default. 926 | And those names can be used to subset rows and columns in a matrix style: 927 | 928 | ```r 929 | df <- data.frame(id = 1:3, name = c("Albert","Bob","Cindy")) 930 | rownames(df) <- c("a", "b","c") 931 | 932 | # id name 933 | # a 1 Albert 934 | # b 2 Bob 935 | # c 3 Cindy 936 | 937 | df[c("a","b"),] # rows named "a" and "b" 938 | df[,"name"] # column named "name" 939 | df[c("a","b"), "name"] # combination of the two 940 | ``` 941 | 942 | Special row-column pair selection works as well: 943 | 944 | ```r 945 | df <- data.frame(id = 1:3, name = c("Albert","Bob","Cindy")) 946 | i <- rbind(c(1,2), # 1st row, 2nd column 947 | c(3,1) # 3rd row, 1st column 948 | ) 949 | 950 | df[i] 951 | ``` 952 | 953 | Replacing the whole entry (one row) is an often needed procedure. 954 | Since such entries are subsets of the data frame they are data frames themselves and need to be replaced with an equivalent data frame (or a list). 955 | Hence the replacement has to use equal element names: 956 | 957 | ```r 958 | df <- data.frame(id = 1:3, name = c("Albert","Bob","Cindy")) 959 | 960 | df[df$id==2, ] <- list(id=2, name="Bruce") # new entry for id = 2 961 | df[df$id==2, ] <- data.frame(id=2, name="Bruce") # same 962 | 963 | # id name 964 | # 1 1 Albert 965 | # 2 2 Bruce 966 | # 3 3 Cindy 967 | ``` 968 | -------------------------------------------------------------------------------- /figure/ridgeline.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /figure/parallel.svg: -------------------------------------------------------------------------------- 1 | --------------------------------------------------------------------------------