├── README.md └── x /README.md: -------------------------------------------------------------------------------- 1 | # R-Style-Guide -- Code RED 2 | 3 | Not about making your code "pretty"! Our goal here is to write code that 4 | is **readable**, **extensible** and **debuggable** (RED). 5 | 6 | # Why It's So Important 7 | 8 | *Your goal should be to avoid future grief,* both for yourself and for 9 | others who might work on your code. 10 | 11 | If you are writing code that you or others will be using in the future, 12 | it is absolutely essential to write "good" code, meaning: 13 | 14 | * It is easy to understand, both by others and *by yourself*, six months 15 | later. This facilitates debugging, and adding new features. 16 | 17 | * It is robust to errors, in that it does error checking and messaging. 18 | Especially important is avoiding "silent" errors, i.e. ones that do 19 | not cause execution to cease but cause possibly major damage in the 20 | output, undetected. 21 | 22 | In other words, your goal should be 23 | 24 | **The RED Principle:** 25 | 26 | * Readability. 27 | 28 | * Extensibility. 29 | 30 | * Debuggability. 31 | 32 | Note that the latter will be especially important in this document. You 33 | will typically spend much more time debugging your code than in writing 34 | it, and this will continue throughout the life of your code. So, 35 | *anti-bugging* is key. 36 | 37 | # What This Guide Does NOT Do 38 | 39 | There are no suggestions here on object naming, number of spaces to 40 | indent, placement of braces and so on. Yes, meaningful names for variables 41 | etc. are very important, as is indenting, but we have no recommendations 42 | on the details. We stick to more specific things that can really make a 43 | difference in the quality of your code from an RED point of view. 44 | 45 | # Motivating Example 46 | 47 | Recently I was running some vital production code written by others. 48 | The code runs input files and processes them in complex ways. 49 | 50 | In many cases for this code, the input files have problems. These can 51 | occur in myriad ways, causing the software to choke. In the case of one 52 | particular input file, the code did choke, so I ran **debug()**, 53 | finding that the problem occurred on the line (I've changed things 54 | slightly, changing some names and actually shortening it) 55 | 56 | ``` r 57 | cn <- paste(g[which(!is.na(str_locate(clientLines,"^[]*cn")[,"start"]))[1]:(abs_-1L)],collapse="") 58 | ``` 59 | 60 | My major obstacle to solving the problem with the input file was 61 | figuring out what this code does! Mind you, the code itself was 62 | correct; the input file was the problem, somehow. But it was quite 63 | difficult to work with here: 64 | 65 | * Way too much nesting of uf function calls: 66 | **str_locate()** within 67 | **is.na()** within 68 | **"[()"** within 69 | **paste()**. 70 | 71 | * No code to check whether **cn** is NA (which it was in this instance). 72 | 73 | * No preceding or following comments explaining what **cn** and 74 | **cLines** 75 | were supposed to contain. 76 | 77 | # Writing Functions 78 | 79 | ## Nested function calls 80 | 81 | Nesting should rarely if ever go beyond two levels. Something like 82 | 83 | ``` r 84 | d <- f(g(h(x))) 85 | ``` 86 | 87 | should be broken down, saving intermediate results, e.g. 88 | 89 | ``` r 90 | th <- h(x) 91 | tg <- g(th) 92 | d <- f(tg) 93 | ``` 94 | 95 | There are two advantages to this multi-stage, multiline approach: 96 | 97 | * It's much easier to read, allowing one to "catch one's breath" at each 98 | stage and ponder what is happening. 99 | 100 | * It allows adding comments at some of the key stages. 101 | 102 | * It is already set up for debugging, so you can easily set a 103 | breakpoint, say at the second stage, then step through 104 | the code from there. 105 | 106 | What about Magrittr pipes? 107 | 108 | * You could do something similar with Magrittr pipes if you like using 109 | them. Make sure to put each stage of the pipe on a separate line, to 110 | enhance the RED-ness of your code, especially for enabling comments at 111 | the key stages. 112 | 113 | * But that still would not address the debuggability issue. One would nedd 114 | to physically modify the code to set a breakpoint, which is distracting 115 | and may break our train of thought. That is the whole point of having a 116 | debugging tool in the first place: the alternative, adding and deleting 117 | **print()** statements, takes time and is distracting. 118 | 119 | * I personally am not a fan of Magrittr pipes, as I just don't see the 120 | need for them. The same "left-to-right" (or top-to-bottom) thinking 121 | that pipes are supposed to facilitate are easily attained with the "each 122 | stage on a separate line" pattern displayed above. And impact on 123 | speed/memory is negligible. 124 | 125 | ## Size of functions (in lines) 126 | 127 | In order to improve clarity of code, computer science students are 128 | taught to code in a *top-down* or *modular* manner. Roughly, the idea 129 | is something like this: 130 | 131 | ``` r 132 | f <- function(somearguments) 133 | { 134 | "glue" lines 135 | out1 <- g(some arguments) 136 | more "glue" lines 137 | out2 <- h(some arguments) 138 | more "glue" lines 139 | out3 <- i(some arguments) 140 | ... 141 | } 142 | ``` 143 | 144 | In other words, the body of a function consists largely of calls to 145 | other functions. And those functions themselves will be of that form as 146 | well. 147 | 148 | The advantage, which applies just as well to R users who are not 149 | primarily programmers as it does to CS people, is that 150 | in this manner, *we limit the length of a function*. Just as a 151 | supermarket may open a new checkstand whenever lines in the existing 152 | ones exceed, say, three people, if the number of lines in a function you 153 | write exceeds, say a screenful, you should shorten it by moving some of 154 | the lines into separate functions. 155 | 156 | Why do this? 157 | 158 | * It's difficult to read/debug/enhance a long function. 159 | 160 | * This style in essence presents an outline of what the function does 161 | (especially with good function names). 162 | 163 | * When you are writing the code, you write this "outline" first, then fill 164 | in the details by writing the functions **g()**, **h()** and so on. 165 | 166 | * This makes your code easier to *write* for *you*. 167 | 168 | * For *others*, it makes your code easier to *read*. 169 | 170 | * For both you and others, it makes your code easier to *debug*. 171 | 172 | 173 | ## Anonymous functions 174 | 175 | Code like 176 | 177 | ``` r 178 | g(x,y,function(t) t^2) 179 | ``` 180 | 181 | should be used very sparingly. Once again, the problem is debugging. 182 | Since the function has no name (technically, is not bound to a name), 183 | one cannot instruct one's debugging tool to pause in that function. 184 | 185 | ## Use of global variables 186 | 187 | Personally I have always thought the stern admonitions against global 188 | variables are overblown. Indeed, you may be surprised to see that some 189 | of your favorite CRAN packages, e.g. **ggplot2**, do make use of some 190 | globals. 191 | 192 | Careful use of globals can make your code easier to write and debug. 193 | You should, however, place your globals in an R **environment**, so that 194 | they are clearly set apart from the others. 195 | 196 | # Commenting Your Code 197 | 198 | In our motivating example above, much of my debugging time would have 199 | been unnecessary had the authors of the code included a comment stating 200 | what the line is supposed to do, e.g. something as simple as 201 | 202 | ``` r 203 | # find line number for client information 204 | ``` 205 | 206 | Let's discuss this further. 207 | 208 | ## Central role in code development 209 | 210 | In any programming course for Computer Science students, this is 211 | absolutely central. If a student turns in a programming assignment with 212 | few or no comments, it will get a [poor 213 | grade](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-189-a-gentle-introduction-to-programming-using-python-january-iap-2011/lectures/MIT6_189IAP11_comment.pdf). 214 | If comments are needed for clarity and readability for CS students, who 215 | are presumably strong programmers, then R users who are not expert 216 | programmers need comments even more. 217 | A [style 218 | guide](https://www.cs.utah.edu/~germain/PPS/Topics/commenting.html) 219 | at a top university computer science department puts it well: 220 | 221 | > Commenting involves placing Human Readable Descriptions inside of 222 | > computer programs detailing what the Code is doing. Proper use of 223 | > commenting can make code maintenance much easier, as well as helping 224 | > make finding bugs faster. Further, commenting is very important when 225 | > writing functions that other people will use. Remember, well 226 | > documented code is as important as correctly working code. 227 | 228 | (Also see specific tips on commenting, later in that document.) 229 | 230 | ## Comments spanning an entire file or a chunk of one 231 | 232 | One tends to think of a comment as pertaining to the one or two lines of 233 | code that follow it. But one should also write comments with wider 234 | scope. 235 | 236 | * At the top of each source file, insert comments giving the reader an 237 | overview of the contents. This will typically be an overview of the roles 238 | of each major function, how the functions interact with each other, what 239 | the main data structures are, and so on. 240 | 241 | I strongly recommend that you write these comments at the top of a file 242 | BEFORE you start coding (and of course modifying it as you do write 243 | code). This will really help you focus during the coding process. 244 | 245 | * Comments that prepare the reader for the following chunk of code, say 246 | 6-12 lines, can also greatly enhance the RED-ness of your code, e.g. 247 | 248 | ``` r 249 | # Our strategy will be to first create a matrix of index numbers of the 250 | # k closest neighbors of each of the given points, then construct the 251 | # corresponding graph. 252 | ``` 253 | 254 | Armed with this overview, the reader will be much better prepared to 255 | tackle the code that follows. 256 | 257 | ## Can code be self-documenting? 258 | 259 | To some degree, the number of comments can be reduced via use of 260 | descriptive variable and function names, e.g. 261 | 262 | ``` r 263 | OverAge65Rows <-getOlder(.....) 264 | ``` 265 | 266 | But comments can be much more descriptive, e.g. 267 | 268 | ``` r 269 | # at this point, the data frame w will consist of the original rows for 270 | # people over age 65 and who are renters; the data continue to be sorted 271 | # by increasing ZIP Code 272 | ``` 273 | 274 | And as noted earlier, comments with scope spanning the entire file or 275 | significant chunks of it can be quite helpful, something that mere choice 276 | of object names can't achieve. 277 | 278 | ## Use of the roxygen Package 279 | 280 | Use of this technique can save you time in writing the online help pages 281 | for your package. By writing some comments in **roxygen** format, you 282 | simultaneously are writing the help pages! Pretty cool. 283 | 284 | But personally, I do not use **roxygen**. Program 285 | comments need to be much more detailed than what 286 | goes into help pages. I recommend not using **roxygen**. 287 | 288 | # Use of External Packages in General 289 | 290 | One of the truly great things about the R language is CRAN and 291 | Bioconductor, with packages for everything under the sun. I use CRAN 292 | packages a lot. 293 | 294 | On the other hand, you should make sure an external package is really 295 | necessary for your code. Relying on a lot of packages can: 296 | 297 | * Make your code hard to maintain. Every time a package that you use 298 | changes, you run the risk that this breaks your code. And remember, 299 | those packages may in turn depend on still more packages, and so on, 300 | further increasing your exposure. 301 | 302 | * You never can be completely sure that those packages do exactly what 303 | you want in all circumstances. Another package's edge case may be a 304 | central use case for your package. Subtle bugs can occur, hard to track 305 | down. 306 | 307 | In many, probably most cases, the advantages of relying on a package 308 | will outweigh the above concerns. But one should approach this very 309 | carefully. 310 | 311 | # Error Checking 312 | 313 | Wherever your code, for instance, extracts a subset from a vector or 314 | data frame, don't assume the result will be non-NULL! You should have 315 | code to check for this whenever you are not completely sure a non-NULL 316 | result will occur. 317 | 318 | Call **stop()** for serious errors, **warning()** for "iffy" cases. 319 | 320 | It may be quite useful to use **tryCatch()** in many of these cases. 321 | Instead of your code blowing up, it can give the user a chance to fix 322 | her input error, or if the code does blow up, it may print out some 323 | helpful information. 324 | 325 | One often sees admonitions against writing something like 326 | 327 | ``` r 328 | for (i in 1:length(x)) x[i] <- x[i]^2 329 | ``` 330 | 331 | The problem is that **x** might have length 0, in which case the loop 332 | would iterate along 1:0, almost certainly not the desired outcome. 333 | 334 | Those warning against this say one should use **seq()** instead 335 | 336 | ``` r 337 | for (i in seq_along(x))x[i] <- x[i]^2 338 | ``` 339 | 340 | If **x** is empty, the loop will not execute any iterations, and is thus 341 | claimed to be harmless. But actually this could be quite dangerous, as 342 | it may mask a problem that wreaks havoc later in the code. A safer 343 | solution is something like 344 | 345 | ``` r 346 | lx <- length(x) 347 | if (lx > 0) for (i in 1:lx) x[i] <- x[i]^2 else warning('lx = 0') 348 | ``` 349 | 350 | 351 | # Functional Programming 352 | 353 | Recently there has been a lot of interest in the R world in *functional 354 | programming* (FP). But of course, a desire to be fashionable should not 355 | take priority over RED principles. 356 | 357 | Typically FP will enable one to replace an entire loop with a single 358 | line of code. This can be beneficial to the RED-ness of your code. For 359 | instance, it can aid in making code "top-down" as described earlier 360 | here. I often use **apply()** and **lapply()** in my own code (and 361 | sometimes **Map()**, **Reduce()** and **Filter()**. I don't use 362 | **purrr**, not having a need for it, but it is certainly a powerful 363 | package. 364 | 365 | On the other hand, it must be kept in mind that **FP may increase the 366 | complexity of your code**, which may run counter to our RED goals. 367 | 368 | One should be especially aware of the possibility that an FP version of 369 | your code may be difficult to debug (very un-RED!). I like the comments 370 | in [this blog post](https://www.weirdfishes.blog/blog/practical-purrr/#debugging-using-safely): 371 | 372 | > One annoying thing about using map (or apply) in place of loops is 373 | > that it can make debugging much harder to deal with. With a loop, it’s 374 | > easy to see where exactly an error occurred and your loop failed (e.g. 375 | > look at the index of the loop when the error occurred). With map, it 376 | > can be much harder to figure out where the problem is, especially if 377 | > you have a very large list that you’re mapping over. 378 | 379 | The author then recommends the **safely()** function in the case of 380 | **purrr** code. For base-R functions such as **lapply()** one can use 381 | **tryCatch()** as explained above. But these approaches will only make 382 | things somewhat easier, and in the end a loop may be easier to debug. 383 | 384 | Note carefully that in the computer science world, FP is considered an 385 | advanced, abstract concept. An interesting discussion of the topic is 386 | in [Charavarty and 387 | Keller](https://www-ps.informatik.uni-kiel.de/~mh/reports/fdpe02/papers/paper15.ps.gz). 388 | They believe FP in its standard form in introductory programming classes 389 | is unsuitable even for CS majors. R users, with generally less 390 | sophistication, may find FP to be harder to code. 391 | 392 | So, in many cases, using a loop rather than FP may be RED-der. Don't 393 | feel that you "must" avoid loops. Again, if you browse through your 394 | favorite CRAN packages, you'll see lots of loops. 395 | 396 | # Classes 397 | 398 | R is blessed with a number of different class structures, notably S3, S4 399 | and R6. My own view of R6 is that it is far too abstract for writing RED 400 | code. Simpler is often better! I recommend sticking mainly to S3, or 401 | if you prefer a bit more tightness, S4. 402 | 403 | Also, resist the temptation to make a class out of everything. Once 404 | again, the principle should be whether the class would improve the 405 | RED-ness of your code. Adding structure means adding complexity, which 406 | may result in weaker RED-ness. My own code does often have S3 classes, 407 | but it is not top-heavy with them. 408 | 409 | # Shorter Code May Not Be Better Code 410 | 411 | One often sees discussions on the Web in which people post various 412 | solutions to a given problem, and in which the tacit goal is to make the 413 | code as compact as possible. This is very different from the RED goals, 414 | and in my experience, is generally counter to it. 415 | 416 | # Other R Style Guides 417 | 418 | The above recommendations stem from my experience as a software developer and 419 | teacher of programming. But in the end, style is a matter of taste. 420 | The reader should look at some of the others: 421 | 422 | * [Hadley Wickham](https://style.tidyverse.org/); see also 423 | [Google](https://google.github.io/styleguide/Rguide.html) 424 | 425 | * [Jean Fan](https://jef.works/R-style-guide/) 426 | 427 | * [more to be added as I encounter them] 428 | 429 | ## Debugging Tools 430 | 431 | Make sure to use one! R's internal browser is not bad, and RStudio, ESS 432 | and StatET have visual ones. Don't use **print()** lines to debug your 433 | code! 434 | 435 | -------------------------------------------------------------------------------- /x: -------------------------------------------------------------------------------- 1 | 2 | 3 | --------------------------------------------------------------------------------