├── Images ├── Rlogo.png └── python-logo.png └── README.md /Images/Rlogo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/matloff/R-vs.-Python-for-Data-Science/a89734f4c65267b4f2f538a77be7194e2ca8c6c0/Images/Rlogo.png -------------------------------------------------------------------------------- /Images/python-logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/matloff/R-vs.-Python-for-Data-Science/a89734f4c65267b4f2f538a77be7194e2ca8c6c0/Images/python-logo.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # R vs. Python for Data Science 2 | 3 | 4 | 5 | 6 | 7 | ## Norm Matloff, Prof. of Computer Science, UC Davis; [my bio](http://heather.cs.ucdavis.edu/matloff.html) 8 | 9 | 10 | Hello! This Web page is aimed at shedding some light on the perennial 11 | R-vs.-Python debates in the Data Science community. This is largely 12 | (though not exclusively) a debate between the Statistics (R) and 13 | Computer Science (Python) fields. Since I have a foot in both camps 14 | (I was a founding member of both the Statistics and Computer Science 15 | Departments at UC Davis), I hope to shed some useful light on the topic. 16 | 17 | I have potential bias: I've written four R-related books; I've given 18 | keynote talks at useR! and other R conferences; I have served as 19 | Editor-in-Chief of the *R Journal*; etc. But I am also an enthusiastic 20 | Python coder, have been for many years, and am the author of a [popular 21 | Python 22 | tutorial](https://www.cs.ucdavis.edu/~matloff/matloff/public_html/Python/PythonIntro.html). 23 | I hope this analysis will be considered fair and helpful. 24 | 25 | Again, **please note:** The emphasis here is on Data Science, not 26 | *Computer* Science. Note too that other than references to specific 27 | packages, ``R'' here means base R, not the Tidyverse, of which I have 28 | been [a critic](https://github.com/matloff/TidyverseSkeptic). 29 | 30 | ## Learning curve 31 | 32 | *Huge win for R.* 33 | 34 | This is of particular interest to me, as an educator. I've taught a 35 | number of subjects -- math, stat, CS and even English As a Second 36 | Language -- and have given intense thought to the learning process for 37 | many, many years. 38 | 39 | To even get started in Data Science with Python, one must learn a lot of 40 | material not in base Python, e.g., NumPy, Pandas and matplotlib. These 41 | libraries require a fair amount of computer systems sophistication. 42 | 43 | Python libraries can be tricky to configure, even for the systems-savvy, 44 | while most R packages run right out of the box. 45 | 46 | ## Data Science emphasis 47 | 48 | **Huge win for R.** 49 | 50 | In my book, *The Art of R Programmming*, I wrote "R is written *by* 51 | statisticians, *for* statisticians," a line I've been pleased to see 52 | quoted by others. One could update that to read "R is written *by* 53 | data scientists, *for* data scientists," and it is of crucial importance 54 | in our discussion here. 55 | 56 | Matrix types, data frames, missing-value handling, basic graphics, 57 | data-and-time processing, linear models, basic statistics, contingency 58 | tables and so on are built-in to base R. The novice can be doing simple 59 | data analyses with these tools within minutes. 60 | 61 | Generally an R data science function will be richer in coverage than its 62 | Python counterpart. For instance, R's histogram plot function, 63 | **hist()**, offers many advanced options, not the case for Python. 64 | 65 | All this is the result of the fact that, indeed, "R is written *by* data 66 | scientists, *for* data scientists." 67 | 68 | ## Available libraries for Data Science 69 | 70 | *Slight edge to R.* 71 | 72 | [CRAN](https://cran.r-project.org/) has over 14,000 packages. 73 | [PyPI](https://pypi.org/) has over 183,000 (both numbers are growing), 74 | but it seems thin on Data Science. 75 | 76 | For example, I once needed code to do fast calculation of 77 | nearest-neighbors of a given data point. (NOT code using that to do 78 | classification.) I was able to immediately find not one but two packages 79 | in CRAN to do this. By contrast, recently I tried to find 80 | nearest-neighbor code for Python and at least with my cursory search in 81 | PyPi, came up empty-handed; there was just one implementation that 82 | described itself as simple and straightforward, nothing fast. 83 | 84 | The following (again, cursory) searches in PyPI turned up nothing: EM 85 | algorithm; log-linear model; Poisson regression; instrumental variables; 86 | spatial data; familywise error rate; etc. 87 | 88 | This is not to say no Python libraries exist for these things; I am 89 | simply saying that they are not easily found in PyPI, whereas it is easy 90 | to find them in CRAN, and indeed, such libraries are more likely to be 91 | in CRAN but not PyPI. Once again, this reflects the difference in 92 | orientation, Data Science for R versus Computer Science for Python. 93 | 94 | And the fact that R has a canonical package structure is a big 95 | advantage. When installing a new package, one knows exactly what to 96 | expect. Similarly, R's *generic functions* are an enormous plus for R. 97 | When I'm using a new package, I know that I can probably use 98 | **print()**, **plot()**, **summary()**, and so on, while I am exploring, 99 | without checking the documentation. These form a "universal language" 100 | for packages. 101 | 102 | ## Visualization tools 103 | 104 | *Win for R* 105 | 106 | Unlike Python, base R itself has sophisticated graphics utilities built 107 | in, and there are two outstanding graphics packages available, 108 | **ggplot2** and **lattice**. The former is so widely used that many 109 | probably perceive it as being part of base R. 110 | 111 | But it goes far beyond that. As noted, a major built-in function in R 112 | is **plot()**. It is *polymorphic*, meaning that its role is different 113 | for each use case it has been written for. This is a fancy term whose 114 | practical meaning is that the objects returned by R functions are 115 | typically paired with a visualization, which we can invoke simply by 116 | calling the generic **plot()**. 117 | 118 | ## Machine learning 119 | 120 | *Slight (or more) edge to Python*. 121 | 122 | As noted, the R-vs.-Python debate is largely a Statistics-vs.-CS debate, 123 | and since most research in neural networks has come from CS, available 124 | software for NNs is mostly in Python. To many in CS, machine learning 125 | means neural networks (NNs). 126 | 127 | RStudio/Posit has done some excellent work in developing a Keras 128 | implementation, and there are R interfaces to PyTorch and so on, but so 129 | far R is limited in this realm. Again, if one's view is that 130 | Data Science = NNs, then Python is the language of choice. 131 | 132 | On the other hand, random forest research originated in the 133 | Statistics community, and most research in this field has been conducted 134 | there. In this realm I'd submit that R has the superior software. The 135 | **grf** package, for instance, allows linear interpolation within tree 136 | leaves, crucial for removing bias near the edges of the data. R also 137 | has excellent packages for gradient boosting, another field originally 138 | invented in Statistics. 139 | 140 | ## Statistical sophistication 141 | 142 | *Big win for R*. 143 | 144 | As noted, I use the slogan, "R is written *by* statisticians, *for* 145 | statisticians." It's important! 146 | 147 | To be frank, I find the machine learning people, who mostly advocate 148 | Python, often have a poor understanding of, and in some cases even a 149 | disdain for, the statistical issues in ML. And, sadly, I often see 150 | ignorance. I was shocked recently, for instance, to see one of the most 151 | prominent ML people state in his otherwise superb book that 152 | standardizing the data to mean-0, variance-1 means one is assuming the 153 | data are Gaussian — absolutely false and misleading. 154 | 155 | ## Parallel computation 156 | 157 | *Let's call it a tie.* 158 | 159 | Neither the base version of R nor Python have good support for multicore 160 | computation. Threads in Python are nice for I/O, but multicore 161 | computation using them is difficult, due to the infamous Global 162 | Interpreter Lock. Python's **multiprocessing** package is much better 163 | than before, but still clunky. R's **parallel** package does allows 164 | shared memory for Macs or Linux, but not on Windows platforms. 165 | 166 | (See my **Rdsm** package if you wish to use shared memory 167 | at the R level.) 168 | 169 | External libraries supporting cluster computation are OK in both languages. 170 | 171 | Currently Python has better interfaces to GPUs, but again, only for NNs. 172 | 173 | ## C/C++ interface and performance enhancement 174 | 175 | *Slight win for R.* 176 | 177 | Though there are tools like SWIG etc. for interfacing Python to C/C++, 178 | as far is I know there is nothing remotely as powerful as R's **Rcpp** for 179 | this at present. The **Pybind11** package is being developed. 180 | 181 | In addition, R's new ALTREP idea has great potential for enhancing 182 | performance and usability. 183 | 184 | On the other hand, the Cython and PyPy variants of Python can in some 185 | cases obviate the need for explicit C/C++ interface in the first place; 186 | indeed some would say Cython IS a C/C++ interface. 187 | 188 | ## Object orientation, metaprogramming 189 | 190 | *Slight win for R*. 191 | 192 | For instance, though functions are objects in both languages, R takes 193 | that further than does Python. Whenever I work in Python, I'm annoyed 194 | by the fact that I cannot directly print a function to the terminal or 195 | edit it, which I do a lot in R. (This goes back to the *polymorphic* 196 | nature of R's **print()** function etc.) 197 | 198 | Python has just one OOP paradigm. In R, you have your choice of several 199 | (S3, S4, R6 etc.), though some may debate whether this is a good thing. 200 | 201 | R's metaprogramming features (code that produces code) are wonderful, 202 | arguably as powerful as, or more powerful than, those of Python. In 203 | both languages, these are only for experts, but the point is that R is 204 | competitive with Python even in this highly CS-ish aspect. 205 | 206 | ## Linked data structures 207 | 208 | *Win for Python.* 209 | 210 | **This is not a big issue in Data Science**, but it does come up in some 211 | contexts. 212 | 213 | Classical computer science data structures, e.g. binary trees, are easy 214 | to implement in Python. This can be done in R in various ways, e.g. 215 | with the **datastructures** package, which wraps the widely-used **Boost** 216 | C++ library, but it is not base-R. 217 | 218 | ## Online help 219 | 220 | *Big win for R.* 221 | 222 | To begin with, R's basic **help()** function is much more informative 223 | than Python's. It's nicely supplemented by **example()**. And most 224 | important, the custom of writing vignettes in R packages makes R a 225 | hands-down winner in this aspect. 226 | 227 | # Essential for data scientists to know both R and Python 228 | 229 | Hopefully I've made a strong case above for using R in data science, in 230 | analytics, visualization and so on. It is [widely used in business and 231 | industry](https://github.com/ThinkR-open/companies-using-r). 232 | In a very 233 | significant move, Python **pandas** creator Wes McKinney recently 234 | joined RStudio/Posit as a principal architect (see **reticulate**. 235 | below). 236 | 237 | On the other hand, Python is my preferred tool in some applications. An 238 | example is [OMSI](https://github.com/matloff/omsi), an online 239 | examination tool that my students and I developed. Python's built-in 240 | threading made our work much easier in that project. 241 | 242 | I thus very strongly recommend that those considering a data science 243 | career not only *learn* both languages, but also *use* them, thus 244 | developing expertise. 245 | 246 | ## Hybrid R/Python applications 247 | 248 | For similar reasons, some user apps may be best developed as a mixture 249 | of R and Python. Here is the current status: 250 | 251 | RStudio/Posit is to be commended for developing the **reticulate** 252 | package, to serve as a bridge between Python and R. The package enables 253 | calling Python from R code. For the opposite direction, calling R from 254 | Python, I recommend RPpy2, which is the approach we take in our **dsld** 255 | package. 256 | 257 | The **reticulate** package is an outstanding effort, and works well for 258 | pure computation. But as far as I can tell, it does not solve the 259 | knotty problems that arise in Python, e.g. virtual environments and the 260 | like. The **RPy2** library has similar issues. 261 | 262 | At present, computer systems expertise--skill with environment 263 | variables, search paths and general coding ability-- is required for 264 | developing mixed R/Python apps. 265 | 266 | ## Learning R and Python 267 | 268 | I have a [quick tutorial on R for 269 | non-programmers](http:github.com/matloff/fasteR), an evolving project. I 270 | also have a book, *the Art of R Programming*, NSP, 2011. 271 | 272 | I have a [tutorial on 273 | Python](http://heather.cs.ucdavis.edu/FastLanePython.pdf), for 274 | those with a strong programming background. 275 | 276 | # Thanks 277 | 278 | This document has benefited from various reader comments, notably from 279 | Dirk Eddelbuettel, as well as Paul Hewson, Bob Muenchen and Inaki Ucar. 280 | 281 | *Updated December 17, 2023* 282 | 283 | --------------------------------------------------------------------------------