├── Images
├── Rlogo.png
└── python-logo.png
└── README.md
/Images/Rlogo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/matloff/R-vs.-Python-for-Data-Science/a89734f4c65267b4f2f538a77be7194e2ca8c6c0/Images/Rlogo.png
--------------------------------------------------------------------------------
/Images/python-logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/matloff/R-vs.-Python-for-Data-Science/a89734f4c65267b4f2f538a77be7194e2ca8c6c0/Images/python-logo.png
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # R vs. Python for Data Science
2 |
3 |
4 |
5 |
6 |
7 | ## Norm Matloff, Prof. of Computer Science, UC Davis; [my bio](http://heather.cs.ucdavis.edu/matloff.html)
8 |
9 |
10 | Hello! This Web page is aimed at shedding some light on the perennial
11 | R-vs.-Python debates in the Data Science community. This is largely
12 | (though not exclusively) a debate between the Statistics (R) and
13 | Computer Science (Python) fields. Since I have a foot in both camps
14 | (I was a founding member of both the Statistics and Computer Science
15 | Departments at UC Davis), I hope to shed some useful light on the topic.
16 |
17 | I have potential bias: I've written four R-related books; I've given
18 | keynote talks at useR! and other R conferences; I have served as
19 | Editor-in-Chief of the *R Journal*; etc. But I am also an enthusiastic
20 | Python coder, have been for many years, and am the author of a [popular
21 | Python
22 | tutorial](https://www.cs.ucdavis.edu/~matloff/matloff/public_html/Python/PythonIntro.html).
23 | I hope this analysis will be considered fair and helpful.
24 |
25 | Again, **please note:** The emphasis here is on Data Science, not
26 | *Computer* Science. Note too that other than references to specific
27 | packages, ``R'' here means base R, not the Tidyverse, of which I have
28 | been [a critic](https://github.com/matloff/TidyverseSkeptic).
29 |
30 | ## Learning curve
31 |
32 | *Huge win for R.*
33 |
34 | This is of particular interest to me, as an educator. I've taught a
35 | number of subjects -- math, stat, CS and even English As a Second
36 | Language -- and have given intense thought to the learning process for
37 | many, many years.
38 |
39 | To even get started in Data Science with Python, one must learn a lot of
40 | material not in base Python, e.g., NumPy, Pandas and matplotlib. These
41 | libraries require a fair amount of computer systems sophistication.
42 |
43 | Python libraries can be tricky to configure, even for the systems-savvy,
44 | while most R packages run right out of the box.
45 |
46 | ## Data Science emphasis
47 |
48 | **Huge win for R.**
49 |
50 | In my book, *The Art of R Programmming*, I wrote "R is written *by*
51 | statisticians, *for* statisticians," a line I've been pleased to see
52 | quoted by others. One could update that to read "R is written *by*
53 | data scientists, *for* data scientists," and it is of crucial importance
54 | in our discussion here.
55 |
56 | Matrix types, data frames, missing-value handling, basic graphics,
57 | data-and-time processing, linear models, basic statistics, contingency
58 | tables and so on are built-in to base R. The novice can be doing simple
59 | data analyses with these tools within minutes.
60 |
61 | Generally an R data science function will be richer in coverage than its
62 | Python counterpart. For instance, R's histogram plot function,
63 | **hist()**, offers many advanced options, not the case for Python.
64 |
65 | All this is the result of the fact that, indeed, "R is written *by* data
66 | scientists, *for* data scientists."
67 |
68 | ## Available libraries for Data Science
69 |
70 | *Slight edge to R.*
71 |
72 | [CRAN](https://cran.r-project.org/) has over 14,000 packages.
73 | [PyPI](https://pypi.org/) has over 183,000 (both numbers are growing),
74 | but it seems thin on Data Science.
75 |
76 | For example, I once needed code to do fast calculation of
77 | nearest-neighbors of a given data point. (NOT code using that to do
78 | classification.) I was able to immediately find not one but two packages
79 | in CRAN to do this. By contrast, recently I tried to find
80 | nearest-neighbor code for Python and at least with my cursory search in
81 | PyPi, came up empty-handed; there was just one implementation that
82 | described itself as simple and straightforward, nothing fast.
83 |
84 | The following (again, cursory) searches in PyPI turned up nothing: EM
85 | algorithm; log-linear model; Poisson regression; instrumental variables;
86 | spatial data; familywise error rate; etc.
87 |
88 | This is not to say no Python libraries exist for these things; I am
89 | simply saying that they are not easily found in PyPI, whereas it is easy
90 | to find them in CRAN, and indeed, such libraries are more likely to be
91 | in CRAN but not PyPI. Once again, this reflects the difference in
92 | orientation, Data Science for R versus Computer Science for Python.
93 |
94 | And the fact that R has a canonical package structure is a big
95 | advantage. When installing a new package, one knows exactly what to
96 | expect. Similarly, R's *generic functions* are an enormous plus for R.
97 | When I'm using a new package, I know that I can probably use
98 | **print()**, **plot()**, **summary()**, and so on, while I am exploring,
99 | without checking the documentation. These form a "universal language"
100 | for packages.
101 |
102 | ## Visualization tools
103 |
104 | *Win for R*
105 |
106 | Unlike Python, base R itself has sophisticated graphics utilities built
107 | in, and there are two outstanding graphics packages available,
108 | **ggplot2** and **lattice**. The former is so widely used that many
109 | probably perceive it as being part of base R.
110 |
111 | But it goes far beyond that. As noted, a major built-in function in R
112 | is **plot()**. It is *polymorphic*, meaning that its role is different
113 | for each use case it has been written for. This is a fancy term whose
114 | practical meaning is that the objects returned by R functions are
115 | typically paired with a visualization, which we can invoke simply by
116 | calling the generic **plot()**.
117 |
118 | ## Machine learning
119 |
120 | *Slight (or more) edge to Python*.
121 |
122 | As noted, the R-vs.-Python debate is largely a Statistics-vs.-CS debate,
123 | and since most research in neural networks has come from CS, available
124 | software for NNs is mostly in Python. To many in CS, machine learning
125 | means neural networks (NNs).
126 |
127 | RStudio/Posit has done some excellent work in developing a Keras
128 | implementation, and there are R interfaces to PyTorch and so on, but so
129 | far R is limited in this realm. Again, if one's view is that
130 | Data Science = NNs, then Python is the language of choice.
131 |
132 | On the other hand, random forest research originated in the
133 | Statistics community, and most research in this field has been conducted
134 | there. In this realm I'd submit that R has the superior software. The
135 | **grf** package, for instance, allows linear interpolation within tree
136 | leaves, crucial for removing bias near the edges of the data. R also
137 | has excellent packages for gradient boosting, another field originally
138 | invented in Statistics.
139 |
140 | ## Statistical sophistication
141 |
142 | *Big win for R*.
143 |
144 | As noted, I use the slogan, "R is written *by* statisticians, *for*
145 | statisticians." It's important!
146 |
147 | To be frank, I find the machine learning people, who mostly advocate
148 | Python, often have a poor understanding of, and in some cases even a
149 | disdain for, the statistical issues in ML. And, sadly, I often see
150 | ignorance. I was shocked recently, for instance, to see one of the most
151 | prominent ML people state in his otherwise superb book that
152 | standardizing the data to mean-0, variance-1 means one is assuming the
153 | data are Gaussian — absolutely false and misleading.
154 |
155 | ## Parallel computation
156 |
157 | *Let's call it a tie.*
158 |
159 | Neither the base version of R nor Python have good support for multicore
160 | computation. Threads in Python are nice for I/O, but multicore
161 | computation using them is difficult, due to the infamous Global
162 | Interpreter Lock. Python's **multiprocessing** package is much better
163 | than before, but still clunky. R's **parallel** package does allows
164 | shared memory for Macs or Linux, but not on Windows platforms.
165 |
166 | (See my **Rdsm** package if you wish to use shared memory
167 | at the R level.)
168 |
169 | External libraries supporting cluster computation are OK in both languages.
170 |
171 | Currently Python has better interfaces to GPUs, but again, only for NNs.
172 |
173 | ## C/C++ interface and performance enhancement
174 |
175 | *Slight win for R.*
176 |
177 | Though there are tools like SWIG etc. for interfacing Python to C/C++,
178 | as far is I know there is nothing remotely as powerful as R's **Rcpp** for
179 | this at present. The **Pybind11** package is being developed.
180 |
181 | In addition, R's new ALTREP idea has great potential for enhancing
182 | performance and usability.
183 |
184 | On the other hand, the Cython and PyPy variants of Python can in some
185 | cases obviate the need for explicit C/C++ interface in the first place;
186 | indeed some would say Cython IS a C/C++ interface.
187 |
188 | ## Object orientation, metaprogramming
189 |
190 | *Slight win for R*.
191 |
192 | For instance, though functions are objects in both languages, R takes
193 | that further than does Python. Whenever I work in Python, I'm annoyed
194 | by the fact that I cannot directly print a function to the terminal or
195 | edit it, which I do a lot in R. (This goes back to the *polymorphic*
196 | nature of R's **print()** function etc.)
197 |
198 | Python has just one OOP paradigm. In R, you have your choice of several
199 | (S3, S4, R6 etc.), though some may debate whether this is a good thing.
200 |
201 | R's metaprogramming features (code that produces code) are wonderful,
202 | arguably as powerful as, or more powerful than, those of Python. In
203 | both languages, these are only for experts, but the point is that R is
204 | competitive with Python even in this highly CS-ish aspect.
205 |
206 | ## Linked data structures
207 |
208 | *Win for Python.*
209 |
210 | **This is not a big issue in Data Science**, but it does come up in some
211 | contexts.
212 |
213 | Classical computer science data structures, e.g. binary trees, are easy
214 | to implement in Python. This can be done in R in various ways, e.g.
215 | with the **datastructures** package, which wraps the widely-used **Boost**
216 | C++ library, but it is not base-R.
217 |
218 | ## Online help
219 |
220 | *Big win for R.*
221 |
222 | To begin with, R's basic **help()** function is much more informative
223 | than Python's. It's nicely supplemented by **example()**. And most
224 | important, the custom of writing vignettes in R packages makes R a
225 | hands-down winner in this aspect.
226 |
227 | # Essential for data scientists to know both R and Python
228 |
229 | Hopefully I've made a strong case above for using R in data science, in
230 | analytics, visualization and so on. It is [widely used in business and
231 | industry](https://github.com/ThinkR-open/companies-using-r).
232 | In a very
233 | significant move, Python **pandas** creator Wes McKinney recently
234 | joined RStudio/Posit as a principal architect (see **reticulate**.
235 | below).
236 |
237 | On the other hand, Python is my preferred tool in some applications. An
238 | example is [OMSI](https://github.com/matloff/omsi), an online
239 | examination tool that my students and I developed. Python's built-in
240 | threading made our work much easier in that project.
241 |
242 | I thus very strongly recommend that those considering a data science
243 | career not only *learn* both languages, but also *use* them, thus
244 | developing expertise.
245 |
246 | ## Hybrid R/Python applications
247 |
248 | For similar reasons, some user apps may be best developed as a mixture
249 | of R and Python. Here is the current status:
250 |
251 | RStudio/Posit is to be commended for developing the **reticulate**
252 | package, to serve as a bridge between Python and R. The package enables
253 | calling Python from R code. For the opposite direction, calling R from
254 | Python, I recommend RPpy2, which is the approach we take in our **dsld**
255 | package.
256 |
257 | The **reticulate** package is an outstanding effort, and works well for
258 | pure computation. But as far as I can tell, it does not solve the
259 | knotty problems that arise in Python, e.g. virtual environments and the
260 | like. The **RPy2** library has similar issues.
261 |
262 | At present, computer systems expertise--skill with environment
263 | variables, search paths and general coding ability-- is required for
264 | developing mixed R/Python apps.
265 |
266 | ## Learning R and Python
267 |
268 | I have a [quick tutorial on R for
269 | non-programmers](http:github.com/matloff/fasteR), an evolving project. I
270 | also have a book, *the Art of R Programming*, NSP, 2011.
271 |
272 | I have a [tutorial on
273 | Python](http://heather.cs.ucdavis.edu/FastLanePython.pdf), for
274 | those with a strong programming background.
275 |
276 | # Thanks
277 |
278 | This document has benefited from various reader comments, notably from
279 | Dirk Eddelbuettel, as well as Paul Hewson, Bob Muenchen and Inaki Ucar.
280 |
281 | *Updated December 17, 2023*
282 |
283 |
--------------------------------------------------------------------------------