├── .gitignore ├── LICENSE ├── README.md ├── Summer2019_Session1.ipynb ├── baseRintro.ipynb ├── data └── weather-kumpula.csv └── extra ├── Summer2019_Session1_solved.ipynb ├── Summer2019_Session1_solved.pdf ├── baseRintro.pdf ├── baseRintro_solved.ipynb └── baseRintro_solved.pdf /.gitignore: -------------------------------------------------------------------------------- 1 | *.Rproj 2 | .Rproj.user 3 | .Rhistory 4 | .RData 5 | .Ruserdata 6 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 CSC - IT Center for Science 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # R-for-beginners 2 | An introduction to base R for beginners suitable for self-learning. 3 | 4 | To read the contents of the IPython notebook, check the pdf file under 'extra'. 5 | 6 | To start studying this interactively: 7 | - go to https://noppe.csc.fi 8 | - login with Haka or Virtu account (*if you don't have a Haka/Virtu account you need a special noppe.csc.fi account created for you, contact noppe@csc.fi*) 9 | - find the application called "Introduction to R", click on round "Start session" button on the right. 10 | - Wait for the session to load. Then it will open Jupyter in your browser. 11 | - In Jupyter, double click on __R-for-beginners__ and then on __baseRintro.ipynb__. This will start the notebook in a new Jupyter tab. 12 | - Follow the instructions in the notebook. Happy learning! 13 | -------------------------------------------------------------------------------- /Summer2019_Session1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Data Analytics Summer School\n", 8 | "2019 Edition (Jesse Harrison, Anni Pyysing)" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "# 1. Hands-on Session: Basics of Data Analysis" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "R is one of the most widely used programming languages for data analysis. It is available for free under the [GNU General public license](https://www.r-project.org/COPYING). In this session, we will go through a number of practical tasks to equip you with a basic grasp of how R works. This will also come in handy in the subsequent sessions. Don't worry, no prior programming experience is expected!\n", 23 | "\n", 24 | "###### R interfaces: details for the interested\n", 25 | "\n", 26 | "R code can be written using essentially any software. However, some methods are more practical than others. The most common method is to use RStudio (see [rstudio.com](https://www.rstudio.com/)). If you want to start using R on your own computer later on, you can start by [downloading R](https://www.r-project.org/) and RStudio on your computer. Both are free. Today we will not download anything - instead, we will work with R code using Jupyter." 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "## Let's begin!" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "What you are currently using is Jupyter, a web-based (\"cloud\") interface that can run R. While not as powerful as a full-fledged R installation, it enables us to focus on some of the basics of reading, creating and editing R code.\n", 41 | "\n", 42 | "Jupyter notebooks consist of cells. For example, this text block is a cell that you can edit by double-clicking it (try this!). Now that you have entered the *edit mode*, you can *run* the cell by clicking the Run above, or by using `Ctrl-Enter`.\n", 43 | "\n", 44 | "You can edit everything in your notebook (for example, delete stuff). Also, you can add cells in the \"Insert\" menu, run a cell in the \"Cell\" menu, and save and download this notebook in the \"File\" menu. __Saving your work__ is possible by using checkpoints (go to the File-menu and click Save and Checkpoint)." 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "## Code cells and simple calculations\n", 52 | "\n", 53 | "Jupyter notebooks have two important types of cells: __Markdown__ and __Code__ cells. This cell is a markdown cell and it contains text. The cell below that has the marking `In[ ]` next to it is the first code cell. R code can be executed in code cells. \n", 54 | "\n", 55 | "You can write, for example, 1+1 in the cell below and run the calculation. Test it! You can erase the calculation, modify it, and Run it again. Also try some additional calculations using the list of arithmetic operators below. One important observation: for decimal points, R accepts the full stop symbol (`.`) instead of a comma (`,`). \n", 56 | "\n", 57 | "To get used to how code cells work, also try typing different calculations on separate rows. You will notice that R treats each row individually (with the calculation results appearing as a list under the code cell).\n", 58 | "\n", 59 | "\n", 60 | "| R code | Explanation | Example |\n", 61 | "| -------|:---------: | -----: |\n", 62 | "| `+` | addition | `1+1` |\n", 63 | "| `-` | subtraction | `10-5` |\n", 64 | "| `*` | multiplication | `10*2` |\n", 65 | "| `/` | division | `10/2` |\n", 66 | "| `^` | exponential | `10^2` |\n", 67 | "| `.` | decimal point | `3.14159` |" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": null, 73 | "metadata": {}, 74 | "outputs": [], 75 | "source": [] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "## Objects (and removing them)\n", 82 | "\n", 83 | "You can store info in *objects*. This is useful when you want to use your object later (once created, they remain in R's memory). The standard way to assign a value to an object is `object <- value`. Here we deal mostly with numeric values, but lines of text (such as \"Text\") can also be used. Further to individual numbers or text labels, you can assign the results of different calculations to objects. Some examples are given below:\n", 84 | "```r\n", 85 | "price <- 140\n", 86 | "carPrice <- 1\n", 87 | "piRounded <- 3.14159\n", 88 | "name <- \"Text\"\n", 89 | "cookiePrice <- 2\n", 90 | "complexCalculation <- 5 + 5\n", 91 | "```\n", 92 | "To get a hang of creating objects in R, try making up a few of your own using the code cell below: one containing a single number, another containing a line of text, and one using a calculation. Note that to actually see what is inside an object, you will need to write the name of the object separately after performing the assignment. Otherwise it may appear as if nothing happens when you run the code. For example:\n", 93 | "\n", 94 | "```r\n", 95 | "price <- 140\n", 96 | "price\n", 97 | "```\n", 98 | "\n", 99 | "Sometimes you might also want to remove an object after creating it. This can be done using `rm()` (with the name of the object inserted inside the brackets). After creating the three objects, let's also try removing them! \n", 100 | "\n", 101 | "Note: there is also a command for completely clearing up your working environment in R. This can be done as follows: `rm(list = ls())`. Use it with some caution because R doesn't come with an undo button!\n" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "## Calculations using objects\n", 116 | "\n", 117 | "Once data have been assigned to objects, they are convenient to use for calculations:\n", 118 | "\n", 119 | "```r\n", 120 | "discount <- 20\n", 121 | "finalPrice <- price - discount\n", 122 | "finalSale <- (price - discount) * 0.5\n", 123 | "```\n", 124 | "\n", 125 | "Now you know how very basic math operations can be done in R, and it is time to put these skills to use! The code cell below has already some code written into it, and you can just modify it as needed.\n", 126 | "\n", 127 | "As an exercise, calculate what is the final price when the discount is __a)__ 40, or __b)__ 15% of the starting price." 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": null, 133 | "metadata": {}, 134 | "outputs": [], 135 | "source": [ 136 | "price <- 140\n", 137 | "discount <- 20\n", 138 | "finalPrice <- price - discount\n", 139 | "finalPrice" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "## Vectors and functions in R\n", 147 | "\n", 148 | "### What vectors are\n", 149 | "\n", 150 | "While the examples above dealt with individual numbers, this is not how R is typically used. Usually you have plenty of data (many items and many prices, for example). The way to store a sequence of numbers (or other types of data) is to use a vector. When we are dealing with numeric data, the vector is called a *numeric vector*. When dealing with text data, the vector is called a *character vector*. \n", 151 | "\n", 152 | "We'll learn more about vectors soon, but for now let's go ahead and create a vector called *testVector*.\n", 153 | "See what comes out by running the cell!" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "testVector <- c(10, 30, 1, 4)\n", 163 | "testVector\n", 164 | "\n", 165 | "sort(testVector)" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "Ok, now we have created a numeric vector and sorted it. You can try to figure out how to make a character vector by yourself. If you can not figure it out, do not worry, we will learn it soon." 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "### What functions are\n", 180 | "\n", 181 | "A surprise! We have already used functions in this session, including `sort()` and `c()`. The first is a function that orders the values in a vector and the latter is a function that combines its arguments to form a vector. In the case of `testVector <- c(10,30,-20,-1)`, the arguments are 10, 30, -20, and -1. The command for removing objects, `rm()`, is also a function.\n", 182 | "\n", 183 | "Some useful functions include convenient ways to calculate things, such as `mean()`, which calculates the mean of a numeric vector. The function call `mean(testVector)` will give you 4.75 as the output, since it treats the vector `c(10,30,-20,-1)` as a whole. Another example of a useful function is `median(testVector)`. Let's try both:" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "testVector <- c(10, 30, -20, -1)\n", 193 | "\n", 194 | "mean(testVector)\n", 195 | "median(testVector)\n", 196 | "\n", 197 | "mean(testVector) - median(testVector)" 198 | ] 199 | }, 200 | { 201 | "cell_type": "markdown", 202 | "metadata": {}, 203 | "source": [ 204 | "__Some useful functions__ are listed next. You don't have to memorize any functions, but it is useful to know some that are available. If you are unsure how a function works, you can always open the help (see below), or you can search the internet or [R documentation](https://www.r-project.org/other-docs.html) for many more functions.\n", 205 | "\n", 206 | "* Most mathematical functions also work *element by element* and return multiple values
\n", 207 | " `exp()`, `log()`\n", 208 | "* Some functions treat the vector as a whole and return a single value, like the length of a vector
\n", 209 | " `length()`, `sum()`, `mean()`, `median()`, `sd()`, `var()`, `min()`, `max()`
\n", 210 | "* Other functions tell several things at once
\n", 211 | " `summary()`, `str()`
\n", 212 | "* It is also possible to ask for help and open R documentation.
\n", 213 | " `help()`, or `?function` (e.g. `?mean`) will both work
\n", 214 | "* A default R installation can also be used to create simple graphics.
\n", 215 | " `plot()`, `barplot()`
\n", 216 | "\n", 217 | "## Annotating your code\n", 218 | "\n", 219 | "Annotating your code is done by using the `#` character in R. This is especially necessary when writing longer projects. Comments are used to explain what you are doing in your code. The `#` will turn the following text into \"non-code\" and it will be ignored by the interpreter that turns the R-code into action." 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": null, 225 | "metadata": {}, 226 | "outputs": [], 227 | "source": [ 228 | "# This is a comment\n", 229 | "\n", 230 | "help(summary)\n", 231 | "summary(testVector)\n", 232 | "\n", 233 | "# See what happens if you erase the # character below\n", 234 | "# mean(testVector)" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "## Next some exercises! \n", 242 | "\n", 243 | "Use the `help()`, search the internet, or just quess the correct commands.\n", 244 | "Writing something \"wrong\" does not break anything, it will just give an error message. Find the mistake, fix the mistake, and run the cell again!\n", 245 | "\n", 246 | "Remember: if you store the value of some calculation in an object, you can read its content by simply writing its name.\n", 247 | "\n", 248 | "#### Exercise 1: Learning to use external resources\n", 249 | "Find out the variance of the following numbers: 1, 4, 9.5, 10, and -2.
\n", 250 | "*Hint!* There is a function for it :)" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": null, 256 | "metadata": {}, 257 | "outputs": [], 258 | "source": [ 259 | "# Exercise 1 answer below\n" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "#### Exercise 2: Finding out how a function works\n", 267 | "Learn what the function `seq()` does and what arguments it needs to work. Test it. Do not be afraid to get an error message!" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": null, 273 | "metadata": {}, 274 | "outputs": [], 275 | "source": [ 276 | "# Exercise 2 answer below\n" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "#### Exercise 3: Thinking on your own\n", 284 | "Create a vector `t` containing the numbers from 0 to 10 with 0.1 interval, and plot the `sin()` of the vector `t` at the points contained in the `t` vector." 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": null, 290 | "metadata": {}, 291 | "outputs": [], 292 | "source": [ 293 | "# Exercise 3 answer below\n" 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "#### Extra exercise for the fast ones\n", 301 | "Find out how to write a title into the plot and change the scatter plot from exercise 3 into a line plot.
\n", 302 | "*Hint!* The `plot()` function can take more arguments than two. Do not worry about finishing this exercise if you run out of time." 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": null, 308 | "metadata": {}, 309 | "outputs": [], 310 | "source": [ 311 | "# Answer for the extra exercise\n" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "metadata": {}, 317 | "source": [ 318 | "## Data types \n", 319 | "\n", 320 | "### Numeric vectors vs. character vectors\n", 321 | "\n", 322 | "Now that you can write simple R code and search the R documentation or the Internet for information about functions, it is time to learn something new! As you might remember, there are *numeric* and *character* vectors, which are used to store numbers and text. A character vector can obviously contain numbers as well, but in text format. Character vectors can be created using quotes `\" \"` or single quotes `' ' `.\n", 323 | "\n", 324 | "For example, what is the difference between the following commands?\n", 325 | "```r\n", 326 | "numVector <- c(1, 2, 3, 4)\n", 327 | "chVector <- c(\"1\", \"2\", \"3\", \"4\")\n", 328 | "```\n", 329 | "Try running the cell below and inspect the message." 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": null, 335 | "metadata": {}, 336 | "outputs": [], 337 | "source": [ 338 | "# Demonstrating the difference between character and numeric vectors\n", 339 | "\n", 340 | "numVector <- c(1, 2, 3, 4)\n", 341 | "chVector <- c(\"1\", \"2\", \"3\", \"4\")\n", 342 | "\n", 343 | "sum(numVector)\n", 344 | "sum(chVector)" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "metadata": {}, 350 | "source": [ 351 | "### Logical vectors and logical operators \n", 352 | "\n", 353 | "In addition to numeric and character vectors, there are *logical* vectors. Logical vectors have only two possible values, `TRUE` and `FALSE`. They are answers to questions such as \"Is the price above 50?\" or \"Is the price exactly 10?\". \n", 354 | "\n", 355 | "A logical vector can be created similarly to other vectors, for example: \n", 356 | "```r\n", 357 | "answers <- c(TRUE, FALSE, TRUE, TRUE)\n", 358 | "```\n", 359 | "\n", 360 | "Logical vectors (notice that a single value is a vector with the length of one) are often created as the output of functions. For example, the function `is.numeric()` tests if its argument is numeric and returns a logical value." 361 | ] 362 | }, 363 | { 364 | "cell_type": "code", 365 | "execution_count": null, 366 | "metadata": {}, 367 | "outputs": [], 368 | "source": [ 369 | "# Run this cell to see what happens\n", 370 | "is.numeric(\"some character string\")\n", 371 | "is.numeric(50385)" 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "metadata": {}, 377 | "source": [ 378 | "Logical vectors are not only created as function outputs, but also as the result of *logical operations* that are written in an intuitive way, very similarly to the arithmetic operators.\n", 379 | "\n", 380 | "The *logical operators* in R are listed below with some use examples:\n", 381 | "\n", 382 | "| R code | Explanation | Example | Output |\n", 383 | "| ------- |:---------: | -----: | -----: |\n", 384 | "| `<` | less than | `10 < 15` | `TRUE` |\n", 385 | "| `>` | greater than | `3 > 3` | `FALSE` |\n", 386 | "| `<=` | less or equal | `3 <= 3` | `TRUE` |\n", 387 | "| `>=` | greater or equal | `5 >= 6` | `FALSE` |\n", 388 | "| `==` | equal | `7 == 7` | `TRUE` |\n", 389 | "| `!=` | not equal | `2 != 2` | `FALSE` |\n", 390 | "\n", 391 | "It is also possible to combine different rules by using logical operators, which are:\n", 392 | "\n", 393 | "- and `&`\n", 394 | "- or `|`\n", 395 | "- not `!`\n", 396 | "\n", 397 | "Try to play with logical operators a bit in the code cell below. After trying out what's already written inside the cells, try some examples of your own." 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": null, 403 | "metadata": {}, 404 | "outputs": [], 405 | "source": [ 406 | "# Test some logical operators to see how they work\n", 407 | "age <- 15\n", 408 | "age < 18\n", 409 | "\n", 410 | "# Combining rules\n", 411 | "age < 18 & age > 16" 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": {}, 417 | "source": [ 418 | "Now for an exercise to flex your brain with logical operators.\n", 419 | "\n", 420 | "#### Exercise: Book auction\n", 421 | "\n", 422 | "Let's say that you run a small company that sells boxes of books in an auction. The amount of books in each box varies, and the price of each box is set in the auction.\n", 423 | "\n", 424 | "The object `prices` has the prices of the boxes that were sold each day, and `booksInBoxes` has the amounts of books that were in each box.\n", 425 | "\n", 426 | "Create your own `prices` and `booksInBoxes` objects. Using what we've learnt so far, can you use a logical operator(s) to find out if the total sales was above our weekly target, 500?\n" 427 | ] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "execution_count": null, 432 | "metadata": {}, 433 | "outputs": [], 434 | "source": [ 435 | "# Answer here\n" 436 | ] 437 | }, 438 | { 439 | "cell_type": "markdown", 440 | "metadata": {}, 441 | "source": [ 442 | "### Choosing values\n", 443 | "Sometimes we want to choose a part of the values from a larger dataset for further analysis.\n", 444 | "We can do this using square brackets `[ ]`.
\n", 445 | "Observe the output of the cell below by running it. Feel free to modify the code and test how it works." 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": null, 451 | "metadata": {}, 452 | "outputs": [], 453 | "source": [ 454 | "prices[1] # choosing the first value\n", 455 | "prices[2:5] # choosing the 2nd, 3rd, 4th and 5th value\n", 456 | "prices[prices > 40] # choosing values based on logical operations\n", 457 | "prices[c(2,4,5)] # choosing the 2nd, 4th and 5th value" 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": {}, 463 | "source": [ 464 | "### The workspace\n", 465 | "\n", 466 | "The workspace holds the objects the user (you!) have defined during the session. For example, the object `prices` should be in your workspace. Here are some useful commands to get you started with the workspace: (further to the `rm()` function):\n", 467 | "* `getwd()` prints the current working directory location\n", 468 | "* `setwd()` can be used to set a custom working directory (you will need to specify the full path)\n", 469 | "* `ls()` lists all the objects in the workspace\n", 470 | "\n", 471 | "Now might also be a good time to create a checkpoint if you want to save the state of your work before moving on." 472 | ] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "execution_count": null, 477 | "metadata": {}, 478 | "outputs": [], 479 | "source": [ 480 | "# Find out what objects you have in your workspace\n", 481 | "ls()" 482 | ] 483 | }, 484 | { 485 | "cell_type": "markdown", 486 | "metadata": {}, 487 | "source": [ 488 | "### Data frames - the most important data structure in R \n", 489 | "We moved on from having objects that were single values to objects that were single vectors. However, it would be useful if the vectors that are related to each other could also be represented by the same object.\n", 490 | "Now we will proceed to storing our data in even bigger entities, __data frames__. Data frames can include our previously very separate vectors in the same object.\n", 491 | "\n", 492 | "Some of the functions will work with entire data frames, like `summary()`, but for some functions, like `mean()`, you will need to specify which data inside the data frame you want to use in the function.\n", 493 | "\n", 494 | "Some important functions and commands that are useful for working with data frames include:\n", 495 | "* `data.frame()` creates a data frame of the arguments that it is given\n", 496 | "* the dollar sign `$` is used to take a column from a data frame\n", 497 | "* square brackets `[ ]` can be used to take a part of the dataframe\n", 498 | "* `subset()` returns a subset of a data frame\n", 499 | "\n", 500 | "In the following examples we will learn how to create, modify, and use data frames in R.\n", 501 | "But first, let's create a couple of vectors and combine them into one data frame for storage. " 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": null, 507 | "metadata": {}, 508 | "outputs": [], 509 | "source": [ 510 | "# Here we create some data to put into a data frame\n", 511 | "amount <- c(2, 3, 5, 1)\n", 512 | "book <- c(\"Comics\", \"Novels\", \"Manuals\", \"Comics\") \n", 513 | "prices <- c(7.50, 20.00, 66.0, 500)\n", 514 | "\n", 515 | "# Combining the vectors into a dataframe\n", 516 | "bookTypes <- data.frame(amount, book, prices)\n", 517 | "bookTypes" 518 | ] 519 | }, 520 | { 521 | "cell_type": "markdown", 522 | "metadata": {}, 523 | "source": [ 524 | "### Factors\n", 525 | "\n", 526 | "There is still one data type to discuss, __factor__. Factors can be described as categories, such as `Comics`, `Novels`, and `Manuals`. You can check that the class of `bookTypes$book` is actually factor by using the function `class()`.\n", 527 | "The class changed from character to factor because the function `data.frame()` turns character vectors into factors by default. The conversion can be avoided by the option `stringsAsFactors = FALSE`. Factors are necessary in statistical modeling, where different categories are often compared.\n", 528 | "\n", 529 | "Let's create a new data frame, but this time, we'll put everything inside it immediately. Also, notice how we have split this longer section of code onto multiple rows. We learnt before that R will treat each line of code individually and this is true when each line contains a *separate* section of code (such as assignments for separate data frames). In the current case, arranging the code on multiple rows is possible since all the commands relate to the same task (creating a data frame)." 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "execution_count": null, 535 | "metadata": {}, 536 | "outputs": [], 537 | "source": [ 538 | "# Creating a data frame called 'bookReview'\n", 539 | "# (Note how the code is split into different rows for easier reading)\n", 540 | "\n", 541 | "bookReview <- data.frame(storage = c(4, 1, 3, 2), \n", 542 | " titles = c(\"Good book\", \"Excellent book\", \"Bad book\", \"Horrible book\"),\n", 543 | " cost = c(12.1, 20.0, 3.33, -10 ),\n", 544 | " stringsAsFactors=FALSE)\n", 545 | "bookReview" 546 | ] 547 | }, 548 | { 549 | "cell_type": "markdown", 550 | "metadata": {}, 551 | "source": [ 552 | "Now if we want to get only the titles, we must specify from which data frame it is found.\n", 553 | "This is done by using the `$` sign. You can also try to take the column without specifying the data frame, and see what happens." 554 | ] 555 | }, 556 | { 557 | "cell_type": "code", 558 | "execution_count": null, 559 | "metadata": { 560 | "scrolled": true 561 | }, 562 | "outputs": [], 563 | "source": [ 564 | "bookReview$titles # take a column from the data frame\n", 565 | "titles # take the variable 'titles'" 566 | ] 567 | }, 568 | { 569 | "cell_type": "markdown", 570 | "metadata": {}, 571 | "source": [ 572 | "### Creating and modifying columns\n", 573 | "Now we will create (and modify) some new columns into the data frame using the `$` sign. See what happens to our `bookReview` and try to figure out how the column `storage.value` is created!
\n", 574 | "\n", 575 | "_Hint!_ The next exercise will be easier if you focus on this part." 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": null, 581 | "metadata": {}, 582 | "outputs": [], 583 | "source": [ 584 | "# Creating some new columns\n", 585 | "bookReview$type <- \"book\"\n", 586 | "bookReview$storage.value <- bookReview$storage * bookReview$cost\n", 587 | "bookReview\n", 588 | "\n", 589 | "# Modifying our new columns\n", 590 | "bookReview$type[4] <- \"trash\"\n", 591 | "bookReview" 592 | ] 593 | }, 594 | { 595 | "cell_type": "markdown", 596 | "metadata": {}, 597 | "source": [ 598 | "#### Exercise: Modifying the data frame\n", 599 | "\n", 600 | "We have received a shipment of new books and their information should be stored into our data frame.\n", 601 | "The shipment included two books of each type. Don't forget to update the storage value!\n", 602 | "\n", 603 | "Luckily for us, the cost of the books stays the same, so we don't need to worry about that." 604 | ] 605 | }, 606 | { 607 | "cell_type": "code", 608 | "execution_count": null, 609 | "metadata": {}, 610 | "outputs": [], 611 | "source": [ 612 | "# Answer here\n" 613 | ] 614 | }, 615 | { 616 | "cell_type": "markdown", 617 | "metadata": {}, 618 | "source": [ 619 | "### Choosing values from a data frame\n", 620 | "\n", 621 | "As you might have already learned, there are many ways to achieve the same goal in R. Both `subset()` and `[ ]` can be used to choose columns and rows. \"Unselecting\" columns is done by the minus sign (`-`).\n", 622 | "Uncomment the method you want to test. Can you figure out how each method works?" 623 | ] 624 | }, 625 | { 626 | "cell_type": "code", 627 | "execution_count": null, 628 | "metadata": { 629 | "scrolled": true 630 | }, 631 | "outputs": [], 632 | "source": [ 633 | "# Create a column with a sequence of numbers\n", 634 | "bookReview$delete.column <- seq(1, 100, length = 4)\n", 635 | "bookReview\n", 636 | "\n", 637 | "# multiple methods to delete the 6th column in the data frame:\n", 638 | "\n", 639 | "# bookReview <- bookReview[1:4,-6]\n", 640 | "# bookReview <- subset(bookReview, select = -6)\n", 641 | "# bookReview[\"delete.column\"] <- NULL\n", 642 | "\n", 643 | "bookReview" 644 | ] 645 | }, 646 | { 647 | "cell_type": "markdown", 648 | "metadata": {}, 649 | "source": [ 650 | "We can also set rules for the observations that we want to take from the data frame." 651 | ] 652 | }, 653 | { 654 | "cell_type": "code", 655 | "execution_count": null, 656 | "metadata": {}, 657 | "outputs": [], 658 | "source": [ 659 | "subset(bookReview, storage > 1 & cost > 5)\n", 660 | "subset(bookReview, type == \"book\")" 661 | ] 662 | }, 663 | { 664 | "cell_type": "markdown", 665 | "metadata": {}, 666 | "source": [ 667 | "#### Exercise: Finding observations from the data frame\n", 668 | "Find out the average cost of the books that are not horrible. *Hint*: logical operators are your friend!" 669 | ] 670 | }, 671 | { 672 | "cell_type": "code", 673 | "execution_count": null, 674 | "metadata": {}, 675 | "outputs": [], 676 | "source": [ 677 | "# Answer here" 678 | ] 679 | }, 680 | { 681 | "cell_type": "markdown", 682 | "metadata": {}, 683 | "source": [ 684 | "Next, let's try to use some functions on our data frame. Run the cell below and observe the output!" 685 | ] 686 | }, 687 | { 688 | "cell_type": "code", 689 | "execution_count": null, 690 | "metadata": {}, 691 | "outputs": [], 692 | "source": [ 693 | "summary(bookReview) # see how this works on a data frame\n", 694 | "mean(bookReview) # see how this does not work on a data frame" 695 | ] 696 | }, 697 | { 698 | "cell_type": "markdown", 699 | "metadata": {}, 700 | "source": [ 701 | "### Error messages (and missing values) \n", 702 | "\n", 703 | "Learning to read error messages is a good skill to have, because it will help you locate and fix errors in your code. Did you read the warning message that appeared when we tried to take the `mean()` from the entire data frame? The message says that the argument we gave the function is not in a suitable format. However, the function did still return something.\n", 704 | "\n", 705 | "`NA` is short for \"not available\", which means that the answer is a missing value. In real life, datasets are full of missing values. You should remember that missing values can disturb some calculations, such as if you want to take the `mean()` from a vector that has `NA`s. Luckily, you can tell the function `mean()` to skip `NA`s in calculation by giving the function an extra attribute, `na.rm = TRUE`.\n", 706 | "\n", 707 | "There is also another value, `NULL`, that could be confused with `NA`, but the difference between `NULL` and `NA` is a bit out of the scope of this session. Check out the R documentation if you are interested!" 708 | ] 709 | }, 710 | { 711 | "cell_type": "markdown", 712 | "metadata": {}, 713 | "source": [ 714 | "## Getting data into R\n", 715 | "\n", 716 | "For now we have dealt with very small data sets that we created by ourselves, but in the real world you would use data from external resources. R has many functions for data import, such as `read.csv()` to read .csv files and `read.table()` to read .txt files. Notice that your data might not be exactly in the format that the function expects! For example, if your data uses a different decimal point than the default `.`, you can include an extra argument to specify the separator. This would be written as `read.csv(\"filename.csv\", dec = \",\")`. Another alternative is to use `read.csv2()`, which uses the semicolon (`;`) as the separator.\n", 717 | "\n", 718 | "To load a .csv file into R, you would have two options:\n", 719 | "\n", 720 | "`read.csv(\"file.csv)` # when the file is in your working directory\n", 721 | "`read.csv(\"pathtofile/file.csv\")` # when the file is outside your working directory\n", 722 | "\n", 723 | "R also comes with some built-in datasets that you can practice with. If you are interested, available datasets are listed by the `data()` function. The same function also loads the dataset when the name of the dataset is given as an argument. The *PlantGrowth* dataset is a good one to start with." 724 | ] 725 | }, 726 | { 727 | "cell_type": "code", 728 | "execution_count": null, 729 | "metadata": {}, 730 | "outputs": [], 731 | "source": [ 732 | "# An example of exploring a built-in dataset\n", 733 | "\n", 734 | "# first clear the workspace\n", 735 | "rm(list = ls())\n", 736 | "\n", 737 | "# load the PlantGrowth data set\n", 738 | "data(PlantGrowth)\n", 739 | "\n", 740 | "# explore the data\n", 741 | "head(PlantGrowth)\n", 742 | "str(PlantGrowth)\n", 743 | "?PlantGrowth" 744 | ] 745 | }, 746 | { 747 | "cell_type": "markdown", 748 | "metadata": {}, 749 | "source": [ 750 | "### Packages\n", 751 | "\n", 752 | "Packages are the strength of R! You can extend R by installing new packages. There are several packages already installed with R, but installing new ones will probably be necessary if you are working with more than the very basic stuff. You only have to install a package once, but load it into the library for use every time you start a new session. Here are some useful functions to get you started!\n", 753 | "\n", 754 | "* `installed.packages()` will list all the packages that are already installed\n", 755 | "* `install.packages(\"packageName\")` will install the package from CRAN\n", 756 | "* `library()` will list all the packages that are available in the library and ready to use\n", 757 | "* `library(\"packageName\")` will load the package to be used during the session\n", 758 | "\n", 759 | "The amount of packages available for R is enormous. See this [article at rstudio.com](https://support.rstudio.com/hc/en-us/articles/201057987-Quick-list-of-useful-R-packages) to find a nice list of useful packages.\n", 760 | "All available packages at CRAN can be found at [cran.r-project.org](https://cran.r-project.org) under Packages, and a useful listing of packages by topic is under Task Views.\n", 761 | "\n", 762 | "Now you have the basic tools to move on to the next section of this Data Analytics School. If you want to go through these materials later on, go ahead and download this notebook as PDF!\n", 763 | "\n", 764 | "\n" 765 | ] 766 | }, 767 | { 768 | "cell_type": "markdown", 769 | "metadata": {}, 770 | "source": [ 771 | "## Extra examples and exercises\n", 772 | "\n", 773 | "Here are some extra examples and exercises for those who have some coding experience, have perhaps used R before, or are just very fast.\n", 774 | "\n", 775 | "### PlantGrowth data set\n", 776 | "\n", 777 | "Let's continue with the *PlantGrowth* data set. This data set, while quite a small one, is still useful for our purposes. It contains information on plant yields (measured as dry weight) obtained under either control conditions or two experimental treatments.\n", 778 | "\n", 779 | "Something we haven't yet tried (but will explore later on in more depth) is using R to visualize your data. In addition to offering many functions for data manipulation and statistics, R is also great for creating plots. Here, we can try looking at the control vs. experimental groups using one of the default functions in R called `boxplot()`. It produces quite a lot of information for just a single line of code: \n" 780 | ] 781 | }, 782 | { 783 | "cell_type": "code", 784 | "execution_count": null, 785 | "metadata": {}, 786 | "outputs": [], 787 | "source": [ 788 | "boxplot(weight ~ group, data = PlantGrowth, col = \"blue\")" 789 | ] 790 | }, 791 | { 792 | "cell_type": "markdown", 793 | "metadata": {}, 794 | "source": [ 795 | "The box and whiskers plot shows the group medians, lower (25%) and upper (75%) quartiles and also the minimal and maximal values within each group. What conclusions do you think we could draw from the results?\n", 796 | "\n", 797 | "One could also think about plotting the data slightly differently. For example, we might want to see all the individual data points rather than a summary. One way to do this is to type in `stripchart(weight ~ group, data = PlantGrowth)` (try it using the code cell below!). The resulting plot could be easier to interpret, however. Can you find a way to rotate it clockwise so that the groups are shown on the x axis and weight on the y axis? To find the answer, have a look at the R help file for this function.\n" 798 | ] 799 | }, 800 | { 801 | "cell_type": "code", 802 | "execution_count": null, 803 | "metadata": {}, 804 | "outputs": [], 805 | "source": [ 806 | "# Try plotting things here..." 807 | ] 808 | }, 809 | { 810 | "cell_type": "markdown", 811 | "metadata": {}, 812 | "source": [ 813 | "### A few words on formulas\n", 814 | "\n", 815 | "You've probably noticed how the code in these examples is somewhat different to what we were working with before. The use of a the tilde symbol (`~`) in R is specific to _formulas_. The general way in which formulas in R are structured can be thought of as:\n", 816 | "\n", 817 | "`function(y ~ x)` # first y, then x!\n", 818 | "\n", 819 | "Another situation where you will encounter formulas is when you're doing statistical modelling with R. For example, in the future if wishing to fit a linear model to a data set, you could use the linear model function `lm()` as follows:\n", 820 | "\n", 821 | "`lm(variable_y ~ variable_x, data = dataframe)`\n", 822 | "\n", 823 | "As diving into the world of statistics is beyond the scope of this hands-on session, you may wish to look up the linear model function in your own time: `?lm`" 824 | ] 825 | }, 826 | { 827 | "cell_type": "markdown", 828 | "metadata": {}, 829 | "source": [ 830 | "### Free exploration\n", 831 | "Feel free to explore another built-in dataset in R, _ToothGrowth_. " 832 | ] 833 | }, 834 | { 835 | "cell_type": "code", 836 | "execution_count": null, 837 | "metadata": {}, 838 | "outputs": [], 839 | "source": [ 840 | "# Free exploration\n", 841 | "\n", 842 | "?ToothGrowth\n", 843 | "data(ToothGrowth)\n", 844 | "head(ToothGrowth)" 845 | ] 846 | } 847 | ], 848 | "metadata": { 849 | "celltoolbar": "Raw Cell Format", 850 | "kernelspec": { 851 | "display_name": "R", 852 | "language": "R", 853 | "name": "ir" 854 | }, 855 | "language_info": { 856 | "codemirror_mode": "r", 857 | "file_extension": ".r", 858 | "mimetype": "text/x-r-source", 859 | "name": "R", 860 | "pygments_lexer": "r", 861 | "version": "3.5.1" 862 | } 863 | }, 864 | "nbformat": 4, 865 | "nbformat_minor": 1 866 | } 867 | -------------------------------------------------------------------------------- /baseRintro.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Introduction to Base R\n", 8 | "Jesse Harrison and Seija Sirkiä" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "Welcome to this introduction to R. The objectives of this notebook are to equip you with a basic understanding of what R is, how R code works and how to manage code of your own. Another purpose is to convince you that learning R can completely transform how you work with and share your data! \n", 16 | "\n", 17 | "The intended audience of this notebook includes those with no experience with R. You may also have very limited or no experience with programming in general. Not to worry - R is quite simple and intuitive once you learn the basics, and it only becomes more so the further you proceed." 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "# 1. The basics" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "## 1.1 What are R and Jupyter?" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "R is a programming language for handling, analysing and visualising data. It is freely downloadable and works on all major platforms, including Windows, macOS and Linux. At its core, R code is made of text-based commands that can be run from a command prompt and without a graphical user interface. To make things easier, R can also be run via an open-source interface such as RStudio (see [rstudio.com](http://www.rstudio.com)).\n", 39 | "\n", 40 | "What you are currently using is Jupyter, a web-based (\"cloud\") interface that can run R. While not as powerful as RStudio, it enables us to focus on some of the basics of reading, creating and editing R code.\n", 41 | "\n", 42 | "We do not need to learn much about Jupyter to use it. We will be working with code cells such as the one below:\n" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": null, 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "df <- data.frame(x = seq(0,1, length = 10)) # Right now this may look difficult, but we'll come back to it!\n", 52 | "df$y <- 2 * df$x + rnorm(10, sd = 0.3)\n", 53 | "lm(y ~ x, df)\n", 54 | "rm(df)" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "Code cells have a distinct appearance and include the text \"In [ ]\" (on the left) to indicate that this is a field for inputting code. Next, we can try running the code. Click on the cell above to make it active and then press Control + Enter on your keyboard. You can also run the code by clicking on the \"Run\" button on the Jupyter tool bar above.\n", 62 | "\n", 63 | "When you run the code, some output will appear below the cell and the text on the left should show \"In [1]\" (or another number). The output includes information on a \"Call\" and \"Coefficients\". If you can see this, you're ready to move on! \n", 64 | "\n", 65 | "## 1.2 Checkpoints in Jupyter\n", 66 | "\n", 67 | "Jupyter notebooks are fully editable, which means that you can edit both the text and code fields. If you delete something and want to restore a previous version of the notebook, this can be done using *checkpoints*. You can easily create a checkpoint by using the save button on the Jupyter tool bar or via the File menu. __Now would be a good time to create your first checkpoint.__ \n", 68 | "\n", 69 | "Accidentally closing the browser tab of the notebook is also not a problem. Just reopen it from the Jupyter home tab. If you've closed that too (or e.g. if you wish to switch browsers or computers) you can start over from [notebooks.csc.fi](https://notebooks.csc.fi). Your notebook can will remain intact for as long as the virtual machine session is running.\n", 70 | "\n", 71 | "To see how long your session is running for, you can go back to the [notebooks.csc.fi](https://notebooks.csc.fi) dashboard. __NOTE: once the session expires, everything you wrote will be deleted, including all checkpoints.__ \n", 72 | "\n", 73 | "To keep a copy of your work, you can download the notebook in several formats (via the File menu). The following formats may be useful:\n", 74 | "\n", 75 | "- Notebook (File -> Download): use this in case you want to return to the actual complete notebook on another Jupyter instance. Notebooks on [notebooks.csc.fi](https://notebooks.csc.fi) only stay up for a limited time and if you want to work for longer than that, this is what you need. You can restart the R for Beginners machine and before clicking on the .ipynb file to open the notebook, upload and replace the notebook (the .ipynb file) with the version you downloaded earlier.\n", 76 | "\n", 77 | "- R code (File -> Save and Export Notebook as -> Executable Script): this will download only the parts involving R code (e.g. for importing in RStudio).\n", 78 | "\n", 79 | "- PDF (File -> Save and Export Notebook as -> PDF): this will create a PDF document which is possible to read but not work on. It's best to do this without the outputs (choose 'Restart & Clear Output' from the Kernel menu)." 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "## 1.3 Simple calculations in R\n", 87 | "\n", 88 | "By running the code above, you already performed several tasks using R. To become more familiar with how this works, let's try some of our own. An example we can try is performing a numeric calculation. For example, try typing `1 + 1` in the field below and running it. Typing code without spaces (e.g. `1+1`) also works, although long sections of code without spaces can become difficult to read. \n", 89 | "\n", 90 | "As you can see, with tasks like these R performs similarly to a calculator. You can also use symbols including parentheses and `^` (for exponentiation). For decimal points, R accepts the full stop symbol (`.`) instead of a comma (`,`). Try out some calculations of your own! You can also try several at once by adding them on separate lines." 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "# 2. Assigning and managing objects" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "## 2.1 Creating objects\n", 112 | "\n", 113 | "One of the key features of R is that your data can be assigned to a named *object* using the operator `<-` (less-than symbol followed by a dash). Try running the code below, which assigns the result of a calculation to an object we have decided to call `answerA`:" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "metadata": {}, 120 | "outputs": [], 121 | "source": [ 122 | "answerA <- 1 + 1" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "While previously the output appeared below the code cell, here it is not immediately visible. We can access the output by running `answerA` on its own:" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "answerA" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "The object is now stored in your R workspace. This is extremely useful because it makes it much easier to organise, manipulate and analyse your data. As part of writing R code, we can create as many objects as needed. For example, try creating a second object for a calculation of your own, called `answerB`, and printing the results below the code cell." 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": {}, 152 | "outputs": [], 153 | "source": [ 154 | "answerB <- \n", 155 | "answerB" 156 | ] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "metadata": {}, 161 | "source": [ 162 | "We can also modify the contents of an object. For example, we could multiply the contents of `answerA` by two (`answerA * 2`). If we decided to assign this back to the same object (`answerA <- answerA * 2`), we would overwrite the original contents. In case we wanted to avoid this, the results can also be assigned to a new object:" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "metadata": {}, 169 | "outputs": [], 170 | "source": [ 171 | "answerC <- answerA * 2\n", 172 | "answerC" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "We now have three objects in our workspace. When working with real data sets, the number can of course be much higher. To keep track of all the objects in your workspace, you can use the following command:" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": null, 185 | "metadata": {}, 186 | "outputs": [], 187 | "source": [ 188 | "ls()" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "A great thing about working with objects in R is that you can easily manipulate your data while keeping the original values intact. This is often much more difficult to achieve with other environments for data analysis. In other words, learning R can help you __keep your raw data raw.__" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "## 2.2 Removing objects" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": {}, 208 | "source": [ 209 | "Sometimes we may wish to remove an object from the workspace. This can be done using `rm()` (for example, typing `rm(answerA)` would remove `answerA`. There is also a command for completely clearing your workspace: `rm(list = ls())`. This deletes __everything__ in your current workspace. Be sure to use these commands with caution - there is no 'Undo' button in R.\n", 210 | "\n", 211 | "If you accidentally remove something, you will need to run your code again. In Jupyter, you can also start over with an empty workspace by choosing 'Restart & Clear Output' from the Kernel menu. For now, however, let's keep the objects you created in the previous section. You may still need them!" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": {}, 217 | "source": [ 218 | "# 3. Code annotation" 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "metadata": {}, 224 | "source": [ 225 | "This is very easy to do, but also one of the most important parts to learn. It is possible to annotate your code using the `#` character. R will ignore the entire line after `#` (without this, it would attempt to run your text as a command, resulting in an error message). You can also use `#` to inactivate parts of your code, for example if you want to try a calculation while omitting some specific step preceding it.\n", 226 | "\n", 227 | "__Always annotate your code! This will enable you to keep track of things that you've done and it also helps you become more familiar with different commands in R. Not only this, but it will be easier for your colleagues to interpret your code. The principle is the same as keeping a thorough laboratory notebook: well-annotated code will make your work more reproducible and easier to communicate.__" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "# 4. Recap\n", 235 | "\n", 236 | "Take a moment to try out different expressions in the cell below until you get the hang of it. For example, you can use the objects you already created as initial values or as intermediates. You can also try adding annotations to your code and removing (as well as re-creating) your objects. Name the new objects as you wish, but keep the following guidelines in mind:\n", 237 | "\n", 238 | "* The name must begin with a letter.\n", 239 | "* It can contain numbers, full stops and underscores, and is case-sensitive.\n", 240 | "* While it is possible to include spaces in object names, this requires extra effort and is not recommended.\n", 241 | " + In general, it is good practice to keep your code as simple and easy to read as possible.\n", 242 | "* Letters with accents and umlauts are best to avoid. \n", 243 | " + They may not work on all systems, which can become problematic when sharing your code with others." 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": null, 249 | "metadata": {}, 250 | "outputs": [], 251 | "source": [] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": {}, 256 | "source": [ 257 | "# 5. Numeric vectors\n", 258 | "\n", 259 | "You have now learned the basic logic of how R works with simple expressions and individual numbers. The next step toward doing some real data analysis in R involves getting to grasp with *vectors*. There are three types of vectors in R: numeric, character-based and logical. We will start with numeric vectors. \n", 260 | "\n", 261 | "* Numeric vectors are ordered sequences of numbers (the order is data-dependent and the sequence can be of any length)\n", 262 | " + An example: 3, 67, 2, 500, 0.5, 10\n", 263 | "\n", 264 | "You can do calculations with vectors just like with individual numbers. The calculations are separately applied to each value. Actually, single numbers __are__ vectors with a length of 1!\n", 265 | "\n", 266 | "* Most mathematical functions also work on the same principle when applied to vectors\n", 267 | " + `exp()`, `log()`, `abs()` and so on\n", 268 | "* However, some functions treat the vector as a whole\n", 269 | " + `length()`, `sum()`, `mean()`, `median()`, `sd()`, `var()`, `min()`, `max()` \n", 270 | "* `summary()` tells us several things about a vector at once\n", 271 | "\n", 272 | "## 5.1 Initial exercises\n", 273 | " \n", 274 | "Let's consider a data set where we have numbers of steps and the estimated number of calories burned per day (from an activity bracelet worn for one week). First, let's create vectors of the data using `c()` and assign them to two separate objects. Then let's perform some calculations:" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": null, 280 | "metadata": {}, 281 | "outputs": [], 282 | "source": [ 283 | "steps <- c(106, 0, 12775, 8287, 9222, 7080, 6055) \n", 284 | "kcal <- c(1356, 1341, 2109, 1882, 1970, 1938, 1851)\n", 285 | "\n", 286 | "# average hourly no. of steps each day\n", 287 | "steps / 24\n", 288 | "\n", 289 | "# energy burned per step each day\n", 290 | "kcal / steps" 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "Note that the results are given as vectors. We also see a strange value called `Inf` (more on this later). For now, let's assign the average number of steps to a new object and calculate their overall sum:" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "metadata": {}, 304 | "outputs": [], 305 | "source": [ 306 | "hourlysteps <- steps / 24\n", 307 | "tot_hourlysteps <- sum(hourlysteps) # One example of how we could calculate the sum\n", 308 | "tot_hourlysteps" 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "metadata": {}, 314 | "source": [ 315 | "It is also possible to create `tot_hourlysteps` using just a single line of code. There are often many ways to achieve the same result in R. Choosing which way to write your code may depend on the situation or personal preference:" 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": null, 321 | "metadata": {}, 322 | "outputs": [], 323 | "source": [ 324 | "sum(steps / 24) # Another way to achieve the same thing" 325 | ] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "metadata": {}, 330 | "source": [ 331 | "Next, try both of these ways to calculate the average hourly energy consumption over the week (that is, the total energy consumed over the week divided by the number of hours in a week). Now might also be a good time to create another checkpoint in case you already haven't!" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": null, 337 | "metadata": {}, 338 | "outputs": [], 339 | "source": [] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "metadata": {}, 344 | "source": [ 345 | "## 5.2 Accessing specific values within a vector" 346 | ] 347 | }, 348 | { 349 | "cell_type": "markdown", 350 | "metadata": {}, 351 | "source": [ 352 | "As we have seen, the data within a numeric vector are always arranged in a specific order. We can refer to this ordering using *indices*. For example, the first value within a vector has an index of 1, the second an index of 2 (and so on). Sometimes it is of interest to extract or access only certain values within a vector. This can be done using square brackets and the index:" 353 | ] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "execution_count": null, 358 | "metadata": {}, 359 | "outputs": [], 360 | "source": [ 361 | "steps[1]" 362 | ] 363 | }, 364 | { 365 | "cell_type": "markdown", 366 | "metadata": {}, 367 | "source": [ 368 | "It is also easy to specify a range of elements using this notation:" 369 | ] 370 | }, 371 | { 372 | "cell_type": "code", 373 | "execution_count": null, 374 | "metadata": {}, 375 | "outputs": [], 376 | "source": [ 377 | "steps[1:5]\n", 378 | "kcal[6:7]" 379 | ] 380 | }, 381 | { 382 | "cell_type": "markdown", 383 | "metadata": {}, 384 | "source": [ 385 | "Again, the results are given as vectors. Using the object `steps`, try calculating the average numbers of steps separately for weekdays and the weekend. Hint: have a look at the list of functions given at the beginning of Section 5!" 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": null, 391 | "metadata": {}, 392 | "outputs": [], 393 | "source": [] 394 | }, 395 | { 396 | "cell_type": "markdown", 397 | "metadata": {}, 398 | "source": [ 399 | "What if you wanted to extract values corresponding to Monday, Wednesday and Friday? Remember that a single index value is also a vector (with a length of 1). By extension, we can use a vector with a length of *n* to extract these values:" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": null, 405 | "metadata": {}, 406 | "outputs": [], 407 | "source": [ 408 | "steps[c(1,3,5)]" 409 | ] 410 | }, 411 | { 412 | "cell_type": "markdown", 413 | "metadata": {}, 414 | "source": [ 415 | "Note that the following gives an error (because `1,3,5` is not a vector while `c(1,3,5)` is). If you were wondering why the error message talks about dimensions, this is because the code below treats `steps` as a 3D array (as opposed to a one-dimensional array, that is, a vector)." 416 | ] 417 | }, 418 | { 419 | "cell_type": "code", 420 | "execution_count": null, 421 | "metadata": {}, 422 | "outputs": [], 423 | "source": [ 424 | "steps[1,3,5]" 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "metadata": {}, 430 | "source": [ 431 | "# 6. Logical vectors\n", 432 | "\n", 433 | "Logical (also known as Boolean) vectors have two possible values: `TRUE` and `FALSE`. They come up as results from logical operations, such as comparing one value to another. Here we are told, element by element, whether the number of steps taken was above 1000:" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": null, 439 | "metadata": {}, 440 | "outputs": [], 441 | "source": [ 442 | "steps > 1000" 443 | ] 444 | }, 445 | { 446 | "cell_type": "markdown", 447 | "metadata": {}, 448 | "source": [ 449 | "The possible comparisons are: \n", 450 | "\n", 451 | "- less than `<`\n", 452 | "- greater than `>`\n", 453 | "- less or equal `<=`\n", 454 | "- greater or equal `>=`\n", 455 | "- equal to `==`\n", 456 | "- not equal to `!=`\n", 457 | "\n", 458 | "There are also logical operators:\n", 459 | "- and `&`\n", 460 | "- or `|`\n", 461 | "- not `!`\n", 462 | "\n", 463 | "A convenience operator for querying whether a value is equal to one of several in a given set exists: `x %in% 1:3`.\n", 464 | "\n", 465 | "__NOTE:__ If you have never played with logical values before, do not worry. For basic usage, the situations where they are needed are quite limited. We will cover two common scenarios in the following sections." 466 | ] 467 | }, 468 | { 469 | "cell_type": "markdown", 470 | "metadata": {}, 471 | "source": [ 472 | "## 6.1 Initial exercises\n", 473 | "\n", 474 | "One scenario where logical vectors are useful is when you want to subset your data (for example, extracting part of a vector). For example, we can calculate the mean energy use on days where the number of steps exceeded 1000, presumably when the bracelet was worn all day long:" 475 | ] 476 | }, 477 | { 478 | "cell_type": "code", 479 | "execution_count": null, 480 | "metadata": {}, 481 | "outputs": [], 482 | "source": [ 483 | "allday_steps <- steps > 1000 # select days where the number of steps exceeds 1000\n", 484 | "allday_steps\n", 485 | "\n", 486 | "allday_kcal <- kcal[allday_steps] # extract kcal values for those days\n", 487 | "allday_kcal\n", 488 | "\n", 489 | "mean(allday_kcal) # calculate the mean" 490 | ] 491 | }, 492 | { 493 | "cell_type": "markdown", 494 | "metadata": {}, 495 | "source": [ 496 | "Previously we tried to calculate the mean energy consumption per step, but we did not account for energy consumption during rest (this is called the basal metabolic rate or BMR). Indeed, on one of the days had zero steps recorded while some energy was still consumed. For the purposes of this exercise, let's call this level of energy use the BMR. \n", 497 | "\n", 498 | "Let's attempt to recalculate the mean energy consumption per step using only that part of the data that exceeded the BMR. We can do this step by step:" 499 | ] 500 | }, 501 | { 502 | "cell_type": "code", 503 | "execution_count": null, 504 | "metadata": {}, 505 | "outputs": [], 506 | "source": [ 507 | "# NOTE: To complete this exercise, replace the points marked with '??' with the correct code.\n", 508 | "\n", 509 | "# First we need to find the BMR value in our data. \n", 510 | "# Hint: It's the element that coincides with the 'TRUE' value of the logical vector \"steps is equal to 0\":\n", 511 | "\n", 512 | "steps0 <- ??\n", 513 | "BMR <- kcal[??]\n", 514 | "BMR \n", 515 | "\n", 516 | "# You should see the BMR value come out as 1341" 517 | ] 518 | }, 519 | { 520 | "cell_type": "code", 521 | "execution_count": null, 522 | "metadata": {}, 523 | "outputs": [], 524 | "source": [ 525 | "# Next we can subtract the BMR value from the kcal values\n", 526 | "\n", 527 | "aboveBMR <- ??\n", 528 | "aboveBMR\n", 529 | "\n", 530 | "# You should now see smaller kcal values, with 0 as the second element" 531 | ] 532 | }, 533 | { 534 | "cell_type": "code", 535 | "execution_count": null, 536 | "metadata": {}, 537 | "outputs": [], 538 | "source": [ 539 | "# Then we can recalculate the energy consumption per step:\n", 540 | "\n", 541 | "kcalperstep <- ??\n", 542 | "kcalperstep\n", 543 | "\n", 544 | "# You should see values between 0.060 and 0.085 for the full days, \n", 545 | "# something a little higher for the first day, and \"NaN\" for the second" 546 | ] 547 | }, 548 | { 549 | "cell_type": "markdown", 550 | "metadata": {}, 551 | "source": [ 552 | "This result seems to make more sense! The `NaN` value will be explained in the next section.\n", 553 | "\n", 554 | "While the same calculation could have been performed by directly typing in the BMR value (1341), the approach we took is more useful because it illustrates the principle behind extracting specific values from a data set. Real data often need some form of subsetting (or other pre-treatment steps) and this is a good skill to learn! " 555 | ] 556 | }, 557 | { 558 | "cell_type": "markdown", 559 | "metadata": {}, 560 | "source": [ 561 | "## 6.2 Missing values\n", 562 | "\n", 563 | "### 6.2.1 Data with 'NA' values\n", 564 | "\n", 565 | "The first two values of the `steps` vector, 106 and 0, were probably due to the bracelet not being worn for the whole day or at all. In other words, we don't know the true step count because we are missing data. The R language provides an explicit way of stating this, the `NA` symbol (\"Not Available\").\n", 566 | "\n", 567 | "We could create an object with the step counts for Monday and Tuesday listed as `NA`. Let's also try calculating the overall sum of steps for this vector:" 568 | ] 569 | }, 570 | { 571 | "cell_type": "code", 572 | "execution_count": null, 573 | "metadata": {}, 574 | "outputs": [], 575 | "source": [ 576 | "steps_na <- c(NA, NA, 12775, 8287, 9222, 7080, 6055)\n", 577 | "sum(steps_na)" 578 | ] 579 | }, 580 | { 581 | "cell_type": "markdown", 582 | "metadata": {}, 583 | "source": [ 584 | "When a vector contains values denoted as `NA`, the sum of this vector is also reported as NA. This makes sense because we cannot know the result! However, what we can do is to calculate the sum of the remaining values while removing the missing data using a single additional _parameter_. This is another scenario where you commonly encounter logical values: " 585 | ] 586 | }, 587 | { 588 | "cell_type": "code", 589 | "execution_count": null, 590 | "metadata": {}, 591 | "outputs": [], 592 | "source": [ 593 | "sum(steps_na, na.rm = TRUE)" 594 | ] 595 | }, 596 | { 597 | "cell_type": "markdown", 598 | "metadata": {}, 599 | "source": [ 600 | "The default behaviour of `sum()` and many other functions is to include missing values in the calculation. However, we have just changed how `sum()` behaves in this particular instance by adding the parameter `na.rm = TRUE`. We will learn more about functions and parameters later in this notebook.\n", 601 | "\n", 602 | "Sometimes we do not directly know which values are missing in a data set. A convenient way to find out is by using the *test* function `is.na()`. One might also be tempted to try a logical comparison using `==`. Let's try both (and think for a moment why the second result does not tell us much!):" 603 | ] 604 | }, 605 | { 606 | "cell_type": "code", 607 | "execution_count": null, 608 | "metadata": {}, 609 | "outputs": [], 610 | "source": [ 611 | "is.na(steps_na)\n", 612 | "steps_na == NA" 613 | ] 614 | }, 615 | { 616 | "cell_type": "markdown", 617 | "metadata": {}, 618 | "source": [ 619 | "In the previous example there was a related symbol, `NaN`, which stands for 'Not a Number'. This resulted from the calculation 0/0, which has no well-defined mathematical meaning. For our purposes, the difference between `NA` and `NaN` can safely be ignored. We also saw a value called `Inf` (or 'Infinity'). This resulted from calculating 1341/0." 620 | ] 621 | }, 622 | { 623 | "cell_type": "markdown", 624 | "metadata": {}, 625 | "source": [ 626 | "### 6.2.2 Converting values to 'NA'" 627 | ] 628 | }, 629 | { 630 | "cell_type": "markdown", 631 | "metadata": {}, 632 | "source": [ 633 | "Sometimes one needs to deal with data sets where missing values are indicated by a number such as 0. This is problematic because such values will influence the outcome of calculations and other analyses. While it is best to avoid this type of notation from the outset, it is possible to rename such values as `NA` directly in R:" 634 | ] 635 | }, 636 | { 637 | "cell_type": "code", 638 | "execution_count": null, 639 | "metadata": {}, 640 | "outputs": [], 641 | "source": [ 642 | "steps[steps == 0] <- NA\n", 643 | "steps" 644 | ] 645 | }, 646 | { 647 | "cell_type": "markdown", 648 | "metadata": {}, 649 | "source": [ 650 | "This converted Tuesday's measurement to `NA`, but we also know the first value in the steps vector is unreliable. How would you convert the values from both Monday and Tuesday? One way would be to use indices just like before:" 651 | ] 652 | }, 653 | { 654 | "cell_type": "code", 655 | "execution_count": null, 656 | "metadata": {}, 657 | "outputs": [], 658 | "source": [ 659 | "steps[c(1, 2)] <- NA\n", 660 | "steps" 661 | ] 662 | }, 663 | { 664 | "cell_type": "markdown", 665 | "metadata": {}, 666 | "source": [ 667 | "# 7. Character vectors and factors\n", 668 | "\n", 669 | "Vectors come in one more flavour: character. The elements of character vectors are snippets of text, such as here:" 670 | ] 671 | }, 672 | { 673 | "cell_type": "code", 674 | "execution_count": null, 675 | "metadata": {}, 676 | "outputs": [], 677 | "source": [ 678 | "treatg_char <- c(\"Trt1\", \"Trt2\", \"Ctrl\", \"Trt1\", \"Ctrl\", \"Trt2\") # Two treatment groups and one control" 679 | ] 680 | }, 681 | { 682 | "cell_type": "markdown", 683 | "metadata": {}, 684 | "source": [ 685 | "Note that each element needs to be surrounded by quotation marks. Both the `'` and `\"` symbols can be used. \n", 686 | "\n", 687 | "Data like this often represent group membership. While some the elements could belong to specific groups (such as different treatments or an experimental control), character vectors do not take these interrelations into account because each element is treated as entirely independent. However, there is a specific type of object in R that can be used to group the data: *factors*. The defining feature of a factor is that it can have several *levels*. For example, in our case we have six elements belonging to three levels: Trt1, Trt2 and Ctrl.\n", 688 | "\n", 689 | "To clarify things further, let's describe the character vector using the `summary()` function:" 690 | ] 691 | }, 692 | { 693 | "cell_type": "code", 694 | "execution_count": null, 695 | "metadata": {}, 696 | "outputs": [], 697 | "source": [ 698 | "summary(treatg_char)" 699 | ] 700 | }, 701 | { 702 | "cell_type": "markdown", 703 | "metadata": {}, 704 | "source": [ 705 | "Then let's change the character vector into a factor:" 706 | ] 707 | }, 708 | { 709 | "cell_type": "code", 710 | "execution_count": null, 711 | "metadata": {}, 712 | "outputs": [], 713 | "source": [ 714 | "treatg <- factor(treatg_char)\n", 715 | "summary(treatg)" 716 | ] 717 | }, 718 | { 719 | "cell_type": "markdown", 720 | "metadata": {}, 721 | "source": [ 722 | "In the first case, nothing is told about the contents beyond the length and type. The second summary looks more useful: we can see that the factor has three levels, each with two observations. Here are two other functions that give us the titles and number of levels:" 723 | ] 724 | }, 725 | { 726 | "cell_type": "code", 727 | "execution_count": null, 728 | "metadata": {}, 729 | "outputs": [], 730 | "source": [ 731 | "levels(treatg)\n", 732 | "nlevels(treatg)" 733 | ] 734 | }, 735 | { 736 | "cell_type": "markdown", 737 | "metadata": {}, 738 | "source": [ 739 | "Now that you are familiar with vectors and factors, you are ready for the most important data type: data frames.\n", 740 | "\n", 741 | "# 8. Data frames\n", 742 | "\n", 743 | "## 8.1 Creating and accessing a data frame\n", 744 | "\n", 745 | "So far, the exercises in this notebook have dealed with individual vectors that weren't actually connected to each other. We treated the `kcal` and `steps` vectors as if their elements corresponded to the same days of the week, but the data were assigned to separate objects. In the following the same data are assembled into a *data frame*. Data frames are where all data are usually kept and manipulated. They can be considered **the most important data structure in R**. \n", 746 | "\n", 747 | "For clarity, let's empty the workspace first:" 748 | ] 749 | }, 750 | { 751 | "cell_type": "code", 752 | "execution_count": null, 753 | "metadata": {}, 754 | "outputs": [], 755 | "source": [ 756 | "rm(list = ls()) # You may remember this from Section 2.2! " 757 | ] 758 | }, 759 | { 760 | "cell_type": "markdown", 761 | "metadata": {}, 762 | "source": [ 763 | "Now we are ready to create a data frame:" 764 | ] 765 | }, 766 | { 767 | "cell_type": "code", 768 | "execution_count": null, 769 | "metadata": {}, 770 | "outputs": [], 771 | "source": [ 772 | "bracelet <- data.frame(day = c(\"Monday\", \"Tuesday\", \"Wednesday\", \"Thursday\", \"Friday\", \"Saturday\", \"Sunday\"),\n", 773 | " steps = c(106, 0, 12775, 8287, 9222, 7080, 6055),\n", 774 | " kcal = c(1356, 1341, 2109, 1882, 1970, 1938, 1851))\n", 775 | "bracelet\n", 776 | "summary(bracelet)\n", 777 | "ls()" 778 | ] 779 | }, 780 | { 781 | "cell_type": "markdown", 782 | "metadata": {}, 783 | "source": [ 784 | "You can see that the whole data set is now shown as a table with rows and columns. This is the proper way to think about data frames: they are tables where columns are variables (in the statistical sense) and rows are observations. The columns of a data frame can still be treated like vectors or factors. The only difference is that you have to refer to the data frame where the data are stored, using the character `$`:" 785 | ] 786 | }, 787 | { 788 | "cell_type": "code", 789 | "execution_count": null, 790 | "metadata": {}, 791 | "outputs": [], 792 | "source": [ 793 | "bracelet$steps # typing just 'steps' would now give an error message!" 794 | ] 795 | }, 796 | { 797 | "cell_type": "markdown", 798 | "metadata": {}, 799 | "source": [ 800 | "## 8.2 Initial exercise\n", 801 | "\n", 802 | "Armed with this knowledge, try and see if you can redo some of the previous calculations related to `steps` and `kcal`, using the code cell below. Let's focus on calculating the mean energy consumption per step while accounting for the BMR (as originally described in Section 6.1).\n", 803 | "\n", 804 | "__NOTE__: This could be challenging and you may run into some pitfalls. If something goes wrong, take your time and try to understand what happened. Remember that you can use `ls()` to see the current objects and that you can restart from the previous code cells if you get confused." 805 | ] 806 | }, 807 | { 808 | "cell_type": "code", 809 | "execution_count": null, 810 | "metadata": {}, 811 | "outputs": [], 812 | "source": [] 813 | }, 814 | { 815 | "cell_type": "markdown", 816 | "metadata": {}, 817 | "source": [ 818 | "## 8.3 Working with the iris data set\n", 819 | "\n", 820 | "A key step in learning how to analyse data with R would naturally involve importing your own data to work with. We will soon cover this, but first we need to cover a few more topics including functions, working directories and CSV files. To do this, let's still make use of some of the example data that come as part of a default R installation. \n", 821 | "\n", 822 | "The example data sets in R can be imported into the workspace using a function called `data()`. First, let's look at a classic example, the iris data set:" 823 | ] 824 | }, 825 | { 826 | "cell_type": "code", 827 | "execution_count": null, 828 | "metadata": {}, 829 | "outputs": [], 830 | "source": [ 831 | "data(iris)\n", 832 | "summary(iris)\n", 833 | "head(iris)" 834 | ] 835 | }, 836 | { 837 | "cell_type": "markdown", 838 | "metadata": {}, 839 | "source": [ 840 | "The summary shows us that there are five variables: four numeric ones and one factor with three levels (each with 50 observations). The `head(iris)` call is used to look at a few first lines of the data frame, instead of looking at all 150 of them. Next, we can look at the documentation page of this data set." 841 | ] 842 | }, 843 | { 844 | "cell_type": "code", 845 | "execution_count": null, 846 | "metadata": {}, 847 | "outputs": [], 848 | "source": [ 849 | "?iris" 850 | ] 851 | }, 852 | { 853 | "cell_type": "markdown", 854 | "metadata": {}, 855 | "source": [ 856 | "It is said that, when doing data analysis, at least 80% of the time is spent on cleaning, rearranging and otherwise preparing the data. R is great for that, particularly when combined with a good user interface (RStudio) and some dedicated *packages* (more on those in Section 12). Most of that is beyond this basic introduction, but a few simple tricks will get us started. The first one is how to add new variables into the data frame. One way to do this is by simple assignment. Let's pretend that the sepal leaves of the iris flowers are rectangular, and add their area to the data:" 857 | ] 858 | }, 859 | { 860 | "cell_type": "code", 861 | "execution_count": null, 862 | "metadata": {}, 863 | "outputs": [], 864 | "source": [ 865 | "iris$Sepal.Area <- iris$Sepal.Width * iris$Sepal.Length\n", 866 | "summary(iris)" 867 | ] 868 | }, 869 | { 870 | "cell_type": "markdown", 871 | "metadata": {}, 872 | "source": [ 873 | "Now try the same with petal leaves:" 874 | ] 875 | }, 876 | { 877 | "cell_type": "code", 878 | "execution_count": null, 879 | "metadata": {}, 880 | "outputs": [], 881 | "source": [] 882 | }, 883 | { 884 | "cell_type": "markdown", 885 | "metadata": { 886 | "collapsed": true 887 | }, 888 | "source": [ 889 | "We can also try taking subsets of the data frame. Recall how we used brackets and indices for extracting parts of vectors. The same kind of syntax works for data frames, except that there are now two sets of indices, one for rows and another for columns. Consider this:" 890 | ] 891 | }, 892 | { 893 | "cell_type": "code", 894 | "execution_count": null, 895 | "metadata": {}, 896 | "outputs": [], 897 | "source": [ 898 | "corner <- iris[1:5, 1:4]\n", 899 | "corner" 900 | ] 901 | }, 902 | { 903 | "cell_type": "markdown", 904 | "metadata": {}, 905 | "source": [ 906 | "This created a new data frame, which consists of the rows 1 to 5 and columns 1 to 4 of the iris data set. The indices before the comma refer to rows and the indices after it refer to columns. Leaving the other index out completely means \"all of them\". This means that, in addition to the `$` notation from before, you can also extract the data from an entire column like this:" 907 | ] 908 | }, 909 | { 910 | "cell_type": "code", 911 | "execution_count": null, 912 | "metadata": {}, 913 | "outputs": [], 914 | "source": [ 915 | "iris[,4]" 916 | ] 917 | }, 918 | { 919 | "cell_type": "markdown", 920 | "metadata": {}, 921 | "source": [ 922 | "Earlier we also learned about using logical vectors to extract parts of vectors that satisfy a condition. This works for data frames too, in an analogous manner. However, there is also a function called `subset()` that is more convenient to use. Here are two examples:" 923 | ] 924 | }, 925 | { 926 | "cell_type": "code", 927 | "execution_count": null, 928 | "metadata": {}, 929 | "outputs": [], 930 | "source": [ 931 | "smallones <- subset(iris, Sepal.Length < 5)\n", 932 | "setosas <- subset(iris, Species == \"setosa\")" 933 | ] 934 | }, 935 | { 936 | "cell_type": "markdown", 937 | "metadata": {}, 938 | "source": [ 939 | "This function works so that you first supply the data frame you want to subset and you then specify the logical rule for the rows you want to keep. Note that inside this function call you only need to mention the data frame once, and then all its columns can be referred to without the `$` symbol.\n", 940 | "\n", 941 | "Try now to calculate the average sepal length for each species separately. You should know (at least) two ways to do this. First, you can exploit the fact that the species observations are nicely in order, 50 observations at a time. Second, because things are usually not this nicely arranged, you should try splitting the data frame into three single-species parts with `subset()`. This solution was actually started in the previous code cell already.\n", 942 | "\n", 943 | "(Note that while these approaches suit us currently, there are other ways to work with the data that could be used 'in the real world'. We will not cover those here, but for example one could use the `aggregate()` function in base R or `summarise()` available via the package `dplyr`. Packages are discussed later in more detail. For illustrative purposes, the solution using `aggregate()` is given in the code cell below.)" 944 | ] 945 | }, 946 | { 947 | "cell_type": "code", 948 | "execution_count": null, 949 | "metadata": {}, 950 | "outputs": [], 951 | "source": [ 952 | "# YOUR CODE HERE\n", 953 | "\n", 954 | "# The solution using 'aggregate()'\n", 955 | "# aggregate(Sepal.Length ~ Species, data = iris, FUN = mean)" 956 | ] 957 | }, 958 | { 959 | "cell_type": "markdown", 960 | "metadata": {}, 961 | "source": [ 962 | "The correct answers are 5.006, 5.936 and 6.588 for setosa, versicolor and virginica, respectively.\n", 963 | "\n", 964 | "# 9. More about functions and their documentation\n", 965 | "\n", 966 | "Everything in R is accomplished by using functions. In fact, once you know the basics, learning to do data analysis using R often revolves around developing a working understanding of different functions and how they might be applied in different situations. Luckily there is an active and helpful R community online (one example is [stackoverflow.com](http://stackoverflow.com/questions/tagged/r)). Another useful resource comes in the form of documents - each R function comes with its own documentation with details on how exactly it can be used.\n", 967 | "\n", 968 | "You can bring up the documentation of a function by typing the question mark `?` in front of the function name (without the parentheses), like before with the iris data set. Try this now with the function `mean`:" 969 | ] 970 | }, 971 | { 972 | "cell_type": "code", 973 | "execution_count": null, 974 | "metadata": {}, 975 | "outputs": [], 976 | "source": [ 977 | "?mean" 978 | ] 979 | }, 980 | { 981 | "cell_type": "markdown", 982 | "metadata": {}, 983 | "source": [ 984 | "The documentation pages have a common structure, although not all have every section present. First there is **Description**, an outline of what the function does. **Usage** and **Arguments** list all the arguments the function accepts (including their type, e.g. numeric or logical) and their default values. There is often a section on **Details**, a longer explanation of how the function works and is intended for use. **Value** explains what kind of output the function will produce. At the end, there is usually a list of related functions under **See also** and finally some practical **Examples**.\n", 985 | "\n", 986 | "Try now to figure out how to use a completely new function based only on its documentation. Use `sample()` to produce\n", 987 | "\n", 988 | "- a Finnish lottery ticket: 7 randomly drawn numbers out of 40\n", 989 | "- a vector of 0s and 1s, representing 10 coin tosses \n", 990 | " - you can also try using a factor with the levels 'heads' and 'tails', if you feel like it!" 991 | ] 992 | }, 993 | { 994 | "cell_type": "code", 995 | "execution_count": null, 996 | "metadata": {}, 997 | "outputs": [], 998 | "source": [] 999 | }, 1000 | { 1001 | "cell_type": "markdown", 1002 | "metadata": { 1003 | "collapsed": true 1004 | }, 1005 | "source": [ 1006 | "# 10. Importing data from CSV files\n", 1007 | "\n", 1008 | "Data can come in many formats, but we will cover one of the most typical: CSV files. A CSV ('Comma-separated values') file is a text file containing data in a tabular format. The first row is usually a header row containing column (variable) names, with the rest of the rows containing the actual data. Columns are separated by commas. Software packages such as Excel, SPSS, SAS and many others are able to export data in this format. Here's an example of what such data look like (these are the first few rows of a CSV file called weather-kumpula.csv we will soon use):\n", 1009 | "\n", 1010 | " ts,year,month,day,dp,rmm,wdir,ws,t,rh,p\n", 1011 | " 2014-01-01,2014,1,1,0.7526754690757457,1.640000000000001,158.0,4.61702571230021,3.05357887421823,84.87491313412092,1011.9901320361296\n", 1012 | " 2014-01-02,2014,1,2,-2.2381944444444573,0.6000000000000004,118.0,3.627777777777778,0.4078472222222211,81.09583333333333,1012.2958333333349\n", 1013 | " 2014-01-03,2014,1,3,-1.4593055555555527,0.5000000000000002,141.0,3.3281944444444385,1.0606944444444493,82.28611111111111,1008.8503472222206\n", 1014 | " 2014-01-04,2014,1,4,0.3417361111111121,3.6999999999999966,219.0,4.28805555555555,1.7476388888888887,90.35138888888889,1001.7586111111136\n", 1015 | "\n", 1016 | "The function for importing data in this format is called `read.csv()`. \n", 1017 | "\n", 1018 | "Were we to check its documentation, we would see that `read.csv()` is an alias for a function called `read.table()`. This function can read many other tabular data formats than just CSV files, although we don't need to do this now. One thing to note, however: Finnish locale settings and some others use commas as the decimal (rather than column) separator. Because of this, there also exists a non-standard version of the CSV format that uses the semicolon (`;`) as the column separator. For these situations, the function `read.csv2()` will work.\n", 1019 | "\n", 1020 | "All you need to tell `read.csv()` is where to find the file. It is done by giving the name of the file, as a character string, like this:\n", 1021 | "\n", 1022 | "read.csv(\"filename.csv\")\n", 1023 | "\n", 1024 | "This only works if the file is in your current R session *working directory*. To see what the current working directory is, use this command:" 1025 | ] 1026 | }, 1027 | { 1028 | "cell_type": "code", 1029 | "execution_count": null, 1030 | "metadata": {}, 1031 | "outputs": [], 1032 | "source": [ 1033 | "getwd() # This will likely say: '/home/jovyan/R-for-beginners'" 1034 | ] 1035 | }, 1036 | { 1037 | "cell_type": "markdown", 1038 | "metadata": {}, 1039 | "source": [ 1040 | "To set a new working directory or to access a file outside it, you can:\n", 1041 | "\n", 1042 | "- change the current working directory to something else using the function `setwd()`\n", 1043 | "- give the whole path to the file\n", 1044 | " - note to Windows users: instead of the `\\` character, you have to use `/` or `\\\\`\n", 1045 | "- use interactive file choosing: `read.csv(file.choose())`\n", 1046 | " - note: this does not work on this notebook!\n", 1047 | "- use the import wizard in RStudio (although the above options also work in RStudio)\n", 1048 | "\n", 1049 | "In the case of this notebook, we can import the weather data and assign it to an object using the following code:" 1050 | ] 1051 | }, 1052 | { 1053 | "cell_type": "code", 1054 | "execution_count": null, 1055 | "metadata": {}, 1056 | "outputs": [], 1057 | "source": [ 1058 | "weather <- read.csv(\"data/weather-kumpula.csv\") # The data are imported as a data frame" 1059 | ] 1060 | }, 1061 | { 1062 | "cell_type": "markdown", 1063 | "metadata": {}, 1064 | "source": [ 1065 | "Now you should be able to both import data *and* understand what's going on when you do that. If you have any data of your own, now would be a great time to try and import it. You just need to upload it to the notebook first (from the Home page). If not, let's get familiar with the weather data.\n", 1066 | "\n", 1067 | "These data consist of daily weather measurements for one year at Kumpula, Helsinki. The names of the variables are:" 1068 | ] 1069 | }, 1070 | { 1071 | "cell_type": "code", 1072 | "execution_count": null, 1073 | "metadata": {}, 1074 | "outputs": [], 1075 | "source": [ 1076 | "names(weather)" 1077 | ] 1078 | }, 1079 | { 1080 | "cell_type": "markdown", 1081 | "metadata": {}, 1082 | "source": [ 1083 | "and they stand for: \n", 1084 | "\n", 1085 | "- timestamp\n", 1086 | "- year\n", 1087 | "- month\n", 1088 | "- day\n", 1089 | "- dewpoint (°C)\n", 1090 | "- rainfall (mm)\n", 1091 | "- wind direction (°)\n", 1092 | "- wind speed\n", 1093 | "- temperature (°C)\n", 1094 | "- relative humidity\n", 1095 | "- air pressure (millibars)\n", 1096 | "\n", 1097 | "You can now use the knowledge you have to get a better handle of this data set. For example:\n", 1098 | "\n", 1099 | "- Use the function `dim()` on the whole data frame. What does this function tell you?\n", 1100 | "- Use `summary()` on the whole data frame get an overall feeling of the values. \n", 1101 | " - For example, what is the range of temperatures observed over the year?\n", 1102 | "- Take a look at a function called `complete.cases()`. Use it, possibly together with `subset()` and/or `dim()`, to see on how many different days there are missing values.\n", 1103 | " - There are different ways to do this, but it is possible to make use of the logical operators list in Section 6.\n", 1104 | " - (Answer: three)\n", 1105 | "- How much did it rain in September, October and November in total? \n", 1106 | " - Remember the `%in%` operator mentioned back in Section 6! \n", 1107 | " - (Answer: 195.48 mm)" 1108 | ] 1109 | }, 1110 | { 1111 | "cell_type": "code", 1112 | "execution_count": null, 1113 | "metadata": {}, 1114 | "outputs": [], 1115 | "source": [] 1116 | }, 1117 | { 1118 | "cell_type": "markdown", 1119 | "metadata": {}, 1120 | "source": [ 1121 | "# 11. Data analysis using formulas\n", 1122 | "\n", 1123 | "You are now almost ready to start working with real data: you have learned about vectors, factors and data frames, are able to import data into R and understand how the language works. There is one more concept that we'll cover in this notebook and which is useful to learn: *formulas*. Here's an example:" 1124 | ] 1125 | }, 1126 | { 1127 | "cell_type": "code", 1128 | "execution_count": null, 1129 | "metadata": {}, 1130 | "outputs": [], 1131 | "source": [ 1132 | "# let's make sure you have your iris data set with you still:\n", 1133 | "data(iris)\n", 1134 | "# create a scatter plot of sepal length vs sepal width:\n", 1135 | "plot(Sepal.Length ~ Sepal.Width, data = iris)" 1136 | ] 1137 | }, 1138 | { 1139 | "cell_type": "markdown", 1140 | "metadata": {}, 1141 | "source": [ 1142 | "`plot()` is a general-purpose plotting function. What we have typed inside the brackets is a *formula* specifying the type of plot we want, and a data frame including the required variables. What characterises formulas in R is the `~` character in the middle. Its purpose is best revealed if you think of the `~` character as something similar to the word \"by\". In this case, we plotted `Sepal.Length` \"by\" `Sepal.Width`. When both variables are numeric vectors, `plot()` creates a scatter plot.\n", 1143 | "\n", 1144 | "Here's another example:" 1145 | ] 1146 | }, 1147 | { 1148 | "cell_type": "code", 1149 | "execution_count": null, 1150 | "metadata": {}, 1151 | "outputs": [], 1152 | "source": [ 1153 | "lm(Sepal.Length ~ Sepal.Width, data = iris)" 1154 | ] 1155 | }, 1156 | { 1157 | "cell_type": "markdown", 1158 | "metadata": {}, 1159 | "source": [ 1160 | "Same formula but a different function: this time we used `lm()` to fit a linear model corresponding to the scatter plot above. The left-hand side of the formula contains the response (dependent) variable whereas on the right we have the explanatory variable. The output of `lm()` is minimal, showing only the coefficient estimates of the fitted model. Use `summary()` on the result too see something more enlightening: " 1161 | ] 1162 | }, 1163 | { 1164 | "cell_type": "code", 1165 | "execution_count": null, 1166 | "metadata": {}, 1167 | "outputs": [], 1168 | "source": [ 1169 | "model <- lm(Sepal.Length ~ Sepal.Width, data = iris)\n", 1170 | "summary(model)" 1171 | ] 1172 | }, 1173 | { 1174 | "cell_type": "markdown", 1175 | "metadata": {}, 1176 | "source": [ 1177 | "(Note: while the output of `lm()` seemed to be just a bit of text, it is actually a *list* object. Lists are a data type we have skipped in this notebook, but you can visit the documenation of `lm()` to get a better idea of what the result list actually is. Also note that data frames are in fact a special case of lists. There is also a an object type called a *matrix* that we have not covered in this notebook. A matrix is similar to a vector with the exception of being two-dimensional. Unlike a data frame, a matrix usually only houses numeric information.)\n", 1178 | "\n", 1179 | "Here is one more example of where you might use formulas. We can see if there's a difference in sepal length between two of the species using a t-test:" 1180 | ] 1181 | }, 1182 | { 1183 | "cell_type": "code", 1184 | "execution_count": null, 1185 | "metadata": {}, 1186 | "outputs": [], 1187 | "source": [ 1188 | "t.test(Sepal.Length ~ Species, data = iris, subset = Species != \"versicolor\")" 1189 | ] 1190 | }, 1191 | { 1192 | "cell_type": "markdown", 1193 | "metadata": {}, 1194 | "source": [ 1195 | "Further to specifying the formula and the data frame, we have also included an argument for subsetting the data frame. Functions that take a formula as an argument usually also accept a subset argument. It works exactly like the `subset()` function we encountered earlier, but skips creating a separate subsetted data frame.\n", 1196 | "\n", 1197 | "Formulas establish the difference between factors and vectors rather clearly. Consider this example data set:" 1198 | ] 1199 | }, 1200 | { 1201 | "cell_type": "code", 1202 | "execution_count": null, 1203 | "metadata": {}, 1204 | "outputs": [], 1205 | "source": [ 1206 | "data(ToothGrowth)\n", 1207 | "summary(ToothGrowth)" 1208 | ] 1209 | }, 1210 | { 1211 | "cell_type": "markdown", 1212 | "metadata": {}, 1213 | "source": [ 1214 | "This data set contains the results from an experiment concerning the effect of vitamin C on the tooth length of guinea pigs. There are two supplement methods and three dose levels, making up a total of six groups with 10 animals in each. The `supp` variable is a factor while `len` and `dose` are numeric. However, it is a valid question to ask whether the doses should be treated as numeric values or as group labels. The difference should become apparent if we convert `dose` into a factor. We can try fitting models and plotting the data for both scenarios:" 1215 | ] 1216 | }, 1217 | { 1218 | "cell_type": "code", 1219 | "execution_count": null, 1220 | "metadata": {}, 1221 | "outputs": [], 1222 | "source": [ 1223 | "# Version with 'dose' as numeric\n", 1224 | "\n", 1225 | "summary(lm(len ~ dose, data = ToothGrowth))\n", 1226 | "plot(len ~ dose,data = ToothGrowth)" 1227 | ] 1228 | }, 1229 | { 1230 | "cell_type": "code", 1231 | "execution_count": null, 1232 | "metadata": {}, 1233 | "outputs": [], 1234 | "source": [ 1235 | "# Version with 'dose' as a factor\n", 1236 | "\n", 1237 | "ToothGrowth$dose.factor <- factor(ToothGrowth$dose)\n", 1238 | "\n", 1239 | "summary(lm(len ~ dose.factor, data = ToothGrowth))\n", 1240 | "plot(len ~ dose.factor, data = ToothGrowth)" 1241 | ] 1242 | }, 1243 | { 1244 | "cell_type": "markdown", 1245 | "metadata": {}, 1246 | "source": [ 1247 | "One final case where formulas could be useful: earlier, when calculating groupwise means with the iris dataset, it was mentioned that these could also be calculated using the function `aggregate()`. That function is also compatible with formulas. Try if you can figure out how to use `aggregate()` in combination with a formula to calculate the mean tooth length for each dose (you can use either the numeric or factorial version of `dose`): " 1248 | ] 1249 | }, 1250 | { 1251 | "cell_type": "code", 1252 | "execution_count": null, 1253 | "metadata": {}, 1254 | "outputs": [], 1255 | "source": [] 1256 | }, 1257 | { 1258 | "cell_type": "markdown", 1259 | "metadata": {}, 1260 | "source": [ 1261 | "Finally, feel free and try any of these plots and other methods using the weather data. Here are a few questions\n", 1262 | "for inspiration:\n", 1263 | "\n", 1264 | "- Make box plots of monthly temperatures. Which month seems to show the most fluctuation?\n", 1265 | "- Is there any relationship between air pressure and wind speed?\n", 1266 | "- How about between air pressure and temperature? Does the answer differ between summer and winter?" 1267 | ] 1268 | }, 1269 | { 1270 | "cell_type": "code", 1271 | "execution_count": null, 1272 | "metadata": {}, 1273 | "outputs": [], 1274 | "source": [] 1275 | }, 1276 | { 1277 | "cell_type": "markdown", 1278 | "metadata": {}, 1279 | "source": [ 1280 | "# 12. Packages\n", 1281 | "\n", 1282 | "A big part of R's success in the world of data analysis is that the scientific community is able to endlessly expand the collection of available methods through *packages*. Packages are user-developed open-source collections of functions and their associated documentation. So far we have been able to get by with functions that come with a default installation and a fresh R session. Only the most basic packages are automatically enabled - these include what we call 'base' R functions (i.e. what we have focused on in this notebook).\n", 1283 | "\n", 1284 | "To enable packages, one needs to use the `library()` command, for example:\n" 1285 | ] 1286 | }, 1287 | { 1288 | "cell_type": "code", 1289 | "execution_count": null, 1290 | "metadata": {}, 1291 | "outputs": [], 1292 | "source": [ 1293 | "library(stats)" 1294 | ] 1295 | }, 1296 | { 1297 | "cell_type": "markdown", 1298 | "metadata": {}, 1299 | "source": [ 1300 | "Here, `stats` is one of the packages that come as part of a default R installation. It also contains the function `lm()` we used in Section 11). More (roughly 14 000 at the time of writing this) are available on CRAN, the main repository of R packages. These need to be installed first using for example the function `install.packages()` (or one of the menu items when using some UI), and again taken in to use with the `library()` command. \n", 1301 | "\n", 1302 | "The abundance of packages is overwhelming. In practice, however, we end up using only a limited selection that matches our needs. To avoid getting lost, you can use the traditional methods of asking colleagues, and searching the Internet for advice and practical examples. For another good resource, you can have a look at Task Views on the [CRAN home page](http://cran.r-project.org/). These are curated lists of packages related to a certain field of study, with short explanations of which functions perform which tasks.\n", 1303 | "\n", 1304 | "## What next?\n", 1305 | "\n", 1306 | "You have reached the end of this notebook. Hopefully you have learned enough of the basics to start learning more on your own, or be able to tell what is going on if you are faced with a colleague's R script. You can actually test this: scroll back to the very beginning of this notebook, to the first code cell. You should be able to tell a function from an assignment etc. You will see some unfamiliar function calls but you should know how to bring up their documentation to see what they do. Another thing you can try: search for a method or task familiar to you and add \"with R\" to the search, see what you find... and see if you can understand the R code part of the content.\n", 1307 | "\n", 1308 | "Of course, consider installing both R and RStudio on your own computer. Note that they are two different things: R is provided by CRAN and comes with its own (very basic) interface. RStudio is an alternative user interface provided by the RStudio company. The open-source version of RStudio can be downloaded for free.\n", 1309 | "\n", 1310 | "Using RStudio efficiently is its own separate learning objective and is intertwined with learning how to use a collection of packages called __tidyverse__, also provided by the RStudio company. See the [tidyverse website](http://www.tidyverse.org/) for more information and the book 'R for Data Science' linked there. These packages have an underlying philosophy that slightly differs from base R, so prepare to relearn a few things. Watch videos, read through tutorials, take a course online or offline. It'll be worth it. Good luck!" 1311 | ] 1312 | } 1313 | ], 1314 | "metadata": { 1315 | "celltoolbar": "Raw Cell Format", 1316 | "kernelspec": { 1317 | "display_name": "R", 1318 | "language": "R", 1319 | "name": "ir" 1320 | }, 1321 | "language_info": { 1322 | "codemirror_mode": "r", 1323 | "file_extension": ".r", 1324 | "mimetype": "text/x-r-source", 1325 | "name": "R", 1326 | "pygments_lexer": "r", 1327 | "version": "3.5.1" 1328 | } 1329 | }, 1330 | "nbformat": 4, 1331 | "nbformat_minor": 1 1332 | } 1333 | -------------------------------------------------------------------------------- /data/weather-kumpula.csv: -------------------------------------------------------------------------------- 1 | ts,year,month,day,dp,rmm,wdir,ws,t,rh,p 2 | 2014-01-01,2014,1,1,0.7526754690757457,1.640000000000001,158.0,4.61702571230021,3.05357887421823,84.87491313412092,1011.9901320361296 3 | 2014-01-02,2014,1,2,-2.2381944444444573,0.6000000000000004,118.0,3.627777777777778,0.4078472222222211,81.09583333333333,1012.2958333333349 4 | 2014-01-03,2014,1,3,-1.4593055555555527,0.5000000000000002,141.0,3.3281944444444385,1.0606944444444493,82.28611111111111,1008.8503472222206 5 | 2014-01-04,2014,1,4,0.3417361111111121,3.6999999999999966,219.0,4.28805555555555,1.7476388888888887,90.35138888888889,1001.7586111111136 6 | 2014-01-05,2014,1,5,2.8411111111111027,0.8000000000000004,90.0,2.4786111111111104,4.128194444444437,91.43263888888889,1003.1302083333369 7 | 2014-01-06,2014,1,6,0.509930555555556,0.9000000000000006,146.0,3.740624999999999,1.6868055555555543,91.88472222222222,1003.9509027777823 8 | 2014-01-07,2014,1,7,1.8627777777777876,10.769999999999996,259.0,3.294305555555547,2.9284722222222253,92.84305555555555,998.4985416666666 9 | 2014-01-08,2014,1,8,3.7473611111110805,2.439999999999994,214.0,3.3897222222222183,4.758680555555566,93.27777777777777,994.2368055555502 10 | 2014-01-09,2014,1,9,2.7474305555555634,1.990000000000001,90.0,2.444791666666665,4.400625000000003,89.27083333333333,991.9824305555547 11 | 2014-01-10,2014,1,10,0.1130555555555591,11.67999999999998,349.0,4.752708333333331,1.356388888888889,91.09861111111111,981.3111111111124 12 | 2014-01-11,2014,1,11,-5.206458333333346,1.3900000000000006,45.0,3.000000000000004,-3.6413888888889003,84.51875,992.1324305555495 13 | 2014-01-12,2014,1,12,-9.559027777777755,0.33000000000000007,343.0,3.3293055555555573,-7.801458333333309,79.54513888888889,995.7530555555536 14 | 2014-01-13,2014,1,13,-12.85145833333333,0.13999999999999999,332.0,3.800902777777774,-10.851388888888858,75.23402777777778,1004.7345833333306 15 | 2014-01-14,2014,1,14,-15.186666666666664,0.12999999999999998,6.0,2.0574305555555537,-14.036180555555525,78.43958333333333,1011.2465277777737 16 | 2014-01-15,2014,1,15,-12.415763888888824,0.6600000000000004,68.0,2.359722222222223,-11.364097222222163,81.41875,1013.3979166666668 17 | 2014-01-16,2014,1,16,-13.0351388888889,0.23000000000000007,23.0,3.3299305555555594,-11.645694444444418,78.81319444444445,1015.2312500000024 18 | 2014-01-17,2014,1,17,-14.43479166666661,0.23000000000000007,17.0,2.604166666666665,-13.28097222222222,79.06527777777778,1016.4515277777756 19 | 2014-01-18,2014,1,18,-14.829444444444466,0.35000000000000014,6.0,2.6052083333333376,-13.467152777777766,77.44722222222222,1021.0986805555558 20 | 2014-01-19,2014,1,19,-14.010763888888874,0.23000000000000007,17.0,2.103333333333336,-12.878263888888927,79.54930555555555,1028.8365972222334 21 | 2014-01-20,2014,1,20,-13.180416666666666,0.6300000000000003,360.0,2.132152777777779,-11.83298611111109,78.89722222222223,1028.7361111111154 22 | 2014-01-21,2014,1,21,-11.696458333333323,0.34000000000000014,349.0,1.656875,-10.76756944444443,82.89722222222223,1024.0152083333364 23 | 2014-01-22,2014,1,22,-15.40701388888891,0.2800000000000001,360.0,2.8827777777777834,-14.390069444444457,79.19097222222223,1021.9112499999967 24 | 2014-01-23,2014,1,23,-16.017986111111156,0.37000000000000016,51.0,2.099791666666668,-15.308611111111144,80.61458333333333,1026.129027777776 25 | 2014-01-24,2014,1,24,-15.923680555555567,0.4300000000000002,309.0,1.6525000000000019,-14.884027777777815,78.55625,1028.6785416666717 26 | 2014-01-25,2014,1,25,-9.696597222222202,0.4000000000000002,225.0,2.3929861111111097,-7.312013888888878,75.66944444444445,1027.3471527777776 27 | 2014-01-26,2014,1,26,-9.348055555555545,0.21000000000000005,84.0,1.9796527777777777,-7.719027777777778,80.40763888888888,1023.6088194444471 28 | 2014-01-27,2014,1,27,-9.361041666666685,0.4000000000000002,96.0,5.00159722222222,-8.029305555555522,82.27361111111111,1020.5417361111092 29 | 2014-01-28,2014,1,28,-9.204513888888885,0.5600000000000003,135.0,6.017499999999992,-7.690416666666642,81.31180555555555,1026.9793749999985 30 | 2014-01-29,2014,1,29,-12.11347222222225,0.35000000000000014,135.0,5.620972222222219,-8.277291666666649,65.79930555555555,1033.5209027777737 31 | 2014-01-30,2014,1,30,-13.0102777777778,0.4400000000000002,152.0,4.539930555555559,-10.281944444444418,71.15138888888889,1034.3155555555531 32 | 2014-01-31,2014,1,31,-12.310552061495466,2.8200000000000007,174.0,5.682809224318653,-9.193501048217996,69.55415793151643,1028.1070580013975 33 | 2014-02-01,2014,2,1,-7.777592205984722,9.959999999999955,124.0,5.513500347947103,-6.364648573416828,83.25330549756437,1014.0991649269325 34 | 2014-02-02,2014,2,2,-7.5808123249299575,4.369999999999991,219.0,3.957142857142847,-3.2987394957982987,67.73319327731092,1013.0142156862726 35 | 2014-02-03,2014,2,3,-1.5145138888888905,4.2199999999999935,231.0,3.5013194444444418,-0.9972916666666743,94.95,1015.7093055555579 36 | 2014-02-04,2014,2,4,-1.1425694444444372,1.440000000000001,118.0,2.5068055555555553,-0.477013888888892,94.28125,1017.3627777777748 37 | 2014-02-05,2014,2,5,-4.618888888888904,1.660000000000001,96.0,2.623819444444446,-3.2181249999999983,86.23333333333333,1008.3406250000032 38 | 2014-02-06,2014,2,6,-6.220000000000004,3.499999999999998,84.0,1.751736111111114,-4.989513888888882,85.82777777777778,999.7083333333347 39 | 2014-02-07,2014,2,7,-2.343094721619675,1.0100000000000007,158.0,1.981417208966017,-1.7689081706435437,93.7823571945047,997.1431670282025 40 | 2014-02-08,2014,2,8,0.15647143890094023,7.2399999999999824,214.0,5.174548083875634,1.1874186550976076,92.85177151120752,989.9902386117142 41 | 2014-02-09,2014,2,9,-0.42523499638467277,0.8200000000000005,158.0,4.654157628344174,1.7938539407086105,85.17932031814895,995.0934201012282 42 | 2014-02-10,2014,2,10,0.2090972222222215,2.3300000000000014,113.0,2.372569444444449,0.9460416666666616,94.82777777777778,998.0920833333255 43 | 2014-02-11,2014,2,11,-0.6383333333333365,4.839999999999985,90.0,2.2356250000000006,-0.09770833333333398,95.62777777777778,1000.4303472222167 44 | 2014-02-12,2014,2,12,-0.3106290672451178,3.7799999999999874,214.0,2.8804772234273273,0.3780911062906712,94.97252349963847,996.8007953723787 45 | 2014-02-13,2014,2,13,-0.47606652205350686,2.65999999999999,146.0,3.0297180043383904,0.8244396240057831,90.81561822125813,998.4788141720875 46 | 2014-02-14,2014,2,14,-1.895227765726686,0.5800000000000003,113.0,2.730079537237891,0.33412870571222253,83.64497469269703,1000.922993492412 47 | 2014-02-15,2014,2,15,-2.2013738250180763,0.6600000000000004,152.0,2.346637744034709,-0.10072306579898735,84.17208966015907,1003.1545914678222 48 | 2014-02-16,2014,2,16,-0.05647143890093878,3.2899999999999947,203.0,4.537744034707153,0.9064352856109892,93.00939985538685,992.0965292841602 49 | 2014-02-17,2014,2,17,0.5301518438177857,2.4999999999999942,281.0,3.0981923355025294,1.6381778741865618,92.4945770065076,986.6694143166997 50 | 2014-02-18,2014,2,18,-0.5454085321764278,0.9300000000000006,293.0,3.55437454808388,1.304627621113529,87.29211858279103,994.8289226319617 51 | 2014-02-19,2014,2,19,-1.601952277657266,0.5500000000000003,309.0,1.5370209689081686,0.632754880694142,83.9060014461316,1000.3556760665207 52 | 2014-02-20,2014,2,20,-3.8271872740419166,0.6800000000000004,62.0,2.453434562545195,-2.1522053506869074,85.1706435285611,1002.4838756326812 53 | 2014-02-21,2014,2,21,-4.823861171366591,0.9500000000000004,84.0,3.530585683297177,-2.4559652928416527,80.1062906724512,1005.9491684743266 54 | 2014-02-22,2014,2,22,0.4550976138828603,4.959999999999982,214.0,4.624005784526396,1.4934201012292105,92.80477223427332,999.8073752711475 55 | 2014-02-23,2014,2,23,0.5838033261026753,4.099999999999993,225.0,5.645336225596518,1.9861894432393348,90.48879248011569,1006.7564714389006 56 | 2014-02-24,2014,2,24,0.12704266088214242,4.1899999999999915,225.0,6.512147505422989,3.288647866955896,79.93275488069415,1013.0845263919008 57 | 2014-02-25,2014,2,25,-0.8857556037599369,3.2899999999999943,186.0,4.203036876355749,1.6833694866232816,82.66738973246565,1019.6940708604516 58 | 2014-02-26,2014,2,26,-2.406724511930585,1.8700000000000014,169.0,2.8984815618221265,0.0013738250180775839,82.03326102675344,1021.3319595083174 59 | 2014-02-27,2014,2,27,-1.3640636297903146,3.839999999999996,146.0,2.066522053506871,0.5286334056399132,86.26681127982647,1016.8091829356492 60 | 2014-02-28,2014,2,28,-1.2892986261749824,3.829999999999995,79.0,1.8605206073752731,0.3344902386117153,87.94577006507592,1012.2981200289241 61 | 2014-03-01,2014,3,1,-2.1121475054229926,2.7899999999999943,84.0,2.7056399132321034,-0.32942877801880033,86.24005784526392,1009.0216919739747 62 | 2014-03-02,2014,3,2,-0.9498192335502476,3.939999999999993,158.0,2.2316702819956618,0.1951554591467827,91.33333333333333,1006.188503253794 63 | 2014-03-03,2014,3,3,-0.23644251626898014,10.009999999999982,219.0,3.8099060014461403,0.4276211135213302,95.27548806941432,1003.993203181487 64 | 2014-03-04,2014,3,4,-0.11836587129428666,5.179999999999993,96.0,1.4609544468546627,0.8677512653651487,93.24511930585683,1011.1805495300104 65 | 2014-03-05,2014,3,5,-1.3140274765003597,4.549999999999994,68.0,2.527621113521329,0.9139551699204645,84.48662328271872,1018.3848879247971 66 | 2014-03-06,2014,3,6,-1.4469609261939198,2.4399999999999977,214.0,2.690520984081042,0.30065123010130257,87.15267727930535,1019.8706946454442 67 | 2014-03-07,2014,3,7,-0.6605206073752722,2.6599999999999935,203.0,6.139551699204636,1.143673174258866,87.62111352133044,1017.3121475054231 68 | 2014-03-08,2014,3,8,-1.7470715835141035,6.399999999999992,276.0,6.60462762111353,3.9455531453362296,68.0173535791757,1008.8753434562545 69 | 2014-03-09,2014,3,9,-0.5281995661605274,2.409999999999998,242.0,6.412509038322491,3.36268980477222,75.46854663774404,1013.841503976869 70 | 2014-03-10,2014,3,10,1.3327785817655566,3.579999999999996,287.0,4.054703328509406,4.383357452966713,81.02387843704776,1009.5011577424073 71 | 2014-03-11,2014,3,11,-1.8043415340086808,1.4300000000000008,304.0,4.8578871201157705,4.806295224312597,62.59768451519537,1016.6501447178022 72 | 2014-03-12,2014,3,12,-1.5127259580621915,2.9899999999999975,315.0,4.657989877078808,5.739696312364416,59.54302241503977,1015.6681851048444 73 | 2014-03-13,2014,3,13,-0.8099060014461292,2.1599999999999984,236.0,3.883514099783092,5.425018076644969,64.61388286334056,1007.2801156905276 74 | 2014-03-14,2014,3,14,-4.423516642547047,1.4700000000000009,231.0,6.301374819102756,4.755137481910284,51.36975397973951,996.8008683068016 75 | 2014-03-15,2014,3,15,-1.3587852494577066,12.749999999999973,304.0,4.261026753434572,0.577946493130876,86.34417932031815,972.5409978308008 76 | 2014-03-16,2014,3,16,-6.197686189443234,0.8600000000000005,321.0,4.94707158351409,-1.5629067245119332,67.02819956616052,976.1605206073742 77 | 2014-03-17,2014,3,17,-9.834707158351437,1.5500000000000007,309.0,2.2791033984092532,-4.144685466377427,58.84526391901663,987.0134490238639 78 | 2014-03-18,2014,3,18,-5.6137382501807656,2.209999999999999,298.0,2.582140274765008,-2.049602313810551,73.04555314533623,993.8686189443238 79 | 2014-03-19,2014,3,19,-9.107519884309465,1.3000000000000007,56.0,3.148373101952272,-1.6751265365148238,54.971800433839476,993.2931308749108 80 | 2014-03-20,2014,3,20,-7.399566160520627,7.849999999999994,208.0,3.241576283441797,-3.5020245842371676,70.6760665220535,999.0910339840912 81 | 2014-03-21,2014,3,21,1.0753434562545219,2.8499999999999943,214.0,5.808098336948669,4.904699927693412,77.07302964569776,985.8986261749848 82 | 2014-03-22,2014,3,22,0.14721619667389726,1.1100000000000008,236.0,3.871655820679684,3.2861171366594295,80.22704266088213,993.825813449024 83 | 2014-03-23,2014,3,23,0.6529284164858996,1.450000000000001,242.0,2.735719450469991,2.433622559652937,88.07881417208966,994.7665220535087 84 | 2014-03-24,2014,3,24,0.10708604483007947,2.2599999999999985,287.0,2.303398409255243,3.8839479392624643,77.45553145336225,1002.3379609544493 85 | 2014-03-25,2014,3,25,-2.524801156905283,1.0200000000000007,62.0,3.920896601590747,4.73405639913232,59.11641359363702,1010.8483731019479 86 | 2014-03-26,2014,3,26,-7.760954446854661,0.6500000000000004,39.0,5.873535791757052,3.2477223427331916,42.172089660159074,1024.6032537960944 87 | 2014-03-27,2014,3,27,-6.016847433116401,0.8000000000000004,354.0,2.0097613882863303,4.552205350686921,47.48300795372379,1021.3710773680403 88 | 2014-03-28,2014,3,28,-5.6824295010845765,0.9100000000000006,354.0,2.2950108459869814,4.093998553868394,46.770065075921906,1024.7621113521416 89 | 2014-03-29,2014,3,29,-4.323138105567618,1.0600000000000007,293.0,3.226753434562541,3.7662328271872694,54.166305133767175,1020.4383947939245 90 | 2014-03-30,2014,3,30,-4.31292517006804,0.9400000000000006,354.0,3.400907029478457,4.687452758881332,52.475434618291764,1009.5324263038576 91 | 2014-03-31,2014,3,31,-9.402947845805018,1.9900000000000013,287.0,3.0558578987150415,1.1141345427059726,42.31897203325775,1006.8025699168458 92 | 2014-04-01,2014,4,1,-6.3052154195011525,2.2400000000000007,270.0,2.826379440665151,1.0721844293272875,55.095238095238095,1007.3748299319708 93 | 2014-04-02,2014,4,2,-6.472562358276635,2.2799999999999994,281.0,3.834164777021925,0.6560090702947841,56.10204081632653,1007.5414210128483 94 | 2014-04-03,2014,4,3,-3.5836734693877585,2.030000000000001,287.0,2.903401360544222,2.905517762660618,62.289493575207864,1000.0012093726369 95 | 2014-04-04,2014,4,4,-9.900907029478457,1.8200000000000014,276.0,2.925925925925926,2.1999244142101295,38.871504157218446,1009.9219198790581 96 | 2014-04-05,2014,4,5,-4.517762660619793,2.5999999999999974,242.0,5.108541194255486,3.3284958427815594,55.03930461073318,1010.0499622071085 97 | 2014-04-06,2014,4,6,1.4159486016628868,6.469999999999989,326.0,2.4550264550264576,3.623658352229777,85.58730158730158,1002.5798185941054 98 | 2014-04-07,2014,4,7,1.3105064247921372,3.659999999999993,68.0,2.1613756613756596,2.6094482237339394,91.25321239606954,1005.1278911564605 99 | 2014-04-08,2014,4,8,-3.7068783068783038,2.749999999999995,73.0,2.791761148904003,2.4099773242630422,63.82010582010582,1006.4889644746859 100 | 2014-04-09,2014,4,9,-8.1644746787604,2.3100000000000005,84.0,5.738397581254721,1.6588813303099046,46.74829931972789,1009.9311413454271 101 | 2014-04-10,2014,4,10,-6.248072562358292,3.5499999999999963,191.0,4.923053665910807,1.475207860922148,53.43386243386244,1018.8216931216949 102 | 2014-04-11,2014,4,11,-0.6945578231292513,6.379999999999992,203.0,4.730385487528342,3.475812547241132,74.13832199546485,1011.8291761148881 103 | 2014-04-12,2014,4,12,1.9891156462584922,3.7899999999999907,203.0,3.8913076341647734,3.2107331821617593,91.91836734693878,1005.3519274376439 104 | 2014-04-13,2014,4,13,2.2924414210128594,6.35999999999999,169.0,5.8309901738473195,3.924187452758855,89.25699168556311,994.9401360544157 105 | 2014-04-14,2014,4,14,2.184429327286473,4.589999999999989,264.0,3.8817082388510973,5.023129251700684,82.38548752834467,989.0088435374195 106 | 2014-04-15,2014,4,15,-0.24735649546827806,3.1099999999999905,338.0,3.557703927492442,5.340105740181273,68.35649546827794,1001.0560422960725 107 | 2014-04-16,2014,4,16,-7.33371126228272,1.1300000000000008,236.0,3.6204081632653047,5.037188208616782,39.73015873015873,1016.0853363567633 108 | 2014-04-17,2014,4,17,-0.8619335347432021,1.3900000000000008,231.0,5.655135951661629,5.235196374622361,65.48942598187311,1010.2648791540797 109 | 2014-04-18,2014,4,18,-1.7519637462235655,0.08,276.0,3.9259063444108633,8.639425981873122,48.653323262839876,1008.2438821752269 110 | 2014-04-19,2014,4,19,-0.6768707482993217,0.02,276.0,2.7369614512471685,10.152834467120165,48.02947845804989,1018.9966742252434 111 | 2014-04-20,2014,4,20,0.8561602418745293,0.11999999999999997,45.0,1.720710506424795,10.9501133786848,50.60997732426304,1019.0283446712004 112 | 2014-04-21,2014,4,21,1.585574018126886,0.10999999999999999,214.0,1.8126888217522652,10.602492447129912,54.455438066465256,1015.6947885196355 113 | 2014-04-22,2014,4,22,0.07190332326284118,0.10999999999999999,360.0,2.804456193353475,13.640709969788533,41.610271903323266,1011.6092900302166 114 | 2014-04-23,2014,4,23,-9.58995468277946,0.13,45.0,4.080664652567974,7.085347432024148,28.797583081571,1022.4072507552909 115 | 2014-04-24,2014,4,24,-8.762915407854996,0.3800000000000001,259.0,2.6323262839879145,6.383157099697871,30.893504531722055,1027.0137462235618 116 | 2014-04-25,2014,4,25,-4.709667673716014,0.6200000000000004,281.0,3.05777945619335,8.939425981873098,36.60498489425982,1018.5661631419935 117 | 2014-04-26,2014,4,26,-3.329780801209366,0.6000000000000003,264.0,2.5850340136054455,10.414588057445163,37.57974300831444,1012.9695389266809 118 | 2014-04-27,2014,4,27,-3.2721088435374095,0.9800000000000005,349.0,2.0486016628873727,12.308390022675738,34.02267573696145,1008.1646258503384 119 | 2014-04-28,2014,4,28,-2.3875377643504536,1.0400000000000005,214.0,3.144561933534743,13.054078549848938,34.16918429003021,1005.3852719033223 120 | 2014-04-29,2014,4,29,-0.48323262839879116,0.6700000000000004,293.0,3.6259063444108834,10.120770392749211,48.144259818731115,1000.6043051359602 121 | 2014-04-30,2014,4,30,-3.7161631419939622,1.4500000000000008,281.0,2.8587613293051333,7.3016616314199405,46.21525679758308,1000.8965256797469 122 | 2014-05-01,2014,5,1,-0.5927492447129924,3.1999999999999944,321.0,3.249471299093657,4.2613293051359555,70.64652567975831,999.0603474320262 123 | 2014-05-02,2014,5,2,-2.659863945578231,8.879999999999994,259.0,3.069992441421013,3.403703703703708,64.7437641723356,1007.630839002273 124 | 2014-05-03,2014,5,3,-1.4541194255479932,2.7299999999999955,264.0,3.427286470143619,4.3761904761904695,65.74527588813304,1009.3109599395243 125 | 2014-05-04,2014,5,4,0.34504913076341676,5.319999999999993,332.0,2.343915343915341,3.9771730914588006,77.69614512471655,1004.9297808012112 126 | 2014-05-05,2014,5,5,-1.0900302114803624,2.2999999999999994,338.0,3.793731117824777,3.9063444108761343,71.33308157099698,1006.3614048338418 127 | 2014-05-06,2014,5,6,-5.5209969788519615,1.460000000000001,214.0,2.9675226586102745,4.729456193353475,47.58761329305136,1010.840558912396 128 | 2014-05-07,2014,5,7,-3.9104768786127195,1.740000000000001,79.0,3.0364884393063623,5.858887283236984,50.320809248554916,1010.0609104046173 129 | 2014-05-08,2014,5,8,2.420824295010843,6.6399999999999775,113.0,5.611713665943603,5.077368040491645,84.41865509761388,1005.1508315256696 130 | 2014-05-09,2014,5,9,7.493853940708592,18.000000000000007,214.0,2.9044830079537256,7.701952277657258,98.22559652928416,1003.0421547360841 131 | 2014-05-10,2014,5,10,5.915184381778735,1.0900000000000007,281.0,2.0893709327548833,8.008026030368764,87.1120751988431,1000.0871294287819 132 | 2014-05-11,2014,5,11,6.460954446854657,4.689999999999993,101.0,2.923427331887202,8.987635574837261,84.66015907447577,993.9305856832997 133 | 2014-05-12,2014,5,12,6.76404833836862,1.5000000000000009,236.0,2.943655589123865,9.125604229607259,85.52567975830816,996.5829305135969 134 | 2014-05-13,2014,5,13,7.359138972809657,4.839999999999991,293.0,2.8115558912386707,9.776510574018081,85.22129909365559,995.1174471299125 135 | 2014-05-14,2014,5,14,0.17913832199546645,0.30000000000000016,96.0,4.299470899470896,10.526077097505683,51.504157218442934,1005.950188964475 136 | 2014-05-15,2014,5,15,-4.4945046999277,0.19,203.0,3.144613159797546,8.19211858279102,40.78886478669559,1023.7388286333987 137 | 2014-05-16,2014,5,16,-1.49472161966739,0.3000000000000001,208.0,2.873246565437454,8.419522776572647,49.89587852494577,1022.7613882863369 138 | 2014-05-17,2014,5,17,1.5430947216196647,0.26,62.0,1.9156905278380327,11.581634128705705,51.5647143890094,1018.9165582067869 139 | 2014-05-18,2014,5,18,2.884454085321761,0.26000000000000006,107.0,2.8853217642805498,16.617787418655137,42.282718727404195,1015.24338394794 140 | 2014-05-19,2014,5,19,11.38829479768789,0.5700000000000003,214.0,3.329190751445082,19.58836705202314,59.524566473988436,1012.2521676300644 141 | 2014-05-20,2014,5,20,12.477890173410412,1.0100000000000005,163.0,2.089739884393064,18.072326589595352,71.09898843930635,1011.8491329479803 142 | 2014-05-21,2014,5,21,11.634417932031807,4.289999999999995,264.0,1.5488069414316676,15.437237888647859,78.42371655820679,1010.46391901663 143 | 2014-05-22,2014,5,22,8.34302241503979,0.7500000000000003,214.0,2.8127982646420793,17.67360809833692,57.346348517715114,1013.7945770065073 144 | 2014-05-23,2014,5,23,10.868257411424452,0.7000000000000003,146.0,1.9250903832248736,18.90592913955172,60.691973969631235,1015.3226319595037 145 | 2014-05-24,2014,5,24,12.044468546637757,0.5000000000000001,96.0,2.1405639913232117,22.079031091829382,54.02603036876356,1012.030513376726 146 | 2014-05-25,2014,5,25,12.673174258857559,0.7600000000000005,236.0,2.3218365871294284,21.067172812725918,59.2530730296457,1009.2145336225622 147 | 2014-05-26,2014,5,26,10.04934971098264,0.8300000000000005,287.0,3.8908236994219636,18.345231213872818,60.533236994219656,1009.4911849711011 148 | 2014-05-27,2014,5,27,4.431164135936362,5.009999999999994,6.0,3.9775849602313764,7.402386117136706,81.8293564714389,1013.4475777295739 149 | 2014-05-28,2014,5,28,3.6571428571428513,7.509999999999991,360.0,4.806268221574353,6.392638483965013,82.91472303206997,1012.2820699708457 150 | 2014-05-29,2014,5,29,7.026247288503244,6.049999999999979,360.0,3.939696312364431,8.671655820679666,89.55386840202459,1005.1749096167844 151 | 2014-05-30,2014,5,30,8.012436731742527,2.1000000000000005,152.0,2.088069414316706,10.816485900216884,83.87201735357918,1004.7462762111356 152 | 2014-05-31,2014,5,31,7.35979754157627,2.899999999999997,219.0,2.8611713665943572,10.848011569052757,79.45697758496023,1007.7946493130856 153 | 2014-06-01,2014,6,1,6.594070860448253,2.1500000000000004,343.0,2.1509038322487304,12.970137382501807,66.59652928416486,1011.507953723786 154 | 2014-06-02,2014,6,2,8.575632682574069,2.7299999999999947,51.0,3.6221258134490175,14.474114244396223,68.20462762111352,1014.6214027476533 155 | 2014-06-03,2014,6,3,12.03326102675345,5.129999999999988,96.0,3.418872017353578,14.430730296457002,85.90672451193059,1012.0966015907516 156 | 2014-06-04,2014,6,4,13.04699927693423,1.3000000000000007,124.0,2.2621113521330414,18.589660159074427,72.29718004338395,1012.8825018076692 157 | 2014-06-05,2014,6,5,9.362111352133033,0.9200000000000005,79.0,2.9715835140997795,19.98156182212578,51.409978308026034,1011.2708604483042 158 | 2014-06-06,2014,6,6,12.847433116413606,2.9899999999999967,152.0,1.6519884309472177,18.53571945046999,70.32610267534345,1011.304121475056 159 | 2014-06-07,2014,6,7,10.689587852494576,2.5099999999999976,248.0,2.4140997830802577,15.707953723788867,72.83947939262472,1010.3221981200292 160 | 2014-06-08,2014,6,8,9.190238611713681,3.6499999999999932,214.0,2.102096890817066,14.917932031814905,69.06218365871294,1009.2058568329775 161 | 2014-06-09,2014,6,9,11.008749096167763,1.8400000000000012,309.0,2.431525668835866,16.94316702819957,68.99132321041215,1008.0652928416498 162 | 2014-06-10,2014,6,10,12.838539407086074,5.409999999999996,23.0,2.2990600144613214,15.918510484454055,82.4468546637744,1010.7151843817807 163 | 2014-06-11,2014,6,11,9.987924801156856,2.120000000000001,219.0,2.5323933477946516,15.9580621836587,68.74331164135937,1014.9247288503285 164 | 2014-06-12,2014,6,12,11.110918293564708,17.889999999999983,51.0,2.241648590021689,13.098120028922635,87.99132321041215,1005.0506869125146 165 | 2014-06-13,2014,6,13,11.267100506146049,10.829999999999991,338.0,2.5788141720896642,13.056905278380277,89.0824295010846,996.0117136659535 166 | 2014-06-14,2014,6,14,7.476500361532888,1.2100000000000006,6.0,5.218076644974694,12.430079537237862,72.67751265365148,1002.6017353579136 167 | 2014-06-15,2014,6,15,4.505350686912513,0.5300000000000002,276.0,2.881995661605201,13.202892263195935,56.76211135213305,1008.3281995661556 168 | 2014-06-16,2014,6,16,4.99450469992769,3.539999999999993,304.0,3.7497469269703547,11.561677512653619,65.19739696312364,1002.7045553145331 169 | 2014-06-17,2014,6,17,-1.9516268980477258,1.3700000000000008,264.0,4.8084598698481535,7.71048445408532,52.14750542299349,1002.3342010122924 170 | 2014-06-18,2014,6,18,5.090527838033258,2.149999999999999,281.0,3.8828633405639885,11.247288503253795,67.44830079537238,998.5558929862583 171 | 2014-06-19,2014,6,19,6.228778018799724,2.010000000000001,309.0,3.2276211135213266,12.421836587129421,67.51482284887925,994.830513376714 172 | 2014-06-20,2014,6,20,5.793347794649312,7.289999999999991,304.0,2.4674620390455533,11.24020245842369,70.86261749819234,993.0475777295655 173 | 2014-06-21,2014,6,21,2.708170643528566,5.839999999999993,264.0,3.2716558206796837,8.82740419378161,66.14605929139552,996.4812725958004 174 | 2014-06-22,2014,6,22,3.5525668835863957,7.749999999999989,349.0,2.7307302964569793,8.856182212581349,70.4287780187997,999.3099060014488 175 | 2014-06-23,2014,6,23,5.049457700650756,1.4900000000000009,23.0,2.1998553868402033,9.645263919016639,74.22415039768619,999.5950108459866 176 | 2014-06-24,2014,6,24,7.340130151843816,5.759999999999997,28.0,1.7727404193781633,11.64916847433112,75.57556037599421,1002.7688358640638 177 | 2014-06-25,2014,6,25,4.610339840925522,0.6800000000000004,315.0,2.7631236442516283,12.151916124367293,61.01301518438178,1004.7045553145247 178 | 2014-06-26,2014,6,26,6.037816341287055,2.509999999999995,73.0,1.8097613882863344,11.987707881417249,67.44323933477946,1008.366232827178 179 | 2014-06-27,2014,6,27,5.52603036876355,1.1200000000000006,231.0,1.9383224873463507,12.68908170643531,62.88864786695589,1011.7185104844546 180 | 2014-06-28,2014,6,28,5.455242227042657,0.4800000000000002,113.0,2.3599421547360846,14.574114244396261,56.57772957339118,1012.2063629790335 181 | 2014-06-29,2014,6,29,10.109399855386844,13.349999999999977,17.0,4.075126536514819,12.63434562545204,85.20390455531454,1004.9251626898055 182 | 2014-06-30,2014,6,30,10.974403470715808,2.9399999999999897,321.0,2.5895878524945792,13.207736804049166,86.75921908893709,999.2373101952257 183 | 2014-07-01,2014,7,1,9.359508315256653,3.109999999999993,315.0,1.981561822125811,12.982791033984128,79.1294287780188,1004.0124367317468 184 | 2014-07-02,2014,7,2,9.24642082429504,1.0000000000000004,293.0,3.420028922631954,14.271294287780181,73.15835140997831,1003.0897324656504 185 | 2014-07-03,2014,7,3,10.046637744034756,0.6600000000000005,197.0,4.996023138105568,14.397975415762799,75.83297180043384,1004.8140997830775 186 | 2014-07-04,2014,7,4,11.007664497469289,1.3900000000000003,264.0,3.9080260303687715,16.467100506146043,71.80260303687636,1004.5708604483043 187 | 2014-07-05,2014,7,5,8.913015184381758,0.3600000000000002,107.0,1.9154013015184366,16.684526391901645,62.67751265365148,1009.1447577729612 188 | 2014-07-06,2014,7,6,7.568112798264622,0.27,68.0,3.1727404193781603,18.38604483007954,51.646420824295014,1012.5725958062221 189 | 2014-07-07,2014,7,7,11.540708604483022,0.31000000000000005,101.0,3.641287057122196,20.210701373825003,58.478669558929866,1016.1485177151131 190 | 2014-07-08,2014,7,8,11.805784526391923,0.1,51.0,2.850542299349243,21.163557483730997,55.77584960231381,1016.1978308026023 191 | 2014-07-09,2014,7,9,10.887129428778003,0.25000000000000006,113.0,1.650108459869848,23.549240780911088,46.219088937093275,1013.0092552422261 192 | 2014-07-10,2014,7,10,10.840057845263889,6.269999999999995,84.0,4.843817787418655,16.222776572668124,70.86261749819234,1015.0577729573414 193 | 2014-07-11,2014,7,11,7.327693420101226,0.8600000000000004,73.0,3.271655820679683,15.306290672451182,60.11496746203905,1017.8294287780135 194 | 2014-07-12,2014,7,12,9.555459146782349,0.7500000000000004,34.0,3.796746203904555,17.23716558206799,61.71872740419378,1014.0154013015168 195 | 2014-07-13,2014,7,13,14.230441070137433,0.6300000000000003,96.0,3.8450469992769367,20.627910339840952,67.09978308026031,1007.1804772234315 196 | 2014-07-14,2014,7,14,14.373246565437489,2.0300000000000007,107.0,2.249891540130152,19.075921908893672,75.1995661605206,1006.873608098339 197 | 2014-07-15,2014,7,15,14.021330441070068,2.9999999999999942,248.0,3.2856832971800483,18.23521330441067,76.81489515545914,1008.1829356471395 198 | 2014-07-16,2014,7,16,13.542010122921154,3.359999999999996,236.0,3.3736804049168434,16.917064352856098,81.39913232104121,1010.2378886478634 199 | 2014-07-17,2014,7,17,13.033622559652963,1.3200000000000007,236.0,2.0610267534345645,18.484598698481587,72.23716558206797,1013.1420824294983 200 | 2014-07-18,2014,7,18,14.38091106290674,1.6200000000000012,259.0,1.680766449746927,19.837382501807657,72.93637020968909,1013.8454085321715 201 | 2014-07-19,2014,7,19,11.531308749096157,0.8600000000000004,90.0,1.8006507592190895,21.190817064352792,56.184381778741866,1013.8029645697757 202 | 2014-07-20,2014,7,20,12.800578452639177,1.1700000000000006,360.0,1.8049891540130147,21.376644974692702,59.95806218365871,1012.554374548085 203 | 2014-07-21,2014,7,21,11.050325379609548,2.040000000000001,338.0,2.3143167028199576,21.485394070860405,53.6941431670282,1013.3949385394067 204 | 2014-07-22,2014,7,22,10.64193781634126,1.7900000000000011,349.0,2.442805495300072,22.662183658712923,48.28054953000723,1017.1540130151849 205 | 2014-07-23,2014,7,23,11.823355025307293,2.1000000000000005,309.0,2.495444685466377,25.056760665220555,44.96167751265365,1017.1310918293651 206 | 2014-07-24,2014,7,24,14.321981200289217,2.19,107.0,2.229718004338394,23.43087490961676,57.07592190889371,1015.6107013738301 207 | 2014-07-25,2014,7,25,14.525813449023845,1.3900000000000008,253.0,1.9162689804772206,24.080404916847378,57.59436008676789,1015.8451916124307 208 | 2014-07-26,2014,7,26,14.685177151120735,1.570000000000001,287.0,2.2077368040491727,25.591323210412146,52.995661605206074,1014.5009399855422 209 | 2014-07-27,2014,7,27,14.466232827187293,1.610000000000001,107.0,1.7006507592190896,25.284381778741857,52.56182212581345,1013.7950108459856 210 | 2014-07-28,2014,7,28,16.473897324656537,2.23,219.0,2.4004338394793927,24.975632682574034,61.51048445408532,1009.1653651482304 211 | 2014-07-29,2014,7,29,17.51373825018074,3.2699999999999956,68.0,2.450831525668837,23.648590021691973,69.26536514822848,1004.4346348517703 212 | 2014-07-30,2014,7,30,17.929562594268482,3.12,152.0,1.2158371040723985,20.819457013574656,84.5444947209653,1004.2196078431346 213 | 2014-07-31,2014,7,31,,,,,,, 214 | 2014-08-01,2014,8,1,13.152101910828055,0.9400000000000006,231.0,4.8831847133757975,22.07630573248405,57.6140127388535,1007.4619108280264 215 | 2014-08-02,2014,8,2,14.28105567606652,1.9300000000000015,146.0,2.572017353579176,21.800216919739704,65.02386117136659,1013.6906001446132 216 | 2014-08-03,2014,8,3,15.62979031091825,1.7800000000000011,107.0,4.11344902386117,22.14540853217642,68.30874909616774,1016.0543745480786 217 | 2014-08-04,2014,8,4,14.966738973246597,1.560000000000001,141.0,3.9646420824295006,23.904049168474298,58.475054229934926,1014.946782357189 218 | 2014-08-05,2014,8,5,15.600723065798993,1.320000000000001,113.0,3.8539407086044806,23.91807664497461,61.211858279103396,1012.8858279103459 219 | 2014-08-06,2014,8,6,15.772089660159073,1.440000000000001,129.0,3.99125090383225,23.819595083152578,61.73246565437455,1011.7819233550315 220 | 2014-08-07,2014,8,7,15.843094721619662,10.61999999999999,281.0,2.283297180043385,23.296456977584967,64.24801156905278,1009.0381778741813 221 | 2014-08-08,2014,8,8,14.666883586406412,1.8000000000000012,248.0,3.024801156905276,21.34041937816341,66.77874186550976,1007.3676066521975 222 | 2014-08-09,2014,8,9,13.669920462762136,1.4200000000000008,264.0,2.3088937093275503,20.98770788141721,63.7527114967462,1005.8887201735394 223 | 2014-08-10,2014,8,10,14.500216919739728,2.6699999999999973,163.0,2.5984815618221253,21.416413593637028,66.29862617498192,1006.3985538684001 224 | 2014-08-11,2014,8,11,16.81359363702097,4.089999999999999,219.0,2.9636297903109163,21.94822848879243,73.3058568329718,1002.8890817064404 225 | 2014-08-12,2014,8,12,13.514316702819949,8.90999999999998,214.0,3.5791033984092544,20.68214027476501,66.34417932031815,998.6686912509014 226 | 2014-08-13,2014,8,13,11.244323933477942,6.039999999999989,197.0,3.6143890093998485,17.789370932754867,66.41287057122199,999.9054229934939 227 | 2014-08-14,2014,8,14,12.255531453362252,8.279999999999982,321.0,2.6481561822125785,17.394721619667344,72.2002892263196,999.1882140274764 228 | 2014-08-15,2014,8,15,12.082357194504699,4.949999999999994,208.0,2.3276934201012307,17.47838033261028,71.66666666666667,997.7898770788153 229 | 2014-08-16,2014,8,16,12.293058568329725,4.209999999999992,309.0,1.9378163412870582,17.94851771511205,70.83007953723789,996.6845986984794 230 | 2014-08-17,2014,8,17,9.36840202458422,3.069999999999994,236.0,2.565148228488791,16.852783803326012,63.11713665943601,995.6255242226996 231 | 2014-08-18,2014,8,18,12.309544468546623,9.469999999999976,174.0,4.594287780188004,16.25422993492403,77.99349240780911,994.4741865509718 232 | 2014-08-19,2014,8,19,11.514533622559659,21.570000000000032,208.0,4.889370932754878,14.932176428054897,80.30151843817788,993.9274765003654 233 | 2014-08-20,2014,8,20,10.384164859002176,29.810000000000002,203.0,4.925958062183653,13.891467823571933,79.77584960231381,996.9176428054905 234 | 2014-08-21,2014,8,21,10.536370209689123,1.6400000000000006,203.0,4.476717281272594,14.772089660159114,76.82501807664498,999.1086767895788 235 | 2014-08-22,2014,8,22,9.479971077368077,0.7200000000000004,253.0,3.657339117859724,14.28286334056404,73.25162689804772,1001.1279103398393 236 | 2014-08-23,2014,8,23,9.481995661605213,14.469999999999995,107.0,2.858423716558204,13.08937093275491,79.6290672451193,1003.526391901661 237 | 2014-08-24,2014,8,24,9.997107736804073,0.5100000000000002,96.0,2.8350686912509007,14.482212581344871,75.75198843094722,1003.6538684020232 238 | 2014-08-25,2014,8,25,11.242443962400614,7.639999999999993,343.0,2.313159797541574,13.438322487346289,86.73825018076646,998.9007953723809 239 | 2014-08-26,2014,8,26,11.490021691973954,12.679999999999987,304.0,2.4140274765003613,14.33304410701376,83.66522053506868,988.2781634128692 240 | 2014-08-27,2014,8,27,11.477512653651456,14.379999999999997,293.0,2.5516268980477217,14.090600144613141,84.75777295733911,990.3383947939291 241 | 2014-08-28,2014,8,28,11.21258134490238,2.479999999999999,326.0,2.6305133767172815,14.597252349963844,80.66666666666667,998.0780187997092 242 | 2014-08-29,2014,8,29,9.61182015953588,3.549999999999999,354.0,2.8254532269760673,12.978027556200132,80.42712110224801,1004.694126178389 243 | 2014-08-30,2014,8,30,5.831742588575554,0.060000000000000005,51.0,2.1695589298626183,12.384381778741872,66.50253073029646,1011.0173535791776 244 | 2014-08-31,2014,8,31,6.009255242227032,0.01,45.0,2.6693420101229175,12.251626898047697,66.21258134490239,1016.5943600867731 245 | 2014-09-01,2014,9,1,7.276066522053531,0.10999999999999999,349.0,2.110484454085321,11.827693420101216,74.0759219088937,1018.7464931308803 246 | 2014-09-02,2014,9,2,7.796312364425166,0.16,270.0,1.7259580621836603,13.057917570498875,71.44179320318149,1019.4466377440443 247 | 2014-09-03,2014,9,3,11.869775849602291,0.08,259.0,3.2468546637743954,16.099783080260334,76.38394793926247,1021.1474331164154 248 | 2014-09-04,2014,9,4,12.32480115690529,0.17,264.0,2.935430224150399,15.47093275488067,82.21475054229936,1021.4612436731787 249 | 2014-09-05,2014,9,5,10.67368040491679,0.16,287.0,3.7906001446131476,16.691540130151797,68.32682574114244,1016.627042660884 250 | 2014-09-06,2014,9,6,10.514461315979728,0.15,287.0,1.7625451916124366,15.135068691250948,74.68112798264642,1013.7083152566911 251 | 2014-09-07,2014,9,7,9.811569052783808,0.18000000000000002,118.0,1.6361532899493856,14.97859725234993,73.0643528561099,1014.2586406363041 252 | 2014-09-08,2014,9,8,12.40759219088934,0.12999999999999998,174.0,2.3070860448300765,15.974692697035406,80.14605929139552,1013.3237165582078 253 | 2014-09-09,2014,9,9,13.002747650036145,11.069999999999999,315.0,2.3201735357917577,16.297324656543818,81.17859725234996,1008.2400578452645 254 | 2014-09-10,2014,9,10,12.389443239334787,0.4000000000000002,28.0,1.5960954446854672,14.899638467100553,85.45336225596529,1011.0167751265323 255 | 2014-09-11,2014,9,11,12.747360809833687,0.23000000000000007,315.0,1.5814895155459132,14.954953000723107,87.35502530730297,1016.7212581345019 256 | 2014-09-12,2014,9,12,12.297180043383909,0.17,11.0,1.6253073029645706,15.490527838033291,82.03904555314534,1018.1315979754146 257 | 2014-09-13,2014,9,13,8.997541576283444,0.12999999999999998,6.0,2.0752711496746197,14.295372378886498,71.92190889370933,1022.4691973969669 258 | 2014-09-14,2014,9,14,8.622270426608802,0.08,315.0,1.5951554591467842,13.629428778018765,72.705712219812,1024.7137382501844 259 | 2014-09-15,2014,9,15,7.885972523499643,0.13999999999999999,17.0,1.6890093998553866,14.120318148951572,68.09616775126537,1024.449530007236 260 | 2014-09-16,2014,9,16,6.997035430224191,0.03,96.0,1.4719450469992765,13.04577006507588,67.23644251626898,1024.2998553868354 261 | 2014-09-17,2014,9,17,8.749382716049386,0.03,39.0,0.7337448559670783,11.801646090534991,81.64197530864197,1023.8395061728368 262 | 2014-09-18,2014,9,18,8.461917998610174,0.04,214.0,1.4972897845726207,14.024808895065974,69.99722029186935,1020.0150104239109 263 | 2014-09-19,2014,9,19,10.698611111111088,0.05,174.0,1.513472222222223,15.050208333333341,76.30625,1015.9753472222212 264 | 2014-09-20,2014,9,20,11.27895878524947,0.04,203.0,2.085683297180043,15.249530007230664,77.73174258857556,1010.7271149674539 265 | 2014-09-21,2014,9,21,11.20216919739695,0.12999999999999998,107.0,3.17353579175705,14.824005784526394,79.23716558206797,1002.687997107735 266 | 2014-09-22,2014,9,22,8.609182935647155,18.080000000000023,360.0,3.1512653651482276,10.431742588575561,88.66015907447577,996.8324656543787 267 | 2014-09-23,2014,9,23,-0.9803326102675335,0.8400000000000005,304.0,5.825596529284173,3.454808387563265,72.87924801156905,998.3402747650072 268 | 2014-09-24,2014,9,24,-0.10809833694865857,8.969999999999997,180.0,4.2403470715835105,5.230947216196676,69.77368040491685,1000.897541576283 269 | 2014-09-25,2014,9,25,8.582791033984076,9.23999999999998,124.0,4.029862617498192,11.383658712942877,83.23138105567607,998.07483731019 270 | 2014-09-26,2014,9,26,8.677657266811282,4.73,236.0,4.512942877801883,11.631959508315303,82.32827187274042,1001.4859002169184 271 | 2014-09-27,2014,9,27,5.842950108459879,2.8299999999999965,203.0,4.750469992769353,11.62241503976859,69.21113521330442,997.5496746203883 272 | 2014-09-28,2014,9,28,8.01887201735363,0.17,276.0,4.135430224150401,12.639479392624748,73.89009399855387,1003.7044107013648 273 | 2014-09-29,2014,9,29,5.831959508315265,0.09999999999999999,293.0,3.547071583514096,12.13412870571218,67.3058568329718,1005.3083152566915 274 | 2014-09-30,2014,9,30,2.30625000000001,0.07,354.0,4.183055555555561,8.728055555555532,64.81597222222223,1014.9063194444418 275 | 2014-10-01,2014,10,1,1.9230555555555549,0.18000000000000002,231.0,1.8154861111111122,7.101597222222199,70.98125,1025.9577083333252 276 | 2014-10-02,2014,10,2,5.219791666666657,0.3100000000000001,287.0,2.7840277777777773,9.808055555555535,73.31944444444444,1024.46125 277 | 2014-10-03,2014,10,3,8.293680555555536,0.30000000000000004,219.0,1.6922222222222212,11.492500000000035,80.90347222222222,1023.3570138888842 278 | 2014-10-04,2014,10,4,9.370069444444445,1.810000000000001,96.0,2.0505555555555555,10.966458333333327,90.05277777777778,1023.132499999995 279 | 2014-10-05,2014,10,5,2.9764583333333317,0.2800000000000001,129.0,3.7055555555555504,7.36243055555557,74.08611111111111,1028.0081250000017 280 | 2014-10-06,2014,10,6,1.1417361111111117,0.24000000000000005,129.0,4.970069444444444,7.330902777777659,65.01319444444445,1032.2793750000108 281 | 2014-10-07,2014,10,7,1.2966666666666693,0.17,158.0,5.982916666666672,8.76083333333339,59.44166666666667,1026.5363888888871 282 | 2014-10-08,2014,10,8,4.8279861111111,6.2099999999999955,208.0,4.995000000000005,8.396944444444536,78.82291666666667,1010.453541666669 283 | 2014-10-09,2014,10,9,8.668680555555579,0.18000000000000002,129.0,3.1016666666666644,10.35493055555553,89.47430555555556,1004.6247916666728 284 | 2014-10-10,2014,10,10,10.899166666666678,0.4900000000000002,242.0,3.162291666666664,11.874166666666701,93.93055555555556,1002.0927777777799 285 | 2014-10-11,2014,10,11,8.305763888888873,0.23000000000000007,264.0,2.967152777777777,11.233888888888874,82.56875,1007.3647222222243 286 | 2014-10-12,2014,10,12,6.644305555555544,0.4200000000000002,332.0,1.5545833333333359,7.704097222222174,93.17013888888889,1009.2017361111128 287 | 2014-10-13,2014,10,13,6.271666666666692,2.9699999999999993,79.0,1.8076388888888901,7.521041666666594,91.92361111111111,1007.7183333333388 288 | 2014-10-14,2014,10,14,7.109444444444432,0.6900000000000004,203.0,1.9265277777777776,8.482986111111076,91.24722222222222,1005.9852083333333 289 | 2014-10-15,2014,10,15,0.6266666666666696,0.29000000000000015,62.0,3.4712500000000004,3.75638888888889,80.01527777777778,1009.163402777778 290 | 2014-10-16,2014,10,16,-2.9529861111111066,0.20000000000000004,62.0,3.8306944444444464,2.2545138888888885,67.19444444444444,1009.3356249999948 291 | 2014-10-17,2014,10,17,-4.464166666666699,0.26000000000000006,326.0,1.8437499999999991,1.0443055555555563,64.79583333333333,1011.4043055555559 292 | 2014-10-18,2014,10,18,-4.5424999999999995,0.3200000000000001,197.0,1.4656944444444453,1.0659722222222239,64.12430555555555,1018.2209722222238 293 | 2014-10-19,2014,10,19,5.385694444444463,16.009999999999973,270.0,3.90333333333333,7.173749999999969,88.86597222222223,1000.6229166666637 294 | 2014-10-20,2014,10,20,5.437847222222231,0.43000000000000016,62.0,2.4784722222222237,6.987152777777737,90.03402777777778,989.0395138888879 295 | 2014-10-21,2014,10,21,-3.2974305555555463,0.4200000000000002,56.0,3.5086805555555545,0.6627777777777768,72.86666666666666,1003.1163888888899 296 | 2014-10-22,2014,10,22,-8.299444444444449,0.38000000000000017,51.0,4.616805555555554,-2.346180555555563,58.922222222222224,1016.3781944444418 297 | 2014-10-23,2014,10,23,-6.939097222222205,0.6600000000000004,124.0,4.690555555555557,-2.2015972222221976,65.5875,1027.2763888888871 298 | 2014-10-24,2014,10,24,-6.920277777777782,0.5200000000000002,214.0,5.978402777777776,0.1963888888888872,56.56319444444444,1026.3215277777772 299 | 2014-10-25,2014,10,25,-1.260208333333333,0.34000000000000014,236.0,5.652013888888881,3.3275694444444506,71.06041666666667,1017.8127083333345 300 | 2014-10-26,2014,10,26,4.178541666666661,1.400000000000001,214.0,6.123749999999997,6.474999999999978,85.57152777777777,1012.1859722222277 301 | 2014-10-27,2014,10,27,8.069097222222231,1.1300000000000008,225.0,6.71534722222222,9.997916666666546,87.88055555555556,1005.2255555555598 302 | 2014-10-28,2014,10,28,8.452569444444388,1.3800000000000008,219.0,5.946111111111121,10.472708333333257,87.50555555555556,1005.4192361111106 303 | 2014-10-29,2014,10,29,5.023958333333328,1.0400000000000007,248.0,6.924375000000009,9.014583333333345,76.44236111111111,1002.5388888888915 304 | 2014-10-30,2014,10,30,2.6024305555555567,0.2700000000000001,264.0,2.877013888888887,6.187638888888901,78.49930555555555,1010.5399305555491 305 | 2014-10-31,2014,10,31,0.5927083333333338,0.4100000000000002,326.0,3.04013888888889,4.386111111111112,77.09722222222223,1015.3755555555506 306 | 2014-11-01,2014,11,1,-3.051458333333338,0.3000000000000001,163.0,2.732222222222222,1.6996527777777768,70.01805555555555,1019.668541666664 307 | 2014-11-02,2014,11,2,4.034166666666655,9.249999999999961,231.0,5.8986111111111175,5.7365277777777575,88.87777777777778,1007.124652777779 308 | 2014-11-03,2014,11,3,7.732847222222257,6.319999999999989,214.0,6.0450000000000035,8.245486111111115,96.7388888888889,995.3088194444434 309 | 2014-11-04,2014,11,4,7.531319444444495,3.1199999999999934,174.0,5.3240972222222105,9.159722222222204,89.82847222222222,991.3978472222249 310 | 2014-11-05,2014,11,5,0.01659722222222371,0.7000000000000005,62.0,4.019444444444442,3.953541666666636,74.6,1000.258402777779 311 | 2014-11-06,2014,11,6,-0.9057638888888957,15.27999999999999,124.0,4.42,0.8269444444444412,87.53125,1008.3648611111159 312 | 2014-11-07,2014,11,7,1.0474305555555528,3.599999999999982,6.0,2.8004166666666648,2.1328472222222326,92.67708333333333,1001.884791666663 313 | 2014-11-08,2014,11,8,-0.6663194444444432,1.0900000000000007,293.0,2.740694444444443,1.306805555555557,86.42986111111111,1005.4300694444456 314 | 2014-11-09,2014,11,9,4.303888888888872,1.440000000000001,203.0,3.0006944444444423,5.743888888888918,90.4798611111111,1013.0588888888908 315 | 2014-11-10,2014,11,10,3.626180555555548,1.2400000000000009,169.0,3.717152777777775,5.374722222222228,88.60902777777778,1013.1176388888882 316 | 2014-11-11,2014,11,11,6.720208333333326,2.8999999999999915,219.0,4.198125000000006,7.7012499999999795,93.64236111111111,1007.5827083333375 317 | 2014-11-12,2014,11,12,3.9397222222222354,2.5199999999999987,360.0,1.8148611111111106,6.780694444444487,82.30347222222223,1016.003541666665 318 | 2014-11-13,2014,11,13,0.9808333333333367,3.5399999999999943,62.0,3.0709027777777793,3.1159722222222443,85.96875,1022.5429861111103 319 | 2014-11-14,2014,11,14,-0.742916666666662,3.4099999999999944,17.0,2.924722222222221,1.7786111111111147,82.9798611111111,1025.739305555553 320 | 2014-11-15,2014,11,15,-0.9076388888888856,4.159999999999993,73.0,3.3458333333333314,1.0438194444444466,86.2701388888889,1024.9084722222149 321 | 2014-11-16,2014,11,16,0.17951388888888686,3.7299999999999947,129.0,2.668194444444445,2.880833333333316,82.74236111111111,1023.3609722222235 322 | 2014-11-17,2014,11,17,-1.4822222222222439,2.879999999999995,107.0,3.8184722222222183,2.1513888888888726,75.9576388888889,1023.6126388888841 323 | 2014-11-18,2014,11,18,-0.5023611111111103,1.8200000000000014,51.0,2.497291666666671,1.9779166666666546,83.37291666666667,1025.8314583333347 324 | 2014-11-19,2014,11,19,-1.886805555555541,1.8400000000000014,79.0,3.2674305555555563,0.9844444444444472,80.10833333333333,1027.8902777777746 325 | 2014-11-20,2014,11,20,-2.8788888888888975,3.0799999999999974,51.0,2.5171527777777825,-0.987777777777784,84.66527777777777,1029.1649305555527 326 | 2014-11-21,2014,11,21,-2.869861111111139,8.629999999999987,326.0,2.5224305555555526,-1.508055555555552,88.0125,1022.1430555555538 327 | 2014-11-22,2014,11,22,-2.972222222222205,0.24000000000000007,158.0,1.3816666666666642,-1.5420138888888915,87.49097222222223,1018.7880555555596 328 | 2014-11-23,2014,11,23,-1.0833333333333255,0.6500000000000004,163.0,1.287013888888888,0.3049999999999997,89.21597222222222,1025.955972222222 329 | 2014-11-24,2014,11,24,-0.7165972222222218,2.2800000000000002,169.0,4.453888888888884,1.1613194444444426,86.69097222222223,1021.8502083333301 330 | 2014-11-25,2014,11,25,0.7094444444444422,3.4199999999999973,146.0,4.535069444444442,2.201666666666652,89.77986111111112,1017.656111111115 331 | 2014-11-26,2014,11,26,2.5935416666666593,2.249999999999999,231.0,3.181944444444445,3.565972222222216,93.44305555555556,1019.597638888897 332 | 2014-11-27,2014,11,27,1.297986111111114,2.1800000000000015,231.0,2.248194444444445,3.0599305555555505,88.28611111111111,1020.5852777777745 333 | 2014-11-28,2014,11,28,0.31652777777777774,2.529999999999998,242.0,2.2569444444444473,2.3031250000000125,86.82638888888889,1026.4057638888864 334 | 2014-11-29,2014,11,29,-0.6068749999999995,2.1399999999999992,124.0,2.0579861111111075,0.6509722222222175,90.93333333333334,1028.7434722222138 335 | 2014-11-30,2014,11,30,-4.344791666666685,0.2900000000000001,276.0,1.9373611111111173,-1.7093055555555476,78.93194444444444,1027.285555555556 336 | 2014-12-01,2014,12,1,-5.224861111111126,0.3900000000000002,214.0,1.5809722222222222,-3.152013888888879,81.40347222222222,1025.8839583333306 337 | 2014-12-02,2014,12,2,-2.498958333333331,0.9100000000000005,236.0,4.294861111111113,0.028819444444447524,81.16527777777777,1017.0970138888867 338 | 2014-12-03,2014,12,3,-0.6606249999999978,1.9100000000000015,231.0,5.058750000000004,3.1463194444444533,75.78888888888889,1011.0161805555537 339 | 2014-12-04,2014,12,4,1.2163888888888876,2.639999999999997,287.0,4.009097222222213,3.8825694444444308,82.89722222222223,1011.9537499999938 340 | 2014-12-05,2014,12,5,1.2993055555555584,2.140000000000001,208.0,3.390902777777775,2.39055555555556,92.52083333333333,1016.3568749999997 341 | 2014-12-06,2014,12,6,0.8255555555555583,6.589999999999985,270.0,4.5122916666666635,2.115694444444445,91.2611111111111,1005.9534027777769 342 | 2014-12-07,2014,12,7,1.135625000000002,1.6100000000000012,191.0,4.438125000000006,3.8727777777778045,82.34652777777778,1004.6529166666636 343 | 2014-12-08,2014,12,8,1.4829861111111151,13.099999999999971,214.0,5.595972222222218,2.7988194444444403,91.19513888888889,1000.3276388888921 344 | 2014-12-09,2014,12,9,1.6967361111111132,1.8400000000000014,225.0,3.589930555555559,3.752847222222215,86.79305555555555,1007.8330555555593 345 | 2014-12-10,2014,12,10,-0.07291666666666645,4.409999999999985,214.0,7.15347222222222,3.1496527777777676,79.16388888888889,1004.8761111111136 346 | 2014-12-11,2014,12,11,0.7175694444444525,3.2899999999999916,174.0,6.625208333333328,2.502638888888884,87.95277777777778,996.0193055555567 347 | 2014-12-12,2014,12,12,-0.08854166666666785,5.11999999999999,141.0,5.130347222222225,1.9087500000000193,86.47638888888889,990.2571527777784 348 | 2014-12-13,2014,12,13,-0.43333333333333307,12.169999999999987,248.0,4.836805555555558,1.2884722222222154,88.04513888888889,977.367222222222 349 | 2014-12-14,2014,12,14,-0.516666666666666,0.18000000000000002,219.0,2.7937500000000015,2.062777777777782,82.73263888888889,996.7413888888833 350 | 2014-12-15,2014,12,15,0.41965277777777954,,158.0,6.582638888888894,2.6641666666666697,85.13958333333333,1000.913333333336 351 | 2014-12-16,2014,12,16,0.18590277777777683,,146.0,4.9833333333333405,3.157916666666662,81.08819444444444,994.0872916666696 352 | 2014-12-17,2014,12,17,-0.0995138888888907,6.329999999999994,309.0,3.3100000000000067,1.8676388888888933,86.39236111111111,989.1411111111195 353 | 2014-12-18,2014,12,18,-1.1642361111111226,3.5699999999999967,242.0,3.468819444444448,1.1683333333333337,83.3875,990.5302777777822 354 | 2014-12-19,2014,12,19,2.1677083333333282,8.13999999999998,248.0,4.041944444444457,3.8442361111111047,89.24791666666667,979.7505555555532 355 | 2014-12-20,2014,12,20,-1.6064583333333302,0.4400000000000002,248.0,3.536597222222224,1.9394444444444408,76.37361111111112,978.47125 356 | 2014-12-21,2014,12,21,-3.468958333333322,2.159999999999997,270.0,4.445277777777784,0.07256944444444459,74.79097222222222,987.518958333331 357 | 2014-12-22,2014,12,22,-3.952488687782811,7.709999999999997,45.0,1.7244343891402707,-1.5330316742081438,80.44796380090497,989.0441930618435 358 | 2014-12-23,2014,12,23,-7.540763888888893,0.38000000000000017,39.0,2.398263888888892,-6.4021527777777845,85.16111111111111,985.2581944444399 359 | 2014-12-24,2014,12,24,-6.440486111111084,5.549999999999989,6.0,3.7456250000000035,-5.440208333333329,87.06805555555556,990.7993055555577 360 | 2014-12-25,2014,12,25,-11.482986111111119,0.38000000000000017,309.0,4.5831944444444535,-10.216388888888837,80.86458333333333,1001.4137500000008 361 | 2014-12-26,2014,12,26,-5.065486111111125,1.6600000000000008,242.0,3.6352777777777847,-3.830625000000017,86.72638888888889,1004.9795138888788 362 | 2014-12-27,2014,12,27,-6.606041666666678,0.3200000000000001,354.0,2.836666666666674,-5.37444444444445,85.40069444444444,1004.832638888889 363 | 2014-12-28,2014,12,28,-10.724722222222246,0.10999999999999999,17.0,2.3302083333333377,-9.214444444444425,79.96944444444445,1013.1005555555525 364 | 2014-12-29,2014,12,29,-15.290416666666644,0.25000000000000006,152.0,1.9175000000000002,-14.476875000000021,80.5423611111111,1019.0821527777734 365 | 2014-12-30,2014,12,30,-1.8870138888888808,12.75999999999997,248.0,4.900972222222217,-0.9349999999999983,91.60069444444444,1009.4786805555586 366 | 2014-12-31,2014,12,31,2.0347916666666688,1.8900000000000012,287.0,4.367569444444439,3.701736111111103,89.00208333333333,1002.5904861111133 367 | -------------------------------------------------------------------------------- /extra/Summer2019_Session1_solved.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Data Analytics Summer School\n", 8 | "2019 Edition (Jesse Harrison, Anni Pyysing)" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "# 1. Hands-on Session: Basics of Data Analysis" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "R is one of the most widely used programming languages for data analysis. It is available for free under the [GNU General public license](https://www.r-project.org/COPYING). In this session, we will go through a number of practical tasks to equip you with a basic grasp of how R works. This will also come in handy in the subsequent sessions. Don't worry, no prior programming experience is expected!\n", 23 | "\n", 24 | "###### R interfaces: details for the interested\n", 25 | "\n", 26 | "R code can be written using essentially any software. However, some methods are more practical than others. The most common method is to use RStudio (see [rstudio.com](https://www.rstudio.com/)). If you want to start using R on your own computer later on, you can start by [downloading R](https://www.r-project.org/) and RStudio on your computer. Both are free. Today we will not download anything - instead, we will work with R code using Jupyter." 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "## Let's begin!" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "What you are currently using is Jupyter, a web-based (\"cloud\") interface that can run R. While not as powerful as a full-fledged R installation, it enables us to focus on some of the basics of reading, creating and editing R code.\n", 41 | "\n", 42 | "Jupyter notebooks consist of cells. For example, this text block is a cell that you can edit by double-clicking it (try this!). Now that you have entered the *edit mode*, you can *run* the cell by clicking the Run above, or by using `Ctrl-Enter`.\n", 43 | "\n", 44 | "You can edit everything in your notebook (for example, delete stuff). Also, you can add cells in the \"Insert\" menu, run a cell in the \"Cell\" menu, and save and download this notebook in the \"File\" menu. __Saving your work__ is possible by using checkpoints (go to the File-menu and click Save and Checkpoint)." 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "## Code cells and simple calculations\n", 52 | "\n", 53 | "Jupyter notebooks have two important types of cells: __Markdown__ and __Code__ cells. This cell is a markdown cell and it contains text. The cell below that has the marking `In[ ]` next to it is the first code cell. R code can be executed in code cells. \n", 54 | "\n", 55 | "You can write, for example, 1+1 in the cell below and run the calculation. Test it! You can erase the calculation, modify it, and Run it again. Also try some additional calculations using the list of arithmetic operators below. One important observation: for decimal points, R accepts the full stop symbol (`.`) instead of a comma (`,`). \n", 56 | "\n", 57 | "To get used to how code cells work, also try typing different calculations on separate rows. You will notice that R treats each row individually (with the calculation results appearing as a list under the code cell).\n", 58 | "\n", 59 | "\n", 60 | "| R code | Explanation | Example |\n", 61 | "| -------|:---------: | -----: |\n", 62 | "| `+` | addition | `1+1` |\n", 63 | "| `-` | subtraction | `10-5` |\n", 64 | "| `*` | multiplication | `10*2` |\n", 65 | "| `/` | division | `10/2` |\n", 66 | "| `^` | exponential | `10^2` |\n", 67 | "| `.` | decimal point | `3.14159` |" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": null, 73 | "metadata": {}, 74 | "outputs": [], 75 | "source": [ 76 | "5 + 10\n", 77 | "14 * 7\n", 78 | "10 ^ 2" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "## Objects (and removing them)\n", 86 | "\n", 87 | "You can store info in *objects*. This is useful when you want to use your object later (once created, they remain in R's memory). The standard way to assign a value to an object is `object <- value`. Here we deal mostly with numeric values, but lines of text (such as \"Text\") can also be used. Further to individual numbers or text labels, you can assign the results of different calculations to objects. Some examples are given below:\n", 88 | "```r\n", 89 | "price <- 140\n", 90 | "carPrice <- 1\n", 91 | "piRounded <- 3.14159\n", 92 | "name <- \"Text\"\n", 93 | "cookiePrice <- 2\n", 94 | "complexCalculation <- 5 + 5\n", 95 | "```\n", 96 | "To get a hang of creating objects in R, try making up a few of your own using the code cell below: one containing a single number, another containing a line of text, and one using a calculation. Note that to actually see what is inside an object, you will need to write the name of the object separately after performing the assignment. Otherwise it may appear as if nothing happens when you run the code. For example:\n", 97 | "\n", 98 | "```r\n", 99 | "price <- 140\n", 100 | "price\n", 101 | "```\n", 102 | "\n", 103 | "Sometimes you might also want to remove an object after creating it. This can be done using `rm()` (with the name of the object inserted inside the brackets). After creating the three objects, let's also try removing them! \n", 104 | "\n", 105 | "Note: there is also a command for completely clearing up your working environment in R. This can be done as follows: `rm(list = ls())`. Use it with some caution because R doesn't come with an undo button!\n" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": null, 111 | "metadata": {}, 112 | "outputs": [], 113 | "source": [ 114 | "obj1 <- 5 \n", 115 | "obj2 <- \"pancakes\"\n", 116 | "obj3 <- 5 / 4\n", 117 | "\n", 118 | "obj1\n", 119 | "obj2\n", 120 | "obj3\n", 121 | "\n", 122 | "rm(obj1)\n", 123 | "rm(obj2)\n", 124 | "rm(obj3)\n", 125 | "\n", 126 | "# rm(obj1, obj2, obj3)" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "## Calculations using objects\n", 134 | "\n", 135 | "Once data have been assigned to objects, they are convenient to use for calculations:\n", 136 | "\n", 137 | "```r\n", 138 | "discount <- 20\n", 139 | "finalPrice <- price - discount\n", 140 | "finalSale <- (price - discount) * 0.5\n", 141 | "```\n", 142 | "\n", 143 | "Now you know how very basic math operations can be done in R, and it is time to put these skills to use! The code cell below has already some code written into it, and you can just modify it as needed.\n", 144 | "\n", 145 | "As an exercise, calculate what is the final price when the discount is __a)__ 40, or __b)__ 15% of the starting price." 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": {}, 152 | "outputs": [], 153 | "source": [ 154 | "discount2 <- 40\n", 155 | "finalPrice2 <- price - discount2\n", 156 | "finalPrice2\n", 157 | "\n", 158 | "discount3 <- 0.15 * price\n", 159 | "finalPrice3 <- price - discount3\n", 160 | "finalPrice3\n", 161 | "\n", 162 | "finalPrice4 <- price - (0.15 * price)\n", 163 | "finalPrice4" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "## Vectors and functions in R\n", 171 | "\n", 172 | "### What vectors are\n", 173 | "\n", 174 | "While the examples above dealt with individual numbers, this is not how R is typically used. Usually you have plenty of data (many items and many prices, for example). The way to store a sequence of numbers (or other types of data) is to use a vector. When we are dealing with numeric data, the vector is called a *numeric vector*. When dealing with text data, the vector is called a *character vector*. \n", 175 | "\n", 176 | "We'll learn more about vectors soon, but for now let's go ahead and create a vector called *testVector*.\n", 177 | "See what comes out by running the cell!" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "metadata": {}, 184 | "outputs": [], 185 | "source": [ 186 | "testVector <- c(10, 30, 1, 4)\n", 187 | "testVector\n", 188 | "\n", 189 | "sort(testVector)" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "Ok, now we have created a numeric vector and sorted it. You can try to figure out how to make a character vector by yourself. If you can not figure it out, do not worry, we will learn it soon." 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "### What functions are\n", 204 | "\n", 205 | "A surprise! We have already used functions in this session, including `sort()` and `c()`. The first is a function that orders the values in a vector and the latter is a function that combines its arguments to form a vector. In the case of `testVector <- c(10,30,-20,-1)`, the arguments are 10, 30, -20, and -1. The command for removing objects, `rm()`, is also a function.\n", 206 | "\n", 207 | "Some useful functions include convenient ways to calculate things, such as `mean()`, which calculates the mean of a numeric vector. The function call `mean(testVector)` will give you 4.75 as the output, since it treats the vector `c(10,30,-20,-1)` as a whole. Another example of a useful function is `median(testVector)`. Let's try both:" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": null, 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "testVector <- c(10, 30, -20, -1)\n", 217 | "\n", 218 | "mean(testVector)\n", 219 | "median(testVector)\n", 220 | "\n", 221 | "mean(testVector) - median(testVector)" 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "metadata": {}, 227 | "source": [ 228 | "__Some useful functions__ are listed next. You don't have to memorize any functions, but it is useful to know some that are available. If you are unsure how a function works, you can always open the help (see below), or you can search the internet or [R documentation](https://www.r-project.org/other-docs.html) for many more functions.\n", 229 | "\n", 230 | "* Most mathematical functions also work *element by element* and return multiple values
\n", 231 | " `exp()`, `log()`\n", 232 | "* Some functions treat the vector as a whole and return a single value, like the length of a vector
\n", 233 | " `length()`, `sum()`, `mean()`, `median()`, `sd()`, `var()`, `min()`, `max()`
\n", 234 | "* Other functions tell several things at once
\n", 235 | " `summary()`, `str()`
\n", 236 | "* It is also possible to ask for help and open R documentation.
\n", 237 | " `help()`, or `?function` (e.g. `?mean`) will both work
\n", 238 | "* A default R installation can also be used to create simple graphics.
\n", 239 | " `plot()`, `barplot()`
\n", 240 | "\n", 241 | "## Annotating your code\n", 242 | "\n", 243 | "Annotating your code is done by using the `#` character in R. This is especially necessary when writing longer projects. Comments are used to explain what you are doing in your code. The `#` will turn the following text into \"non-code\" and it will be ignored by the interpreter that turns the R-code into action." 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": null, 249 | "metadata": {}, 250 | "outputs": [], 251 | "source": [ 252 | "# This is a comment\n", 253 | "\n", 254 | "help(summary)\n", 255 | "summary(testVector)\n", 256 | "\n", 257 | "# See what happens if you erase the # character below\n", 258 | "# mean(testVector)" 259 | ] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "metadata": {}, 264 | "source": [ 265 | "## Next some exercises! \n", 266 | "\n", 267 | "Use the `help()`, search the internet, or just quess the correct commands.\n", 268 | "Writing something \"wrong\" does not break anything, it will just give an error message. Find the mistake, fix the mistake, and run the cell again!\n", 269 | "\n", 270 | "Remember: if you store the value of some calculation in an object, you can read its content by simply writing its name.\n", 271 | "\n", 272 | "#### Exercise 1: Learning to use external resources\n", 273 | "Find out the variance of the following numbers: 1, 4, 9.5, 10, and -2.
\n", 274 | "*Hint!* There is a function for it :)" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": null, 280 | "metadata": {}, 281 | "outputs": [], 282 | "source": [ 283 | "# Exercise 1 answer below\n", 284 | "\n", 285 | "variance <- var(c(1, 4, 9.5, 10, -2))\n", 286 | "variance" 287 | ] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "metadata": {}, 292 | "source": [ 293 | "#### Exercise 2: Finding out how a function works\n", 294 | "Learn what the function `seq()` does and what arguments it needs to work. Test it. Do not be afraid to get an error message!" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": null, 300 | "metadata": {}, 301 | "outputs": [], 302 | "source": [ 303 | "# Exercise 2 answer below\n", 304 | "\n", 305 | "mysterySeq1 <- seq(1, 10, 1)\n", 306 | "mysterySeq1\n", 307 | "\n", 308 | "# needs 'from', 'to' and 'by' arguments.\n", 309 | "# above: from 1 to 10 \"by\" 1 (the \"by\" argument specifies the interval)\n", 310 | "\n", 311 | "mysterySeq2 <- seq(1, 10, 2) # \"by\" 2\n", 312 | "mysterySeq2" 313 | ] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "metadata": {}, 318 | "source": [ 319 | "#### Exercise 3: Thinking on your own\n", 320 | "Create a vector `t` containing the numbers from 0 to 10 with 0.1 interval, and plot the `sin()` of the vector `t` at the points contained in the `t` vector." 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": null, 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [ 329 | "# Exercise 3 answer below\n", 330 | "\n", 331 | "t <- seq(0, 10, 0.1)\n", 332 | "y <- sin(t)\n", 333 | "\n", 334 | "plot(t, y)\n", 335 | "plot(y, t)" 336 | ] 337 | }, 338 | { 339 | "cell_type": "markdown", 340 | "metadata": {}, 341 | "source": [ 342 | "#### Extra exercise for the fast ones\n", 343 | "Find out how to write a title into the plot and change the scatter plot from exercise 3 into a line plot.
\n", 344 | "*Hint!* The `plot()` function can take more arguments than two. Do not worry about finishing this exercise if you run out of time." 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": null, 350 | "metadata": {}, 351 | "outputs": [], 352 | "source": [ 353 | "# Answer for the extra exercise\n", 354 | "\n", 355 | "# for more info, see help(plot)!\n", 356 | "\n", 357 | "plot(t, y, type = \"l\", main = \"a smart title\")" 358 | ] 359 | }, 360 | { 361 | "cell_type": "markdown", 362 | "metadata": {}, 363 | "source": [ 364 | "## Data types \n", 365 | "\n", 366 | "### Numeric vectors vs. character vectors\n", 367 | "\n", 368 | "Now that you can write simple R code and search the R documentation or the Internet for information about functions, it is time to learn something new! As you might remember, there are *numeric* and *character* vectors, which are used to store numbers and text. A character vector can obviously contain numbers as well, but in text format. Character vectors can be created using quotes `\" \"` or single quotes `' ' `.\n", 369 | "\n", 370 | "For example, what is the difference between the following commands?\n", 371 | "```r\n", 372 | "numVector <- c(1, 2, 3, 4)\n", 373 | "chVector <- c(\"1\", \"2\", \"3\", \"4\")\n", 374 | "```\n", 375 | "Try running the cell below and inspect the message." 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": null, 381 | "metadata": {}, 382 | "outputs": [], 383 | "source": [ 384 | "# Demonstrating the difference between character and numeric vectors\n", 385 | "\n", 386 | "numVector <- c(1, 2, 3, 4)\n", 387 | "chVector <- c(\"1\", \"2\", \"3\", \"4\")\n", 388 | "\n", 389 | "sum(numVector)\n", 390 | "sum(chVector)" 391 | ] 392 | }, 393 | { 394 | "cell_type": "markdown", 395 | "metadata": {}, 396 | "source": [ 397 | "### Logical vectors and logical operators \n", 398 | "\n", 399 | "In addition to numeric and character vectors, there are *logical* vectors. Logical vectors have only two possible values, `TRUE` and `FALSE`. They are answers to questions such as \"Is the price above 50?\" or \"Is the price exactly 10?\". \n", 400 | "\n", 401 | "A logical vector can be created similarly to other vectors, for example: \n", 402 | "```r\n", 403 | "answers <- c(TRUE, FALSE, TRUE, TRUE)\n", 404 | "```\n", 405 | "\n", 406 | "Logical vectors (notice that a single value is a vector with the length of one) are often created as the output of functions. For example, the function `is.numeric()` tests if its argument is numeric and returns a logical value." 407 | ] 408 | }, 409 | { 410 | "cell_type": "code", 411 | "execution_count": null, 412 | "metadata": {}, 413 | "outputs": [], 414 | "source": [ 415 | "# Run this cell to see what happens\n", 416 | "is.numeric(\"some character string\")\n", 417 | "is.numeric(50385)" 418 | ] 419 | }, 420 | { 421 | "cell_type": "markdown", 422 | "metadata": {}, 423 | "source": [ 424 | "Logical vectors are not only created as function outputs, but also as the result of *logical operations* that are written in an intuitive way, very similarly to the arithmetic operators.\n", 425 | "\n", 426 | "The *logical operators* in R are listed below with some use examples:\n", 427 | "\n", 428 | "| R code | Explanation | Example | Output |\n", 429 | "| ------- |:---------: | -----: | -----: |\n", 430 | "| `<` | less than | `10 < 15` | `TRUE` |\n", 431 | "| `>` | greater than | `3 > 3` | `FALSE` |\n", 432 | "| `<=` | less or equal | `3 <= 3` | `TRUE` |\n", 433 | "| `>=` | greater or equal | `5 >= 6` | `FALSE` |\n", 434 | "| `==` | equal | `7 == 7` | `TRUE` |\n", 435 | "| `!=` | not equal | `2 != 2` | `FALSE` |\n", 436 | "\n", 437 | "It is also possible to combine different rules by using logical operators, which are:\n", 438 | "\n", 439 | "- and `&`\n", 440 | "- or `|`\n", 441 | "- not `!`\n", 442 | "\n", 443 | "Try to play with logical operators a bit in the code cell below. After trying out what's already written inside the cells, try some examples of your own." 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": null, 449 | "metadata": {}, 450 | "outputs": [], 451 | "source": [ 452 | "# Test some logical operators to see how they work\n", 453 | "age <- 15\n", 454 | "\n", 455 | "# One rule\n", 456 | "age < 18\n", 457 | "\n", 458 | "# Combining rules\n", 459 | "age < 18 & age > 16\n", 460 | "\n", 461 | "# Own examples\n", 462 | "\n", 463 | "oldage <- 99\n", 464 | "oldage <= 100\n", 465 | "\n", 466 | "oldage != 100" 467 | ] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "metadata": {}, 472 | "source": [ 473 | "Now for an exercise to flex your brain with logical operators.\n", 474 | "\n", 475 | "#### Exercise: Book auction\n", 476 | "\n", 477 | "Let's say that you run a small company that sells boxes of books in an auction. The amount of books in each box varies, and the price of each box is set in the auction.\n", 478 | "\n", 479 | "The object `prices` has the prices of the boxes that were sold each day, and `booksInBoxes` has the amounts of books that were in each box.\n", 480 | "\n", 481 | "Create your own `prices` and `booksInBoxes` objects. Using what we've learnt so far, can you use a logical operator(s) to find out if the total sales was above our weekly target, 500?\n" 482 | ] 483 | }, 484 | { 485 | "cell_type": "code", 486 | "execution_count": null, 487 | "metadata": {}, 488 | "outputs": [], 489 | "source": [ 490 | "# Answer here\n", 491 | "\n", 492 | "prices <- c(100, 49, 10, 16, 88, 2, 51)\n", 493 | "booksInBoxes <- c(30, 1, 12, 5, 8, 1, 14)\n", 494 | "target <- 500\n", 495 | "\n", 496 | "totSales <- sum(prices)\n", 497 | "totSales\n", 498 | "\n", 499 | "totSales > target" 500 | ] 501 | }, 502 | { 503 | "cell_type": "markdown", 504 | "metadata": {}, 505 | "source": [ 506 | "### Choosing values\n", 507 | "Sometimes we want to choose a part of the values from a larger dataset for further analysis.\n", 508 | "We can do this using square brackets `[ ]`.
\n", 509 | "Observe the output of the cell below by running it. Feel free to modify the code and test how it works." 510 | ] 511 | }, 512 | { 513 | "cell_type": "code", 514 | "execution_count": null, 515 | "metadata": {}, 516 | "outputs": [], 517 | "source": [ 518 | "prices[1] # choosing the first value\n", 519 | "prices[2:5] # choosing the 2nd, 3rd, 4th and 5th value\n", 520 | "prices[prices > 40] # choosing values based on logical operations\n", 521 | "prices[c(2,4,5)] # choosing the 2nd, 4th and 5th value" 522 | ] 523 | }, 524 | { 525 | "cell_type": "markdown", 526 | "metadata": {}, 527 | "source": [ 528 | "### The workspace\n", 529 | "\n", 530 | "The workspace holds the objects the user (you!) have defined during the session. For example, the object `prices` should be in your workspace. Here are some useful commands to get you started with the workspace: (further to the `rm()` function):\n", 531 | "* `getwd()` prints the current working directory location\n", 532 | "* `setwd()` can be used to set a custom working directory (you will need to specify the full path)\n", 533 | "* `ls()` lists all the objects in the workspace\n", 534 | "\n", 535 | "Now might also be a good time to create a checkpoint if you want to save the state of your work before moving on." 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": null, 541 | "metadata": {}, 542 | "outputs": [], 543 | "source": [ 544 | "# Find out what objects you have in your workspace\n", 545 | "ls()" 546 | ] 547 | }, 548 | { 549 | "cell_type": "markdown", 550 | "metadata": {}, 551 | "source": [ 552 | "### Data frames - the most important data structure in R \n", 553 | "We moved on from having objects that were single values to objects that were single vectors. However, it would be useful if the vectors that are related to each other could also be represented by the same object.\n", 554 | "Now we will proceed to storing our data in even bigger entities, __data frames__. Data frames can include our previously very separate vectors in the same object.\n", 555 | "\n", 556 | "Some of the functions will work with entire data frames, like `summary()`, but for some functions, like `mean()`, you will need to specify which data inside the data frame you want to use in the function.\n", 557 | "\n", 558 | "Some important functions and commands that are useful for working with data frames include:\n", 559 | "* `data.frame()` creates a data frame of the arguments that it is given\n", 560 | "* the dollar sign `$` is used to take a column from a data frame\n", 561 | "* square brackets `[ ]` can be used to take a part of the dataframe\n", 562 | "* `subset()` returns a subset of a data frame\n", 563 | "\n", 564 | "In the following examples we will learn how to create, modify, and use data frames in R.\n", 565 | "But first, let's create a couple of vectors and combine them into one data frame for storage. " 566 | ] 567 | }, 568 | { 569 | "cell_type": "code", 570 | "execution_count": null, 571 | "metadata": {}, 572 | "outputs": [], 573 | "source": [ 574 | "# Here we create some data to put into a data frame\n", 575 | "\n", 576 | "amount <- c(2, 3, 5, 1)\n", 577 | "book <- c(\"Comics\", \"Novels\", \"Manuals\", \"Comics\") \n", 578 | "prices <- c(7.50, 20.00, 66.0, 500)\n", 579 | "\n", 580 | "# Combining the vectors into a dataframe\n", 581 | "\n", 582 | "bookTypes <- data.frame(amount, book, prices)\n", 583 | "bookTypes" 584 | ] 585 | }, 586 | { 587 | "cell_type": "markdown", 588 | "metadata": {}, 589 | "source": [ 590 | "### Factors\n", 591 | "\n", 592 | "There is still one data type to discuss, __factor__. Factors can be described as categories, such as `Comics`, `Novels`, and `Manuals`. You can check that the class of `bookTypes$book` is actually factor by using the function `class()`.\n", 593 | "The class changed from character to factor because the function `data.frame()` turns character vectors into factors by default. The conversion can be avoided by the option `stringsAsFactors = FALSE`. Factors are necessary in statistical modeling, where different categories are often compared.\n", 594 | "\n", 595 | "Let's create a new data frame, but this time, we'll put everything inside it immediately. Also, notice how we have split this longer section of code onto multiple rows. We learnt before that R will treat each line of code individually and this is true when each line contains a *separate* section of code (such as assignments for separate data frames). In the current case, arranging the code on multiple rows is possible since all the commands relate to the same task (creating a data frame)." 596 | ] 597 | }, 598 | { 599 | "cell_type": "code", 600 | "execution_count": null, 601 | "metadata": {}, 602 | "outputs": [], 603 | "source": [ 604 | "# Creating a data frame called 'bookReview'\n", 605 | "# (Note how the code is split into different rows for easier reading)\n", 606 | "\n", 607 | "bookReview <- data.frame(storage = c(4, 1, 3, 2), \n", 608 | " titles = c(\"Good book\", \"Excellent book\", \"Bad book\", \"Horrible book\"),\n", 609 | " cost = c(12.1, 20.0, 3.33, -10 ),\n", 610 | " stringsAsFactors=FALSE)\n", 611 | "bookReview" 612 | ] 613 | }, 614 | { 615 | "cell_type": "markdown", 616 | "metadata": {}, 617 | "source": [ 618 | "Now if we want to get only the titles, we must specify from which data frame it is found.\n", 619 | "This is done by using the `$` sign. You can also try to take the column without specifying the data frame, and see what happens." 620 | ] 621 | }, 622 | { 623 | "cell_type": "code", 624 | "execution_count": null, 625 | "metadata": { 626 | "scrolled": true 627 | }, 628 | "outputs": [], 629 | "source": [ 630 | "bookReview$titles # take a column from the data frame\n", 631 | "titles # take the variable 'titles'" 632 | ] 633 | }, 634 | { 635 | "cell_type": "markdown", 636 | "metadata": {}, 637 | "source": [ 638 | "### Creating and modifying columns\n", 639 | "Now we will create (and modify) some new columns into the data frame using the `$` sign. See what happens to our `bookReview` and try to figure out how the column `storage.value` is created!
\n", 640 | "\n", 641 | "_Hint!_ The next exercise will be easier if you focus on this part." 642 | ] 643 | }, 644 | { 645 | "cell_type": "code", 646 | "execution_count": null, 647 | "metadata": {}, 648 | "outputs": [], 649 | "source": [ 650 | "# Creating some new columns\n", 651 | "bookReview$type <- \"book\"\n", 652 | "bookReview$storage.value <- bookReview$storage * bookReview$cost\n", 653 | "bookReview\n", 654 | "\n", 655 | "# Modifying our new columns\n", 656 | "bookReview$type[4] <- \"trash\"\n", 657 | "bookReview" 658 | ] 659 | }, 660 | { 661 | "cell_type": "markdown", 662 | "metadata": {}, 663 | "source": [ 664 | "#### Exercise: Modifying the data frame\n", 665 | "\n", 666 | "We have received a shipment of new books and their information should be stored into our data frame.\n", 667 | "The shipment included two books of each type. Don't forget to update the storage value!\n", 668 | "\n", 669 | "Luckily for us, the cost of the books stays the same, so we don't need to worry about that." 670 | ] 671 | }, 672 | { 673 | "cell_type": "code", 674 | "execution_count": null, 675 | "metadata": {}, 676 | "outputs": [], 677 | "source": [ 678 | "# Answer here\n", 679 | "\n", 680 | "bookReview # before changes\n", 681 | "\n", 682 | "bookReview$storage <- bookReview$storage + 2 # updating number in store\n", 683 | "bookReview\n", 684 | "\n", 685 | "bookReview$storage.value <- bookReview$cost * bookReview$storage # updating storage value\n", 686 | "bookReview" 687 | ] 688 | }, 689 | { 690 | "cell_type": "markdown", 691 | "metadata": {}, 692 | "source": [ 693 | "### Choosing values from a data frame\n", 694 | "\n", 695 | "As you might have already learned, there are many ways to achieve the same goal in R. Both `subset()` and `[ ]` can be used to choose columns and rows. \"Unselecting\" columns is done by the minus sign (`-`).\n", 696 | "Uncomment the method you want to test. Can you figure out how each method works?" 697 | ] 698 | }, 699 | { 700 | "cell_type": "code", 701 | "execution_count": null, 702 | "metadata": { 703 | "scrolled": true 704 | }, 705 | "outputs": [], 706 | "source": [ 707 | "# Create a column with a sequence of numbers\n", 708 | "bookReview$delete.column <- seq(1, 100, length = 4)\n", 709 | "bookReview\n", 710 | "\n", 711 | "# multiple methods to delete the 6th column in the data frame:\n", 712 | "\n", 713 | "# bookReview <- bookReview[1:4,-6]\n", 714 | "# bookReview <- subset(bookReview, select = -6)\n", 715 | "# bookReview[\"delete.column\"] <- NULL\n", 716 | "\n", 717 | "bookReview" 718 | ] 719 | }, 720 | { 721 | "cell_type": "markdown", 722 | "metadata": {}, 723 | "source": [ 724 | "We can also set rules for the observations that we want to take from the data frame." 725 | ] 726 | }, 727 | { 728 | "cell_type": "code", 729 | "execution_count": null, 730 | "metadata": {}, 731 | "outputs": [], 732 | "source": [ 733 | "subset(bookReview, storage > 1 & cost > 5)\n", 734 | "subset(bookReview, type == \"book\")" 735 | ] 736 | }, 737 | { 738 | "cell_type": "markdown", 739 | "metadata": {}, 740 | "source": [ 741 | "#### Exercise: Finding observations from the data frame\n", 742 | "Find out the average cost of the books that are not horrible. *Hint*: logical operators are your friend!" 743 | ] 744 | }, 745 | { 746 | "cell_type": "code", 747 | "execution_count": null, 748 | "metadata": {}, 749 | "outputs": [], 750 | "source": [ 751 | "goodReview <- subset(bookReview, titles != \"Horrible book\")\n", 752 | "goodReview\n", 753 | "\n", 754 | "mean(goodReview$cost)" 755 | ] 756 | }, 757 | { 758 | "cell_type": "markdown", 759 | "metadata": {}, 760 | "source": [ 761 | "Next, let's try to use some functions on our data frame. Run the cell below and observe the output!" 762 | ] 763 | }, 764 | { 765 | "cell_type": "code", 766 | "execution_count": null, 767 | "metadata": {}, 768 | "outputs": [], 769 | "source": [ 770 | "summary(bookReview) # see how this works on a data frame\n", 771 | "mean(bookReview) # see how this does not work on a data frame" 772 | ] 773 | }, 774 | { 775 | "cell_type": "markdown", 776 | "metadata": {}, 777 | "source": [ 778 | "### Error messages (and missing values) \n", 779 | "\n", 780 | "Learning to read error messages is a good skill to have, because it will help you locate and fix errors in your code. Did you read the warning message that appeared when we tried to take the `mean()` from the entire data frame? The message says that the argument we gave the function is not in a suitable format. However, the function did still return something.\n", 781 | "\n", 782 | "`NA` is short for \"not available\", which means that the answer is a missing value. In real life, datasets are full of missing values. You should remember that missing values can disturb some calculations, such as if you want to take the `mean()` from a vector that has `NA`s. Luckily, you can tell the function `mean()` to skip `NA`s in calculation by giving the function an extra attribute, `na.rm = TRUE`.\n", 783 | "\n", 784 | "There is also another value, `NULL`, that could be confused with `NA`, but the difference between `NULL` and `NA` is a bit out of the scope of this session. Check out the R documentation if you are interested!" 785 | ] 786 | }, 787 | { 788 | "cell_type": "markdown", 789 | "metadata": {}, 790 | "source": [ 791 | "## Getting data into R\n", 792 | "\n", 793 | "For now we have dealt with very small data sets that we created by ourselves, but in the real world you would use data from external resources. R has many functions for data import, such as `read.csv()` to read .csv files and `read.table()` to read .txt files. Notice that your data might not be exactly in the format that the function expects! For example, if your data uses a different decimal point than the default `.`, you can include an extra argument to specify the separator. This would be written as `read.csv(\"filename.csv\", dec = \",\")`. Another alternative is to use `read.csv2()`, which uses the semicolon (`;`) as the separator.\n", 794 | "\n", 795 | "To load a .csv file into R, you would have two options:\n", 796 | "\n", 797 | "`read.csv(\"file.csv)` # when the file is in your working directory\n", 798 | "`read.csv(\"pathtofile/file.csv\")` # when the file is outside your working directory\n", 799 | "\n", 800 | "R also comes with some built-in datasets that you can practice with. If you are interested, available datasets are listed by the `data()` function. The same function also loads the dataset when the name of the dataset is given as an argument. The *PlantGrowth* dataset is a good one to start with." 801 | ] 802 | }, 803 | { 804 | "cell_type": "code", 805 | "execution_count": null, 806 | "metadata": {}, 807 | "outputs": [], 808 | "source": [ 809 | "# An example of exploring a built-in dataset\n", 810 | "\n", 811 | "# first clear the workspace\n", 812 | "rm(list = ls())\n", 813 | "\n", 814 | "# load the PlantGrowth data set\n", 815 | "data(PlantGrowth)\n", 816 | "\n", 817 | "# explore the data\n", 818 | "head(PlantGrowth)\n", 819 | "str(PlantGrowth)\n", 820 | "?PlantGrowth" 821 | ] 822 | }, 823 | { 824 | "cell_type": "markdown", 825 | "metadata": {}, 826 | "source": [ 827 | "### Packages\n", 828 | "\n", 829 | "Packages are the strength of R! You can extend R by installing new packages. There are several packages already installed with R, but installing new ones will probably be necessary if you are working with more than the very basic stuff. You only have to install a package once, but load it into the library for use every time you start a new session. Here are some useful functions to get you started!\n", 830 | "\n", 831 | "* `installed.packages()` will list all the packages that are already installed\n", 832 | "* `install.packages(\"packageName\")` will install the package from CRAN\n", 833 | "* `library()` will list all the packages that are available in the library and ready to use\n", 834 | "* `library(\"packageName\")` will load the package to be used during the session\n", 835 | "\n", 836 | "The amount of packages available for R is enormous. See this [article at rstudio.com](https://support.rstudio.com/hc/en-us/articles/201057987-Quick-list-of-useful-R-packages) to find a nice list of useful packages.\n", 837 | "All available packages at CRAN can be found at [cran.r-project.org](https://cran.r-project.org) under Packages, and a useful listing of packages by topic is under Task Views.\n", 838 | "\n", 839 | "Now you have the basic tools to move on to the next section of this Data Analytics School. If you want to go through these materials later on, go ahead and download this notebook as PDF!\n", 840 | "\n", 841 | "\n" 842 | ] 843 | }, 844 | { 845 | "cell_type": "markdown", 846 | "metadata": {}, 847 | "source": [ 848 | "## Extra examples and exercises\n", 849 | "\n", 850 | "Here are some extra examples and exercises for those who have some coding experience, have perhaps used R before, or are just very fast.\n", 851 | "\n", 852 | "### PlantGrowth data set\n", 853 | "\n", 854 | "Let's continue with the *PlantGrowth* data set. This data set, while quite a small one, is still useful for our purposes. It contains information on plant yields (measured as dry weight) obtained under either control conditions or two experimental treatments.\n", 855 | "\n", 856 | "Something we haven't yet tried (but will explore later on in more depth) is using R to visualize your data. In addition to offering many functions for data manipulation and statistics, R is also great for creating plots. Here, we can try looking at the control vs. experimental groups using one of the default functions in R called `boxplot()`. It produces quite a lot of information for just a single line of code: \n" 857 | ] 858 | }, 859 | { 860 | "cell_type": "code", 861 | "execution_count": null, 862 | "metadata": {}, 863 | "outputs": [], 864 | "source": [ 865 | "boxplot(weight ~ group, data = PlantGrowth, col = \"blue\")" 866 | ] 867 | }, 868 | { 869 | "cell_type": "markdown", 870 | "metadata": {}, 871 | "source": [ 872 | "The box and whiskers plot shows the group medians, lower (25%) and upper (75%) quartiles and also the minimal and maximal values within each group. What conclusions do you think we could draw from the results?\n", 873 | "\n", 874 | "One could also think about plotting the data slightly differently. For example, we might want to see all the individual data points rather than a summary. One way to do this is to type in `stripchart(weight ~ group, data = PlantGrowth)` (try it using the code cell below!). The resulting plot could be easier to interpret, however. Can you find a way to rotate it clockwise so that the groups are shown on the x axis and weight on the y axis? To find the answer, have a look at the R help file for this function.\n" 875 | ] 876 | }, 877 | { 878 | "cell_type": "code", 879 | "execution_count": null, 880 | "metadata": {}, 881 | "outputs": [], 882 | "source": [ 883 | "stripchart(weight ~ group, data = PlantGrowth)\n", 884 | "\n", 885 | "stripchart(weight ~ group, data = PlantGrowth, vertical = TRUE)" 886 | ] 887 | }, 888 | { 889 | "cell_type": "markdown", 890 | "metadata": {}, 891 | "source": [ 892 | "### A few words on formulas\n", 893 | "\n", 894 | "You've probably noticed how the code in these examples is somewhat different to what we were working with before. The use of a the tilde symbol (`~`) in R is specific to _formulas_. The general way in which formulas in R are structured can be thought of as:\n", 895 | "\n", 896 | "`function(y ~ x)` # first y, then x!\n", 897 | "\n", 898 | "Another situation where you will encounter formulas is when you're doing statistical modelling with R. For example, in the future if wishing to fit a linear model to a data set, you could use the linear model function `lm()` as follows:\n", 899 | "\n", 900 | "`lm(variable_y ~ variable_x, data = dataframe)`\n", 901 | "\n", 902 | "As diving into the world of statistics is beyond the scope of this hands-on session, you may wish to look up the linear model function in your own time: `?lm`" 903 | ] 904 | }, 905 | { 906 | "cell_type": "markdown", 907 | "metadata": {}, 908 | "source": [ 909 | "### Free exploration\n", 910 | "Feel free to explore another built-in dataset in R, _ToothGrowth_. " 911 | ] 912 | }, 913 | { 914 | "cell_type": "code", 915 | "execution_count": null, 916 | "metadata": {}, 917 | "outputs": [], 918 | "source": [ 919 | "# Free exploration\n", 920 | "\n", 921 | "?ToothGrowth\n", 922 | "data(ToothGrowth)\n", 923 | "head(ToothGrowth)" 924 | ] 925 | } 926 | ], 927 | "metadata": { 928 | "celltoolbar": "Raw Cell Format", 929 | "kernelspec": { 930 | "display_name": "R", 931 | "language": "R", 932 | "name": "ir" 933 | }, 934 | "language_info": { 935 | "codemirror_mode": "r", 936 | "file_extension": ".r", 937 | "mimetype": "text/x-r-source", 938 | "name": "R", 939 | "pygments_lexer": "r", 940 | "version": "3.5.1" 941 | } 942 | }, 943 | "nbformat": 4, 944 | "nbformat_minor": 1 945 | } 946 | -------------------------------------------------------------------------------- /extra/Summer2019_Session1_solved.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/csc-training/R-for-beginners/021b684de5c4a6ee44ed2af2d28df4046f71817a/extra/Summer2019_Session1_solved.pdf -------------------------------------------------------------------------------- /extra/baseRintro.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/csc-training/R-for-beginners/021b684de5c4a6ee44ed2af2d28df4046f71817a/extra/baseRintro.pdf -------------------------------------------------------------------------------- /extra/baseRintro_solved.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/csc-training/R-for-beginners/021b684de5c4a6ee44ed2af2d28df4046f71817a/extra/baseRintro_solved.pdf --------------------------------------------------------------------------------