├── README.md ├── cities.csv ├── citiesNA.csv ├── rankcity.md ├── rankeachsubset.md ├── sortcolumns.md ├── sortna.md └── sortsubset.md /README.md: -------------------------------------------------------------------------------- 1 | # Assignment 3: tutorial 2 | ## Series of functions that show concepts useful to complete Assignment 3 3 | 4 | Assignment 3 requries to manage a big database, rank hospitals by different factors and apply some complex criteria. 5 | To make the principles easier to be understood, I created smaller data frames and functions. 6 | There are plenty of ways to solve Assignment 3, and below you find just my suggestions. I'm sure there are better ways to do it. 7 | 8 | To run the functions below, you need to download and read these two small csv files: 9 | [cities.csv](cities.csv) 10 | [citiesNA.csv](citiesNA.csv) 11 | 12 | 1) How to order a data frame by the values in its columns 13 | 14 | [sort_by_column () and sort_by_columns ()] (https://github.com/DanieleP/PA3-tutorial/blob/master/sortcolumns.md) 15 | 16 | 2) How to manage NA when ordering data frames 17 | 18 | [sort_by_column_NA()](https://github.com/DanieleP/PA3-tutorial/blob/master/sortna.md) 19 | 20 | 3) How to order only a specific subset of a data frame 21 | 22 | [sort_country()] (https://github.com/DanieleP/PA3-tutorial/blob/master/sortsubset.md) 23 | 24 | 4) How to return a specific ranking after ordering 25 | 26 | [find_city_rank() and find_last_city()] (https://github.com/DanieleP/PA3-tutorial/blob/master/rankcity.md) 27 | 28 | 5) How to return a data frame of specific rankings for each subset 29 | 30 | [rank_by_country()] (https://github.com/DanieleP/PA3-tutorial/blob/master/rankeachsubset.md) 31 | -------------------------------------------------------------------------------- /cities.csv: -------------------------------------------------------------------------------- 1 | cities,countries,areakm2,populationk Shanghai,China,2643,21766 Beijing,China,1368,21500 NYC,USA,1214,8406 LA,USA,1302,3884 London,UK,1737,9789 Manchester,UK,116,255 -------------------------------------------------------------------------------- /citiesNA.csv: -------------------------------------------------------------------------------- 1 | cities,countries,areakm2,populationk Shanghai,China,2643,21766 Beijing,China,1368,21500 NYC,USA,Unknown,8406 LA,USA,1302,3884 London,UK,1737,Unknown Manchester,UK,116,255 -------------------------------------------------------------------------------- /rankcity.md: -------------------------------------------------------------------------------- 1 | # How to return a specific ranking after ordering 2 | 3 | Working on cities.csv, we want to know what's the second biggest city in our UK database. 4 | We are not interested in an ordered data frame, we just want to know the name of the city. 5 | 6 | ## argument decreasing = TRUE inverts the direction of the order. Numbers from biggest to smallest and 7 | ## characters from Z to A. This is helpful when we consider rank#1 the biggest city. 8 | ## as.character () will return the vector with the name of the city. If we just return orderdata[rank,1] 9 | ## we get a factor instead. 10 | find_city_rank <- function(data,column,rank){ 11 | orderdata <- data[order(decreasing = TRUE,data[,column]),] 12 | return(as.character(orderdata[rank,1])) 13 | } 14 | 15 | Examples: 16 | 17 | > find_city_rank (data,3,1) 18 | [1] "Shanghai" 19 | > find_city_rank (data,3,2) 20 | [1] "London" 21 | > find_city_rank (data,4,2) 22 | [1] "Beijing" 23 | 24 | Now let's consider the case in which we don't know the length of the csv file and we just want to get the 25 | last city in the ranking. 26 | 27 | ## nrow() returns the number of rows of a data frame. We use this function to determine the index of the last item. 28 | find_last_city <- function(data,column){ 29 | orderdata <- data[order(decreasing = TRUE,data[,column]),] 30 | return(as.character(orderdata[nrow(orderdata),1])) 31 | } 32 | 33 | Examples: 34 | 35 | > find_last_city (data,3) 36 | [1] "Manchester" 37 | > find_last_city (data,1) 38 | [1] "Beijing" 39 | 40 | Note that in the last example Beijing it's the last in alphabetical ranking. This is because decreasing = TRUE. 41 | -------------------------------------------------------------------------------- /rankeachsubset.md: -------------------------------------------------------------------------------- 1 | # How to return a data frame of specific rankings for each subset 2 | 3 | Considering cities.csv, in this case we want to know what's the second city by size of each country, without knowing the 4 | dimension of our database and the number of countries. We want a data frame with these cities as final output. 5 | 6 | rank_by_country <- function(data,column,rank){ 7 | ## We save the levels of column 2, the countries' names, in the countries vector 8 | countries <- levels(data[,2]) 9 | ## We generate an empty vector that we will fill later, row by row, to generate our final output 10 | output <- vector() 11 | ## For loop to get the right data on each city. length(countries) is the number of different countries in our 12 | ## database. In our case we have 3 countries: China, UK, USA. 13 | for (i in 1:length(countries)) { 14 | ## countrydata subsets data by the considered country 15 | countrydata <- data [grep(countries[i],data$countries),] 16 | orderdata <- countrydata[order(decreasing = TRUE, countrydata[,column]),] 17 | ## append() adds elements at the end of a vector. We want to add the name of the city [rank,1], 18 | ## the areakm2 [rank,2] and the populationk [rank,3]. We don't add the name of the countries, because it 19 | ## will be the label of the rows. 20 | output <- append (output, as.character(orderdata[rank,1])) 21 | for (l in 3:4){ 22 | output <- append (output, as.character(orderdata[rank,l])) 23 | } 24 | } 25 | ## Just because it's simpler to generate a matrix rather than a data frame, I generate it first and convert it 26 | ## to data frame immediatly after. 27 | output <- as.data.frame(matrix(output,length(countries),3, byrow = TRUE)) 28 | ## Name of the columns will be "cities", "areakm2" and "populationk". Name of the rows are the countries. 29 | colnames(output) <- c("cities","areakm2","populationk") 30 | rownames(output) <- countries 31 | return(output) 32 | } 33 | 34 | Examples: 35 | 36 | > rank_by_country(data,3,1) 37 | cities areakm2 populationk 38 | China Shanghai 2643 21766 39 | UK London 1737 9789 40 | USA LA 1302 3884 41 | > rank_by_country(data,3,2) 42 | cities areakm2 populationk 43 | China Beijing 1368 21500 44 | UK Manchester 116 255 45 | USA NYC 1214 8406 46 | -------------------------------------------------------------------------------- /sortcolumns.md: -------------------------------------------------------------------------------- 1 | # How to order a data frame by the values in its columns 2 | 3 | Let's read first the cities.csv file 4 | 5 | > data <- read.csv("cities.csv") 6 | > data 7 | cities countries areakm2 populationk 8 | 1 Shanghai China 2643 21766 9 | 2 Beijing China 1368 21500 10 | 3 NYC USA 1214 8406 11 | 4 LA USA 1302 3884 12 | 5 London UK 1737 9789 13 | 6 Manchester UK 116 255 14 | > class(data) 15 | [1] "data.frame" 16 | 17 | As we can see data is a data.frame with 6 rows and 4 columns, both character and numeric. 18 | Below is the function to order the data frame by column: 19 | 20 | ## orderdata: output data.frame with the ordered rows 21 | ## order(): sort by default in decreasing order the values. In case of numbers 22 | ## from the smallest to the biggest, in case of characters from A to Z. This 23 | ## function returns a vector of indexes with the ordered rows. 24 | ## > order(data[,1]) 25 | ## [1] 2 4 5 6 3 1 26 | ## data[order(),] subsets the data frame using the indexes above 27 | sort_by_column <- function (data, column){ 28 | orderdata <- data[order(data[,column]),] 29 | return(orderdata) 30 | } 31 | 32 | Examples: 33 | 34 | > sort_by_column (data,1) 35 | cities countries areakm2 populationk 36 | 2 Beijing China 1368 21500 37 | 4 LA USA 1302 3884 38 | 5 London UK 1737 9789 39 | 6 Manchester UK 116 255 40 | 3 NYC USA 1214 8406 41 | 1 Shanghai China 2643 21766 42 | > sort_by_column (data,3) 43 | cities countries areakm2 populationk 44 | 6 Manchester UK 116 255 45 | 3 NYC USA 1214 8406 46 | 4 LA USA 1302 3884 47 | 2 Beijing China 1368 21500 48 | 5 London UK 1737 9789 49 | 1 Shanghai China 2643 21766 50 | 51 | In case of tie, we might consider to give a second attribute to order() to give a second criteria 52 | 53 | sort_by_columns <- function (data, col1, col2){ 54 | orderdata <- data[order(data[,col1],data[,col2]),] 55 | return(orderdata) 56 | } 57 | 58 | Examples: 59 | 60 | > sort_by_columns (data,2,3) 61 | cities countries areakm2 populationk 62 | 2 Beijing China 1368 21500 63 | 1 Shanghai China 2643 21766 64 | 6 Manchester UK 116 255 65 | 5 London UK 1737 9789 66 | 3 NYC USA 1214 8406 67 | 4 LA USA 1302 3884 68 | > sort_by_columns (data,2,1) 69 | cities countries areakm2 populationk 70 | 2 Beijing China 1368 21500 71 | 1 Shanghai China 2643 21766 72 | 5 London UK 1737 9789 73 | 6 Manchester UK 116 255 74 | 4 LA USA 1302 3884 75 | 3 NYC USA 1214 8406 76 | -------------------------------------------------------------------------------- /sortna.md: -------------------------------------------------------------------------------- 1 | # How to manage NA when ordering data frames 2 | 3 | Let's read citiesNA.csv file 4 | 5 | > data <- read.csv("citiesNA.csv") 6 | > data 7 | cities countries areakm2 populationk 8 | 1 Shanghai China 2643 21766 9 | 2 Beijing China 1368 21500 10 | 3 NYC USA Unknown 8406 11 | 4 LA USA 1302 3884 12 | 5 London UK 1737 Unknown 13 | 6 Manchester UK 116 255 14 | > class(data) 15 | [1] "data.frame" 16 | 17 | In this case we have some character data in the areakm2 and populationk that we want to consider as NA. 18 | Below is the function to order the data frame by column, that considers only complete rows with no NA values. 19 | This means that we will exclude NYC and London form our data frame. 20 | 21 | ## Subsetting data by column, we get a factor: 22 | ## > class(data[,2]) 23 | ## [1] "factor" 24 | ## One way to extract a vector from the factor is by subsetting it by its levels. 25 | ## levels(data[,2]) returns a vector of the levels: 26 | ## [1] "China" "UK" "USA" 27 | ## levels(data[,2])[data[,2]] returns a vector with the content of [data[,2]] 28 | ## [1] "China" "China" "USA" "USA" "UK" "UK" 29 | ## data[,2] would return a factor, that for our purposes is harder to handle 30 | ## [1] China China USA USA UK UK 31 | ## Levels: China UK USA 32 | ## SuppressWarnings() stops the warning alerts from R. When we coerce a mixed list of numeric and character 33 | ## into a numeric vector, text becomes automatically NA, but it's a forced coercion and R sends a warning. 34 | ## This is the case of our column 3 and 4, where "Unknown" becomes NA. 35 | ## complete.cases() returns the indexes of the rows that don't have any NA. By subsetting the matrix by these 36 | ## indexes we get a data frame with only complete cases. 37 | sort_by_column_NA <- function(data,column){ 38 | for (i in 3:4){ 39 | data[,i] <- suppressWarnings(as.numeric(levels(data[,i])[data[,i]])) 40 | } 41 | orderdata <- data[order(data[,column]),] 42 | orderdata <- orderdata[complete.cases(orderdata),] 43 | return(orderdata) 44 | } 45 | 46 | Examples: 47 | 48 | > sort_by_column_NA(data,3) 49 | cities countries areakm2 populationk 50 | 6 Manchester UK 116 255 51 | 4 LA USA 1302 3884 52 | 2 Beijing China 1368 21500 53 | 1 Shanghai China 2643 21766 54 | > sort_by_column_NA(data,1) 55 | cities countries areakm2 populationk 56 | 2 Beijing China 1368 21500 57 | 4 LA USA 1302 3884 58 | 6 Manchester UK 116 255 59 | 1 Shanghai China 2643 21766 60 | -------------------------------------------------------------------------------- /sortsubset.md: -------------------------------------------------------------------------------- 1 | # How to order only a specific subset of a data frame 2 | 3 | Let's read cities.csv. 4 | 5 | If our purpose is to see what is the largest city in China between Shanghai and Beijing, we might be not interested 6 | in the other cities'areas. 7 | To do so we subset our initial data frame, and then we order it considering the criteria we prefer. 8 | It doesn't make much sense with 2 items, but when we are analysing thousands of items it's far more useful. 9 | 10 | ## grep function finds the character vector (e.g. "China") in the 11 | ## data$countries factor, and returns a vector of indexes. 12 | ## > data$countries 13 | ## [1] China China USA USA UK UK 14 | ## Levels: China UK USA 15 | ## > grep("China",data$countries) 16 | ## [1] 1 2 17 | ## We then subset the main data frame, data, by these indexes 18 | ## > data [grep("China",data$countries),] 19 | ## cities countries areakm2 populationk 20 | ## 1 Shanghai China 2643 21766 21 | ## 2 Beijing China 1368 21500 22 | sort_country <- function (data, country, column){ 23 | countrydata <- data [grep(country,data$countries),] 24 | orderdata <- countrydata[order(countrydata[,column]),] 25 | return (orderdata) 26 | } 27 | 28 | Examples: 29 | 30 | > sort_country(data, "USA", 4) 31 | cities countries areakm2 populationk 32 | 4 LA USA 1302 3884 33 | 3 NYC USA 1214 8406 34 | > sort_country(data, "UK", 1) 35 | cities countries areakm2 populationk 36 | 5 London UK 1737 9789 37 | 6 Manchester UK 116 255 38 | --------------------------------------------------------------------------------