├── README.md
├── cities.csv
├── citiesNA.csv
├── rankcity.md
├── rankeachsubset.md
├── sortcolumns.md
├── sortna.md
└── sortsubset.md


/README.md:
--------------------------------------------------------------------------------
 1 | # Assignment 3: tutorial
 2 | ## Series of functions that show concepts useful to complete Assignment 3
 3 | 
 4 | Assignment 3 requries to manage a big database, rank hospitals by different factors and apply some complex criteria.
 5 | To make the principles easier to be understood, I created smaller data frames and functions.
 6 | There are plenty of ways to solve Assignment 3, and below you find just my suggestions. I'm sure there are better ways to do it.
 7 | 
 8 | To run the functions below, you need to download and read these two small csv files:
 9 | [cities.csv](cities.csv)
10 | [citiesNA.csv](citiesNA.csv)
11 | 
12 | 1) How to order a data frame by the values in its columns
13 | 
14 | [sort_by_column () and sort_by_columns ()] (https://github.com/DanieleP/PA3-tutorial/blob/master/sortcolumns.md)
15 | 
16 | 2) How to manage NA when ordering data frames
17 | 
18 | [sort_by_column_NA()](https://github.com/DanieleP/PA3-tutorial/blob/master/sortna.md)
19 | 
20 | 3) How to order only a specific subset of a data frame
21 | 
22 | [sort_country()] (https://github.com/DanieleP/PA3-tutorial/blob/master/sortsubset.md)
23 | 
24 | 4) How to return a specific ranking after ordering
25 | 
26 | [find_city_rank() and find_last_city()] (https://github.com/DanieleP/PA3-tutorial/blob/master/rankcity.md)
27 | 
28 | 5) How to return a data frame of specific rankings for each subset
29 | 
30 | [rank_by_country()] (https://github.com/DanieleP/PA3-tutorial/blob/master/rankeachsubset.md)
31 | 


--------------------------------------------------------------------------------
/cities.csv:
--------------------------------------------------------------------------------
1 | cities,countries,areakm2,populationkShanghai,China,2643,21766Beijing,China,1368,21500NYC,USA,1214,8406LA,USA,1302,3884London,UK,1737,9789Manchester,UK,116,255


--------------------------------------------------------------------------------
/citiesNA.csv:
--------------------------------------------------------------------------------
1 | cities,countries,areakm2,populationkShanghai,China,2643,21766Beijing,China,1368,21500NYC,USA,Unknown,8406LA,USA,1302,3884London,UK,1737,UnknownManchester,UK,116,255


--------------------------------------------------------------------------------
/rankcity.md:
--------------------------------------------------------------------------------
 1 | # How to return a specific ranking after ordering
 2 | 
 3 | Working on cities.csv, we want to know what's the second biggest city in our UK database. 
 4 | We are not interested in an ordered data frame, we just want to know the name of the city.
 5 | 
 6 |     ## argument decreasing = TRUE inverts the direction of the order. Numbers from biggest to smallest and 
 7 |     ## characters from Z to A. This is helpful when we consider rank#1 the biggest city.
 8 |     ## as.character () will return the vector with the name of the city. If we just return orderdata[rank,1] 
 9 |     ## we get a factor instead.
10 |     find_city_rank <- function(data,column,rank){
11 |         orderdata <- data[order(decreasing = TRUE,data[,column]),]
12 |         return(as.character(orderdata[rank,1]))
13 |     }
14 |     
15 | Examples:
16 | 
17 |     > find_city_rank (data,3,1)
18 |     [1] "Shanghai"
19 |     > find_city_rank (data,3,2)
20 |     [1] "London"
21 |     > find_city_rank (data,4,2)
22 |     [1] "Beijing"
23 |     
24 | Now let's consider the case in which we don't know the length of the csv file and we just want to get the
25 | last city in the ranking.
26 | 
27 |     ## nrow() returns the number of rows of a data frame. We use this function to determine the index of the last item.
28 |     find_last_city <- function(data,column){
29 |         orderdata <- data[order(decreasing = TRUE,data[,column]),]
30 |         return(as.character(orderdata[nrow(orderdata),1]))
31 |     }
32 |     
33 | Examples:
34 | 
35 |     > find_last_city (data,3)
36 |     [1] "Manchester"
37 |     > find_last_city (data,1)
38 |     [1] "Beijing"
39 |   
40 | Note that in the last example Beijing it's the last in alphabetical ranking. This is because decreasing = TRUE.
41 | 


--------------------------------------------------------------------------------
/rankeachsubset.md:
--------------------------------------------------------------------------------
 1 | # How to return a data frame of specific rankings for each subset
 2 | 
 3 | Considering cities.csv, in this case we want to know what's the second city by size of each country, without knowing the
 4 | dimension of our database and the number of countries. We want a data frame with these cities as final output.
 5 | 
 6 |     rank_by_country <- function(data,column,rank){
 7 |         ## We save the levels of column 2, the countries' names, in the countries vector
 8 |         countries <- levels(data[,2])
 9 |         ## We generate an empty vector that we will fill later, row by row, to generate our final output
10 |         output <- vector()
11 |         ## For loop to get the right data on each city. length(countries) is the number of different countries in our
12 |         ## database. In our case we have 3 countries: China, UK, USA.
13 |         for (i in 1:length(countries)) {
14 |             ## countrydata subsets data by the considered country
15 |             countrydata <- data [grep(countries[i],data$countries),]
16 |             orderdata <- countrydata[order(decreasing = TRUE, countrydata[,column]),]
17 |             ## append() adds elements at the end of a vector. We want to add the name of the city [rank,1],
18 |             ## the areakm2 [rank,2] and the populationk [rank,3]. We don't add the name of the countries, because it
19 |             ## will be the label of the rows.
20 |             output <- append (output, as.character(orderdata[rank,1]))
21 |             for (l in 3:4){
22 |                 output <- append (output, as.character(orderdata[rank,l]))
23 |             }
24 |         }
25 |         ## Just because it's simpler to generate a matrix rather than a data frame, I generate it first and convert it
26 |         ## to data frame immediatly after. 
27 |         output <- as.data.frame(matrix(output,length(countries),3, byrow = TRUE))
28 |         ## Name of the columns will be "cities", "areakm2" and "populationk". Name of the rows are the countries.
29 |         colnames(output) <- c("cities","areakm2","populationk")
30 |         rownames(output) <- countries
31 |         return(output)
32 |     }
33 |     
34 | Examples:
35 | 
36 |     > rank_by_country(data,3,1)
37 |             cities areakm2 populationk
38 |     China Shanghai    2643       21766
39 |     UK      London    1737        9789
40 |     USA         LA    1302        3884
41 |     > rank_by_country(data,3,2)
42 |               cities areakm2 populationk
43 |     China    Beijing    1368       21500
44 |     UK    Manchester     116         255
45 |     USA          NYC    1214        8406
46 | 


--------------------------------------------------------------------------------
/sortcolumns.md:
--------------------------------------------------------------------------------
 1 | # How to order a data frame by the values in its columns
 2 | 
 3 | Let's read first the cities.csv file
 4 | 
 5 |     > data <- read.csv("cities.csv")
 6 |     > data
 7 |           cities countries areakm2 populationk
 8 |     1   Shanghai     China    2643       21766
 9 |     2    Beijing     China    1368       21500
10 |     3        NYC       USA    1214        8406
11 |     4         LA       USA    1302        3884
12 |     5     London        UK    1737        9789
13 |     6 Manchester        UK     116         255
14 |     > class(data)
15 |     [1] "data.frame"
16 | 
17 | As we can see data is a data.frame with 6 rows and 4 columns, both character and numeric.
18 | Below is the function to order the data frame by column:
19 | 
20 | 	  ## orderdata: output data.frame with the ordered rows
21 | 	  ## order(): sort by default in decreasing order the values. In case of numbers
22 | 	  ## from the smallest to the biggest, in case of characters from A to Z. This 
23 | 	  ## function returns a vector of indexes with the ordered rows. 
24 | 	  ## > order(data[,1])
25 | 	  ## [1] 2 4 5 6 3 1
26 | 	  ## data[order(),] subsets the data frame using the indexes above
27 | 	  sort_by_column <- function (data, column){
28 | 		orderdata <- data[order(data[,column]),]
29 | 		return(orderdata)
30 | 	  }
31 | 
32 | Examples:
33 | 
34 | 	  > sort_by_column (data,1)
35 | 			cities countries areakm2 populationk
36 | 	  2	   Beijing	   China	1368	   21500
37 | 	  4			LA		 USA	1302		3884
38 | 	  5		London		  UK	1737		9789
39 | 	  6 Manchester		  UK	 116		 255
40 | 	  3		   NYC		 USA	1214		8406
41 | 	  1	  Shanghai	   China	2643	   21766
42 | 	  > sort_by_column (data,3)
43 | 			cities countries areakm2 populationk
44 | 	  6 Manchester		  UK	 116		 255
45 | 	  3		   NYC		 USA	1214		8406
46 | 	  4			LA		 USA	1302		3884
47 | 	  2	   Beijing	   China	1368	   21500
48 | 	  5		London		  UK	1737		9789
49 | 	  1	  Shanghai	   China	2643	   21766
50 | 
51 | In case of tie, we might consider to give a second attribute to order() to give a second criteria
52 | 
53 | 	  sort_by_columns <- function (data, col1, col2){
54 | 		orderdata <- data[order(data[,col1],data[,col2]),]
55 | 		return(orderdata)
56 | 	  }
57 | 
58 | Examples:
59 | 
60 | 	  > sort_by_columns (data,2,3)
61 | 			cities countries areakm2 populationk
62 | 	  2    Beijing     China    1368       21500
63 | 	  1   Shanghai     China    2643       21766
64 | 	  6 Manchester        UK     116         255
65 | 	  5     London        UK    1737        9789
66 | 	  3        NYC       USA    1214        8406
67 | 	  4         LA       USA    1302        3884
68 | 	  > sort_by_columns (data,2,1)
69 | 			cities countries areakm2 populationk
70 | 	  2    Beijing     China    1368       21500
71 | 	  1   Shanghai     China    2643       21766
72 | 	  5     London        UK    1737        9789
73 | 	  6 Manchester        UK     116         255
74 | 	  4         LA       USA    1302        3884
75 | 	  3        NYC       USA    1214        8406
76 | 


--------------------------------------------------------------------------------
/sortna.md:
--------------------------------------------------------------------------------
 1 | # How to manage NA when ordering data frames
 2 | 
 3 | Let's read citiesNA.csv file
 4 | 
 5 |     > data <- read.csv("citiesNA.csv")
 6 |     > data
 7 |           cities countries areakm2 populationk
 8 |     1   Shanghai     China    2643       21766
 9 |     2    Beijing     China    1368       21500
10 |     3        NYC       USA Unknown        8406
11 |     4         LA       USA    1302        3884
12 |     5     London        UK    1737     Unknown
13 |     6 Manchester        UK     116         255
14 |     > class(data)
15 |     [1] "data.frame"
16 | 
17 | In this case we have some character data in the areakm2 and populationk that we want to consider as NA.
18 | Below is the function to order the data frame by column, that considers only complete rows with no NA values.
19 | This means that we will exclude NYC and London form our data frame.
20 | 
21 |     ## Subsetting data by column, we get a factor:
22 |     ## > class(data[,2])
23 |     ## [1] "factor"
24 |     ## One way to extract a vector from the factor is by subsetting it by its levels.
25 |     ## levels(data[,2]) returns a vector of the levels:
26 |     ## [1] "China" "UK"    "USA"
27 |     ## levels(data[,2])[data[,2]] returns a vector with the content of [data[,2]]
28 |     ## [1] "China" "China" "USA"   "USA"   "UK"    "UK"  
29 |     ## data[,2] would return a factor, that for our purposes is harder to handle
30 |     ## [1] China China USA   USA   UK    UK   
31 |     ## Levels: China UK USA
32 |     ## SuppressWarnings() stops the warning alerts from R. When we coerce a mixed list of numeric and character
33 |     ## into a numeric vector, text becomes automatically NA, but it's a forced coercion and R sends a warning.
34 |     ## This is the case of our column 3 and 4, where "Unknown" becomes NA.
35 |     ## complete.cases() returns the indexes of the rows that don't have any NA. By subsetting the matrix by these
36 |     ## indexes we get a data frame with only complete cases.
37 |     sort_by_column_NA <- function(data,column){
38 |         for (i in 3:4){
39 |             data[,i] <- suppressWarnings(as.numeric(levels(data[,i])[data[,i]]))
40 |         }
41 |         orderdata <- data[order(data[,column]),]
42 |         orderdata <- orderdata[complete.cases(orderdata),] 
43 |         return(orderdata)
44 |     }
45 | 
46 | Examples:
47 | 
48 |     > sort_by_column_NA(data,3)
49 |           cities countries areakm2 populationk
50 |     6 Manchester        UK     116         255
51 |     4         LA       USA    1302        3884
52 |     2    Beijing     China    1368       21500
53 |     1   Shanghai     China    2643       21766
54 |     > sort_by_column_NA(data,1)
55 |           cities countries areakm2 populationk
56 |     2    Beijing     China    1368       21500
57 |     4         LA       USA    1302        3884
58 |     6 Manchester        UK     116         255
59 |     1   Shanghai     China    2643       21766
60 | 


--------------------------------------------------------------------------------
/sortsubset.md:
--------------------------------------------------------------------------------
 1 | # How to order only a specific subset of a data frame
 2 | 
 3 | Let's read cities.csv.
 4 | 
 5 | If our purpose is to see what is the largest city in China between Shanghai and Beijing, we might be not interested
 6 | in the other cities'areas. 
 7 | To do so we subset our initial data frame, and then we order it considering the criteria we prefer.
 8 | It doesn't make much sense with 2 items, but when we are analysing thousands of items it's far more useful.
 9 | 
10 |     ## grep function finds the character vector (e.g. "China") in the
11 |     ## data$countries factor, and returns a vector of indexes. 
12 |     ## > data$countries
13 |     ## [1] China China USA   USA   UK    UK   
14 |     ## Levels: China UK USA
15 |     ## > grep("China",data$countries)
16 |     ## [1] 1 2
17 |     ## We then subset the main data frame, data, by these indexes
18 |     ## > data [grep("China",data$countries),]
19 |     ##     cities countries areakm2 populationk
20 |     ## 1 Shanghai     China    2643       21766
21 |     ## 2  Beijing     China    1368       21500
22 |     sort_country <- function (data, country, column){
23 |       countrydata <- data [grep(country,data$countries),]
24 |       orderdata <- countrydata[order(countrydata[,column]),]
25 |       return (orderdata)
26 |     }
27 | 
28 | Examples:
29 | 
30 |     > sort_country(data, "USA", 4)
31 |       cities countries areakm2 populationk
32 |     4     LA       USA    1302        3884
33 |     3    NYC       USA    1214        8406
34 |     > sort_country(data, "UK", 1)
35 |           cities countries areakm2 populationk
36 |     5     London        UK    1737        9789
37 |     6 Manchester        UK     116         255
38 | 


--------------------------------------------------------------------------------