├── Data
    ├── README
    ├── C_elegans_DAPI.xlsx
    ├── Calvin_cycle_edges.xlsx
    ├── Calvin_cycle_nodes.xlsx
    ├── Example_network_edges.xlsx
    ├── LU-Matand - Stem Data.xlsx
    ├── dataset2.csv
    ├── dataset1.csv
    └── USDA_2022_matrix.csv
├── Results
    ├── README
    ├── 08_A.png
    ├── 08_A1.png
    ├── 08_A2.png
    ├── 08_A3.png
    ├── 08_B.png
    ├── 08_C.png
    ├── 03_bar.png
    ├── 04_bar.png
    ├── 04_box.png
    ├── 04_dots.png
    ├── 05_dot1.png
    ├── 05_dot2.png
    ├── 05_pie1.png
    ├── 06_abc_1.png
    ├── 06_abc_2.png
    ├── 08_A_1.png
    ├── 08_A_2.png
    ├── 05_stack1.png
    ├── 05_stack2.png
    ├── 06_genes_1.png
    ├── 06_genes_2.png
    ├── 06_genes_3.png
    ├── 03_line_plot.png
    ├── 03_scatter_1.png
    ├── 03_scatter_2.png
    ├── 03_scatter_3.png
    ├── 03_scatter_4.png
    ├── 03_scatter_5.png
    ├── 04_multi_bar.png
    ├── 04_multi_dots.png
    ├── 08_assembled.png
    ├── 02_scatter_plot.png
    ├── 04_dots_violin.png
    ├── 07_my_network1.png
    ├── 07_my_network2.png
    ├── 07_my_network3.png
    ├── 07_small_network1.png
    └── 08_assembled_label.png
├── Scripts
    ├── README
    ├── 05_Intro_proportions.Rmd
    ├── 06_Intro_to_heatmaps.Rmd
    ├── 08_Intro_plot_assembly.Rmd
    ├── 03_Intro_to_data_vis.Rmd
    ├── 01_Intro_to_R.Rmd
    ├── 04_Intro_mean_separation.Rmd
    └── 07_Intro_elationship_data.Rmd
├── Lessons
    ├── README
    ├── 05_Intro_proportions.md
    ├── 08_Intro_plot_assembly.md
    ├── 06_Intro_to_heatmaps.md
    ├── 01_Intro_to_R.md
    ├── 03_Intro_to_data_vis.md
    ├── 04_Intro_mean_separation.md
    └── 07_Intro_elationship_data.md
├── Answer_key
    ├── readme
    ├── Lesson3_answer1.png
    ├── Lesson3_answer2.png
    ├── Lesson4_answer1.png
    ├── Lesson4_answer2.png
    ├── Lesson5_answer1.png
    ├── Lesson7_answer1.png
    ├── Lesson7_answer2.png
    ├── Lesson8_answer1.png
    ├── 05_Intro_proportions_answer_key.Rmd
    ├── 01_Intro_to_R_answer_key.Rmd
    ├── 06_Intro_to_heatmaps_answer_key.Rmd
    ├── 03_Intro_to_data_vis_answer_key.Rmd
    ├── 07_Intro_relationship_data_answer_key.Rmd
    ├── 08_Intro_plot_assembly_answer_key.Rmd
    ├── 04_Intro_mean_separation_answer_key.Rmd
    └── 02_Intro_to_tidy_data_answer_key.Rmd
├── Graphical_abstract_2023_01_14.png
├── LICENSE
└── README.md


/Data/README:
--------------------------------------------------------------------------------
1 | Data files for the course will be here. 
2 | 


--------------------------------------------------------------------------------
/Results/README:
--------------------------------------------------------------------------------
1 | Results from the R scripts will be here. 
2 | 


--------------------------------------------------------------------------------
/Scripts/README:
--------------------------------------------------------------------------------
1 | .Rmd Scripts for the course will be here. 
2 | 


--------------------------------------------------------------------------------
/Lessons/README:
--------------------------------------------------------------------------------
1 | Markdown formatted text for the lessons will be here. 
2 | 


--------------------------------------------------------------------------------
/Answer_key/readme:
--------------------------------------------------------------------------------
1 | Answer key for homework questions are available in this folder. 
2 | 


--------------------------------------------------------------------------------
/Results/08_A.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/08_A.png


--------------------------------------------------------------------------------
/Results/08_A1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/08_A1.png


--------------------------------------------------------------------------------
/Results/08_A2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/08_A2.png


--------------------------------------------------------------------------------
/Results/08_A3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/08_A3.png


--------------------------------------------------------------------------------
/Results/08_B.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/08_B.png


--------------------------------------------------------------------------------
/Results/08_C.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/08_C.png


--------------------------------------------------------------------------------
/Results/03_bar.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/03_bar.png


--------------------------------------------------------------------------------
/Results/04_bar.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/04_bar.png


--------------------------------------------------------------------------------
/Results/04_box.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/04_box.png


--------------------------------------------------------------------------------
/Results/04_dots.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/04_dots.png


--------------------------------------------------------------------------------
/Results/05_dot1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/05_dot1.png


--------------------------------------------------------------------------------
/Results/05_dot2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/05_dot2.png


--------------------------------------------------------------------------------
/Results/05_pie1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/05_pie1.png


--------------------------------------------------------------------------------
/Results/06_abc_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/06_abc_1.png


--------------------------------------------------------------------------------
/Results/06_abc_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/06_abc_2.png


--------------------------------------------------------------------------------
/Results/08_A_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/08_A_1.png


--------------------------------------------------------------------------------
/Results/08_A_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/08_A_2.png


--------------------------------------------------------------------------------
/Results/05_stack1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/05_stack1.png


--------------------------------------------------------------------------------
/Results/05_stack2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/05_stack2.png


--------------------------------------------------------------------------------
/Results/06_genes_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/06_genes_1.png


--------------------------------------------------------------------------------
/Results/06_genes_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/06_genes_2.png


--------------------------------------------------------------------------------
/Results/06_genes_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/06_genes_3.png


--------------------------------------------------------------------------------
/Data/C_elegans_DAPI.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Data/C_elegans_DAPI.xlsx


--------------------------------------------------------------------------------
/Results/03_line_plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/03_line_plot.png


--------------------------------------------------------------------------------
/Results/03_scatter_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/03_scatter_1.png


--------------------------------------------------------------------------------
/Results/03_scatter_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/03_scatter_2.png


--------------------------------------------------------------------------------
/Results/03_scatter_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/03_scatter_3.png


--------------------------------------------------------------------------------
/Results/03_scatter_4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/03_scatter_4.png


--------------------------------------------------------------------------------
/Results/03_scatter_5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/03_scatter_5.png


--------------------------------------------------------------------------------
/Results/04_multi_bar.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/04_multi_bar.png


--------------------------------------------------------------------------------
/Results/04_multi_dots.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/04_multi_dots.png


--------------------------------------------------------------------------------
/Results/08_assembled.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/08_assembled.png


--------------------------------------------------------------------------------
/Results/02_scatter_plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/02_scatter_plot.png


--------------------------------------------------------------------------------
/Results/04_dots_violin.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/04_dots_violin.png


--------------------------------------------------------------------------------
/Results/07_my_network1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/07_my_network1.png


--------------------------------------------------------------------------------
/Results/07_my_network2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/07_my_network2.png


--------------------------------------------------------------------------------
/Results/07_my_network3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/07_my_network3.png


--------------------------------------------------------------------------------
/Answer_key/Lesson3_answer1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Answer_key/Lesson3_answer1.png


--------------------------------------------------------------------------------
/Answer_key/Lesson3_answer2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Answer_key/Lesson3_answer2.png


--------------------------------------------------------------------------------
/Answer_key/Lesson4_answer1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Answer_key/Lesson4_answer1.png


--------------------------------------------------------------------------------
/Answer_key/Lesson4_answer2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Answer_key/Lesson4_answer2.png


--------------------------------------------------------------------------------
/Answer_key/Lesson5_answer1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Answer_key/Lesson5_answer1.png


--------------------------------------------------------------------------------
/Answer_key/Lesson7_answer1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Answer_key/Lesson7_answer1.png


--------------------------------------------------------------------------------
/Answer_key/Lesson7_answer2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Answer_key/Lesson7_answer2.png


--------------------------------------------------------------------------------
/Answer_key/Lesson8_answer1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Answer_key/Lesson8_answer1.png


--------------------------------------------------------------------------------
/Data/Calvin_cycle_edges.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Data/Calvin_cycle_edges.xlsx


--------------------------------------------------------------------------------
/Data/Calvin_cycle_nodes.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Data/Calvin_cycle_nodes.xlsx


--------------------------------------------------------------------------------
/Results/07_small_network1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/07_small_network1.png


--------------------------------------------------------------------------------
/Results/08_assembled_label.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/08_assembled_label.png


--------------------------------------------------------------------------------
/Data/Example_network_edges.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Data/Example_network_edges.xlsx


--------------------------------------------------------------------------------
/Data/LU-Matand - Stem Data.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Data/LU-Matand - Stem Data.xlsx


--------------------------------------------------------------------------------
/Data/dataset2.csv:
--------------------------------------------------------------------------------
1 | "t1p1","t1p2","t2p1","t2p2","outcome"
2 | 75,80,90,90,"pass"
3 | 15,15,5,4,"fail"
4 | 10,5,5,6,"ambiguous"
5 | 


--------------------------------------------------------------------------------
/Graphical_abstract_2023_01_14.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Graphical_abstract_2023_01_14.png


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 C. Li
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/Answer_key/05_Intro_proportions_answer_key.Rmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "05_Intro_proportions_ANSWER_KEY"
 3 | author: "Chenxin Li"
 4 | date: "2023-01-14"
 5 | output: html_notebook 
 6 | ---
 7 | 
 8 | ```{r setup, include=FALSE}
 9 | knitr::opts_chunk$set(echo = TRUE)
10 | ```
11 | 
12 | # Packages 
13 | ```{r}
14 | library(tidyverse)
15 | library(readxl)
16 | library(RColorBrewer)
17 | ```
18 |  
19 | 
20 | # Exercise 
21 | As an exercise, let's visualize this example. 
22 | We have two groups, each contains 4 categories. 
23 | ```{r}
24 | group1 <- data.frame(
25 |   "Type" = c("Type I", "Type II", "Type III", "Type IV"),
26 |   "Percentage" = c(15, 35, 30, 20)
27 | )
28 | group2 <- data.frame(
29 |   "Type" = c("Type I", "Type II", "Type III", "Type IV"),
30 |   "Percentage" = c(10, 25, 35, 30)
31 | )
32 | ```
33 | 
34 | ```{r}
35 | group1_and_group2 <- rbind(
36 |   group1 %>% 
37 |     mutate(group = "group1"), 
38 |   group2 %>% 
39 |     mutate(group = "group2")
40 | )
41 | 
42 | head(group1_and_group2)
43 | ```
44 | 
45 | ```{r}
46 | group1_and_group2 %>% 
47 |   ggplot(aes(x = group, y = Percentage, fill = Type)) +  
48 |   geom_bar(stat = "identity", width = 0.5) +
49 |   scale_fill_manual(values = brewer.pal(8, "Accent"))+
50 |   theme_classic()
51 | 
52 | ggsave("Lesson5_answer1.png", width = 3, height = 3)
53 | ```
54 | 
55 | Use `ggsave` to save your visualization. 
56 | 
57 | Hint: look in this lesson to see what I did to combine two entities into one data frame while giving each a unique identifier. 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # "Quick data vis"
 2 | [![DOI](https://zenodo.org/badge/876821586.svg)](https://doi.org/10.5281/zenodo.13973004)
 3 | 
 4 | This is a 8-part quick course in data visualization using R. 
 5 | 
 6 | **Graphical Abstract** 
 7 | 
 8 | ![Graphical Abstract](https://github.com/cxli233/Quick_data_vis/blob/main/Graphical_abstract_2023_01_14.png)
 9 | 
10 | * Author: Chenxin Li, Ph.D., Assistant Professor at Department of Plant Biology, Michigan State University.
11 | * Contact: lichen27@msu.edu | [@chenxinli2.bsky.social](https://bsky.app/profile/chenxinli2.bsky.social)
12 | 
13 | # Browse lessons using an internet browser 
14 | 1. Navigate to [Quick_data_vis/Lessons/](https://github.com/cxli233/Quick_data_vis/tree/main/Lessons)
15 | 2. Click on any .md files
16 | 3. GitHub will render a nice html 
17 | 
18 | # Content
19 | This repository has 8 activities: 
20 | 1. Very bascis of R coding
21 | 2. Introduction to tidy data frames
22 | 3. Introduction to data visualization using ggplot 
23 | 4. Introduction to mean separation 
24 | 5. Introduction to proportional data 
25 | 6. Introduction to heatmaps
26 | 7. Introduction to relationship data and networks
27 | 8. Introduction to plot composition/assembly 
28 | 
29 | # Required software:
30 | 
31 | * R: [R Download](https://cran.r-project.org/bin/)
32 | * RStudio: [RStudio Download](https://www.rstudio.com/products/rstudio/download/)
33 | * rmarkdown can be installed using the intall packages interface in RStudio.
34 | 
35 | # Getting started
36 | 1. [Clone](https://github.com/cxli233/Quick_data_vis.git) the repository to your machine by downloading the zip file.
37 | 2. Unzip and move folder to whichever location you like on your machine.
38 | 3. Open RStudio.  
39 | 4. Open .Rmd files under the Scripts folder. 
40 | 5. Enjoy! 
41 | 
42 | 
43 | 
44 | 


--------------------------------------------------------------------------------
/Answer_key/01_Intro_to_R_answer_key.Rmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "Very basics of R coding ANSWER KEY"
 3 | author: "Chenxin Li"
 4 | date: "01/06/2023"
 5 | output:
 6 |   html_notebook:
 7 |     number_sections: yes
 8 |     toc: yes
 9 |     toc_float: yes
10 | ---
11 | 
12 | ```{r setup, include=FALSE}
13 | knitr::opts_chunk$set(echo = TRUE)
14 | ```
15 | 
16 | # Packages 
17 | ```{r}
18 | library(dplyr) 
19 | ```
20 | 
21 | # Excerice
22 | 
23 | Today you have learned some basic syntax of R.
24 | Now it's time for you to practice.
25 | 
26 | ## Q1:
27 | 
28 | 1.  Insert a new code chunk
29 | 2.  Make this matrix
30 | 
31 | | 1  | 1 | 2 | 2 |
32 | | 2  | 2 | 1 | 2 |
33 | | 2  | 3 | 3 | 4 |
34 | | 1  | 2 | 3 | 4 |
35 | 
36 | and save it as an item called `my_mat2`.
37 | ```{r}
38 | my_mat2 <- rbind(
39 |   c(1, 1, 2, 2),
40 |   c(2, 2, 1, 2),
41 |   c(2, 3, 3, 4),
42 |   c(1, 2, 3, 4)
43 |   )
44 | 
45 | my_mat2
46 | ```
47 | 
48 | 
49 | 3. Select the 1st and 3rd rows and the 1st, 2nd and 4th columns, and save it as an item.
50 | ```{r}
51 | item <- my_mat2[c(1,3), c(1,2,4)]   
52 | item
53 | ```
54 | 
55 | 4. Take the square root for each member of my_mat2, then take log2(), and lastly find the maximum value.
56 | Use the pipe syntax. The command for maximum is `max()`.
57 | 
58 | ```{r}
59 | my_mat2 %>%
60 |   sqrt() %>% 
61 |   log2() %>% 
62 |   max() 
63 | ```
64 | 
65 | ## Q2:
66 | 
67 | 1.  Use the following info to make a data frame and save it as an item called "grade".
68 |     Adel got 85 on the exam, Bren got 83, and Cecil got 93.
69 |     Their letter grades are B, B, and A, respectively.
70 |     (Hint: How many columns do you have to have?)
71 |     
72 | ```{r}
73 | grade <- data.frame(
74 |   name = c("Adel", "Bren", "Cecil"),
75 |   score = c(85, 83, 93),
76 |   letters = c("B", "B", "A")
77 | )
78 | 
79 | grade
80 | ```
81 | 
82 | 2. Pull out the column with the scores.
83 |     Use the `$` syntax.
84 | ```{r}
85 | grade$score
86 | ```
87 | 
88 | 


--------------------------------------------------------------------------------
/Data/dataset1.csv:
--------------------------------------------------------------------------------
 1 | "treatment","tensile_strength","temperature","pressure","resistance"
 2 | "t1p1",11.506622092435565,"t1","p1",17.128494668708832
 3 | "t2p1",14.100085235689477,"t2","p1",2.5589702596091675
 4 | "t1p2",13.615801427173414,"t1","p2",15.527962563665543
 5 | "t2p2",13.5099923273713,"t2","p2",4.721478705270513
 6 | "t1p1",14.028709331397305,"t1","p1",16.203734568157543
 7 | "t2p1",6.259538323593588,"t2","p1",5.978400931935168
 8 | "t1p2",12.633912927374507,"t1","p2",15.961706984662447
 9 | "t2p2",10.717022205761431,"t2","p2",4.714634074022134
10 | "t1p1",9.289731079256217,"t1","p1",17.728534497569896
11 | "t2p1",11.52930718913078,"t2","p1",4.535213990213128
12 | "t1p2",17.537701412126225,"t1","p2",14.89671060171909
13 | "t2p2",14.862264653605669,"t2","p2",3.514235497735813
14 | "t1p1",14.056335685284438,"t1","p1",15.222575582702989
15 | "t2p1",6.3596882036778215,"t2","p1",6.334807900162937
16 | "t1p2",14.3748969079777,"t1","p2",15.282977444075684
17 | "t2p2",10.750898568933968,"t2","p2",3.967891905343131
18 | "t1p1",5.566251097715127,"t1","p1",18.141227761742456
19 | "t2p1",10.068251336282362,"t2","p1",4.350111449210299
20 | "t1p2",15.06114251819578,"t1","p2",16.05765273364922
21 | "t2p2",12.457989671827754,"t2","p2",4.794965577505662
22 | "t1p1",11.516792356002085,"t1","p1",15.750018418634559
23 | "t2p1",9.648346870695256,"t2","p1",4.450955602610659
24 | "t1p2",12.035450174518523,"t1","p2",15.334780663377284
25 | "t2p2",12.526361047296032,"t2","p2",5.28441492721479
26 | "t1p1",7.387629481976594,"t1","p1",18.326161223857568
27 | "t2p1",11.516601088751841,"t2","p1",3.919718778665725
28 | "t1p2",12.746703113255409,"t1","p2",16.324424180528453
29 | "t2p2",12.858436597291924,"t2","p2",3.890657377240126
30 | "t1p1",8.394960862592415,"t1","p1",18.79670808927312
31 | "t2p1",10.48980069620454,"t2","p1",4.7053804564748045
32 | "t1p2",11.472313423671851,"t1","p2",16.305175757716956
33 | "t2p2",9.031559755128118,"t2","p2",5.367688131005203
34 | "t1p1",6.415518331077725,"t1","p1",18.638230241629685
35 | "t2p1",8.635094619784446,"t2","p1",5.835800882546767
36 | "t1p2",12.874761841578259,"t1","p2",15.15526760876225
37 | "t2p2",12.360016747779738,"t2","p2",3.767237799186369
38 | "t1p1",9.915935091954513,"t1","p1",16.396518137608442
39 | "t2p1",11.372340751850805,"t2","p1",1.9592745082574727
40 | "t1p2",12.314001521011349,"t1","p2",15.80367212667379
41 | "t2p2",16.122120677779193,"t2","p2",1.4964095536105786
42 | 


--------------------------------------------------------------------------------
/Answer_key/06_Intro_to_heatmaps_answer_key.Rmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "Intro_to_heatmap_ANSWER_KEY"
 3 | author: "Chenxin Li"
 4 | date: "2023-01-15"
 5 | output: html_notebook
 6 | ---
 7 | 
 8 | ```{r setup, include=FALSE}
 9 | knitr::opts_chunk$set(echo = TRUE)
10 | ```
11 |  
12 | # Packages 
13 | ```{r}
14 | library(tidyverse)
15 | library(RColorBrewer)
16 | library(viridis) 
17 | ```
18 | 
19 | # Data from lecture 
20 | ```{r}
21 | abc_1 <- expand.grid(
22 |   a = c(1:10),
23 |   b = c(1:10)
24 | ) %>% 
25 |   mutate(c = a + b)
26 | 
27 | head(abc_1)
28 | ```
29 | 
30 | # Exercise 
31 | ## Q1
32 | Here is the rainbow color scale: 
33 | ```{r}
34 | abc_1 %>% 
35 |   ggplot(aes(x = a, y = b)) +
36 |   geom_tile(aes(fill = c)) +
37 |   scale_fill_gradientn(colors = rainbow(20)) +
38 |   labs(title = "c = a + b") +
39 |   theme_classic() +
40 |   coord_fixed()
41 | 
42 | ggsave("../Results/06_abc_2.png", height = 3, width = 3)
43 | ```
44 | Can you explain why the rainbow color scale is inappropriate for heatmaps?  
45 | 
46 | There are multiple issues with the rainbow color scale.
47 | 1. It is not red/green colorblind friendly.
48 | 2. It is neither a unidirectional or bidirectional color scale. 
49 | 
50 | ## Q2
51 | You can ignore the following chunk. It generates the data. 
52 | ```{r}
53 | Q2_data <- rbind(
54 |   c(10, 20, 30, 40, 50, 60), # 6
55 |   c(0, 10, 30, 10, 5, 0), # 3
56 |   c(100, 200, 300, 400, 500, 550), # 6
57 |   c(50, 45, 40, 30, 20, 0),  # 1
58 |   c(0, 200, 300, 500, 200, 100), # 4
59 |   c(10, 500, 200, 100, 10, 0), # 2 
60 |   c(0, 0, 0, 10, 500, 0) # 5
61 | ) %>%
62 |   as.data.frame() %>% 
63 |   cbind(gene = c("a", "b", "c", "d", "e", "f", "g")) %>% 
64 |   pivot_longer(cols = !gene, names_to = "V", values_to = "expression") %>% 
65 |   mutate(stage = str_remove(V, "V"))
66 | 
67 | Q2_data %>% 
68 |   ggplot(aes(x = stage, y = gene)) +
69 |   geom_tile(aes(fill = expression)) +
70 |   geom_text(aes(label = expression)) +
71 |   scale_fill_gradientn(colors = rev(brewer.pal(11, "RdBu")))+
72 |   theme_classic()
73 | 
74 | ggsave("../Results/06_genes_3.png", height = 3.5, width = 4.2)
75 | ```
76 | What is (are) wrong with this heatmap? 
77 | Hint: there is a lot going wrong with this heatmap. 
78 | 
79 | 1. Values are not scaled to z-score. 
80 | 2. Bidirectional color scale used for unidirectional data.
81 | 3. Potential issue with outlier(s).  
82 | 


--------------------------------------------------------------------------------
/Answer_key/03_Intro_to_data_vis_answer_key.Rmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "Intro_to_data_vis ANSWER KEY"
 3 | author: "Chenxin Li"
 4 | date: "2023-01-06"
 5 | output:
 6 |   html_notebook:
 7 |     number_sections: yes
 8 |     toc: yes
 9 |     toc_float: yes
10 | ---
11 | 
12 | ```{r setup, include=FALSE}
13 | knitr::opts_chunk$set(echo = TRUE)
14 | ```
15 | 
16 | 
17 | # Required packages
18 | ```{r}
19 | library(tidyverse)
20 | library(RColorBrewer)
21 | ```
22 | 
23 | # Data from lecture  
24 | ```{r}
25 | child_mortality <- read_csv("../Data/child_mortality_0_5_year_olds_dying_per_1000_born.csv", col_types = cols()) 
26 | babies_per_woman <- read_csv("../Data/children_per_woman_total_fertility.csv", col_types = cols()) 
27 | income <- read_csv("../Data/income_per_person_gdppercapita_ppp_inflation_adjusted.csv", col_types = cols()) 
28 | ```
29 | 
30 | Re-shape data into tidy format. 
31 | ```{r}
32 | babies_per_woman_tidy <- babies_per_woman %>% 
33 |   pivot_longer(names_to = "year", values_to = "birth", cols = c(2:302))  
34 | 
35 | child_mortality_tidy <- child_mortality %>% 
36 |   pivot_longer(names_to = "year", values_to = "death_per_1000_born", cols = c(2:302))  
37 | 
38 | income_tidy <- income %>% 
39 |   pivot_longer(names_to = "year", values_to = "income", cols = c(2:242))  
40 | ```
41 | 
42 | Join them together. 
43 | ```{r}
44 | example2_data <- babies_per_woman_tidy %>% 
45 |   inner_join(child_mortality_tidy, by = c("country", "year")) %>% 
46 |   inner_join(income_tidy, by = c("country", "year"))
47 | 
48 | head(example2_data)
49 | ```
50 | 
51 | # Exercise 
52 | Graph income (in log10 scale) on x axis, child mortality on y axis, and color with children/woman in year 2010. 
53 | Were the trend similar to year 1945? 
54 | Save the graph using `ggsave()`. 
55 | 
56 |  
57 | ```{r}
58 | example2_data %>% 
59 |   filter(year == 2010) %>% 
60 |   ggplot(aes(x = log10(income), y = death_per_1000_born)) +
61 |   geom_point(aes(color = birth)) +
62 |   scale_color_gradientn(colours = brewer.pal(9, "YlGnBu")) + 
63 |   labs(x = "log10 income",
64 |        y = "death per 1000 born",
65 |        title = "2010") +
66 |   theme_classic()
67 | 
68 | ggsave("Lesson3_answer1.png", width = 3, height = 3)
69 | ```
70 | ```{r}
71 | example2_data %>% 
72 |   filter(year == 1945) %>% 
73 |   ggplot(aes(x = log10(income), y = death_per_1000_born)) +
74 |   geom_point(aes(color = birth)) +
75 |   scale_color_gradientn(colours = brewer.pal(9, "YlGnBu")) + 
76 |   labs(x = "log10 income",
77 |        y = "death per 1000 born",
78 |        title = "1945") +
79 |   theme_classic()
80 | 
81 | ggsave("Lesson3_answer2.png", width = 3, height = 3)
82 | ```
83 | 
84 | 


--------------------------------------------------------------------------------
/Data/USDA_2022_matrix.csv:
--------------------------------------------------------------------------------
 1 | "Crop","2022","2021","2020","2019","2018","2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007","2006","2005","2004","2003","2002","2001","2000","1999","1998","1997","1996","1995","1994","1993","1992","1991","1990","1989","1988","1987","1986","1985","1984","1983","1982","1981","1980","1979","1978","1977","1976","1975","1974","1973","1972","1971","1970","1969","1968","1967","1966","1965","1964","1963","1962","1961","1960","1959","1958","1957","1956","1955","1954","1953","1952","1951","1950","1949","1948","1947","1946","1945"
 2 | "barley",71.7,60.3,77.2,77.7,77.5,73,77.9,69.1,72.7,71.3,66.9,69.1,73.1,72.8,63.3,60,61.1,64.8,69.6,58.9,55,58.1,61.1,59.5,60.1,58.1,58.5,57.2,56.2,58.9,62.5,55.2,56.1,48.6,38,52.4,50.8,50.9,53.3,52.3,57.2,52.4,49.7,50.9,49.2,44,45.4,44,37.7,40.5,43.7,45.8,42.8,44.7,43.8,40.5,38.3,42.9,37.6,35,35,30.6,31,28.3,32.3,29.8,29.3,27.8,28.4,28.4,27.7,27.3,27.2,24,26.5,25.7,25.5,25.5
 3 | "canola",1762,1302,1931,1781,1861,1526,1825,1679,1610,1742,1392,1479,1711,1813,1461,1238,1366,1419,1618,1416,1197,1374,1334,1306,1448,1237,1385,1278,1316,1350,1286,1300,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
 4 | "corn",173.3,176.7,171.4,167.5,176.4,176.6,174.6,168.4,171,158.1,123.1,146.8,152.6,164.4,153.3,150.7,149.1,147.9,160.3,142.2,129.3,138.2,136.9,133.8,134.4,126.7,127.1,113.5,138.6,100.7,131.5,108.6,118.5,116.3,84.6,119.8,119.4,118,106.7,81.1,113.2,108.9,91,109.5,101,90.8,88,86.4,71.9,91.3,97,88.1,72.4,85.9,79.5,80.1,73.1,74.1,62.9,67.9,64.7,62.4,54.7,53.1,52.8,48.3,47.4,42,39.4,40.7,41.8,36.9,38.2,38.2,43,28.6,37.2,33.1
 5 | "cotton",950,819,853,831,882,905,867,766,838,822,892,790,812,776,813,879,814,831,855,730,665,705,632,607,625,673,705,537,708,606,700,652,634,614,619,706,552,630,600,508,590,542,404,547,420,520,465,453,442,520,507,438,438,434,516,447,480,527,517,517,457,438,446,461,466,388,409,417,341,324,280,269,269,282,311,267,236,254
 6 | "rice",7383,7709,7619,7473,7692,7507,7237,7472,7576,7694,7463,7067,6725,7085,6846,7219,6898,6624,6988,6670,6578,6496,6281,5866,5663,5897,6120,5621,5964,5510,5736,5731,5529,5749,5514,5555,5651,5414,4954,4598,4710,4819,4413,4599,4484,4412,4663,4558,4440,4274,4700,4718,4618,4318,4425,4537,4322,4255,4098,3968,3726,3411,3423,3382,3164,3204,3151,3061,2517,2447,2413,2309,2371,2194,2122,2062,2054,2046
 7 | "sorghum",41.1,69,73.2,73,72.1,71.7,77.9,76,67.6,59.6,49.6,54,71.9,69.4,65.1,73.2,56.1,68.5,69.6,52.7,50.6,59.9,60.9,69.7,67.3,69.2,67.3,55.6,72.7,59.9,72.6,59.3,63.1,55.4,63.8,69.4,67.7,66.8,56.4,48.7,59.1,64,46.3,62.6,54.5,56.6,49.1,49,45.1,58.8,60.7,53.8,50.4,54.3,52.6,50.4,55.8,51.6,41.7,43.9,44.1,43.7,39.7,36.1,35.2,28.8,22.2,18.8,20.1,18.4,17,19.1,22.6,22.5,18,17,15.9,15.2
 8 | "soybeans",49.5,51.7,51,47.4,50.6,49.3,51.9,48,47.5,44,40,42,43.5,44,39.7,41.7,42.9,43.1,42.2,33.9,38,39.6,38.1,36.6,38.9,38.9,37.6,35.3,41.4,32.6,37.6,34.2,34.1,32.3,27,33.9,33.3,34.1,28.1,26.2,31.5,30.1,26.5,32.1,29.4,30.6,26.1,28.9,23.7,27.8,27.8,27.5,26.7,27.4,26.7,24.5,25.4,24.5,22.8,24.4,24.2,25.1,23.5,23.5,24.2,23.2,21.8,20.1,20,18.2,20.7,20.8,21.7,22.3,21.3,16.3,20.5,18
 9 | "wheat",46.5,44.3,49.7,51.7,47.6,46.4,52.7,43.6,43.7,47.1,46.2,43.6,46.1,44.3,44.8,40.2,38.6,42,43.2,44.2,35,40.2,42,42.7,43.2,39.5,36.3,35.8,37.6,38.2,39.3,34.3,39.5,32.7,34.1,37.7,34.4,37.5,38.8,39.4,35.5,34.5,33.5,34.2,31.4,30.7,30.3,30.6,27.3,31.6,32.7,33.9,31,30.6,28.4,25.8,26.3,26.5,25.8,25.2,25,23.9,26.1,21.6,27.5,21.8,20.2,19.8,18.1,17.3,18.4,16,16.5,14.5,17.9,18.2,17.2,17
10 | 


--------------------------------------------------------------------------------
/Answer_key/07_Intro_relationship_data_answer_key.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Intro_to_network ANSWER KEY"
  3 | author: "Chenxin Li"
  4 | date: "2023-01-15"
  5 | output: html_notebook
  6 | ---
  7 | 
  8 | ```{r setup, include=FALSE}
  9 | knitr::opts_chunk$set(echo = TRUE)
 10 | ```
 11 | 
 12 | # Packages 
 13 | ```{r}
 14 | library(tidyverse)
 15 | library(igraph)
 16 | library(ggraph)
 17 | library(readxl)
 18 | library(RColorBrewer)
 19 | ```
 20 | # Homework 
 21 | ## Q1 
 22 | Metabolic pathways can be modeled by networks, where each node is a metabolite and each edge is a metabolic enzyme. 
 23 | You can read more about how to represent metabolic pathways [here](https://github.com/cxli233/ggpathway). 
 24 | 
 25 | Here is an example:
 26 | ```{r}
 27 | calvin_cycle_edges <- read_excel("../Data/Calvin_cycle_edges.xlsx")
 28 | calvin_cycle_nodes <- read_excel("../Data/Calvin_cycle_nodes.xlsx")
 29 | 
 30 | head(calvin_cycle_edges)
 31 | head(calvin_cycle_nodes)
 32 | ```
 33 | Make a network diagram for the Calvin cycle. 
 34 | Color each node by how many carbons they have (the `carbon` column in the node table).
 35 | Label each metabolite using `geom_node_text(aes(label = name), hjust = 0.5, repel = T)`. 
 36 | Save the figure using `ggsave()`.
 37 | Use `bg = "white"` inside `ggsave()` to add a white background when exporting the png file.  
 38 | 
 39 | ```{r}
 40 | network <- graph_from_data_frame(
 41 |   d = calvin_cycle_edges,
 42 |   vertices = calvin_cycle_nodes,
 43 |   directed = F)
 44 | ```
 45 | 
 46 | ```{r}
 47 | ggraph(network, layout = "kk") +
 48 |   geom_edge_link(color = "grey70", arrow = arrow(length = unit(3, "mm")))+
 49 |   geom_node_point(size = 3, shape = 21, color = "white", aes(fill = as.factor(carbon))) + 
 50 |   geom_node_text(aes(label = name), repel = T, hjust = 0.5) +
 51 |   scale_fill_manual(values = brewer.pal(8, "Set2")) +
 52 |   labs(fill = "carbon") +
 53 |   theme_void() 
 54 | 
 55 | ggsave("Lesson7_answer1.png", height = 4, width = 6, bg = "white")
 56 | ```
 57 |  
 58 | ## Q2 
 59 | [Phylogenetic trees](https://en.wikipedia.org/wiki/Phylogenetic_tree) and [dendrogram](https://en.wikipedia.org/wiki/Dendrogram) can be modeled by a network. 
 60 | Say we have a tree like this: 
 61 | 
 62 | ```    
 63 |     I
 64 |     |
 65 |     G - H
 66 |     |
 67 | A - B - C 
 68 |     |   
 69 |     D - E
 70 |     |
 71 |     F
 72 | ``` 
 73 | 
 74 | Make a network diagram and label each node.
 75 | Save the diagram using `ggave()`.
 76 | Use `bg = "white"` inside `ggsave()` to add a white background when exporting the png file.  
 77 | Hint: You can write the edge table in Excel and import it into R if that's easier. 
 78 | ```{r}
 79 | edgetable <- data.frame(
 80 |   from = c("G", "G", "A", "B", "B","B", "D", "D"),
 81 |   to = c("I", "H", "B", "C", "G","D", "E", "F")
 82 | )
 83 | head(edgetable)
 84 | ```
 85 | 
 86 | ```{r}
 87 | nodes <- data.frame(
 88 |   nodes = unique(c(edgetable$from, edgetable$to))
 89 | )
 90 | 
 91 | nodes
 92 | ```
 93 | 
 94 | ```{r}
 95 | treenetwork <- graph_from_data_frame(
 96 |   d = edgetable,
 97 |   vertices = nodes,
 98 |   directed = F)
 99 | ```
100 | 
101 | ```{r}
102 | ggraph(treenetwork, layout = "tree") +
103 |   geom_edge_link(color = "grey70") +
104 |   geom_node_point(size = 3, shape = 21, color = "white", fill = "grey20") +
105 |   geom_node_text(repel = T, aes(label = nodes$nodes)) +
106 |   theme_void()
107 | 
108 | ggsave("Lesson7_answer2.png", height = 3, width = 3.5, bg = "white")
109 | ```
110 | 
111 | 


--------------------------------------------------------------------------------
/Answer_key/08_Intro_plot_assembly_answer_key.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "08_Plot_assembly_ANSWER_KEY"
  3 | author: "Chenxin Li"
  4 | date: "2023-07-09"
  5 | output: html_notebook
  6 | ---
  7 | 
  8 | ```{r setup, include=FALSE}
  9 | knitr::opts_chunk$set(echo = TRUE)
 10 | ```
 11 | 
 12 |  
 13 | ```{r}
 14 | library(tidyverse)
 15 | library(ggbeeswarm)
 16 | library(patchwork)
 17 | 
 18 | library(RColorBrewer)
 19 | ```
 20 | 
 21 | # Exercise
 22 | Use your own data from your research to generate a composite figure. 
 23 | Save it using `ggsave()`. 
 24 | 
 25 | For those of you who do not have your own data, please use these datasets from a hypothetical project.  
 26 | 
 27 | You measured tensile strength and resistance of a material at two temperatures (t1 and t2) and two pressures (p1 and p2) (dataset1). 
 28 | You also tested samples of this material in the above-mentioned temperature-pressure combinations. You recorded the percentages of samples that passed, failed, or were unclear (dataset2). 
 29 | 
 30 | * What is the condition that resulted in the highest tensile strength? 
 31 | * What is the condition that resulted in the lowest resistance? 
 32 | * Is there any correlation between tensile strength and resistance across various conditions? 
 33 | * What is the condition that resulted in the lowest percentage of failure? What is the breakdown of test outcomes across treatment? 
 34 | * Based on all the data available, which treatment is the best? 
 35 | 
 36 | ## Datasets 
 37 | ```{r}
 38 | dataset1 <- read.csv("../Data/dataset1.csv")
 39 | dataset2 <- read.csv("../Data/dataset2.csv")
 40 | 
 41 | head(dataset1)
 42 | head(dataset2)
 43 | ```
 44 | ### Panel A 
 45 | ```{r}
 46 | panel_A <- dataset1 %>% 
 47 |   ggplot(aes(x = temperature, y = tensile_strength)) +
 48 |   facet_grid(.~ pressure) +
 49 |   ggbeeswarm::geom_quasirandom(alpha = 0.8, aes(color = treatment)) +
 50 |   scale_color_manual(values = brewer.pal(8, "Set2")) +
 51 |   labs(x = "Temperature",
 52 |        y = "Tensile strength") +
 53 |   theme_classic()
 54 | 
 55 | panel_A
 56 | ```
 57 | ### Panel B 
 58 | ```{r}
 59 | panel_B <- dataset1 %>% 
 60 |   ggplot(aes(x = temperature, y = resistance)) +
 61 |   facet_grid(.~ pressure) +
 62 |   ggbeeswarm::geom_quasirandom(alpha = 0.8, aes(color = treatment)) +
 63 |   scale_color_manual(values = brewer.pal(8, "Set2")) +
 64 |   labs(x = "Temperature",
 65 |        y = "Resistance") +
 66 |   theme_classic()
 67 | 
 68 | panel_B
 69 | ```
 70 | ### Panel C
 71 | ```{r}
 72 | panel_C <- dataset1 %>% 
 73 |   ggplot(aes(x = resistance, y = tensile_strength)) +
 74 |   facet_wrap(~ treatment) +
 75 |   geom_point(aes(color = treatment)) +
 76 |   labs(x = "Resistance",
 77 |        y = "Tesnsile strength") +
 78 |   scale_color_manual(values = brewer.pal(8, "Set2")) +
 79 |   theme_classic()
 80 | 
 81 | panel_C
 82 | ```
 83 | 
 84 | ### Panel D
 85 | ```{r}
 86 | panel_D <- dataset2 %>% 
 87 |   pivot_longer(names_to = "treatment", values_to = "percent_samples", cols = !outcome) %>% 
 88 |   ggplot(aes(x = treatment, y = percent_samples)) +
 89 |   geom_bar(stat = "identity", position = "stack", aes(fill = outcome)) +
 90 |   scale_fill_manual(values = c("grey20", "tomato1", "dodgerblue4")) +
 91 |   labs(y = "Percent samples",
 92 |        x = "Treatment") + 
 93 |   theme_classic()
 94 | 
 95 | panel_D
 96 | ```
 97 | 
 98 | ### Assemble them
 99 | ```{r}
100 |  wrap_plots(
101 |   panel_A + labs(tag = "A") + theme(legend.position = "none"),
102 |   panel_B + labs(tag = "B") + theme(legend.position = "none"),
103 |   panel_C + labs(tag = "C"),
104 |   panel_D + labs(tag = "D"),
105 |   guides = "collect",
106 |   design = "AB
107 |             CD")
108 | 
109 | ggsave("Lesson8_answer1.svg", height = 6, width = 8)
110 | ggsave("Lesson8_answer1.png", height = 6, width = 8)
111 | ```
112 | 
113 | Based on all the data available, t2p2 appeared to the best performer out of all metrics studied. 
114 | 


--------------------------------------------------------------------------------
/Answer_key/04_Intro_mean_separation_answer_key.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "mean_separation_ANSWER_KEY"
  3 | author: "Chenxin Li"
  4 | date: "2023-01-07"
  5 | output:
  6 |   html_notebook:
  7 |     number_sections: yes
  8 |     toc: yes
  9 |     toc_float: yes
 10 | ---
 11 | 
 12 | ```{r setup, include=FALSE}
 13 | knitr::opts_chunk$set(echo = TRUE)
 14 | ```
 15 | 
 16 | 
 17 | # Packages 
 18 | ```{r}
 19 | library(tidyverse)
 20 | library(readxl)
 21 | library(RColorBrewer)
 22 | library(ggbeeswarm)
 23 | ```
 24 | 
 25 | 
 26 | # Data from lecture
 27 | ```{r}
 28 | tissue_culture <- read_excel("../Data/LU-Matand - Stem Data.xlsx")
 29 | head(tissue_culture)
 30 | ```
 31 | 
 32 | ```{r}
 33 | tissue_culture %>% 
 34 |   mutate(Treatment = factor(Treatment, 
 35 |                             levels = c("T1", "T5", "T10"))) %>% 
 36 |   ggplot(aes(x = Treatment, y = Buds_Shoots)) +
 37 |   facet_wrap(~ Variety, scales = "free") +
 38 |   geom_quasirandom(aes(color = Explant), alpha = 0.8) +
 39 |   scale_colour_manual(values = brewer.pal(8, "Set2")) +
 40 |   labs(x = "Hormone Treatment",
 41 |        y = "Num buds and shoots",
 42 |        fill = NULL) +
 43 |   theme_classic() +
 44 |   theme(legend.position = "bottom") 
 45 | 
 46 | ggsave("../Results/04_multi_dots.png", height = 6, width = 9)
 47 | ```
 48 |  
 49 | 
 50 | # Exercise
 51 | ## Q1
 52 | Notice the `scales = "free` inside `facet_wrap()`. 
 53 | Make a new graph but remove the `scales = "free` inside `facet_wrap()`.
 54 | 
 55 | 
 56 | ```{r}
 57 | tissue_culture %>% 
 58 |   mutate(Treatment = factor(Treatment, 
 59 |                             levels = c("T1", "T5", "T10"))) %>% 
 60 |   ggplot(aes(x = Treatment, y = Buds_Shoots)) +
 61 |   facet_wrap(~ Variety) +
 62 |   geom_quasirandom(aes(color = Explant), alpha = 0.8) +
 63 |   scale_colour_manual(values = brewer.pal(8, "Set2")) +
 64 |   labs(x = "Hormone Treatment",
 65 |        y = "Num buds and shoots",
 66 |        fill = NULL) +
 67 |   theme_classic() +
 68 |   theme(legend.position = "bottom") 
 69 | 
 70 | ggsave("Lesson4_answer1.png", height = 6, width = 9)
 71 | ```
 72 | 
 73 | Can you explain what happened? 
 74 | Now all the varieties have the same y axis. 
 75 | 
 76 | Can you explain in what scenario is more appropriate to have `scales = "free"` turned on vs off? 
 77 | When we are only comparing within subplots, it's better to use `scales = "free"`. 
 78 | When we are comparing across subplots, it's better to not use `scales = "free"`.
 79 | 
 80 | ## Q2 
 81 | Here is an example data for you to practice: 
 82 | ```{r}
 83 | M1 <- data.frame(
 84 |   conc = rnorm(n = 8, mean = 0.03, sd = 0.01)
 85 | ) %>% 
 86 |   mutate(group = "ctrl") %>% 
 87 |   rbind(
 88 |     data.frame(
 89 |       conc = rnorm(n = 6, mean = 0.25, sd = 0.02)
 90 |     ) %>% 
 91 |       mutate(group = "trt")
 92 |   ) %>% 
 93 |   mutate(pest = "Pest 1")
 94 | 
 95 | M2 <- data.frame(
 96 |   conc = rnorm(n = 8, mean = 6, sd = 1)
 97 | ) %>% 
 98 |   mutate(group = "ctrl") %>% 
 99 |   rbind(
100 |     data.frame(
101 |       conc = rnorm(n = 6, mean = 5.5, sd = 1.1)
102 |     ) %>% 
103 |       mutate(group = "trt")
104 |   ) %>% 
105 |   mutate(pest = "Pest 2")
106 | 
107 | M3 <- data.frame(
108 |   conc = rnorm(n = 8, mean = 20, sd = 0.5)
109 | ) %>% 
110 |   mutate(group = "ctrl") %>% 
111 |   rbind(
112 |     data.frame(
113 |       conc = rnorm(n = 6, mean = 19.5, sd = 1.2)
114 |     ) %>% 
115 |       mutate(group = "trt")
116 |   ) %>% 
117 |   mutate(pest = "Pest 3")
118 | 
119 | Spray <- rbind(
120 |   M1, M2, M3
121 | )
122 | 
123 | head(Spray)
124 | ```
125 | 
126 | In this hypothetical experiment, I have two treatments: ctrl vs. trt. 
127 | And I measured the occurrence of three different pests after spraying either treatments. 
128 | Make a mean separation plot for this experiment. 
129 | 
130 | Hints: 
131 | 
132 | * Is this a multifactorial experiment? Yes! 
133 | * How many observations are there in each group? 
134 | 
135 | ```{r}
136 | Spray %>% 
137 |   group_by(group, pest) %>% 
138 |   count()
139 | ```
140 | Ctrl groups has 8 observations, while trt group have 6 observations.  
141 | 
142 | Was the treatment effective in controlling any of the three pests? 
143 | ```{r}
144 | Spray %>% 
145 |   ggplot(aes(x = group, y = conc)) +
146 |   facet_wrap(~ pest, scales = "free") +
147 |   geom_quasirandom(aes(color = group), alpha = 0.8) +
148 |   scale_colour_manual(values = brewer.pal(8, "Set2")) +
149 |   labs(x = "Treatment",
150 |        y = "concentration",
151 |        fill = NULL) +
152 |   theme_classic() +
153 |   theme(legend.position = "bottom") 
154 | 
155 | ggsave("Lesson4_answer2.png", width = 4, height = 3)
156 | ```
157 | 
158 | Save the graph using `ggsave()`. 
159 | 
160 |  
161 | 
162 | 
163 | 
164 | 
165 | 
166 | 
167 | 
168 | 
169 | 
170 | 
171 | 
172 | 


--------------------------------------------------------------------------------
/Answer_key/02_Intro_to_tidy_data_answer_key.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Data Arrangement ANSWER KEY"
  3 | author: "Chenxin Li"
  4 | date: "01/06/2023"
  5 | output:
  6 |   html_notebook:
  7 |     number_sections: yes
  8 |     toc: yes
  9 |     toc_float: yes
 10 | 
 11 | ---
 12 | 
 13 | ```{r setup, include=FALSE}
 14 | knitr::opts_chunk$set(echo = TRUE)
 15 | ```
 16 | 
 17 | # Load packages
 18 | ```{r}
 19 | library(tidyverse)
 20 | library(readxl)
 21 | ```
 22 | ## Data from lecture 
 23 | ```{r}
 24 | child_mortality <- read_csv("../Data/child_mortality_0_5_year_olds_dying_per_1000_born.csv", col_types = cols()) 
 25 | babies_per_woman <- read_csv("../Data/children_per_woman_total_fertility.csv", col_types = cols()) 
 26 | ```
 27 | 
 28 | These are two datasets downloaded from the [Gapminder foundation](https://www.gapminder.org/data/).
 29 | The Gapminder foundation has datasets on life expectancy, economy, education, and population across countries and years.
 30 | The goal is to remind us not only the "gaps" between developed and developing worlds, but also the amazing continuous improvements of quality of life through time.
 31 | 
 32 | 1.  Child mortality (0 - 5 year old) dying per 1000 born.
 33 | 2.  Births per woman.
 34 | 
 35 | These were recorded from year 1800 and projected all the way to 2100.
 36 | 
 37 | Let's look at them.
 38 | 
 39 | ```{r}
 40 | head(child_mortality)
 41 | head(babies_per_woman)
 42 | ```
 43 | 
 44 | 
 45 | ```{r}
 46 | babies_per_woman_tidy <- babies_per_woman %>% 
 47 |   pivot_longer(names_to = "year", values_to = "birth", cols = c(2:302)) 
 48 | 
 49 | head(babies_per_woman_tidy)
 50 | 
 51 | child_mortality_tidy <- child_mortality %>% 
 52 |   pivot_longer(names_to = "year", values_to = "death_per_1000_born", cols = c(2:302)) 
 53 | 
 54 | head(child_mortality_tidy)
 55 | ```
 56 | 
 57 | ```{r}
 58 | birth_and_mortality <- babies_per_woman_tidy %>% 
 59 |   inner_join(child_mortality_tidy, by = c("country", "year"))
 60 | 
 61 | head(birth_and_mortality)
 62 | ```
 63 | 
 64 | # Exercise
 65 | 
 66 | You have learned data arrangement! Let's do an exercise to practice what
 67 | you have learned today. 
 68 | As the example, this time we will use income per person dataset from Gapminder foundation.
 69 | 
 70 | ```{r}
 71 | income <- read_csv("../Data/income_per_person_gdppercapita_ppp_inflation_adjusted.csv", col_types = cols()) 
 72 | head(income)
 73 | ```
 74 | 
 75 | ## Tidy data
 76 | Is this a tidy data frame?
 77 | NO! 
 78 | 
 79 | Make it a tidy data frame using this code chunk.
 80 | ```{r}
 81 | income_tidy <- income %>% 
 82 |   pivot_longer(names_to = "year", values_to = "income", cols = !country)
 83 | 
 84 | head(income_tidy)
 85 | ```
 86 | 
 87 | ## Joining data
 88 | 
 89 | Combine the income data with birth per woman and child mortality data using this code chunk.
 90 | Name the new data frame "birth_and_mortality_and_income".
 91 | 
 92 | ```{r}
 93 |  birth_and_mortality_and_income <- income_tidy %>% 
 94 |   inner_join(babies_per_woman_tidy, by = c("country", "year")) %>% 
 95 |   inner_join(child_mortality_tidy, by = c("country", "year"))
 96 | 
 97 | head(birth_and_mortality_and_income)
 98 | ```
 99 |  
100 | 
101 | ## Filtering data
102 | 
103 | Filter out the data for Bangladesh and Sweden, in years 1945 (when WWII ended) and 2010.
104 | Name the new data frame BS_1945_2010.
105 | How has income, birth per woman and child mortality rate changed during this 55-year period?
106 | 
107 | ```{r}
108 | BS_1945_2010 <- birth_and_mortality_and_income %>% 
109 |  filter(country == "Bangladesh" | 
110 |           country == "Sweden") %>% 
111 |  filter(year == 1945 | 
112 |          year == 2010)
113 |  
114 | 
115 | head(BS_1945_2010)
116 | ```
117 |  
118 | 
119 | ## Mutate data
120 | 
121 | Let's say for countries with income between 1000 to 10,000 dollars per year, they are called "fed".
122 | For countries with income above 10,000 dollars per year, they are called "wealthy".
123 | Below 1000, they are called "poor".
124 | 
125 | Using this info to make a new column called "status".
126 | Hint: you will have to use case_when() and the "&" logic somewhere in this chunk.
127 | 
128 | ```{r}
129 | birth_and_mortality_and_income <- birth_and_mortality_and_income %>% 
130 |   mutate(status = case_when(
131 |     income >= 1000 & income <= 10000 ~ "fed",       
132 |     income > 10000 ~ "wealthy",                     
133 |     income < 1000 ~ "poor"                        
134 | ))
135 | 
136 | head(birth_and_mortality_and_income)
137 | ```
138 | 
139 | ## Summarise the data
140 | 
141 | Let's look at the average child mortality and its sd in year 2010. 
142 | across countries across different status that we just defined. 
143 | Name the new data frame "child_mortality_summmary_2010".
144 | 
145 | ```{r}
146 | child_mortality_summary_2010 <- birth_and_mortality_and_income %>% 
147 |   filter(year == 2010) %>% 
148 |   group_by(status) %>%
149 |   summarize(
150 |     avg = mean(death_per_1000_born), 
151 |     sd = sd(death_per_1000_born))
152 | 
153 | head(child_mortality_summary_2010)
154 | ```
155 |  
156 | How does child mortality compare across income group in year 2010?
157 | Child mortality is higher for lower income groups.
158 | 


--------------------------------------------------------------------------------
/Scripts/05_Intro_proportions.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "05_Intro_proportions.Rmd"
  3 | author: "Chenxin Li"
  4 | date: "2023-01-14"
  5 | output: html_notebook 
  6 | ---
  7 | 
  8 | ```{r setup, include=FALSE}
  9 | knitr::opts_chunk$set(echo = TRUE)
 10 | ```
 11 | 
 12 | # Introduction 
 13 | In this lesson we will explore the best practice for presenting proportional data. 
 14 | The presentation of proportional data is very common, and thus warrant a lesson of its own. 
 15 | Proportional data are data that range from 0 to 1. 
 16 | More importantly, proportional data should add up to 1 if they are derived from the same entity. 
 17 | For example, let's say on a corn ear, there are yellow and white kernels (and no other colors). 
 18 | The proportion of yellow and white kernels should add up to 1. 
 19 | Let's say if 75% of the kernel are yellow, it implies the proportion of white ones is 25%. 
 20 | 
 21 | Goals might differ when we are presenting proportional data. Here are two common objectives:
 22 | 
 23 | 1. Showing different proportions of an entity add up to 1. 
 24 | 2. Comparing proportions of different entities. 
 25 | 
 26 | These two objectives may not be mutually exclusive. 
 27 | We will explore what are the best practice to achieve these objectives. 
 28 | 
 29 | # Packages 
 30 | ```{r}
 31 | library(tidyverse)
 32 | library(readxl)
 33 | library(RColorBrewer)
 34 | ```
 35 |  
 36 | # What really is a pie chart? 
 37 | The most common visualization for proportional data is a pie chart. 
 38 | You literally see them all the time. 
 39 | Pie chart is a common type of visualization for proportional data, where proportions add up to 100%.
 40 | This is achieved by dividing a circle into sectors, and the sectors add up to a full circle.
 41 | This all sounds fine, but how do we construct a pie chart using ggplot? 
 42 | The short answer is we make a stacked bar chart and wrap it into a circle. 
 43 | 
 44 | Let's make a pie chart. 
 45 | Let's say we have a corn ear that has been open pollinated. 
 46 | As a result, there are purple, yellow, and white kernels on the ear. 
 47 | Let's say the proportions are 15% purple, 70% yellow, and 15% white. 
 48 | 
 49 | ```{r}
 50 | ear_1 <- data.frame(
 51 |   colors = c("purple", "yellow", "white"),
 52 |   proportions = c(0.15, 0.7, 0.15)
 53 | )
 54 | 
 55 | head(ear_1)
 56 | ```
 57 | 
 58 | ```{r}
 59 | example_1_stacked <- ear_1 %>% 
 60 |   ggplot(aes(x = "A Corn Ear", y = proportions)) +
 61 |   geom_bar(stat = "identity", aes(fill = colors),
 62 |            color = "black", width = 0.5) +
 63 |   scale_fill_identity() +
 64 |   labs(x = NULL) +
 65 |   theme_classic()
 66 | 
 67 | example_1_stacked
 68 | 
 69 | ggsave("../Results/05_stack1.png", width = 2, height = 2.5) 
 70 | ```
 71 | 
 72 | Note that there are two coloring options in `ggplot`.
 73 | There is `color`, which is the color of the outline. 
 74 | In this case, the outline of the stacks is black. 
 75 | There is also `fill`, which is the color of the interior. 
 76 | In this case, the interior of the stacks is colored by the color of the corn kernel. 
 77 | 
 78 | So now we have a stacked bar plot. 
 79 | Technically, we are done. The proportions are presented by the heights of the stacks within the bar.
 80 | And this is a perfectly fine visualization of our proportional data. 
 81 | But what if we want to make a pie chart? 
 82 | Well, we will use the polar coordinate.
 83 | 
 84 | ```{r}
 85 | example_1_stacked +
 86 |   coord_polar(theta = "y") +
 87 |   theme_void()
 88 | 
 89 | ggsave("../Results/05_pie1.png", width = 2, height = 2.5) 
 90 | ```
 91 | There are just two extra lines of code to convert a stack bar into a pie chart.
 92 | First, `coord_polar(theta = "y")` wraps the y axis into a circle. 
 93 | Second, `theme_void()` turns off the x and y axis. 
 94 | Axis are not all that informative in for pie charts anyway. 
 95 | So if you want to make a pie chart, this is what you do. 
 96 | 
 97 | A final note about pie chart: what are the data represented by? 
 98 | In this case, the data are represented by the arc length. 
 99 | So a pie chart is a length based visualization. 
100 | Due the properties of a circle, the data are also presented by the arc angle and sector area.   
101 | 
102 | # What is the best visualization for comparing proportional data? 
103 | However, oftentimes we don't want to just present proportions of one entity. 
104 | Instead, we want to compare the proportions of multiple entities.
105 | For this purpose, pie chart is probably not the best option. 
106 | Pie charts have been criticized, because we are much worse in reading angles, areas, and lengths of arcs than reading lengths of straight lines. 
107 | 
108 | The best way to visualize proportions from multiple entities is stacked bars. 
109 | Here is an example. 
110 | ![Just make a stacked bar chart](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/dont_pie_chart.svg)   
111 | 
112 | In this example, we have two groups, each contains 4 sub-category. 
113 | In classic pie charts, the angles (and thus arc lengths & sector area) represent the data. 
114 | The problem is that it is very difficult to compare between entities. 
115 | We can visually simplify the pie chart into donut charts, where the data are now represented by arc lengths. 
116 | However, if we want to use lengths to represent the data, why don't we just unwrap the donut and make stacked bars? 
117 | In stacked bar graphs, bars are shown side-by-side and thus easier to compare across entities. 
118 | 
119 | Let's try an example on our own. 
120 | Let's say we have another open pollinated corn ear. 
121 | This ear has 20% purple kernels, 75% yellow kernels, and 5% white kernels.
122 | ```{r}
123 | ear_2 <- data.frame(
124 |   colors = c("purple", "yellow", "white"),
125 |   proportions = c(0.20, 0.75, 0.05)
126 | )
127 | 
128 | ears_1_and_2 <- rbind(
129 |   ear_1 %>% 
130 |     mutate(ear = "ear1"), 
131 |   ear_2 %>% 
132 |     mutate(ear = "ear2")
133 | )
134 | 
135 | head(ears_1_and_2)
136 | ```
137 | Say now we want to make a graph to compare the proportions of these two ears. 
138 | The code is in fact quite straight forward. 
139 | 
140 | ```{r}
141 | ears_1_and_2 %>% 
142 |   ggplot(aes(x = ear, y = proportions)) +
143 |   geom_bar(stat = "identity", aes(fill = colors), 
144 |            color = "black", width = 0.5) +
145 |   scale_fill_identity() +
146 |   theme_classic()
147 | 
148 | ggsave("../Results/05_stack2.png", width = 2, height = 2.5)
149 | ```
150 | That's it! 
151 | I would say stacked bars are my go-to visualization for comparing proportions of multiple entities. 
152 | A key advantage is that side-by-side stacked bars are more space efficient. 
153 | Imagine you are comparing proportions of hundreds of entities. 
154 | It is unrealistic to present hundreds of pie or donut charts, let alone comparing across them. 
155 | But stacked bars make the task easy. 
156 | 
157 | # Donut charts? 
158 | Donut charts are good alternatives of pie charts for presenting proportions of a small number of entities. 
159 | We won't cover how to make donut charts here. 
160 | If you are interested, you can explore on your own. 
161 | But what you should _never_ do is making concentric donuts. 
162 | Here is an example. 
163 | ![concentric donuts](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/dont_concentric_donuts.svg) 
164 | For concentric donuts, you might be tempted to say the data are represented by the arc lengths, 
165 | which is in fact inaccurate. 
166 | The arc lengths on the outer rings are much longer than those in the inner rings. 
167 | Group 2 and Group 3 have the same exact values, but the arc lengths of Group 3 are much longer. 
168 | In fact the data are represented by the arc angles, which we are bad at reading.
169 | 
170 | Since outer rings are longer, the ordering of the groups (which group goes to which ring) has a big impact on the impression of the plot. 
171 | It can lead to the apparent paradox where larger values have shorter arcs. 
172 | The better (and simpler!) alternative is just unwrap the donuts and make a good old stacked bar plot. 
173 | 
174 | # Should you use log scale when preseneting proportional data? 
175 | This might be something you have not thought about. 
176 | Let's look an example. 
177 | In this hypothetical example, we quantified biodiversity of a rain forest relative to year 1960. 
178 | The data are normalized to the value of year 1960. 
179 | 
180 | ```{r}
181 | example_2 <- data.frame(
182 |  year = c(1960, 1970, 1980, 1990, 2000, 2010, 2020),
183 |  relative_biodiversity = c(1, 0.6, 0.3, 0.2, 0.15, 0.1, 0.01)
184 | )
185 | 
186 | head(example_2)
187 | ```
188 | 
189 | ```{r}
190 | example_2 %>% 
191 |   ggplot(aes(x = year, y = relative_biodiversity)) +
192 |   geom_point(size = 3) +
193 |   labs(y = "biodiversity\n(relative to 1960)") +
194 |   theme_classic()
195 | 
196 | ggsave("../Results/05_dot1.png", width = 3, height = 2.5)
197 | ```
198 | Looking at this graph, you would probably say the loss of biodiversity has stabilized in the last 4 decades.
199 | But is that so? 
200 | 
201 | A related concept of proportion is odds. 
202 | Odds = proportion /  (1 - proportion). 
203 | For example, if p = 0.5, then the odds is 1:1. 
204 | If p = 0.1, then the odds is 1:9. 
205 | If p = 0.01, then then odds is 1:99.  
206 | A property of proportions is that is bound between 0 and 1. 
207 | Thus, any changes near 0 and 1 will appear small by definition. 
208 | However, if you think about it, going from 0.1 to 0.01 is 10 fold change in proportion and 11 fold change in odds, 
209 | We can capture the relative changes using the log scale. 
210 | We can use the log10 or nature log scale. It doesn't matter. 
211 | 
212 | ```{r}
213 | example_2 %>% 
214 |   mutate(log_odds = log(relative_biodiversity/(1-relative_biodiversity))) %>% 
215 |   ggplot(aes(x = year, y = log_odds)) +
216 |   geom_point(size = 3) +
217 |   labs(y = "biodiversity\n(relative to 1960\nlog odds scale)") +
218 |   theme_classic()
219 | 
220 | ggsave("../Results/05_dot2.png", width = 3, height = 2.5)
221 | ```
222 | In this case, presenting proportional data in the log odds scale paints a different picture. 
223 | Biodiversity has decreased sharply relative to the previous decade.
224 | It makes sense: 
225 | From 2000 to 2010, the change was 0.15 to 0.1, or 0.1/0.15 = 0.67, or 33% decrease from 2000. 
226 | However, from 2010 to 2020, the change was 0.1 to 0.01, which is a 90% decrease from 2010. 
227 | In practice, you will need to think very carefully about if you need to present your proportional data in log scale. 
228 | 
229 | # Exercise 
230 | As an exercise, let's visualize this example. 
231 | We have two groups, each contains 4 categories. 
232 | ```{r}
233 | group1 <- data.frame(
234 |   "Type" = c("Type I", "Type II", "Type III", "Type IV"),
235 |   "Percentage" = c(15, 35, 30, 20)
236 | )
237 | group2 <- data.frame(
238 |   "Type" = c("Type I", "Type II", "Type III", "Type IV"),
239 |   "Percentage" = c(10, 25, 35, 30)
240 | )
241 | ```
242 | 
243 | Use `ggsave` to save your visualization. 
244 | 
245 | Hint: look in this lesson to see what I did to combine two entities into one data frame while giving each a unique identifier. 


--------------------------------------------------------------------------------
/Lessons/05_Intro_proportions.md:
--------------------------------------------------------------------------------
  1 | # Introduction 
  2 | In this lesson we will explore the best practice for presenting proportional data. 
  3 | The presentation of proportional data is very common, and thus warrant a lesson of its own. 
  4 | Proportional data are data that range from 0 to 1. 
  5 | More importantly, proportional data should add up to 1 if they are derived from the same entity. 
  6 | For example, let's say on a corn ear, there are yellow and white kernels (and no other colors). 
  7 | The proportion of yellow and white kernels should add up to 1. 
  8 | Let's say if 75% of the kernel are yellow, it implies the proportion of white ones is 25%. 
  9 | 
 10 | Goals might differ when we are presenting proportional data. Here are two common objectives:
 11 | 
 12 | 1. Showing different proportions of an entity add up to 1. 
 13 | 2. Comparing proportions of different entities. 
 14 | 
 15 | These two objectives may not be mutually exclusive. 
 16 | We will explore what are the best practice to achieve these objectives. 
 17 | 
 18 | # Packages 
 19 | ```{r}
 20 | library(tidyverse)
 21 | library(readxl)
 22 | library(RColorBrewer)
 23 | ```
 24 |  
 25 | # What really is a pie chart? 
 26 | The most common visualization for proportional data is a pie chart. 
 27 | You literally see them all the time. 
 28 | Pie chart is a common type of visualization for proportional data, where proportions add up to 100%.
 29 | This is achieved by dividing a circle into sectors, and the sectors add up to a full circle.
 30 | This all sounds fine, but how do we construct a pie chart using ggplot? 
 31 | The short answer is we make a stacked bar chart and wrap it into a circle. 
 32 | 
 33 | Let's make a pie chart. 
 34 | Let's say we have a corn ear that has been open pollinated. 
 35 | As a result, there are purple, yellow, and white kernels on the ear. 
 36 | Let's say the proportions are 15% purple, 70% yellow, and 15% white. 
 37 | 
 38 | ```{r}
 39 | ear_1 <- data.frame(
 40 |   colors = c("purple", "yellow", "white"),
 41 |   proportions = c(0.15, 0.7, 0.15)
 42 | )
 43 | 
 44 | head(ear_1)
 45 | ```
 46 | 
 47 | ```{r}
 48 | example_1_stacked <- ear_1 %>% 
 49 |   ggplot(aes(x = "A Corn Ear", y = proportions)) +
 50 |   geom_bar(stat = "identity", aes(fill = colors),
 51 |            color = "black", width = 0.5) +
 52 |   scale_fill_identity() +
 53 |   labs(x = NULL) +
 54 |   theme_classic()
 55 | 
 56 | example_1_stacked
 57 | 
 58 | ggsave("../Results/05_stack1.png", width = 2, height = 2.5) 
 59 | ```
 60 | ![stacked_bar1](https://github.com/cxli233/Quick_data_vis/blob/main/Results/05_stack1.png)
 61 | 
 62 | Note that there are two coloring options in `ggplot`.
 63 | There is `color`, which is the color of the outline. 
 64 | In this case, the outline of the stacks is black. 
 65 | There is also `fill`, which is the color of the interior. 
 66 | In this case, the interior of the stacks is colored by the color of the corn kernel. 
 67 | 
 68 | So now we have a stacked bar plot. 
 69 | Technically, we are done. The proportions are presented by the heights of the stacks within the bar.
 70 | And this is a perfectly fine visualization of our proportional data. 
 71 | But what if we want to make a pie chart? 
 72 | Well, we will use the polar coordinate.
 73 | 
 74 | ```{r}
 75 | example_1_stacked +
 76 |   coord_polar(theta = "y") +
 77 |   theme_void()
 78 | 
 79 | ggsave("../Results/05_pie1.png", width = 2, height = 2.5) 
 80 | ```
 81 | ![pie1](https://github.com/cxli233/Quick_data_vis/blob/main/Results/05_pie1.png)
 82 | 
 83 | There are just two extra lines of code to convert a stack bar into a pie chart.
 84 | First, `coord_polar(theta = "y")` wraps the y axis into a circle. 
 85 | Second, `theme_void()` turns off the x and y axis. 
 86 | Axis are not all that informative in for pie charts anyway. 
 87 | So if you want to make a pie chart, this is what you do. 
 88 | 
 89 | A final note about pie chart: what are the data represented by? 
 90 | In this case, the data are represented by the arc length. 
 91 | So a pie chart is a length based visualization. 
 92 | Due the properties of a circle, the data are also presented by the arc angle and sector area.   
 93 | 
 94 | # What is the best visualization for comparing proportional data? 
 95 | However, oftentimes we don't want to just present proportions of one entity. 
 96 | Instead, we want to compare the proportions of multiple entities.
 97 | For this purpose, pie chart is probably not the best option. 
 98 | Pie charts have been criticized, because we are much worse in reading angles, areas, and lengths of arcs than reading lengths of straight lines. 
 99 | 
100 | The best way to visualize proportions from multiple entities is stacked bars. 
101 | Here is an example. 
102 | ![Just make a stacked bar chart](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/dont_pie_chart.svg)   
103 | 
104 | In this example, we have two groups, each contains 4 sub-category. 
105 | In classic pie charts, the angles (and thus arc lengths & sector area) represent the data. 
106 | The problem is that it is very difficult to compare between entities. 
107 | We can visually simplify the pie chart into donut charts, where the data are now represented by arc lengths. 
108 | However, if we want to use lengths to represent the data, why don't we just unwrap the donut and make stacked bars? 
109 | In stacked bar graphs, bars are shown side-by-side and thus easier to compare across entities. 
110 | 
111 | Let's try an example on our own. 
112 | Let's say we have another open pollinated corn ear. 
113 | This ear has 20% purple kernels, 75% yellow kernels, and 5% white kernels.
114 | ```{r}
115 | ear_2 <- data.frame(
116 |   colors = c("purple", "yellow", "white"),
117 |   proportions = c(0.20, 0.75, 0.05)
118 | )
119 | 
120 | ears_1_and_2 <- rbind(
121 |   ear_1 %>% 
122 |     mutate(ear = "ear1"), 
123 |   ear_2 %>% 
124 |     mutate(ear = "ear2")
125 | )
126 | 
127 | head(ears_1_and_2)
128 | ```
129 | Say now we want to make a graph to compare the proportions of these two ears. 
130 | The code is in fact quite straight forward. 
131 | 
132 | ```{r}
133 | ears_1_and_2 %>% 
134 |   ggplot(aes(x = ear, y = proportions)) +
135 |   geom_bar(stat = "identity", aes(fill = colors), 
136 |            color = "black", width = 0.5) +
137 |   scale_fill_identity() +
138 |   theme_classic()
139 | 
140 | ggsave("../Results/05_stack2.png", width = 2, height = 2.5)
141 | ```
142 | ![stack bar2](https://github.com/cxli233/Quick_data_vis/blob/main/Results/05_stack2.png)
143 | 
144 | That's it! 
145 | I would say stacked bars are my go-to visualization for comparing proportions of multiple entities. 
146 | A key advantage is that side-by-side stacked bars are more space efficient. 
147 | Imagine you are comparing proportions of hundreds of entities. 
148 | It is unrealistic to present hundreds of pie or donut charts, let alone comparing across them. 
149 | But stacked bars make the task easy. 
150 | 
151 | # Donut charts? 
152 | Donut charts are good alternatives of pie charts for presenting proportions of a small number of entities. 
153 | We won't cover how to make donut charts here. 
154 | If you are interested, you can explore on your own. 
155 | But what you should _never_ do is making concentric donuts. 
156 | Here is an example. 
157 | ![concentric donuts](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/dont_concentric_donuts.svg) 
158 | 
159 | For concentric donuts, you might be tempted to say the data are represented by the arc lengths, 
160 | which is in fact inaccurate. 
161 | The arc lengths on the outer rings are much longer than those in the inner rings. 
162 | Group 2 and Group 3 have the same exact values, but the arc lengths of Group 3 are much longer. 
163 | In fact the data are represented by the arc angles, which we are bad at reading.
164 | 
165 | Since outer rings are longer, the ordering of the groups (which group goes to which ring) has a big impact on the impression of the plot. 
166 | It can lead to the apparent paradox where larger values have shorter arcs. 
167 | The better (and simpler!) alternative is just unwrap the donuts and make a good old stacked bar plot. 
168 | 
169 | # Should you use log scale when preseneting proportional data? 
170 | This might be something you have not thought about. 
171 | Let's look an example. 
172 | In this hypothetical example, we quantified biodiversity of a rain forest relative to year 1960. 
173 | The data are normalized to the value of year 1960. 
174 | 
175 | ```{r}
176 | example_2 <- data.frame(
177 |  year = c(1960, 1970, 1980, 1990, 2000, 2010, 2020),
178 |  relative_biodiversity = c(1, 0.6, 0.3, 0.2, 0.15, 0.1, 0.01)
179 | )
180 | 
181 | head(example_2)
182 | ```
183 | 
184 | ```{r}
185 | example_2 %>% 
186 |   ggplot(aes(x = year, y = relative_biodiversity)) +
187 |   geom_point(size = 3) +
188 |   labs(y = "biodiversity\n(relative to 1960)") +
189 |   theme_classic()
190 | 
191 | ggsave("../Results/05_dot1.png", width = 3, height = 2.5)
192 | ```
193 | ![actual scale](https://github.com/cxli233/Quick_data_vis/blob/main/Results/05_dot1.png)
194 | 
195 | Looking at this graph, you would probably say the loss of biodiversity has stabilized in the last 4 decades.
196 | But is that so? 
197 | 
198 | A related concept of proportion is odds. 
199 | Odds = proportion /  (1 - proportion). 
200 | For example, if p = 0.5, then the odds is 1:1. 
201 | If p = 0.1, then the odds is 1:9. 
202 | If p = 0.01, then then odds is 1:99.  
203 | A property of proportions is that is bound between 0 and 1. 
204 | Thus, any changes near 0 and 1 will appear small by definition. 
205 | However, if you think about it, going from 0.1 to 0.01 is 10 fold change in proportion and 11 fold change in odds, 
206 | We can capture the relative changes using the log scale. 
207 | We can use the log10 or nature log scale. It doesn't matter. 
208 | 
209 | ```{r}
210 | example_2 %>% 
211 |   mutate(log_odds = log(relative_biodiversity/(1-relative_biodiversity))) %>% 
212 |   ggplot(aes(x = year, y = log_odds)) +
213 |   geom_point(size = 3) +
214 |   labs(y = "biodiversity\n(relative to 1960\nlog odds scale)") +
215 |   theme_classic()
216 | 
217 | ggsave("../Results/05_dot2.png", width = 3, height = 2.5)
218 | ```
219 | ![log odds scale](https://github.com/cxli233/Quick_data_vis/blob/main/Results/05_dot2.png)
220 | 
221 | In this case, presenting proportional data in the log odds scale paints a different picture. 
222 | Biodiversity has decreased sharply relative to the previous decade.
223 | It makes sense: 
224 | From 2000 to 2010, the change was 0.15 to 0.1, or 0.1/0.15 = 0.67, or 33% decrease from 2000. 
225 | However, from 2010 to 2020, the change was 0.1 to 0.01, which is a 90% decrease from 2010. 
226 | In practice, you will need to think very carefully about if you need to present your proportional data in log scale. 
227 | 
228 | # Exercise 
229 | As an exercise, let's visualize this example. 
230 | We have two groups, each contains 4 categories. 
231 | ```{r}
232 | group1 <- data.frame(
233 |   "Type" = c("Type I", "Type II", "Type III", "Type IV"),
234 |   "Percentage" = c(15, 35, 30, 20)
235 | )
236 | group2 <- data.frame(
237 |   "Type" = c("Type I", "Type II", "Type III", "Type IV"),
238 |   "Percentage" = c(10, 25, 35, 30)
239 | )
240 | ```
241 | 
242 | Use `ggsave` to save your visualization. 
243 | 
244 | Hint: look in this lesson to see what I did to combine two entities into one data frame while giving each a unique identifier. 
245 | 


--------------------------------------------------------------------------------
/Scripts/06_Intro_to_heatmaps.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Intro_to_heatmap.Rmd"
  3 | author: "Chenxin Li"
  4 | date: "2023-01-15"
  5 | output: html_notebook
  6 | ---
  7 | 
  8 | ```{r setup, include=FALSE}
  9 | knitr::opts_chunk$set(echo = TRUE)
 10 | ```
 11 |  
 12 | # Introduction
 13 | Heatmap is another very common type of visualization in scientific publications. 
 14 | In biology they are extremely common in omics papers. 
 15 | However, extra care must be taken to not produce an uninterpretable or misleading heatmap. 
 16 | 
 17 | We will cover: 
 18 | 
 19 | 1. What is a heat map?
 20 | 2. Are you using an appropriate color scale? 
 21 | 3. Are you scaling the data correctly to best visualize your data? 
 22 | 4. How to handle outliers in heatmaps? 
 23 | 5. Should you reorder rows and columns of a heatmap? 
 24 | 
 25 | # Packages 
 26 | ```{r}
 27 | library(tidyverse)
 28 | library(RColorBrewer)
 29 | library(viridis) 
 30 | ```
 31 | We have new package `viridis`. 
 32 | `viridis` is a collection of beautiful color scales for heatmaps. 
 33 | Read more [here](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html). 
 34 | 
 35 | # What is a heat map? 
 36 | A heatmap is a representation of 3 dimension data. 
 37 | Say we have 3 dimensions x, y, z. 
 38 | In a heatmap, the z dimension is visualized by different colors. 
 39 | Let's look at an example: 
 40 | 
 41 | ```{r}
 42 | abc_1 <- expand.grid(
 43 |   a = c(1:10),
 44 |   b = c(1:10)
 45 | ) %>% 
 46 |   mutate(c = a + b)
 47 | 
 48 | head(abc_1)
 49 | ```
 50 | In this example, we have 3 dimensions: a, b, and c. 
 51 | a and b ranges from 1 to 10. 
 52 | And the c is the sum of a and b. 
 53 | 
 54 | ```{r}
 55 | abc_1 %>% 
 56 |   ggplot(aes(x = a, y = b)) +
 57 |   geom_tile(aes(fill = c)) +
 58 |   scale_fill_gradientn(colors = viridis(100)) +
 59 |   labs(title = "c = a + b") +
 60 |   theme_classic() +
 61 |   coord_fixed()
 62 | 
 63 | ggsave("../Results/06_abc_1.png", height = 3, width = 3)
 64 | ```
 65 | 
 66 | Now this is a heatmap. 
 67 | In ggplot, heatmap can be made using `geom_tile()`. 
 68 | To map values to the tiles, use `aes(fill = ...)`. 
 69 | In this case, we want to color the interior of tiles using the values of `c`. 
 70 | We do `aes(fill = c)` inside `geom_tile()`. 
 71 | 
 72 | # Choosing the correct color scale 
 73 | Note the `scale_fill_gradientn(colors = viridis(100))` line in the previous code chunk. 
 74 | What it does is it maps values of z to the `viridis` color scale, a purple-blue-green-yellow color gradient, where smaller values are darker purple, and larger values are mapped to progressively lighter and yellower colors. 
 75 | 
 76 | Color scales are pretty, but we must be extra careful. 
 77 | There are a couple important concepts regarding choosing a color scale. 
 78 | 
 79 | ## 1. Never use both red/green colors for heatmaps. 
 80 | Yes, I repeat, _NEVER_ use red and green in the same heat map. 
 81 | Red/green color blindness occurs in about 1/16 in male and 1/256 in female. 
 82 | People who are red/green colorblind cannot distinguish between red and green; they see brown instead. 
 83 | They will not be able to read a heatmap where numerical values are represented by a gradient of red and green.  
 84 | 
 85 | ## 2. Correctly use unidirectional and bidirectional color scales.
 86 | There are two types of color scales:
 87 | 
 88 | 1. Unidirectional or sequential: the color gradient goes from darkest to lightest.The viridis color gradient is a unidirectional color scale. 
 89 | 2. Bidirectional or divergent: the color gradient has a single lightest point at the center and two darkest points at the extremes. 
 90 | 
 91 | You should _NEVER_ use a bidirectional color scale for unidirectional data. 
 92 | ![Color scales](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/ColorScales.svg)
 93 | You might not have thought about this one, and people make this mistake all the time. 
 94 | In this example, the two plots on the right use a bidirectional color scale. 
 95 | Why? Because the blue and red are darker than white. 
 96 | The first three plots are fine, but last one is not okay. 
 97 | You might ask, what's wrong with it? 
 98 | 
 99 | When color scales (or color gradients) are used to represent numerical data, 
100 | the darkest and lightest colors should have special meanings. 
101 | You can decide what those special meanings are: e.g., max, min, mean, zero. 
102 | But they should represent something meaningful. 
103 | A data visualization sin for heat maps/color gradients is when the lightest or darkest colors are some arbitrary numbers. 
104 | _This is as bad as the longest bar in a bar chart not being the largest value._ Can you imagine that?
105 | 
106 | # Same heatmap, same scale  
107 | The next issue with heatmap is scaling. 
108 | Let's use an example to illustrate this. 
109 | In this hypothetical example, I have the expression level of 3 genes across 5 stages of development.
110 | I want to visualize their expression pattern across development. 
111 | You can ignore this chunk. 
112 | ```{r}
113 | genes123 <- data.frame(
114 |   stages = c(1:5),
115 |   gene1 = c(15, 25, 35, 45, 55),
116 |   gene2 = c(100, 200, 300, 400, 500),
117 |   gene3 = c(100, 80, 70, 60, 50)
118 | ) %>% 
119 |   pivot_longer(cols = !stages, names_to = "gene", values_to = "expression") 
120 | 
121 | head(genes123)
122 | ```
123 | 
124 | If I just make a heatmap, what will happen? 
125 | ```{r}
126 | genes123 %>% 
127 |   ggplot(aes(x = stages, y = gene)) +
128 |   geom_tile(aes(fill = expression), color = "grey90") +
129 |   scale_fill_gradientn(colors = brewer.pal(9, "YlGnBu")) +
130 |   theme_classic() +
131 |   theme(
132 |     legend.position = "top"
133 |   )
134 | 
135 | ggsave("../Results/06_genes_1.png", height = 2.5, width = 3)
136 | ```
137 | We have stages on x axis, genes on y axis, and tiles colored by expression. 
138 | If you look at the heatmap, you might be tempted to say, gene2 is going up during development, 
139 | and genes 1 and 3 are not changing that much. 
140 | But is that so? 
141 | 
142 | In this example, gene2 ranges from 100-500, whereas the rest of the two genes ranges from 10-100. 
143 | The large difference in range implies that the lower range values are not visualized well. 
144 | 
145 | A good way to scale the data is using z score. 
146 | A z score is the difference between a value and the mean, divided by the standard deviation. 
147 | In this example, we will calculate z score for each gene. 
148 | Remember our good friend `group_by()`? 
149 | 
150 | ```{r}
151 | genes123 <- genes123 %>% 
152 |   group_by(gene) %>% 
153 |   mutate(z = (expression - mean(expression))/sd(expression)) %>% 
154 |   ungroup()
155 | 
156 | head(genes123)
157 | ```
158 | 
159 | What happened here is we `group_by(gene)` first,
160 | such that z score calculation is done relative to each gene, not the entire data. 
161 | Then `mutate(z = (expression - mean(expression))/sd(expression))` quite literally calculates the z score. 
162 | Lastly, we `ungroup()`, and we are done. 
163 | 
164 | Now we make a new heatmap, but color tiles with z scores. 
165 | ```{r}
166 | genes123 %>% 
167 |   ggplot(aes(x = stages, y = gene)) +
168 |   geom_tile(aes(fill = z), color = "grey90") +
169 |   scale_fill_gradientn(colors = brewer.pal(9, "YlGnBu")) +
170 |   theme_classic() +
171 |   theme(legend.position = "top")
172 | 
173 | ggsave("../Results/06_genes_2.png", height = 2.5, width = 3)
174 | ```
175 | In this version of the heatmap, it is obvious that genes 1 and 2 have the same expression pattern.
176 | They both go up during development, while gene3 has the opposite pattern. 
177 | 
178 | # Be aware of outliers 
179 | This is related to the previous point. 
180 | Outliers can drastically change the appearance of a heatmap. 
181 | Here is an example: 
182 | ![check outliers for heatmap](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/Check_outliers_for_heatmap.svg) 
183 | 
184 | In this example, I have 2 observations. 
185 | For each observations, I measured 20 features. 
186 | Without checking for outliers, it may appear that the 2 observations are overall similar, except at 2 features. 
187 | However, after maxing out the color scale around the 95th percentile of the data, it reveals that the two observations are distinct across all features.
188 | 
189 | The point here is always check for outliers when making heatmaps. 
190 | If there are outliers, it may be a good idea to clip outliers at the extremes. 
191 | 
192 | # Reordering rows and columns 
193 | In most cases, the ordering of rows and columns are arbitrary.
194 | For example, nothing says the y axis has to be in order of gene1, followed by gene2, then gene3. 
195 | However, if the heatmap reflects a physical map (x and y axis maps to physical locations in space), 
196 | then you can't reorder rows and columns. 
197 | 
198 | The ordering of the x and y axis has a _HUGE_ impact on how useful a heatmap is, especially a large heatmap.  
199 | Let's look at an example: 
200 | ![recorder rows and columns](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/Reorder_rows_and_columns_for_heatmap.png) 
201 | 
202 | In this example, I have cells as columns and features as rows. 
203 | Each tile is showing z scores. 
204 | It is impossible to get anything useful out of the heatmap without reordering rows and columns.
205 | We can reorder rows and columns, such that features that are enriched in cell type 1 appears first.
206 | Then features that are enriched in cell type 2 appears second. Data from: [Li et al., 2022, BioRxiv](https://www.biorxiv.org/content/10.1101/2022.07.04.498697v1.abstract)
207 | 
208 | When rows and columns of a heatmap is properly ordered, you can see a clear signal across the diagonal. For example: 
209 | 
210 | ![Beautiful heatmap](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/Abstract_R_2022_11_24.svg)
211 | 
212 | The coding underlying reordering rows and columns of a heatmap is actually quite complex. 
213 | I would say this is a bit outside the scope of this introductory lesson. 
214 | That being said, there are packages that produces heatmaps, and they also handles reordering rows and columns automatically.  
215 | For example, [tidyHeatmap](https://github.com/stemangiola/tidyHeatmap) is a good one.
216 | If you ever need to make heatmaps, you should explore how to reorder rows and columns, or explore how to use a package like `tidyHeatmap`. 
217 | 
218 | # Exercise 
219 | ## Q1
220 | Here is the rainbow color scale: 
221 | ```{r}
222 | abc_1 %>% 
223 |   ggplot(aes(x = a, y = b)) +
224 |   geom_tile(aes(fill = c)) +
225 |   scale_fill_gradientn(colors = rainbow(20)) +
226 |   labs(title = "c = a + b") +
227 |   theme_classic() +
228 |   coord_fixed()
229 | 
230 | ggsave("../Results/06_abc_2.png", height = 3, width = 3)
231 | ```
232 | Can you explain why the rainbow color scale is inappropriate for heatmaps?  
233 | 
234 | ## Q2
235 | You can ignore the following chunk. It generates the data. 
236 | ```{r}
237 | Q2_data <- rbind(
238 |   c(10, 20, 30, 40, 50, 60), # 6
239 |   c(0, 10, 30, 10, 5, 0), # 3
240 |   c(100, 200, 300, 400, 500, 550), # 6
241 |   c(50, 45, 40, 30, 20, 0),  # 1
242 |   c(0, 200, 300, 500, 200, 100), # 4
243 |   c(10, 500, 200, 100, 10, 0), # 2 
244 |   c(0, 0, 0, 10, 500, 0) # 5
245 | ) %>%
246 |   as.data.frame() %>% 
247 |   cbind(gene = c("a", "b", "c", "d", "e", "f", "g")) %>% 
248 |   pivot_longer(cols = !gene, names_to = "V", values_to = "expression") %>% 
249 |   mutate(stage = str_remove(V, "V"))
250 | 
251 | Q2_data %>% 
252 |   ggplot(aes(x = stage, y = gene)) +
253 |   geom_tile(aes(fill = expression)) +
254 |   geom_text(aes(label = expression)) +
255 |   scale_fill_gradientn(colors = rev(brewer.pal(11, "RdBu")))+
256 |   theme_classic()
257 | 
258 | ggsave("../Results/06_genes_3.png", height = 3.5, width = 4.2)
259 | ```
260 | What is (are) wrong with this heatmap? 
261 | Hint: there is a lot going wrong with this heatmap. 
262 | 


--------------------------------------------------------------------------------
/Lessons/08_Intro_plot_assembly.md:
--------------------------------------------------------------------------------
  1 | # Introduction 
  2 | In this lesson, we will create publication ready composite plots. 
  3 | I.e., each composite plot (or figure) is assembled from multiple panels. 
  4 | And the panels themselves could also be a composite of multiple plots. 
  5 | 
  6 | You will learn:
  7 | 
  8 | 1. How to arrange and assemble composite plots.
  9 | 2. How to control relative heights and widths of panels.
 10 | 3. How to generate hierarchical composite plots.
 11 | 4. How to control the panel labels. 
 12 | 
 13 | As an example, we will be using the "mtcars" dataset from R. 
 14 | Don't worry about what the data or these plots mean. 
 15 | This is just an example. 
 16 | 
 17 | # Pacakge 
 18 | We will be using the `patchwork` package. 
 19 | Documentation and tutorials can be found [here](https://github.com/thomasp85/patchwork). 
 20 | While it is not the only package for plot assembly, it is my go-to. 
 21 | 
 22 | ```{r}
 23 | library(tidyverse)
 24 | library(ggbeeswarm)
 25 | library(patchwork)
 26 | 
 27 | library(RColorBrewer)
 28 | ```
 29 | 
 30 | # Making the panels 
 31 | You can ignore this entire section. 
 32 | This section generates the parts that we will use to assemble our final figure. 
 33 | 
 34 | ## A
 35 | ```{r}
 36 | A2 <- mtcars %>% 
 37 |   mutate(AM = case_when(
 38 |     am == 1 ~ "manual",
 39 |     am == 0 ~ "automatic"
 40 |   )) %>% 
 41 |   ggplot(aes(x = mpg, y = disp)) +
 42 |   geom_point(size = 3, alpha = 0.8, color = "white",
 43 |              shape = 21, aes(fill = AM)) +
 44 |   scale_fill_manual(values = c("tomato1", "grey20")) +
 45 |   labs(x = "Miles/(US) gallon",
 46 |        y = "Displacement (cu.in.)",
 47 |        fill = "Automatic") +
 48 |   theme_classic() +
 49 |   theme(
 50 |     legend.position = c(0.8, 0.8)
 51 |   )
 52 | 
 53 | A1 <- mtcars %>% 
 54 |   mutate(AM = case_when(
 55 |     am == 1 ~ "manual",
 56 |     am == 0 ~ "automatic"
 57 |   )) %>% 
 58 |   ggplot(aes(y = mpg, x = "")) +
 59 |   geom_quasirandom(size = 3, alpha = 0.8, color = "white",
 60 |              shape = 21, aes(fill = AM)) +
 61 |   scale_fill_manual(values = c("tomato1", "grey20")) +
 62 |   labs(y = "Miles/(US) gallon",
 63 |        x = NULL) +
 64 |   theme_classic() +
 65 |   theme(
 66 |     legend.position = "none"
 67 |   ) +
 68 |   coord_flip()
 69 | 
 70 | A3 <- mtcars %>% 
 71 |   mutate(AM = case_when(
 72 |     am == 1 ~ "manual",
 73 |     am == 0 ~ "automatic"
 74 |   )) %>% 
 75 |   ggplot(aes(y = disp, x = "" )) +
 76 |   geom_quasirandom(size = 3, alpha = 0.8, color = "white",
 77 |              shape = 21, aes(fill = AM)) +
 78 |   scale_fill_manual(values = c("tomato1", "grey20")) +
 79 |   labs(y = "Displacement (cu.in.)",
 80 |        x = NULL) +
 81 |   theme_classic() +
 82 |    theme(
 83 |     legend.position = "none"
 84 |   ) 
 85 | 
 86 | # A_scatter
 87 | ggsave(plot = A2, filename = "../Results/08_A2.png", height = 5, width = 5)
 88 | 
 89 | # A_hist_mpg
 90 | ggsave(plot = A1, filename = "../Results/08_A1.png", height = 3, width = 5)
 91 | 
 92 | # A_hist_disp
 93 | ggsave(plot = A3, filename = "../Results/08_A3.png", height = 3, width = 5)
 94 | ```
 95 | 
 96 | ## B
 97 | ```{r}
 98 | Panel_B <- mtcars %>% 
 99 |   mutate(VS = case_when(
100 |     vs == 0 ~ "V shaped",
101 |     vs == 1 ~ "Straight"
102 |   )) %>% 
103 |   ggplot(aes(x = VS, y = hp)) +
104 |   geom_quasirandom(size = 3, alpha = 0.8, 
105 |                    aes(color = VS)) +
106 |   scale_color_manual(values = brewer.pal(8, "Accent")[5:6]) +
107 |   labs(x = "Engine",
108 |        y = "Horsepower") +
109 |   theme_classic() +
110 |   theme(legend.position = "none") +
111 |   coord_flip()
112 | 
113 | # Panel_B
114 | ggsave("../Results/08_B.png", height = 2, width = 5)
115 | ```
116 | 
117 | ## C 
118 | ```{r}
119 | Panel_C <- mtcars %>% 
120 |   mutate(rank_wt = rank(wt, ties.method = "first")) %>% 
121 |   ggplot(aes(x = "", y = rank_wt)) +
122 |   geom_tile(aes(fill = mpg)) +
123 |   scale_fill_gradientn(colours = brewer.pal(9, "YlGnBu")) +
124 |   labs(x = NULL,
125 |        y = "Rank of Weight",
126 |        fill = "Miles/(US) gallon") +
127 |   theme_classic()
128 | 
129 | # Panel_C
130 | ggsave("../Results/08_C.png", height = 6, width = 3)
131 | ```
132 | 
133 | 
134 | # Putting plots together 
135 | Let's say we want to make a figure with three panels: A, B & C. 
136 | And we want a layout like this: 
137 | 
138 | | Column 1    | Column 2 |
139 | | :---------: | :------: |
140 | | Panel A     | Panel C  |
141 | | Panel B     | Panel C  |
142 | 
143 | The top left corner is panel A.
144 | The lower left corner is panel B.
145 | The right side is panel C. 
146 | 
147 | ## Panel A
148 | For the sake of this lesson, let's make panel A itself a composite of three sub-panels. 
149 | The 3 sub-panels are called "A1", "A2", and "A3". 
150 | 
151 | Let's say I want panel A to look like this: 
152 | 
153 | | Column 1    | Column 2  |
154 | | :---------: | :-------: |
155 | | A1          | Blank     |
156 | | A2          | A3        |
157 | 
158 | Notice that we only have 3 sub-panels to fill this 2x2 layout. 
159 | In this case the upper right corner is left blank. 
160 | What should we do? 
161 | 
162 | Let's look at each sub-panel for panel A first. 
163 | I already saved them as an R item in the memory. 
164 | ```{r}
165 | A1
166 | A2
167 | A3
168 | ```
169 | 
170 | ![A1](https://github.com/cxli233/Quick_data_vis/blob/main/Results/08_A1.png)
171 | 
172 | ![A2](https://github.com/cxli233/Quick_data_vis/blob/main/Results/08_A_2.png)
173 | 
174 | ![A3](https://github.com/cxli233/Quick_data_vis/blob/main/Results/08_A3.png)
175 | 
176 | ### The design of the composite 
177 | To create a composite, we use the `wrap_plots()` function from `patchwork`. 
178 | The name is somewhat self-explanatory as for wrapping plots into a larger one. 
179 | `patchwork` has an intuitive way to arrange plots using the `design = ` argument. 
180 | In this case, the design looks like this: 
181 | 
182 | ```{r}
183 | design <- c(
184 |   "A#
185 |    BC"
186 | )
187 | ```
188 | 
189 | The `design` argument is a character vector of uppercase letters (A, B, C, and so on).
190 | within this vector we start new lines to mean different rows in the composite.
191 | In this case, the `design` has two rows, so the composite will have two rows. 
192 | It specified that "part A" will be at the top left, and that "part B" will be at the lower left. 
193 | Finally, "part C" will be at the lower right. 
194 | The "#" sign indicates a blank space. 
195 | 
196 | To set it up, we simply call `wrap_plot()` and list the sub-panels in order.
197 | ```{r}
198 | wrap_plots(A1, A2, A3,
199 |            design = c("A#
200 |                        BC"))
201 | 
202 | ggsave("../Results/08_A_1.png", height = 6, width = 6)
203 | ```
204 | 
205 | ![A_1](https://github.com/cxli233/Quick_data_vis/blob/main/Results/08_A_1.png)
206 | 
207 | The way `wrap_plots()` matches sub-panel names to parts A, B, and C is by the order. 
208 | In this case, the first sub-panel name (A1) provided is part A, and so on. 
209 | 
210 | ### Controling relative heights and widths of sub-panels. 
211 | `patchwork` allows you control the relative heights and the widths of sub-panels. 
212 | In this example, we have a 2x2 design. 
213 | Let's say I want the 1st row to 30% of the height of the 2nd row.
214 | And I want the 2nd column to be 30% of the width of the 1st. 
215 | This can be specified using the `heights = ` and `widths =` argument within `wrap_plots()`.
216 | 
217 | ```{r}
218 | wrap_plots(A1, A2, A3,
219 |            design = c("A#
220 |                        BC"), 
221 |            heights = c(0.3, 1),
222 |            widths = c(1, 0.3))
223 | 
224 | ggsave("../Results/08_A_2.png", height = 6, width = 6)
225 | ```
226 | 
227 | ![A_2](https://github.com/cxli233/Quick_data_vis/blob/main/Results/08_A_2.png)
228 | 
229 | For example, vector `c(0.3, 1)` passed on to the `widths` option specifies that the 1st column should be 0.3 of the width of the 2nd column.  
230 | Notice we only 2 rows and columns, so the heights and widths vectors each have 2 numbers. 
231 | However, if the dimension of the design is different, the heights & widths vectors must have matching dimension too. 
232 | For example, if your design have 3 rows and 4 columns, the heights vector must have 3 numbers, and the widths vector must have 4 columns. 
233 | 
234 | Finally, we can control where we put panel labels. 
235 | In publication ready figures, each panel is labeled by a letter (e.g., A, B, C and so on).
236 | In this example, panel A has 3 sub-panels, how do we control which sub-panel gets labeled? 
237 | We can control this by using `+ labs(tag = "...")` option in `wrap_plots()`. 
238 | Let's say we want the letter "A" to appear at the upper left of panel A. 
239 | The best way to do this is appending `+ labs(tag = "A")` to the first sub-panel, in this case A1.  
240 | 
241 | 
242 | ```{r}
243 | Panel_A <- wrap_plots(A1 +
244 |                         labs(tag = "A"), 
245 |            A2, A3, 
246 |            design = c("A#
247 |                        BC"),
248 |            heights = c(0.3, 1),
249 |            widths = c(1, 0.3))
250 | 
251 | Panel_A
252 | ggsave("../Results/08_A.png", height = 6, width = 6)
253 | ```
254 | 
255 | ![A](https://github.com/cxli233/Quick_data_vis/blob/main/Results/08_A.png)
256 | 
257 | Now we have panel A! 
258 | 
259 | ## Hierarchical assembly 
260 | We made panel A, but we still need to assemble panel A with panels B and C.
261 | Well, we don't have to stop here. 
262 | We can call `wrap_plots()` again, now using "Panel_A" as one of the parts. 
263 | 
264 | Here are panels B and C:
265 | ```{r}
266 | Panel_B
267 | Panel_C
268 | ```
269 | 
270 | ![B](https://github.com/cxli233/Quick_data_vis/blob/main/Results/08_B.png)
271 | 
272 | ![C](https://github.com/cxli233/Quick_data_vis/blob/main/Results/08_C.png)
273 | 
274 | Let's say we want panel A and panel B to be in the left column, panel C alone take up the entire right column, what should the design argument be? 
275 | ```{r}
276 | c("AC
277 |    BC")
278 | ```
279 | In this case, the design vector has two rows. 
280 | And we specified that part C should take up both rows of the 2nd column, as we specified that part C should occupy both the 1st and 2nd rows. 
281 | 
282 | And let's say we want the 2nd column to the 1/10 of the width of the 1st.
283 | And that the 2nd row to be 30% of the height of the 1st. 
284 | Again, we can control that by calling `widths = ` and `heights = ` within `wrap_plots()`. 
285 | 
286 | ```{r}
287 |  wrap_plots(Panel_A, 
288 |                      Panel_B, 
289 |                      Panel_C, 
290 |                      design = "AC
291 |                                BC",
292 |                      widths = c(1, 0.1),
293 |                      heights = c(1, 0.3))
294 | 
295 | ggsave("../Results/08_assembled.png", height = 7, width = 8)
296 | ```
297 | 
298 | ![assembled](https://github.com/cxli233/Quick_data_vis/blob/main/Results/08_assembled.png)
299 | 
300 | Now we are really close. 
301 | Notice that only panel A is labeled? 
302 | We want panel B to be labeled with a "B" and panel C to be labeled with a "C". 
303 | We just need to append `+ labs(tag = "B")` and `+ labs(tag = "C")` to the respective panels. 
304 | 
305 | ## Adding panel label 
306 | ```{r}
307 |  wrap_plots(Panel_A, 
308 |               Panel_B +
309 |               labs(tag = "B"), 
310 |                      Panel_C +
311 |               labs(tag = "C"), 
312 |                      design = "AC
313 |                                BC",
314 |                      widths = c(1, 0.1),
315 |                      heights = c(1, 0.3))
316 | 
317 | ggsave("../Results/08_assembled_label.png", height = 7, width = 8)
318 | ```
319 | 
320 | ![assembled_labeled](https://github.com/cxli233/Quick_data_vis/blob/main/Results/08_assembled_label.png)
321 | 
322 | Done! 
323 | Great, we have a publication ready figure! 
324 | 
325 | # Exercise
326 | Use your own data from your research to generate a composite figure. 
327 | Save it using `ggsave()`. 
328 | 
329 | 


--------------------------------------------------------------------------------
/Lessons/06_Intro_to_heatmaps.md:
--------------------------------------------------------------------------------
  1 | # Introduction
  2 | Heatmap is another very common type of visualization in scientific publications. 
  3 | In biology they are extremely common in omics papers. 
  4 | However, extra care must be taken to not produce an uninterpretable or misleading heatmap. 
  5 | 
  6 | We will cover: 
  7 | 
  8 | 1. What is a heat map?
  9 | 2. Are you using an appropriate color scale? 
 10 | 3. Are you scaling the data correctly to best visualize your data? 
 11 | 4. How to handle outliers in heatmaps? 
 12 | 5. Should you reorder rows and columns of a heatmap? 
 13 | 
 14 | # Packages 
 15 | ```{r}
 16 | library(tidyverse)
 17 | library(RColorBrewer)
 18 | library(viridis) 
 19 | ```
 20 | We have new package `viridis`. 
 21 | `viridis` is a collection of beautiful color scales for heatmaps. 
 22 | Read more [here](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html). 
 23 | 
 24 | # What is a heat map? 
 25 | A heatmap is a representation of 3 dimension data. 
 26 | Say we have 3 dimensions x, y, z. 
 27 | In a heatmap, the z dimension is visualized by different colors. 
 28 | Let's look at an example: 
 29 | 
 30 | ```{r}
 31 | abc_1 <- expand.grid(
 32 |   a = c(1:10),
 33 |   b = c(1:10)
 34 | ) %>% 
 35 |   mutate(c = a + b)
 36 | 
 37 | head(abc_1)
 38 | ```
 39 | In this example, we have 3 dimensions: a, b, and c. 
 40 | a and b ranges from 1 to 10. 
 41 | And the c is the sum of a and b. 
 42 | 
 43 | ```{r}
 44 | abc_1 %>% 
 45 |   ggplot(aes(x = a, y = b)) +
 46 |   geom_tile(aes(fill = c)) +
 47 |   scale_fill_gradientn(colors = viridis(100)) +
 48 |   labs(title = "c = a + b") +
 49 |   theme_classic() +
 50 |   coord_fixed()
 51 | 
 52 | ggsave("../Results/06_abc_1.png", height = 3, width = 3)
 53 | ```
 54 | ![heatmap1](https://github.com/cxli233/Quick_data_vis/blob/main/Results/06_abc_1.png)
 55 | 
 56 | Now this is a heatmap. 
 57 | In ggplot, heatmap can be made using `geom_tile()`. 
 58 | To map values to the tiles, use `aes(fill = ...)`. 
 59 | In this case, we want to color the interior of tiles using the values of `c`. 
 60 | We do `aes(fill = c)` inside `geom_tile()`. 
 61 | 
 62 | # Choosing the correct color scale 
 63 | Note the `scale_fill_gradientn(colors = viridis(100))` line in the previous code chunk. 
 64 | What it does is it maps values of z to the `viridis` color scale, a purple-blue-green-yellow color gradient, where smaller values are darker purple, and larger values are mapped to progressively lighter and yellower colors. 
 65 | 
 66 | Color scales are pretty, but we must be extra careful. 
 67 | There are a couple important concepts regarding choosing a color scale. 
 68 | 
 69 | ## 1. Never use both red/green colors for heatmaps. 
 70 | Yes, I repeat, _NEVER_ use red and green in the same heat map. 
 71 | Red/green color blindness occurs in about 1/16 in male and 1/256 in female. 
 72 | People who are red/green colorblind cannot distinguish between red and green; they see brown instead. 
 73 | They will not be able to read a heatmap where numerical values are represented by a gradient of red and green.  
 74 | 
 75 | ## 2. Correctly use unidirectional and bidirectional color scales.
 76 | There are two types of color scales:
 77 | 
 78 | 1. Unidirectional or sequential: the color gradient goes from darkest to lightest.The viridis color gradient is a unidirectional color scale. 
 79 | 2. Bidirectional or divergent: the color gradient has a single lightest point at the center and two darkest points at the extremes. 
 80 | 
 81 | You should _NEVER_ use a bidirectional color scale for unidirectional data. 
 82 | ![Color scales](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/ColorScales.svg)
 83 | 
 84 | You might not have thought about this one, and people make this mistake all the time. 
 85 | In this example, the two plots on the right use a bidirectional color scale. 
 86 | Why? Because the blue and red are darker than white. 
 87 | The first three plots are fine, but last one is not okay. 
 88 | You might ask, what's wrong with it? 
 89 | 
 90 | When color scales (or color gradients) are used to represent numerical data, 
 91 | the darkest and lightest colors should have special meanings. 
 92 | You can decide what those special meanings are: e.g., max, min, mean, zero. 
 93 | But they should represent something meaningful. 
 94 | A data visualization sin for heat maps/color gradients is when the lightest or darkest colors are some arbitrary numbers. 
 95 | _This is as bad as the longest bar in a bar chart not being the largest value._ Can you imagine that?
 96 | 
 97 | # Same heatmap, same scale  
 98 | The next issue with heatmap is scaling. 
 99 | Let's use an example to illustrate this. 
100 | In this hypothetical example, I have the expression level of 3 genes across 5 stages of development.
101 | I want to visualize their expression pattern across development. 
102 | You can ignore this chunk. 
103 | ```{r}
104 | genes123 <- data.frame(
105 |   stages = c(1:5),
106 |   gene1 = c(15, 25, 35, 45, 55),
107 |   gene2 = c(100, 200, 300, 400, 500),
108 |   gene3 = c(100, 80, 70, 60, 50)
109 | ) %>% 
110 |   pivot_longer(cols = !stages, names_to = "gene", values_to = "expression") 
111 | 
112 | head(genes123)
113 | ```
114 | 
115 | If I just make a heatmap, what will happen? 
116 | ```{r}
117 | genes123 %>% 
118 |   ggplot(aes(x = stages, y = gene)) +
119 |   geom_tile(aes(fill = expression), color = "grey90") +
120 |   scale_fill_gradientn(colors = brewer.pal(9, "YlGnBu")) +
121 |   theme_classic() +
122 |   theme(
123 |     legend.position = "top"
124 |   )
125 | 
126 | ggsave("../Results/06_genes_1.png", height = 2.5, width = 3)
127 | ```
128 | ![gene_heatmap1](https://github.com/cxli233/Quick_data_vis/blob/main/Results/06_genes_1.png)
129 | 
130 | We have stages on x axis, genes on y axis, and tiles colored by expression. 
131 | If you look at the heatmap, you might be tempted to say, gene2 is going up during development, 
132 | and genes 1 and 3 are not changing that much. 
133 | But is that so? 
134 | 
135 | In this example, gene2 ranges from 100-500, whereas the rest of the two genes ranges from 10-100. 
136 | The large difference in range implies that the lower range values are not visualized well. 
137 | 
138 | A good way to scale the data is using z score. 
139 | A z score is the difference between a value and the mean, divided by the standard deviation. 
140 | In this example, we will calculate z score for each gene. 
141 | Remember our good friend `group_by()`? 
142 | 
143 | ```{r}
144 | genes123 <- genes123 %>% 
145 |   group_by(gene) %>% 
146 |   mutate(z = (expression - mean(expression))/sd(expression)) %>% 
147 |   ungroup()
148 | 
149 | head(genes123)
150 | ```
151 | 
152 | What happened here is we `group_by(gene)` first,
153 | such that z score calculation is done relative to each gene, not the entire data. 
154 | Then `mutate(z = (expression - mean(expression))/sd(expression))` quite literally calculates the z score. 
155 | Lastly, we `ungroup()`, and we are done. 
156 | 
157 | Now we make a new heatmap, but color tiles with z scores. 
158 | ```{r}
159 | genes123 %>% 
160 |   ggplot(aes(x = stages, y = gene)) +
161 |   geom_tile(aes(fill = z), color = "grey90") +
162 |   scale_fill_gradientn(colors = brewer.pal(9, "YlGnBu")) +
163 |   theme_classic() +
164 |   theme(legend.position = "top")
165 | 
166 | ggsave("../Results/06_genes_2.png", height = 2.5, width = 3)
167 | ```
168 | ![gene_heatmap2](https://github.com/cxli233/Quick_data_vis/blob/main/Results/06_genes_2.png)
169 | 
170 | In this version of the heatmap, it is obvious that genes 1 and 2 have the same expression pattern.
171 | They both go up during development, while gene3 has the opposite pattern. 
172 | 
173 | # Be aware of outliers 
174 | This is related to the previous point. 
175 | Outliers can drastically change the appearance of a heatmap. 
176 | Here is an example: 
177 | ![check outliers for heatmap](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/Check_outliers_for_heatmap.svg) 
178 | 
179 | In this example, I have 2 observations. 
180 | For each observations, I measured 20 features. 
181 | Without checking for outliers, it may appear that the 2 observations are overall similar, except at 2 features. 
182 | However, after maxing out the color scale around the 95th percentile of the data, it reveals that the two observations are distinct across all features.
183 | 
184 | The point here is always check for outliers when making heatmaps. 
185 | If there are outliers, it may be a good idea to clip outliers at the extremes. 
186 | 
187 | # Reordering rows and columns 
188 | In most cases, the ordering of rows and columns are arbitrary.
189 | For example, nothing says the y axis has to be in order of gene1, followed by gene2, then gene3. 
190 | However, if the heatmap reflects a physical map (x and y axis maps to physical locations in space), 
191 | then you can't reorder rows and columns. 
192 | 
193 | The ordering of the x and y axis has a _HUGE_ impact on how useful a heatmap is, especially a large heatmap.  
194 | Let's look at an example: 
195 | ![recorder rows and columns](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/Reorder_rows_and_columns_for_heatmap.png) 
196 | 
197 | In this example, I have cells as columns and features as rows. 
198 | Each tile is showing z scores. 
199 | It is impossible to get anything useful out of the heatmap without reordering rows and columns.
200 | We can reorder rows and columns, such that features that are enriched in cell type 1 appears first.
201 | Then features that are enriched in cell type 2 appears second. Data from: [Li et al., 2022, BioRxiv](https://www.biorxiv.org/content/10.1101/2022.07.04.498697v1.abstract)
202 | 
203 | When rows and columns of a heatmap is properly ordered, you can see a clear signal across the diagonal. For example: 
204 | 
205 | ![Beautiful heatmap](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/Abstract_R_2022_11_24.svg)
206 | 
207 | The coding underlying reordering rows and columns of a heatmap is actually quite complex. 
208 | I would say this is a bit outside the scope of this introductory lesson. 
209 | That being said, there are packages that produces heatmaps, and they also handles reordering rows and columns automatically.  
210 | For example, [tidyHeatmap](https://github.com/stemangiola/tidyHeatmap) is a good one.
211 | If you ever need to make heatmaps, you should explore how to reorder rows and columns, or explore how to use a package like `tidyHeatmap`. 
212 | 
213 | # Exercise 
214 | ## Q1
215 | Here is the rainbow color scale: 
216 | ```{r}
217 | abc_1 %>% 
218 |   ggplot(aes(x = a, y = b)) +
219 |   geom_tile(aes(fill = c)) +
220 |   scale_fill_gradientn(colors = rainbow(20)) +
221 |   labs(title = "c = a + b") +
222 |   theme_classic() +
223 |   coord_fixed()
224 | 
225 | ggsave("../Results/06_abc_2.png", height = 3, width = 3)
226 | ```
227 | ![rainbow](https://github.com/cxli233/Quick_data_vis/blob/main/Results/06_abc_2.png)
228 | 
229 | Can you explain why the rainbow color scale is inappropriate for heatmaps?  
230 | 
231 | ## Q2
232 | You can ignore the following chunk. It generates the data. 
233 | ```{r}
234 | Q2_data <- rbind(
235 |   c(10, 20, 30, 40, 50, 60), # 6
236 |   c(0, 10, 30, 10, 5, 0), # 3
237 |   c(100, 200, 300, 400, 500, 550), # 6
238 |   c(50, 45, 40, 30, 20, 0),  # 1
239 |   c(0, 200, 300, 500, 200, 100), # 4
240 |   c(10, 500, 200, 100, 10, 0), # 2 
241 |   c(0, 0, 0, 10, 500, 0) # 5
242 | ) %>%
243 |   as.data.frame() %>% 
244 |   cbind(gene = c("a", "b", "c", "d", "e", "f", "g")) %>% 
245 |   pivot_longer(cols = !gene, names_to = "V", values_to = "expression") %>% 
246 |   mutate(stage = str_remove(V, "V"))
247 | 
248 | Q2_data %>% 
249 |   ggplot(aes(x = stage, y = gene)) +
250 |   geom_tile(aes(fill = expression)) +
251 |   geom_text(aes(label = expression)) +
252 |   scale_fill_gradientn(colors = rev(brewer.pal(11, "RdBu")))+
253 |   theme_classic()
254 | 
255 | ggsave("../Results/06_genes_3.png", height = 3.5, width = 4.2)
256 | ```
257 | ![bad heatmap](https://github.com/cxli233/Quick_data_vis/blob/main/Results/06_genes_3.png)
258 | 
259 | What is (are) wrong with this heatmap? 
260 | Hint: there is a lot going wrong with this heatmap. 
261 | 


--------------------------------------------------------------------------------
/Scripts/08_Intro_plot_assembly.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "08_Plot_assembly"
  3 | author: "Chenxin Li"
  4 | date: "2023-07-09"
  5 | output: html_notebook
  6 | ---
  7 | 
  8 | ```{r setup, include=FALSE}
  9 | knitr::opts_chunk$set(echo = TRUE)
 10 | ```
 11 | 
 12 | # Introduction 
 13 | In this lesson, we will create publication ready composite plots. 
 14 | I.e., each composite plot (or figure) is assembled from multiple panels. 
 15 | And the panels themselves could also be a composite of multiple plots. 
 16 | 
 17 | You will learn:
 18 | 
 19 | 1. How to arrange and assemble composite plots.
 20 | 2. How to control relative heights and widths of panels.
 21 | 3. How to generate hierarchical composite plots.
 22 | 4. How to control the panel labels. 
 23 | 
 24 | As an example, we will be using the "mtcars" dataset from R. 
 25 | Don't worry about what the data or these plots mean. 
 26 | This is just an example. 
 27 | 
 28 | # Pacakge 
 29 | We will be using the `patchwork` package. 
 30 | Documentation and tutorials can be found [here](https://github.com/thomasp85/patchwork). 
 31 | While it is not the only package for plot assembly, it is my go-to. 
 32 | 
 33 | ```{r}
 34 | library(tidyverse)
 35 | library(ggbeeswarm)
 36 | library(patchwork)
 37 | 
 38 | library(RColorBrewer)
 39 | ```
 40 | 
 41 | # Making the panels 
 42 | You can ignore this entire section. 
 43 | This section generates the parts that we will use to assemble our final figure. 
 44 | 
 45 | ## A
 46 | ```{r}
 47 | A2 <- mtcars %>% 
 48 |   mutate(AM = case_when(
 49 |     am == 1 ~ "manual",
 50 |     am == 0 ~ "automatic"
 51 |   )) %>% 
 52 |   ggplot(aes(x = mpg, y = disp)) +
 53 |   geom_point(size = 3, alpha = 0.8, color = "white",
 54 |              shape = 21, aes(fill = AM)) +
 55 |   scale_fill_manual(values = c("tomato1", "grey20")) +
 56 |   labs(x = "Miles/(US) gallon",
 57 |        y = "Displacement (cu.in.)",
 58 |        fill = "Automatic") +
 59 |   theme_classic() +
 60 |   theme(
 61 |     legend.position = c(0.8, 0.8)
 62 |   )
 63 | 
 64 | A1 <- mtcars %>% 
 65 |   mutate(AM = case_when(
 66 |     am == 1 ~ "manual",
 67 |     am == 0 ~ "automatic"
 68 |   )) %>% 
 69 |   ggplot(aes(y = mpg, x = "")) +
 70 |   geom_quasirandom(size = 3, alpha = 0.8, color = "white",
 71 |              shape = 21, aes(fill = AM)) +
 72 |   scale_fill_manual(values = c("tomato1", "grey20")) +
 73 |   labs(y = "Miles/(US) gallon",
 74 |        x = NULL) +
 75 |   theme_classic() +
 76 |   theme(
 77 |     legend.position = "none"
 78 |   ) +
 79 |   coord_flip()
 80 | 
 81 | A3 <- mtcars %>% 
 82 |   mutate(AM = case_when(
 83 |     am == 1 ~ "manual",
 84 |     am == 0 ~ "automatic"
 85 |   )) %>% 
 86 |   ggplot(aes(y = disp, x = "" )) +
 87 |   geom_quasirandom(size = 3, alpha = 0.8, color = "white",
 88 |              shape = 21, aes(fill = AM)) +
 89 |   scale_fill_manual(values = c("tomato1", "grey20")) +
 90 |   labs(y = "Displacement (cu.in.)",
 91 |        x = NULL) +
 92 |   theme_classic() +
 93 |    theme(
 94 |     legend.position = "none"
 95 |   ) 
 96 | 
 97 | # A_scatter
 98 | ggsave(plot = A2, filename = "../Results/08_A2.png", height = 5, width = 5)
 99 | 
100 | # A_hist_mpg
101 | ggsave(plot = A1, filename = "../Results/08_A1.png", height = 3, width = 5)
102 | 
103 | # A_hist_disp
104 | ggsave(plot = A3, filename = "../Results/08_A3.png", height = 3, width = 5)
105 | ```
106 | 
107 | ## B
108 | ```{r}
109 | Panel_B <- mtcars %>% 
110 |   mutate(VS = case_when(
111 |     vs == 0 ~ "V shaped",
112 |     vs == 1 ~ "Straight"
113 |   )) %>% 
114 |   ggplot(aes(x = VS, y = hp)) +
115 |   geom_quasirandom(size = 3, alpha = 0.8, 
116 |                    aes(color = VS)) +
117 |   scale_color_manual(values = brewer.pal(8, "Accent")[5:6]) +
118 |   labs(x = "Engine",
119 |        y = "Horsepower") +
120 |   theme_classic() +
121 |   theme(legend.position = "none") +
122 |   coord_flip()
123 | 
124 | # Panel_B
125 | ggsave("../Results/08_B.png", height = 2, width = 5)
126 | ```
127 | 
128 | ## C 
129 | ```{r}
130 | Panel_C <- mtcars %>% 
131 |   mutate(rank_wt = rank(wt, ties.method = "first")) %>% 
132 |   ggplot(aes(x = "", y = rank_wt)) +
133 |   geom_tile(aes(fill = mpg)) +
134 |   scale_fill_gradientn(colours = brewer.pal(9, "YlGnBu")) +
135 |   labs(x = NULL,
136 |        y = "Rank of Weight",
137 |        fill = "Miles/(US) gallon") +
138 |   theme_classic()
139 | 
140 | # Panel_C
141 | ggsave("../Results/08_C.png", height = 6, width = 3)
142 | ```
143 | 
144 | 
145 | # Putting plots together 
146 | Let's say we want to make a figure with three panels: A, B & C. 
147 | And we want a layout like this: 
148 | 
149 | | Column 1    | Column 2 |
150 | | :---------: | :------: |
151 | | Panel A     | Panel C  |
152 | | Panel B     | Panel C  |
153 | 
154 | The top left corner is panel A.
155 | The lower left corner is panel B.
156 | The right side is panel C. 
157 | 
158 | ## Panel A
159 | For the sake of this lesson, let's make panel A itself a composite of three sub-panels. 
160 | The 3 sub-panels are called "A1", "A2", and "A3". 
161 | 
162 | Let's say I want panel A to look like this: 
163 | 
164 | | Column 1    | Column 2  |
165 | | :---------: | :-------: |
166 | | A1          | Blank     |
167 | | A2          | A3        |
168 | 
169 | Notice that we only have 3 sub-panels to fill this 2x2 layout. 
170 | In this case the upper right corner is left blank. 
171 | What should we do? 
172 | 
173 | Let's look at each sub-panel for panel A first. 
174 | I already saved them as an R item in the memory. 
175 | ```{r}
176 | A1
177 | A2
178 | A3
179 | ```
180 | 
181 | ### The design of the composite 
182 | To create a composite, we use the `wrap_plots()` function from `patchwork`. 
183 | The name is somewhat self-explanatory as for wrapping plots into a larger one. 
184 | `patchwork` has an intuitive way to arrange plots using the `design = ` argument. 
185 | In this case, the design looks like this: 
186 | 
187 | ```{r}
188 | design <- c(
189 |   "A#
190 |    BC"
191 | )
192 | ```
193 | 
194 | The `design` argument is a character vector of uppercase letters (A, B, C, and so on).
195 | within this vector we start new lines to mean different rows in the composite.
196 | In this case, the `design` has two rows, so the composite will have two rows. 
197 | It specified that "part A" will be at the top left, and that "part B" will be at the lower left. 
198 | Finally, "part C" will be at the lower right. 
199 | The "#" sign indicates a blank space. 
200 | 
201 | To set it up, we simply call `wrap_plot()` and list the sub-panels in order.
202 | ```{r}
203 | wrap_plots(A1, A2, A3,
204 |            design = c("A#
205 |                        BC"))
206 | 
207 | ggsave("../Results/08_A_1.png", height = 6, width = 6)
208 | ```
209 | The way `wrap_plots()` matches sub-panel names to parts A, B, and C is by the order. 
210 | In this case, the first sub-panel name (A1) provided is part A, and so on. 
211 | 
212 | ### Controling relative heights and widths of sub-panels. 
213 | `patchwork` allows you control the relative heights and the widths of sub-panels. 
214 | In this example, we have a 2x2 design. 
215 | Let's say I want the 1st row to 30% of the height of the 2nd row.
216 | And I want the 2nd column to be 30% of the width of the 1st. 
217 | This can be specified using the `heights = ` and `widths =` argument within `wrap_plots()`.
218 | 
219 | ```{r}
220 | wrap_plots(A1, A2, A3,
221 |            design = c("A#
222 |                        BC"), 
223 |            heights = c(0.3, 1),
224 |            widths = c(1, 0.3))
225 | 
226 | ggsave("../Results/08_A_2.png", height = 6, width = 6)
227 | ```
228 | For example, vector `c(0.3, 1)` passed on to the `widths` option specifies that the 1st column should be 0.3 of the width of the 2nd column.  
229 | Notice we only 2 rows and columns, so the heights and widths vectors each have 2 numbers. 
230 | However, if the dimension of the design is different, the heights & widths vectors must have matching dimension too. 
231 | For example, if your design have 3 rows and 4 columns, the heights vector must have 3 numbers, and the widths vector must have 4 columns. 
232 | 
233 | Finally, we can control where we put panel labels. 
234 | In publication ready figures, each panel is labeled by a letter (e.g., A, B, C and so on).
235 | In this example, panel A has 3 sub-panels, how do we control which sub-panel gets labeled? 
236 | We can control this by using `+ labs(tag = "...")` option in `wrap_plots()`. 
237 | Let's say we want the letter "A" to appear at the upper left of panel A. 
238 | The best way to do this is appending `+ labs(tag = "A")` to the first sub-panel, in this case A1.  
239 | 
240 | 
241 | ```{r}
242 | Panel_A <- wrap_plots(A1 +
243 |                         labs(tag = "A"), 
244 |            A2, A3, 
245 |            design = c("A#
246 |                        BC"),
247 |            heights = c(0.3, 1),
248 |            widths = c(1, 0.3))
249 | 
250 | Panel_A
251 | ggsave("../Results/08_A.png", height = 6, width = 6)
252 | ```
253 | Now we have panel A! 
254 | 
255 | ## Hierarchical assembly 
256 | We made panel A, but we still need to assemble panel A with panels B and C.
257 | Well, we don't have to stop here. 
258 | We can call `wrap_plots()` again, now using "Panel_A" as one of the parts. 
259 | 
260 | Here are panels B and C:
261 | ```{r}
262 | Panel_B
263 | Panel_C
264 | ```
265 | 
266 | Let's say we want panel A and panel B to be in the left column, panel C alone take up the entire right column, what should the design argument be? 
267 | ```{r}
268 | c("AC
269 |    BC")
270 | ```
271 | In this case, the design vector has two rows. 
272 | And we specified that part C should take up both rows of the 2nd column, as we specified that part C should occupy both the 1st and 2nd rows. 
273 | 
274 | And let's say we want the 2nd column to the 1/10 of the width of the 1st.
275 | And that the 2nd row to be 30% of the height of the 1st. 
276 | Again, we can control that by calling `widths = ` and `heights = ` within `wrap_plots()`. 
277 | 
278 | ```{r}
279 |  wrap_plots(Panel_A, 
280 |                      Panel_B, 
281 |                      Panel_C, 
282 |                      design = "AC
283 |                                BC",
284 |                      widths = c(1, 0.1),
285 |                      heights = c(1, 0.3))
286 | 
287 | ggsave("../Results/08_assembled.png", height = 7, width = 8)
288 | ```
289 | Now we are really close. 
290 | Notice that only panel A is labeled? 
291 | We want panel B to be labeled with a "B" and panel C to be labeled with a "C". 
292 | We just need to append `+ labs(tag = "B")` and `+ labs(tag = "C")` to the respective panels. 
293 | 
294 | ## Adding panel label 
295 | ```{r}
296 |  wrap_plots(Panel_A, 
297 |               Panel_B +
298 |               labs(tag = "B"), 
299 |                      Panel_C +
300 |               labs(tag = "C"), 
301 |                      design = "AC
302 |                                BC",
303 |                      widths = c(1, 0.1),
304 |                      heights = c(1, 0.3))
305 | 
306 | ggsave("../Results/08_assembled_label.png", height = 7, width = 8)
307 | ```
308 | Done! 
309 | Great, we have a publication ready figure! 
310 | 
311 | # Exercise
312 | Use your own data from your research to generate a composite figure. 
313 | Save it using `ggsave()`. 
314 | 
315 | For those of you who do not have your own data, please use these datasets from a hypothetical project.  
316 | 
317 | You measured tensile strength and resistance of a material at two temperatures (t1 and t2) and two pressures (p1 and p2) (dataset1). 
318 | You also tested samples of this material in the above-mentioned temperature-pressure combinations. You recorded the percentages of samples that passed, failed, or were unclear (dataset2). 
319 | 
320 | * What is the condition that resulted in the highest tensile strength? 
321 | * What is the condition that resulted in the lowest resistance? 
322 | * Is there any correlation between tensile strength and resistance across various conditions? 
323 | * What is the condition that resulted in the lowest percentage of failure? What is the breakdown of test outcomes across treatment? 
324 | * Based on all the data available, which treatment is the best? 
325 | 
326 | ## Datasets 
327 | ```{r}
328 | dataset1 <- read.csv("../Data/dataset1.csv")
329 | dataset2 <- read.csv("../Data/dataset2.csv")
330 | 
331 | head(dataset1)
332 | head(dataset2)
333 | ```
334 | 
335 | 
336 | 


--------------------------------------------------------------------------------
/Scripts/03_Intro_to_data_vis.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Intro_to_data_vis"
  3 | author: "Chenxin Li"
  4 | date: "2023-01-06"
  5 | output:
  6 |   html_notebook:
  7 |     number_sections: yes
  8 |     toc: yes
  9 |     toc_float: yes
 10 | ---
 11 | 
 12 | ```{r setup, include=FALSE}
 13 | knitr::opts_chunk$set(echo = TRUE)
 14 | ```
 15 | 
 16 | # Introduction 
 17 | 
 18 | In this lesson, we will cover: 
 19 | 
 20 | 1. How are data represented in visualizations? 
 21 | 2. What is "the grammar of graphics"? 
 22 | 3. Introduction to ggplot: mapping variables to axis and aesthetics. 
 23 | 
 24 | # Required packages
 25 | ```{r}
 26 | library(tidyverse)
 27 | library(RColorBrewer)
 28 | ```
 29 | 
 30 | The `tidyverse` will be doing all the heavy lifting. 
 31 | The `RColorBrewer` will be providing some nice color sets for our graphs. 
 32 | 
 33 | # How are data represented in visualizations?
 34 | As an example, let's simulate some data: 
 35 | ```{r}
 36 | set.seed(666)
 37 | Example_data1 <- rbind(
 38 |   rnorm(3, mean = 5, sd = 0.5) %>%
 39 |     as.data.frame() %>% 
 40 |     mutate(time = 1),
 41 |   rnorm(3, mean = 8, sd = 0.5) %>% 
 42 |     as.data.frame() %>% 
 43 |     mutate(time = 2),
 44 |   rnorm(3, mean = 10, sd = 0.5) %>%
 45 |     as.data.frame() %>% 
 46 |     mutate(time = 3)
 47 | ) %>% 
 48 |   rename(response = ".")
 49 | 
 50 | Example_data1
 51 | ```
 52 | |response|time|
 53 | |--------|----|
 54 | |5.376656|   1|
 55 | |6.007177|   1|
 56 | |4.822433|   1|
 57 | |9.014084|   2|
 58 | |6.891563|   2|
 59 | |8.379198|   2|
 60 | |9.346907|   3|
 61 | |9.598740|   3|
 62 | |9.103880|   3|
 63 | 
 64 | As you can see, the data are stored in this table. This is already a tidy data frame. 
 65 | It has two columns, "response" and "time". It appears each time point has 3 observations. 
 66 | 
 67 | If you were to just print this table, it will be a perfectly correct representation of the data.  
 68 | For example, if you read the numbers, you probably notice the response variable goes up as time progresses.  
 69 | 
 70 | But that is NOT the point of data visualization. 
 71 | The purpose of data visualization is to create a visual and intuitive representation of the data. 
 72 | A good data visualization allows us to gain insight into the data at a glance. 
 73 | That being said, what are the ways to visually represent data? 
 74 | 
 75 | ## Represent data by position
 76 | The first way to represent data is by position, such that the data is represented by position along the axis. 
 77 | ```{r}
 78 | Example_data1 %>% 
 79 |   ggplot(aes(x = time, y = response)) +
 80 |   stat_summary(geom = "line", fun.data = mean_se, 
 81 |                size = 1.1, alpha = 0.8, color = "grey60") +
 82 |   geom_point(aes(fill = as.factor(time)), shape = 21, color = "grey20",
 83 |              alpha = 0.8, size = 3, position = position_jitter(0.1, seed = 666)) +
 84 |   stat_summary(geom = "errorbar", fun.data = mean_se,
 85 |                width = 0.05, alpha = 0.8) +
 86 |   scale_fill_manual(values = brewer.pal(8, "Accent")) +
 87 |   scale_x_continuous(breaks = c(1, 2, 3)) +
 88 |   labs(x = "Time point",
 89 |        y = "Response",
 90 |        title = "Dot/line graphs are\nposition based") +
 91 |   theme_classic() +
 92 |   theme(
 93 |     legend.position = "none",
 94 |     text = element_text(size = 14, color = "black"),
 95 |     axis.text = element_text(color = "black"),
 96 |     plot.title = element_text(size = 12)
 97 |   )
 98 | 
 99 | ggsave("../Results/03_line_plot.png", height = 2.5, width = 2)
100 | ```
101 | In this case, I graphed time on x-axis, and response on y-axis. 
102 | This visualization is position based. 
103 | 
104 | ## Represent data by length/size 
105 | ```{r}
106 | Example_data1 %>% 
107 |   ggplot(aes(x = time, y = response)) +
108 |   stat_summary(geom = "bar", fun.data = mean_se, aes(color = as.factor(time)),
109 |                size = 1.1, alpha = 0.8, fill = NA, width = 0.5) +
110 |   geom_point(aes(fill = as.factor(time)), shape = 21, color = "grey20",
111 |              alpha = 0.8, size = 3, position = position_jitter(0.1, seed = 666)) +
112 |   stat_summary(geom = "errorbar", fun.data = mean_se,
113 |                width = 0.05, alpha = 0.8) +
114 |   scale_fill_manual(values = brewer.pal(8, "Accent")) +
115 |   scale_color_manual(values = brewer.pal(8, "Accent")) +
116 |   scale_x_continuous(breaks = c(1, 2, 3)) +
117 |   labs(x = "Time point",
118 |        y = "\nResponse",
119 |        title = "Bar graphs are\nlength based") +
120 |   theme_classic() +
121 |   theme(
122 |     legend.position = "none",
123 |     text = element_text(size = 14, color = "black"),
124 |     axis.text = element_text(color = "black"),
125 |     plot.title = element_text(size = 12) 
126 |   )
127 | 
128 | ggsave("../Results/03_bar.png", height = 2.5, width = 2)
129 | ```
130 | In this case, I graphed time on x-axis, and response on y-axis. 
131 | This visualization is length based. Why?
132 | In bar plots, values are represented by the distance from the x axis, and thus the length of the bar. 
133 | 
134 | Never confuse position and length based visualizations! 
135 | Can you tell what is wrong with the graph on the right? 
136 | 
137 | ![What is wrong?](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/Position_and_length_based_visualizations.svg)
138 | 
139 | ## Represent data by color 
140 | Finally, data can be represented by color. 
141 | We will explore this more when we talk about heatmaps and whatnot. 
142 | 
143 | # ggplot: data visualization using the grammar of graphics
144 | As you can see in the examples, I've been using `ggplot` to produce graphs. 
145 | The "gg" in `ggplot` stands for grammar of graphics, a theoretical framework to visualize data. 
146 | In grammar of graphics, each visualization has the following layers: 
147 | 
148 | 1. Coordinates: what are on x and y axis? 
149 | 2. Geometric objects ("geom"): e.g., bars, points, lines... 
150 | 3. Mapping: mapping colors, size, and transparency to geometric objects.
151 | 
152 | For more resources, you can look at the [ggplot cheat sheet](https://statsandr.com/blog/files/ggplot2-cheatsheet.pdf). 
153 | 
154 | Let's build a scatter plot for our example data. 
155 | Usually I start with the data table and pipe it into `ggplot()` using the pipe `%>%` syntax.
156 | Inside the `ggplot()` function, there is this `aes()` argument. 
157 | Inside the `aes()` argument, it takes the x and y variables. 
158 | Not intuitive, but it is what it is. 
159 | 
160 | ```{r}
161 | Example_data1 %>% 
162 |   ggplot(aes(x = time, y = response)) +
163 |   theme_classic()
164 | 
165 | ggsave("../Results/03_scatter_1.png", height = 2.5, width = 2)
166 | ```
167 | If we just do that, we get a blank graph with time on x axis and response on y axis. 
168 | The `theme_classic()` only change the appearance of the background, read more [here](https://ggplot2.tidyverse.org/reference/ggtheme.html).
169 | ggplot comes with a collection of themes. My go-to is "classic". It has white background and axis lines, as you can see in this example. 
170 | 
171 | The graph is blank because we haven't added any geometric objects (geom) yet. 
172 | For the sake of this example only, let's make a scatter plot. So we need some points. 
173 | The command to add points in `ggplot` is `geom_poit()`. 
174 | 
175 | ```{r}
176 | Example_data1 %>% 
177 |   ggplot(aes(x = time, y = response)) +
178 |   geom_point() +
179 |   theme_classic()
180 | 
181 | ggsave("../Results/03_scatter_2.png", height = 2.5, width = 2)
182 | ```
183 | This is how the graph looks like after we added points. 
184 | To include more layers, you literally add (using the `+` syntax) layers to the existing code. 
185 | After the `ggplot()` line which initiates the graph, I usually add the geom layers first, then other layers, then lastly the `theme` layer. 
186 | 
187 | For a minimal plot, we are done! 
188 | 
189 | # Mapping variables to ggplot geom
190 | The example data we are using only have 2 variables: time and response. What if we have more than 2? 
191 | To demonstrate that, let's use another example. 
192 | 
193 | I am going to read in child mortality, children/women, and income data. 
194 | We will use those data for another example. For the sake of this lesson, we will just use year 1945. 
195 | 
196 | First, read in data. 
197 | ```{r}
198 | child_mortality <- read_csv("../Data/child_mortality_0_5_year_olds_dying_per_1000_born.csv", col_types = cols()) 
199 | babies_per_woman <- read_csv("../Data/children_per_woman_total_fertility.csv", col_types = cols()) 
200 | income <- read_csv("../Data/income_per_person_gdppercapita_ppp_inflation_adjusted.csv", col_types = cols()) 
201 | ```
202 | 
203 | Second, re-shape data into tidy format. 
204 | ```{r}
205 | babies_per_woman_tidy <- babies_per_woman %>% 
206 |   pivot_longer(names_to = "year", values_to = "birth", cols = c(2:302))  
207 | 
208 | child_mortality_tidy <- child_mortality %>% 
209 |   pivot_longer(names_to = "year", values_to = "death_per_1000_born", cols = c(2:302))  
210 | 
211 | income_tidy <- income %>% 
212 |   pivot_longer(names_to = "year", values_to = "income", cols = c(2:242))  
213 | ```
214 | 
215 | Third, join them together. 
216 | ```{r}
217 | example2_data <- babies_per_woman_tidy %>% 
218 |   inner_join(child_mortality_tidy, by = c("country", "year")) %>% 
219 |   inner_join(income_tidy, by = c("country", "year"))
220 | 
221 | head(example2_data)
222 | ```
223 | 
224 | Fourth, filter for year 1945.
225 | ```{r}
226 | example2_data_1945 <- example2_data %>% 
227 |   filter(year == "1945")
228 | 
229 | head(example2_data_1945)
230 | ```
231 | 
232 | 
233 | You should be able to do all that relatively easily after completing [Intro_to_tidy_data](https://github.com/cxli233/Online_R_learning/blob/master/Quick_data_vis/Lessons/02_Intro_to_tidy_data.md). 
234 | If not, practice more! 
235 | 
236 | Say we want birth rate on x axis, mortality on y axis, make a scatter plot, and color the dots by income, what should we do? 
237 | Let's make a plain scatter plot first. 
238 | 
239 | ```{r}
240 | example2_data_1945 %>% 
241 |   ggplot(aes(x = birth, y = death_per_1000_born)) +
242 |   geom_point() +
243 |   labs(x = "No. children per woman",
244 |        y = "Child mortality/1000 born",
245 |        title = "Year 1945") +
246 |   theme_classic()
247 | 
248 | ggsave("../Results/03_scatter_3.png", width = 2.5, height = 2.5)
249 | ```
250 | That looks fine. 
251 | To color dots (or any geom) based on a variable in the data frame, you open an `aes()` argument inside the `geom()`.
252 | 
253 | ```{r}
254 | example2_data_1945 %>% 
255 |   ggplot(aes(x = birth, y = death_per_1000_born)) +
256 |   geom_point(aes(color = log10(income))) +
257 |   labs(x = "No. children per woman",
258 |        y = "Child mortality/1000 born",
259 |        title = "Year 1945") +
260 |   theme_classic()
261 | 
262 | ggsave("../Results/03_scatter_4.png", width = 3, height = 2.5)
263 | ```
264 | Note: income is quite unevenly distributed, so I log-transformed it. 
265 | Now we have dots colored by income (in log10 scale).
266 | The problem is the default color scale in ggplot looks bad. 
267 | We can use a different set of colors from the `RColorBrewer` package. 
268 | You can look at all the available color scales in `RColorBrewer` [here](https://r-graph-gallery.com/38-rcolorbrewers-palettes.html). 
269 | Let's use something easier to see than the default ggplot blue. 
270 | 
271 | ```{r}
272 | example2_data_1945 %>% 
273 |   ggplot(aes(x = birth, y = death_per_1000_born)) +
274 |   geom_point(aes(color = log10(income))) +
275 |   scale_color_gradientn(colours = brewer.pal(9, "YlGnBu")) +
276 |   labs(x = "No. children per woman",
277 |        y = "Child mortality/1000 born",
278 |        title = "Year 1945") +
279 |   theme_classic()
280 | 
281 | ggsave("../Results/03_scatter_5.png", width = 3, height = 2.5)
282 | ```
283 | To provide a custom color scale that produce a color gradient, use `scale_color_gradientn())`.
284 | Here is a whole family of `scale` functions in ggplot, which allow you to control the scales for colors and axis.  
285 | You can read more about scales on the [ggplot cheat sheet](https://statsandr.com/blog/files/ggplot2-cheatsheet.pdf). 
286 | I quite like the yellow-green-blue color gradient from `RColorBrewer`, which is what I used here. 
287 | As you can see, low income countries have more children/woman and also higher child mortality. 
288 | 
289 | # Exercise 
290 | Graph income (in log10 scale) on x axis, child mortality on y axis, and color with children/woman in year 2010. 
291 | Were the trend similar to year 1945? 
292 | Save the graph using `ggsave()`. 
293 | 
294 |  
295 | 
296 | 


--------------------------------------------------------------------------------
/Lessons/01_Intro_to_R.md:
--------------------------------------------------------------------------------
  1 | # Introduction
  2 | 
  3 | This is a quick introduction to R! 
  4 | 
  5 | As a resource, here's the link to the [base R cheat sheet](https://iqss.github.io/dss-workshops/R/Rintro/base-r-cheat-sheet.pdf)
  6 | 
  7 | # R, RStudio, packages and R Markdown
  8 | ## R packages
  9 | 
 10 | A feature about R is R has a lot of packages.
 11 | A package is a collection of R functions that other people wrote and published.
 12 | We can use pre-existing packages, so that we don't have to re-invent the wheels.
 13 | A difference between R and many other coding languages is that in R you learn more about how to use other packages, and less about coding stuff yourself.
 14 | This is very advantageous for those who do not have a traditional computer science background.
 15 | 
 16 | When you need to install packages, you use the `install.packages()` command.
 17 | For example, if you need to install a package called `dplyr`,
 18 | which is a package that we'll use.
 19 | You will type `install.packages("dplyr")` at the console.
 20 | 
 21 | Alternatively, you can click the Package tab at the lower right.
 22 | Then you select Install.
 23 | In the pop-up window, type in the package you need to install and click install.
 24 | We will talk about how to load an installed package later.
 25 | 
 26 | ## The R Markdown (.Rmd) file
 27 | 
 28 | R Markdown is an enhanced R script file.
 29 | You will need the `rmarkdowm` package. 
 30 | R markdown files are more user-friendly than a common R script.
 31 | 
 32 | In a .Rmd file, there are two kinds of text.
 33 | One is plain text, which is what you are reading right now.
 34 | 
 35 | The other is code chunk, which R will actually run.
 36 | 
 37 | To insert a code chunk, you can click the Code drop-down menu from the top.
 38 | Then select Insert Chunk.
 39 | 
 40 | Alternatively, you can do Ctrl+Alt+I on a PC, or Cmd+Opt+I on a Mac.
 41 | When you do that, this will show up:
 42 | 
 43 | ```{r}
 44 | 
 45 | ```
 46 | 
 47 | The above is a code chunk.
 48 | If you click the green triangle at the upper right, it will run the entire chunk.
 49 | The advantage of having chunks is that you can run codes chunk-by-chunk, and de-bug chunk by chunk.
 50 | This is very user-friendly.
 51 | If you want to run a particular line in the chunk, you do Ctrl+Enter on a PC or Cmd+Enter on a Mac.
 52 | I think Opt+Enter also works on a Mac.
 53 | To run the whole chunk, do Ctrl+Shift+Enter on PC or Cmd+Shift+Enter on a Mac.
 54 | 
 55 | When you start a new script, the first thing you want to do is to load your packages.
 56 | Say we want to load the dplyr that we just installed, we will do:
 57 | 
 58 | ```{r}
 59 | library(dplyr) 
 60 | ```
 61 | 
 62 | The command is `library()`.
 63 | It's not very intuitive, but this is the command to load packages that we already installed.
 64 | The dplyr package is very useful in data arrangement, which we will cover in depth in the next lesson.
 65 | 
 66 | ## Sections in a .Rmd file
 67 | 
 68 | You can see the section headers are in blue color.
 69 | To make a new section header, you can type `#` followed by a space, and then any text at the first position of any line in the plain text area.
 70 | In addition, `##` will be a sub-header, and `###` will be a sub-sub-header, and so on.
 71 | 
 72 | You can view all the sections at the file outline.
 73 | The shortcut is Ctrl+Shift+O for PC and Cmd+Shift+O for Mac.
 74 | You'll see that section titles will show up at the right to the script area.
 75 | If you click on any section titles, RStudio will jump to that section.
 76 | 
 77 | # Basic syntax in R
 78 | 
 79 | ## Single value items
 80 | 
 81 | In R, there are 3 types of single value items: logical, numeric and character.
 82 | For example,
 83 | 
 84 | *   TRUE or FALSE are logical
 85 | *   1, 2, 3 ... are numeric 
 86 | *   a, b, c ... are character
 87 | 
 88 | This is pretty self-explanatory.
 89 | 
 90 | ## Vectors
 91 | 
 92 | The term vector in R is a bit different from the traditional sense of vector in math.
 93 | It stands for a list of items that are the same type,
 94 | i.e., a list that is all numeric, or all character, but not mixed.
 95 | 
 96 | To create a vector, you use the `c()` command.
 97 | For example, I want to create a vector of 1, 2, 3, the code will be
 98 | 
 99 | ```{r}
100 | c(1, 2, 3)
101 | ```
102 | 
103 | Now, to save my vector (or anything) as an R item, you will use the `<-` syntax.
104 | This is called the left arrow.
105 | Let's say we are saving this vector of c(1, 2, 3) as an item called `a` then
106 | 
107 | ```{r}
108 | a <- c(1, 2, 3)
109 | ```
110 | 
111 | When you run this, something happens.
112 | If you look at the environment at the upper right, you should see 'a' shows up as a saved item.
113 | Remember to use a different name for different items.
114 | If you save them as the same name, the later item will just overwrite the previous.
115 | 
116 | Say I want a vector consists of apple, banana, and orange, and save it as an item named `b`, what do I do?
117 | 
118 | ```{r}
119 | b <- c("apple", "banana", "orange")
120 | b
121 | ```
122 | 
123 | Note that apple, banana and orange all have to be in quotes `" "`.
124 | The quotes `" "` specify the terms apple, banana or oranges are characters,
125 | not items in the environment.
126 | Forgetting to use quotes is a very common source of bugs.
127 | 
128 | ### Subsetting vectors
129 | 
130 | Say if you want to select a member of a vector by position,
131 | you will use the square bracket `[ ]` syntax.
132 | Say I want the 2nd member in vector `a`, the code will be
133 | 
134 | ```{r}
135 | a[2]
136 | ```
137 | 
138 | If I want anything but the 2nd member, it will be
139 | 
140 | ```{r}
141 | a[-2]
142 | ```
143 | 
144 | If I want the 1st and 2nd member, it will be
145 | 
146 | ```{r}
147 | a[c(1, 2)]
148 | ```
149 | 
150 | If I want anything but the 1st and 2nd member, it will be
151 | 
152 | ```{r}
153 | a[-c(1, 2)]
154 | ```
155 | 
156 | You don't have to memorize how to do anything.
157 | If you ever get confused, you can always go look at the base R cheat sheet.
158 | 
159 | ## Matrix
160 | 
161 | Matrix is rectangular data that is all the same type, e.g., all numeric.
162 | For example, if I want a matrix like this:
163 | 
164 | | 1 | 2 | 3 | 4 |   
165 | | 1 | 2 | 1 | 2 |   
166 | | 2 | 3 | 2 | 4 |   
167 | 
168 | What do I do?
169 | An easy way to do is combining different vectors.
170 | 
171 | ```{r}
172 | my_matrix <- rbind(
173 |   c(1, 2, 3, 4),
174 |   c(1, 2, 1, 2),
175 |   c(2, 3, 2, 4)
176 | )
177 | 
178 | my_matrix
179 | ```
180 | 
181 | The `rbind()` command takes vectors of the same length, and bind them as rows.
182 | The matrix has 3 rows, so I bind 3 vectors together.
183 | (Note: there is a `cbind()` command, which binds vectors as columns.)
184 | 
185 | ### Subsetting a matrix
186 | 
187 | Now let's talk about selecting members of a matrix.
188 | We will use the `[ ]` syntax again.
189 | Since a matrix has two dimensions, now the `[ ]` takes two inputs, which is [rows, columns]
190 | 
191 | Say I want the 1st and 2nd rows of this matrix, what do I do?
192 | 
193 | ```{r}
194 | my_matrix[c(1,2), ]
195 | ```
196 | 
197 | In the rows area of the `[ ]`, I put in `c(1,2)`, which selects the 1st and 2nd rows.
198 | I want all the columns, so I left the column area blank.
199 | 
200 | Now if I want the 2nd and 4th column of this matrix, what do I do?
201 | 
202 | ```{r}
203 | my_matrix[, c(2,4)] 
204 | ```
205 | 
206 | Now I left the rows area blank, and specify `c(2, 4)` in the columns area.
207 | 
208 | So if I want the 1st and 2nd rows and 2nd and 4th columns, what do I do?
209 | 
210 | ```{r}
211 | my_matrix[c(1,2), c(2,4)]    
212 | ```
213 | 
214 | Pretty straightforward.
215 | 
216 | ## Data frame
217 | 
218 | The last type of items I will cover today is data frame.
219 | Data frame is the technical jargon for data tables in R.
220 | We will cover data frames more in depth in the next lesson.
221 | Today we will just have a brief introduction. 
222 | 
223 | Say I have 3 apples, 4 banana and 2 oranges,
224 | and I want to organize this info into a table, and called this table fruits.
225 | What do I do?
226 | 
227 | ```{r}
228 | fruits <- data.frame(
229 |   name = c("apple", "banana", "orange"),
230 |   count = c(3, 4, 2)
231 | )
232 | 
233 | fruits
234 | ```
235 | 
236 | That will do it.
237 | 
238 | To make a new table, you can use the `data.frame()` command.
239 | Inside `data.frame()`, you then type out each columns as vectors.
240 | In this example, the columns are name and count.
241 | 
242 | Notice that in a data frame, all members of a column are the same type.
243 | For example, the count column is all numeric.
244 | This is different from a matrix,
245 | because in a matrix, all members are the same type across rows and columns.
246 | Also notice data frames have column names.
247 | In this case, the column names are name and count.
248 | 
249 | You can pull out a particular column using its name.
250 | For example, say I want to pull out the count column, what do I do?
251 | 
252 | You will use the `$` syntax.
253 | 
254 | ```{r}
255 | fruits$count
256 | ```
257 | 
258 | If I want the name column, it would be `fruits$name` instead.
259 | 
260 | # Operations
261 | 
262 | Now let's go over some basic operations.
263 | Say I want to add 0.5 to each member of my matrix,
264 | then take the log2 of each member then take the square root, what do I do?
265 | 
266 | ```{r}
267 | sqrt(log2(my_matrix + 0.5))
268 | ```
269 | 
270 | See, nothing complicated, just an interactive calculator.
271 | You see how the parentheses of `log2()` and `sqrt()` are nested.
272 | You can imagine if you do this for many steps, there will be too many parentheses to keep track of.
273 | To avoid the too many nested parentheses problem, you can use the pipe syntax.
274 | 
275 | The pipe is `%>%`
276 | The pipe belongs to the dplyr package,
277 | so if you want to use the pipe, you have to load the dplyr package first.
278 | We already did that earlier when we did `library(dplyr) `.
279 | 
280 | The shortcut is Ctrl+Shift+M on a PC and Cmd+Shift+M on a Mac.
281 | I think Ctrl+Shift+M also works on a Mac.
282 | When you use the pipe, it takes the output of the previous line and feed it as the input for the next line.
283 | In our example, it should look like:
284 | 
285 | ```{r}
286 | (my_matrix + 0.5) %>% 
287 |   log2() %>% 
288 |   sqrt() 
289 | ```
290 | 
291 | See how this is much clearer and easier to read than nested parentheses?
292 | It's a good practice to start a new line after `%>%`, which makes your code easier to read.
293 | `%>%` is a good trick.
294 | Missing parentheses is another common source of bugs, and `%>%` really reduces that.
295 | 
296 | ## add comments or skip lines in the code chunk
297 | 
298 | Sometimes it's nice to annotate your code in-line as well.
299 | The way to do it is using the `#` symbol in the code chunk.
300 | Any text on the same line after a `#` sign will be ignore in a code chunk by R.
301 | For example:
302 | 
303 | ```{r}
304 | (my_matrix + 0.5) %>% #add 0.5 to every value first
305 |   log2() %>% #then take log2()
306 |   sqrt() #lastly take square root 
307 | ```
308 | 
309 | If you add the `#` sign at the very start of the line, it will skip that entire line.
310 | 
311 | ```{r}
312 | c(4, 4, 4) %>% 
313 |   # log2() %>%
314 |   # mean() %>% 
315 |   sqrt()
316 | ```
317 | 
318 | This is helpful when you realize you want to skip a step (or steps),
319 | but you don't want to remove the code.
320 | You can simply add a `#` sign to the start of the lines that you want to skip, and R will ignore them.
321 | The shortcut is Ctrl+Shift+C on a PC and Cmd+Shift+C on a Mac.
322 | 
323 | # Excerice
324 | 
325 | Today you have learned some basic syntax of R.
326 | Now it's time for you to practice.
327 | 
328 | ## Q1:
329 | 
330 | 1.  Insert a new code chunk
331 | 2.  Make this matrix
332 | 
333 | | 1  | 1 | 2 | 2 |   
334 | | 2  | 2 | 1 | 2 |   
335 | | 2  | 3 | 3 | 4 |   
336 | | 1  | 2 | 3 | 4 |   
337 | 
338 | and save it as an item called `my_mat2`.
339 | 
340 | 3. Select the 1st and 3rd rows and the 1st, 2nd and 4th columns, and save it as an item.
341 | 
342 | 4. Take the square root for each member of my_mat2, then take log2(), and lastly find the maximum value.
343 | Use the pipe syntax. The command for maximum is `max()`.
344 | 
345 | ## Q2:
346 | 
347 | 1.  Use the following info to make a data frame and save it as an item called "grade".
348 |     Adel got 85 on the exam, Bren got 83, and Cecil got 93.
349 |     Their letter grades are B, B, and A, respectively.
350 |     (Hint: How many columns do you have to have?)
351 | 
352 | 2. Pull out the column with the scores.
353 |     Use the `$` syntax.
354 | 


--------------------------------------------------------------------------------
/Lessons/03_Intro_to_data_vis.md:
--------------------------------------------------------------------------------
  1 | # Introduction 
  2 | 
  3 | In this lesson, we will cover: 
  4 | 
  5 | 1. How are data represented in visualizations? 
  6 | 2. What is "the grammar of graphics"? 
  7 | 3. Introduction to ggplot: mapping variables to axis and aesthetics. 
  8 | 
  9 | # Required packages
 10 | ```{r}
 11 | library(tidyverse)
 12 | library(RColorBrewer)
 13 | ```
 14 | 
 15 | The `tidyverse` will be doing all the heavy lifting. 
 16 | The `RColorBrewer` will be providing some nice color sets for our graphs. 
 17 | 
 18 | # How are data represented in visualizations?
 19 | As an example, let's simulate some data: 
 20 | ```{r}
 21 | set.seed(666)
 22 | Example_data1 <- rbind(
 23 |   rnorm(3, mean = 5, sd = 0.5) %>%
 24 |     as.data.frame() %>% 
 25 |     mutate(time = 1),
 26 |   rnorm(3, mean = 8, sd = 0.5) %>% 
 27 |     as.data.frame() %>% 
 28 |     mutate(time = 2),
 29 |   rnorm(3, mean = 10, sd = 0.5) %>%
 30 |     as.data.frame() %>% 
 31 |     mutate(time = 3)
 32 | ) %>% 
 33 |   rename(response = ".")
 34 | 
 35 | Example_data1
 36 | ```
 37 | |response|time|
 38 | |--------|----|
 39 | |5.376656|   1|
 40 | |6.007177|   1|
 41 | |4.822433|   1|
 42 | |9.014084|   2|
 43 | |6.891563|   2|
 44 | |8.379198|   2|
 45 | |9.346907|   3|
 46 | |9.598740|   3|
 47 | |9.103880|   3|
 48 | 
 49 | As you can see, the data are stored in this table. This is already a tidy data frame. 
 50 | It has two columns, "response" and "time". It appears each time point has 3 observations. 
 51 | 
 52 | If you were to just print this table, it will be a perfectly correct representation of the data.  
 53 | For example, if you read the numbers, you probably notice the response variable goes up as time progresses.  
 54 | 
 55 | But that is NOT the point of data visualization. 
 56 | The purpose of data visualization is to create a visual and intuitive representation of the data. 
 57 | A good data visualization allows us to gain insight into the data at a glance. 
 58 | That being said, what are the ways to visually represent data? 
 59 | 
 60 | ## Represent data by position
 61 | The first way to represent data is by position, such that the data is represented by position along the axis. 
 62 | ```{r}
 63 | Example_data1 %>% 
 64 |   ggplot(aes(x = time, y = response)) +
 65 |   stat_summary(geom = "line", fun.data = mean_se, 
 66 |                size = 1.1, alpha = 0.8, color = "grey60") +
 67 |   geom_point(aes(fill = as.factor(time)), shape = 21, color = "grey20",
 68 |              alpha = 0.8, size = 3, position = position_jitter(0.1, seed = 666)) +
 69 |   stat_summary(geom = "errorbar", fun.data = mean_se,
 70 |                width = 0.05, alpha = 0.8) +
 71 |   scale_fill_manual(values = brewer.pal(8, "Accent")) +
 72 |   scale_x_continuous(breaks = c(1, 2, 3)) +
 73 |   labs(x = "Time point",
 74 |        y = "Response",
 75 |        title = "Dot/line graphs are\nposition based") +
 76 |   theme_classic() +
 77 |   theme(
 78 |     legend.position = "none",
 79 |     text = element_text(size = 14, color = "black"),
 80 |     axis.text = element_text(color = "black"),
 81 |     plot.title = element_text(size = 12)
 82 |   )
 83 | 
 84 | ggsave("../Results/03_line_plot.png", height = 2.5, width = 2)
 85 | ```
 86 | ![line graph](https://github.com/cxli233/Quick_data_vis/blob/main/Results/03_line_plot.png)
 87 | 
 88 | In this case, I graphed time on x-axis, and response on y-axis. 
 89 | This visualization is position based. 
 90 | 
 91 | ## Represent data by length/size 
 92 | ```{r}
 93 | Example_data1 %>% 
 94 |   ggplot(aes(x = time, y = response)) +
 95 |   stat_summary(geom = "bar", fun.data = mean_se, aes(color = as.factor(time)),
 96 |                size = 1.1, alpha = 0.8, fill = NA, width = 0.5) +
 97 |   geom_point(aes(fill = as.factor(time)), shape = 21, color = "grey20",
 98 |              alpha = 0.8, size = 3, position = position_jitter(0.1, seed = 666)) +
 99 |   stat_summary(geom = "errorbar", fun.data = mean_se,
100 |                width = 0.05, alpha = 0.8) +
101 |   scale_fill_manual(values = brewer.pal(8, "Accent")) +
102 |   scale_color_manual(values = brewer.pal(8, "Accent")) +
103 |   scale_x_continuous(breaks = c(1, 2, 3)) +
104 |   labs(x = "Time point",
105 |        y = "\nResponse",
106 |        title = "Bar graphs are\nlength based") +
107 |   theme_classic() +
108 |   theme(
109 |     legend.position = "none",
110 |     text = element_text(size = 14, color = "black"),
111 |     axis.text = element_text(color = "black"),
112 |     plot.title = element_text(size = 12) 
113 |   )
114 | 
115 | ggsave("../Results/03_bar.png", height = 2.5, width = 2)
116 | ```
117 | ![bar graph](https://github.com/cxli233/Quick_data_vis/blob/main/Results/03_bar.png) 
118 | 
119 | In this case, I graphed time on x-axis, and response on y-axis. 
120 | This visualization is length based. Why?
121 | In bar plots, values are represented by the distance from the x axis, and thus the length of the bar. 
122 | 
123 | Never confuse position and length based visualizations! 
124 | Can you tell what is wrong with the graph on the right? 
125 | 
126 | ![What is wrong?](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/Position_and_length_based_visualizations.svg)
127 | 
128 | ## Represent data by color 
129 | Finally, data can be represented by color. 
130 | We will explore this more when we talk about heatmaps and whatnot. 
131 | 
132 | # ggplot: data visualization using the grammar of graphics
133 | As you can see in the examples, I've been using `ggplot` to produce graphs. 
134 | The "gg" in `ggplot` stands for grammar of graphics, a theoretical framework to visualize data. 
135 | In grammar of graphics, each visualization has the following layers: 
136 | 
137 | 1. Coordinates: what are on x and y axis? 
138 | 2. Geometric objects ("geom"): e.g., bars, points, lines... 
139 | 3. Mapping: mapping colors, size, and transparency to geometric objects.
140 | 
141 | For more resources, you can look at the [ggplot cheat sheet](https://statsandr.com/blog/files/ggplot2-cheatsheet.pdf). 
142 | 
143 | Let's build a scatter plot for our example data. 
144 | Usually I start with the data table and pipe it into `ggplot()` using the pipe `%>%` syntax.
145 | Inside the `ggplot()` function, there is this `aes()` argument. 
146 | Inside the `aes()` argument, it takes the x and y variables. 
147 | Not intuitive, but it is what it is. 
148 | 
149 | ```{r}
150 | Example_data1 %>% 
151 |   ggplot(aes(x = time, y = response)) +
152 |   theme_classic()
153 | 
154 | ggsave("../Results/03_scatter_1.png", height = 2.5, width = 2)
155 | ```
156 | ![step1](https://github.com/cxli233/Quick_data_vis/blob/main/Results/03_scatter_1.png)
157 | 
158 | If we just do that, we get a blank graph with time on x axis and response on y axis. 
159 | The `theme_classic()` only change the appearance of the background, read more [here](https://ggplot2.tidyverse.org/reference/ggtheme.html).
160 | ggplot comes with a collection of themes. My go-to is "classic". It has white background and axis lines, as you can see in this example. 
161 | 
162 | The graph is blank because we haven't added any geometric objects (geom) yet. 
163 | For the sake of this example only, let's make a scatter plot. So we need some points. 
164 | The command to add points in `ggplot` is `geom_poit()`. 
165 | 
166 | ```{r}
167 | Example_data1 %>% 
168 |   ggplot(aes(x = time, y = response)) +
169 |   geom_point() +
170 |   theme_classic()
171 | 
172 | ggsave("../Results/03_scatter_2.png", height = 2.5, width = 2)
173 | ```
174 | ![step2](https://github.com/cxli233/Quick_data_vis/blob/main/Results/03_scatter_2.png)
175 | 
176 | This is how the graph looks like after we added points. 
177 | To include more layers, you literally add (using the `+` syntax) layers to the existing code. 
178 | After the `ggplot()` line which initiates the graph, I usually add the geom layers first, then other layers, then lastly the `theme` layer. 
179 | 
180 | For a minimal plot, we are done! 
181 | 
182 | # Mapping variables to ggplot geom
183 | The example data we are using only have 2 variables: time and response. What if we have more than 2? 
184 | To demonstrate that, let's use another example. 
185 | 
186 | I am going to read in child mortality, children/women, and income data. 
187 | We will use those data for another example. For the sake of this lesson, we will just use year 1945. 
188 | 
189 | First, read in data. 
190 | ```{r}
191 | child_mortality <- read_csv("../Data/child_mortality_0_5_year_olds_dying_per_1000_born.csv", col_types = cols()) 
192 | babies_per_woman <- read_csv("../Data/children_per_woman_total_fertility.csv", col_types = cols()) 
193 | income <- read_csv("../Data/income_per_person_gdppercapita_ppp_inflation_adjusted.csv", col_types = cols()) 
194 | ```
195 | 
196 | Second, re-shape data into tidy format. 
197 | ```{r}
198 | babies_per_woman_tidy <- babies_per_woman %>% 
199 |   pivot_longer(names_to = "year", values_to = "birth", cols = c(2:302))  
200 | 
201 | child_mortality_tidy <- child_mortality %>% 
202 |   pivot_longer(names_to = "year", values_to = "death_per_1000_born", cols = c(2:302))  
203 | 
204 | income_tidy <- income %>% 
205 |   pivot_longer(names_to = "year", values_to = "income", cols = c(2:242))  
206 | ```
207 | 
208 | Third, join them together. 
209 | ```{r}
210 | example2_data <- babies_per_woman_tidy %>% 
211 |   inner_join(child_mortality_tidy, by = c("country", "year")) %>% 
212 |   inner_join(income_tidy, by = c("country", "year"))
213 | 
214 | head(example2_data)
215 | ```
216 | 
217 | Fourth, filter for year 1945.
218 | ```{r}
219 | example2_data_1945 <- example2_data %>% 
220 |   filter(year == "1945")
221 | 
222 | head(example2_data_1945)
223 | ```
224 | 
225 | 
226 | You should be able to do all that relatively easily after completing [Intro_to_tidy_data](https://github.com/cxli233/Quick_data_vis/blob/main/Lessons/02_Intro_to_tidy_data.md). 
227 | If not, practice more! 
228 | 
229 | Say we want birth rate on x axis, mortality on y axis, make a scatter plot, and color the dots by income, what should we do? 
230 | Let's make a plain scatter plot first. 
231 | 
232 | ```{r}
233 | example2_data_1945 %>% 
234 |   ggplot(aes(x = birth, y = death_per_1000_born)) +
235 |   geom_point() +
236 |   labs(x = "No. children per woman",
237 |        y = "Child mortality/1000 born",
238 |        title = "Year 1945") +
239 |   theme_classic()
240 | 
241 | ggsave("../Results/03_scatter_3.png", width = 2.5, height = 2.5)
242 | ```
243 | ![basic scatter](https://github.com/cxli233/Quick_data_vis/blob/main/Results/03_scatter_3.png)
244 | 
245 | That looks fine. 
246 | To color dots (or any geom) based on a variable in the data frame, you open an `aes()` argument inside the `geom()`.
247 | 
248 | ```{r}
249 | example2_data_1945 %>% 
250 |   ggplot(aes(x = birth, y = death_per_1000_born)) +
251 |   geom_point(aes(color = log10(income))) +
252 |   labs(x = "No. children per woman",
253 |        y = "Child mortality/1000 born",
254 |        title = "Year 1945") +
255 |   theme_classic()
256 | 
257 | ggsave("../Results/03_scatter_4.png", width = 3, height = 2.5)
258 | ```
259 | ![scatter, colored](https://github.com/cxli233/Quick_data_vis/blob/main/Results/03_scatter_4.png)
260 | 
261 | Note: income is quite unevenly distributed, so I log-transformed it. 
262 | Now we have dots colored by income (in log10 scale).
263 | The problem is the default color scale in ggplot looks bad. 
264 | We can use a different set of colors from the `RColorBrewer` package. 
265 | You can look at all the available color scales in `RColorBrewer` [here](https://r-graph-gallery.com/38-rcolorbrewers-palettes.html). 
266 | Let's use something easier to see than the default ggplot blue. 
267 | 
268 | ```{r}
269 | example2_data_1945 %>% 
270 |   ggplot(aes(x = birth, y = death_per_1000_born)) +
271 |   geom_point(aes(color = log10(income))) +
272 |   scale_color_gradientn(colours = brewer.pal(9, "YlGnBu")) +
273 |   labs(x = "No. children per woman",
274 |        y = "Child mortality/1000 born",
275 |        title = "Year 1945") +
276 |   theme_classic()
277 | 
278 | ggsave("../Results/03_scatter_5.png", width = 3, height = 2.5)
279 | ```
280 | ![scatter, colored, better](https://github.com/cxli233/Quick_data_vis/blob/main/Results/03_scatter_5.png)
281 | 
282 | To provide a custom color scale that produce a color gradient, use `scale_color_gradientn())`.
283 | Here is a whole family of `scale` functions in ggplot, which allow you to control the scales for colors and axis.  
284 | You can read more about scales on the [ggplot cheat sheet](https://statsandr.com/blog/files/ggplot2-cheatsheet.pdf). 
285 | I quite like the yellow-green-blue color gradient from `RColorBrewer`, which is what I used here. 
286 | As you can see, low income countries have more children/woman and also higher child mortality. 
287 | 
288 | # Exercise 
289 | Graph income (in log10 scale) on x axis, child mortality on y axis, and color with children/woman in year 2010. 
290 | Were the trend similar to year 1945? 
291 | Save the graph using `ggsave()`. 
292 | 
293 |  
294 | 
295 | 


--------------------------------------------------------------------------------
/Scripts/01_Intro_to_R.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Very basics of R coding"
  3 | author: "Chenxin Li"
  4 | date: "01/06/2023"
  5 | output:
  6 |   html_notebook:
  7 |     number_sections: yes
  8 |     toc: yes
  9 |     toc_float: yes
 10 | ---
 11 | 
 12 | ```{r setup, include=FALSE}
 13 | knitr::opts_chunk$set(echo = TRUE)
 14 | ```
 15 | 
 16 | # Introduction
 17 | 
 18 | This is a quick introduction to R! 
 19 | 
 20 | As a resource, here's the link to the [base R cheat sheet](https://iqss.github.io/dss-workshops/R/Rintro/base-r-cheat-sheet.pdf)
 21 | 
 22 | # R, RStudio, packages and R Markdown
 23 | ## R packages
 24 | 
 25 | A feature about R is R has a lot of packages.
 26 | A package is a collection of R functions that other people wrote and published.
 27 | We can use pre-existing packages, so that we don't have to re-invent the wheels.
 28 | A difference between R and many other coding languages is that in R you learn more about how to use other packages, and less about coding stuff yourself.
 29 | This is very advantageous for those who do not have a traditional computer science background.
 30 | 
 31 | When you need to install packages, you use the `install.packages()` command.
 32 | For example, if you need to install a package called `dplyr`,
 33 | which is a package that we'll use.
 34 | You will type `install.packages("dplyr")` at the console.
 35 | 
 36 | Alternatively, you can click the Package tab at the lower right.
 37 | Then you select Install.
 38 | In the pop-up window, type in the package you need to install and click install.
 39 | We will talk about how to load an installed package later.
 40 | 
 41 | ## The R Markdown (.Rmd) file
 42 | 
 43 | R Markdown is an enhanced R script file.
 44 | You will need the `rmarkdowm` package. 
 45 | R markdown files are more user-friendly than a common R script.
 46 | 
 47 | In a .Rmd file, there are two kinds of text.
 48 | One is plain text, which is what you are reading right now.
 49 | 
 50 | The other is code chunk, which R will actually run.
 51 | 
 52 | To insert a code chunk, you can click the Code drop-down menu from the top.
 53 | Then select Insert Chunk.
 54 | 
 55 | Alternatively, you can do Ctrl+Alt+I on a PC, or Cmd+Opt+I on a Mac.
 56 | When you do that, this will show up:
 57 | 
 58 | ```{r}
 59 | 
 60 | ```
 61 | 
 62 | The above is a code chunk.
 63 | If you click the green triangle at the upper right, it will run the entire chunk.
 64 | The advantage of having chunks is that you can run codes chunk-by-chunk, and de-bug chunk by chunk.
 65 | This is very user-friendly.
 66 | If you want to run a particular line in the chunk, you do Ctrl+Enter on a PC or Cmd+Enter on a Mac.
 67 | I think Opt+Enter also works on a Mac.
 68 | To run the whole chunk, do Ctrl+Shift+Enter on PC or Cmd+Shift+Enter on a Mac.
 69 | 
 70 | When you start a new script, the first thing you want to do is to load your packages.
 71 | Say we want to load the dplyr that we just installed, we will do:
 72 | 
 73 | ```{r}
 74 | library(dplyr) 
 75 | ```
 76 | 
 77 | The command is `library()`.
 78 | It's not very intuitive, but this is the command to load packages that we already installed.
 79 | The dplyr package is very useful in data arrangement, which we will cover in depth in the next lesson.
 80 | 
 81 | ## Sections in a .Rmd file
 82 | 
 83 | You can see the section headers are in blue color.
 84 | To make a new section header, you can type `#` followed by a space, and then any text at the first position of any line in the plain text area.
 85 | In addition, `##` will be a sub-header, and `###` will be a sub-sub-header, and so on.
 86 | 
 87 | You can view all the sections at the file outline.
 88 | The shortcut is Ctrl+Shift+O for PC and Cmd+Shift+O for Mac.
 89 | You'll see that section titles will show up at the right to the script area.
 90 | If you click on any section titles, RStudio will jump to that section.
 91 | 
 92 | # Basic syntax in R
 93 | 
 94 | ## Single value items
 95 | 
 96 | In R, there are 3 types of single value items: logical, numeric and character.
 97 | For example,
 98 | 
 99 | *   TRUE or FALSE are logical
100 | *   1, 2, 3 ... are numeric 
101 | *   a, b, c ... are character
102 | 
103 | This is pretty self-explanatory.
104 | 
105 | ## Vectors
106 | 
107 | The term vector in R is a bit different from the traditional sense of vector in math.
108 | It stands for a list of items that are the same type,
109 | i.e., a list that is all numeric, or all character, but not mixed.
110 | 
111 | To create a vector, you use the `c()` command.
112 | For example, I want to create a vector of 1, 2, 3, the code will be
113 | 
114 | ```{r}
115 | c(1, 2, 3)
116 | ```
117 | 
118 | Now, to save my vector (or anything) as an R item, you will use the `<-` syntax.
119 | This is called the left arrow.
120 | Let's say we are saving this vector of c(1, 2, 3) as an item called `a` then
121 | 
122 | ```{r}
123 | a <- c(1, 2, 3)
124 | ```
125 | 
126 | When you run this, something happens.
127 | If you look at the environment at the upper right, you should see 'a' shows up as a saved item.
128 | Remember to use a different name for different items.
129 | If you save them as the same name, the later item will just overwrite the previous.
130 | 
131 | Say I want a vector consists of apple, banana, and orange, and save it as an item named `b`, what do I do?
132 | 
133 | ```{r}
134 | b <- c("apple", "banana", "orange")
135 | b
136 | ```
137 | 
138 | Note that apple, banana and orange all have to be in quotes `" "`.
139 | The quotes `" "` specify the terms apple, banana or oranges are characters,
140 | not items in the environment.
141 | Forgetting to use quotes is a very common source of bugs.
142 | 
143 | ### Subsetting vectors
144 | 
145 | Say if you want to select a member of a vector by position,
146 | you will use the square bracket `[ ]` syntax.
147 | Say I want the 2nd member in vector `a`, the code will be
148 | 
149 | ```{r}
150 | a[2]
151 | ```
152 | 
153 | If I want anything but the 2nd member, it will be
154 | 
155 | ```{r}
156 | a[-2]
157 | ```
158 | 
159 | If I want the 1st and 2nd member, it will be
160 | 
161 | ```{r}
162 | a[c(1, 2)]
163 | ```
164 | 
165 | If I want anything but the 1st and 2nd member, it will be
166 | 
167 | ```{r}
168 | a[-c(1, 2)]
169 | ```
170 | 
171 | You don't have to memorize how to do anything.
172 | If you ever get confused, you can always go look at the base R cheat sheet.
173 | 
174 | ## Matrix
175 | 
176 | Matrix is rectangular data that is all the same type, e.g., all numeric.
177 | For example, if I want a matrix like this:
178 | 
179 | | 1 | 2 | 3 | 4 |
180 | | 1 | 2 | 1 | 2 |
181 | | 2 | 3 | 2 | 4 |
182 | 
183 | What do I do?
184 | An easy way to do is combining different vectors.
185 | 
186 | ```{r}
187 | my_matrix <- rbind(
188 |   c(1, 2, 3, 4),
189 |   c(1, 2, 1, 2),
190 |   c(2, 3, 2, 4)
191 | )
192 | 
193 | my_matrix
194 | ```
195 | 
196 | The `rbind()` command takes vectors of the same length, and bind them as rows.
197 | The matrix has 3 rows, so I bind 3 vectors together.
198 | (Note: there is a `cbind()` command, which binds vectors as columns.)
199 | 
200 | ### Subsetting a matrix
201 | 
202 | Now let's talk about selecting members of a matrix.
203 | We will use the `[ ]` syntax again.
204 | Since a matrix has two dimensions, now the `[ ]` takes two inputs, which is [rows, columns]
205 | 
206 | Say I want the 1st and 2nd rows of this matrix, what do I do?
207 | 
208 | ```{r}
209 | my_matrix[c(1,2), ]
210 | ```
211 | 
212 | In the rows area of the `[ ]`, I put in `c(1,2)`, which selects the 1st and 2nd rows.
213 | I want all the columns, so I left the column area blank.
214 | 
215 | Now if I want the 2nd and 4th column of this matrix, what do I do?
216 | 
217 | ```{r}
218 | my_matrix[, c(2,4)] 
219 | ```
220 | 
221 | Now I left the rows area blank, and specify `c(2, 4)` in the columns area.
222 | 
223 | So if I want the 1st and 2nd rows and 2nd and 4th columns, what do I do?
224 | 
225 | ```{r}
226 | my_matrix[c(1,2), c(2,4)]    
227 | ```
228 | 
229 | Pretty straightforward.
230 | 
231 | ## Data frame
232 | 
233 | The last type of items I will cover today is data frame.
234 | Data frame is the technical jargon for data tables in R.
235 | We will cover data frames more in depth in the next lesson.
236 | Today we will just have a brief introduction. 
237 | 
238 | Say I have 3 apples, 4 banana and 2 oranges,
239 | and I want to organize this info into a table, and called this table fruits.
240 | What do I do?
241 | 
242 | ```{r}
243 | fruits <- data.frame(
244 |   name = c("apple", "banana", "orange"),
245 |   count = c(3, 4, 2)
246 | )
247 | 
248 | fruits
249 | ```
250 | 
251 | That will do it.
252 | 
253 | To make a new table, you can use the `data.frame()` command.
254 | Inside `data.frame()`, you then type out each columns as vectors.
255 | In this example, the columns are name and count.
256 | 
257 | Notice that in a data frame, all members of a column are the same type.
258 | For example, the count column is all numeric.
259 | This is different from a matrix,
260 | because in a matrix, all members are the same type across rows and columns.
261 | Also notice data frames have column names.
262 | In this case, the column names are name and count.
263 | 
264 | You can pull out a particular column using its name.
265 | For example, say I want to pull out the count column, what do I do?
266 | 
267 | You will use the `$` syntax.
268 | 
269 | ```{r}
270 | fruits$count
271 | ```
272 | 
273 | If I want the name column, it would be `fruits$name` instead.
274 | 
275 | # Operations
276 | 
277 | Now let's go over some basic operations.
278 | Say I want to add 0.5 to each member of my matrix,
279 | then take the log2 of each member then take the square root, what do I do?
280 | 
281 | ```{r}
282 | sqrt(log2(my_matrix + 0.5))
283 | ```
284 | 
285 | See, nothing complicated, just an interactive calculator.
286 | You see how the parentheses of `log2()` and `sqrt()` are nested.
287 | You can imagine if you do this for many steps, there will be too many parentheses to keep track of.
288 | To avoid the too many nested parentheses problem, you can use the pipe syntax.
289 | 
290 | The pipe is `%>%`
291 | The pipe belongs to the dplyr package,
292 | so if you want to use the pipe, you have to load the dplyr package first.
293 | We already did that earlier when we did `library(dplyr) `.
294 | 
295 | The shortcut is Ctrl+Shift+M on a PC and Cmd+Shift+M on a Mac.
296 | I think Ctrl+Shift+M also works on a Mac.
297 | When you use the pipe, it takes the output of the previous line and feed it as the input for the next line.
298 | In our example, it should look like:
299 | 
300 | ```{r}
301 | (my_matrix + 0.5) %>% 
302 |   log2() %>% 
303 |   sqrt() 
304 | ```
305 | 
306 | See how this is much clearer and easier to read than nested parentheses?
307 | It's a good practice to start a new line after `%>%`, which makes your code easier to read.
308 | `%>%` is a good trick.
309 | Missing parentheses is another common source of bugs, and `%>%` really reduces that.
310 | 
311 | ## add comments or skip lines in the code chunk
312 | 
313 | Sometimes it's nice to annotate your code in-line as well.
314 | The way to do it is using the `#` symbol in the code chunk.
315 | Any text on the same line after a `#` sign will be ignore in a code chunk by R.
316 | For example:
317 | 
318 | ```{r}
319 | (my_matrix + 0.5) %>% #add 0.5 to every value first
320 |   log2() %>% #then take log2()
321 |   sqrt() #lastly take square root 
322 | ```
323 | 
324 | If you add the `#` sign at the very start of the line, it will skip that entire line.
325 | 
326 | ```{r}
327 | c(4, 4, 4) %>% 
328 |   # log2() %>%
329 |   # mean() %>% 
330 |   sqrt()
331 | ```
332 | 
333 | This is helpful when you realize you want to skip a step (or steps),
334 | but you don't want to remove the code.
335 | You can simply add a `#` sign to the start of the lines that you want to skip, and R will ignore them.
336 | The shortcut is Ctrl+Shift+C on a PC and Cmd+Shift+C on a Mac.
337 | 
338 | # Excerice
339 | 
340 | Today you have learned some basic syntax of R.
341 | Now it's time for you to practice.
342 | 
343 | ## Q1:
344 | 
345 | 1.  Insert a new code chunk
346 | 2.  Make this matrix
347 | 
348 | | 1  | 1 | 2 | 2 |
349 | | 2  | 2 | 1 | 2 |
350 | | 2  | 3 | 3 | 4 |
351 | | 1  | 2 | 3 | 4 |
352 | 
353 | and save it as an item called `my_mat2`.
354 | 
355 | 3. Select the 1st and 3rd rows and the 1st, 2nd and 4th columns, and save it as an item.
356 | 
357 | 4. Take the square root for each member of my_mat2, then take log2(), and lastly find the maximum value.
358 | Use the pipe syntax. The command for maximum is `max()`.
359 | 
360 | ## Q2:
361 | 
362 | 1.  Use the following info to make a data frame and save it as an item called "grade".
363 |     Adel got 85 on the exam, Bren got 83, and Cecil got 93.
364 |     Their letter grades are B, B, and A, respectively.
365 |     (Hint: How many columns do you have to have?)
366 | 
367 | 2. Pull out the column with the scores.
368 |     Use the `$` syntax.
369 | 


--------------------------------------------------------------------------------
/Scripts/04_Intro_mean_separation.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "mean_separation"
  3 | author: "Chenxin Li"
  4 | date: "2023-01-07"
  5 | output:
  6 |   html_notebook:
  7 |     number_sections: yes
  8 |     toc: yes
  9 |     toc_float: yes
 10 | ---
 11 | 
 12 | ```{r setup, include=FALSE}
 13 | knitr::opts_chunk$set(echo = TRUE)
 14 | ```
 15 | 
 16 | # Introduction
 17 | In this lesson we will explore the best practice for mean separation visualizations. 
 18 | Mean separation is one of the most common graph types in scientific publications. 
 19 | In these graphs, you have two or more groups.
 20 | These groups can be experimental treatments, genotypes, conditions, and so on. 
 21 | Within each group, there are multiple observations. 
 22 | The goal of the visualization is to represent the mean (the central tendency) and spread of the data. 
 23 | 
 24 | We will explore: 
 25 | 
 26 | 1. Is your bar plot hiding data from you?  
 27 | 2. What are the optimal graph type given the number of observations? 
 28 | 3. How to handle mean separation plots in multifactorial experiments? 
 29 | 
 30 | # Packages 
 31 | ```{r}
 32 | library(tidyverse)
 33 | library(readxl)
 34 | library(RColorBrewer)
 35 | library(ggbeeswarm)
 36 | ```
 37 | 
 38 | We have a new package `ggbeeswarm`.
 39 | `ggbeeswarm` is useful for mean separation graphs with many overlapping dots. 
 40 | You will see that in a moment. 
 41 | 
 42 | # Is your bar plot hiding data from you? 
 43 | For a first example, let's simulate some data, then make a plain old bar plot. 
 44 | You can ignore this code chunk. 
 45 | ```{r}
 46 | set.seed(666)
 47 | group1 <- rnorm(n = 100, mean = 1, sd = 1)
 48 | group2 <- rlnorm(n = 100, 
 49 |                  meanlog = log(1^2/sqrt(1^2 + 1^2)), 
 50 |                  sdlog = sqrt(log(1+(1^2/1^2))))
 51 | 
 52 | groups_long <- cbind(
 53 |   group1,
 54 |   group2
 55 | ) %>% 
 56 |   as.data.frame() %>% 
 57 |   pivot_longer(cols = 1:2, names_to = "group", values_to = "response")
 58 | 
 59 | head(groups_long)
 60 | ```
 61 | 
 62 | I simulated two groups, each has 100 observations. Now let's make a bar plot. 
 63 | To make a bar plot, we will use `geom_bar()`. 
 64 | 
 65 | ```{r}
 66 | groups_long %>% 
 67 |   ggplot(aes(x = group, y = response)) +
 68 |   geom_bar(stat = "summary", fun = mean, width = 0.5,
 69 |            aes(fill = group)) +
 70 |   stat_summary(geom = "errorbar", fun.data = mean_se, 
 71 |                width = 0.1) +
 72 |   scale_fill_manual(values = brewer.pal(8, "Accent")) +
 73 |   theme_classic() +
 74 |   theme(
 75 |     legend.position = "none"
 76 |   )
 77 | 
 78 | ggsave("../Results/04_bar.png", height = 2.5, width = 2)
 79 | ```
 80 | `stat = "summary", fun = mean` inside `geom_bar()` calculates the mean for each group,
 81 | such that the bar lengths are the mean, not the individual observations. 
 82 | The `stat_summary()` layer adds the means +/- standard errors (SE) as error bars,
 83 | which is specified by `fun.data = mean_se`. 
 84 | 
 85 | As you can see in this simple bar plot, the two groups are _very_ similar in terms of mean and SE.
 86 | But is this the full story? Let's make another plot. 
 87 | 
 88 | ```{r}
 89 | groups_long %>% 
 90 |   ggplot(aes(x = group, y = response)) +
 91 |   geom_boxplot(aes(fill = group), width = 0.5) +
 92 |   scale_fill_manual(values = brewer.pal(8, "Accent")) +
 93 |   theme_classic() +
 94 |   theme(
 95 |     legend.position = "none"
 96 |   )
 97 | 
 98 | ggsave("../Results/04_box.png", height = 2.5, width = 2)
 99 | ```
100 | Now I've made a box plot. 
101 | In a box plot, the center line is median. 
102 | The box spans interquartile range (from 25th percentile to 75th percentile). 
103 | The whiskers span 1.5 IQR or (min and max, whichever is less extreme). 
104 | 
105 | Looking at the box plots, we can notice that group2 is more skewed (more outliers on the upper end).
106 | In this case, our simple bar plot was hiding data from us. 
107 | 
108 | For mean separation plots, my go-to is plotting all dots with lateral offsets based on density. 
109 | What does that even mean? Let's graph it first and I will explain. 
110 | ```{r}
111 | groups_long %>% 
112 |   ggplot(aes(x = group, y = response)) +
113 |   geom_quasirandom(aes(color = group), alpha = 0.8) +
114 |   scale_color_manual(values = brewer.pal(8, "Accent")) +
115 |   theme_classic() +
116 |   theme(
117 |     legend.position = "none"
118 |   )
119 | 
120 | ggsave("../Results/04_dots.png", height = 2.5, width = 2)
121 | ```
122 | 
123 | The offset dots layer is provided by `geom_quasirandom()` of the `ggbeeswarm` package.
124 | The name comes from the spread of dots looks like a swarm of bees.
125 | As you can see, this is not just a dot plot. 
126 | The dots are offset to the left or right when there are multiple observations around the same value.
127 | The term "quasirandom" means the dots are offset more if the density is higher. 
128 | The result is that the dots lay themselves out into the distribution of the data. 
129 | 
130 | A final note is that we have 100 observations per group. 
131 | Even with `geom_quasirandom()` and offsets, there are still overlapping dots. 
132 | This is called "overplotting" in data visualization. 
133 | I turned on `alpha = 0.8` inside `geom_quasirandom()` to make the dots a bit transparent. 
134 | `alpha = 0.8` means 80% opacity, or 20% transparency. 
135 | This way, when dots do overlap, we can at least see through them. 
136 | 
137 | Comparing the bar plot, the box plot, and the dot plot, 
138 | we can see that while the mean and SE are similar between the groups, 
139 | their underlying distributions are very different. 
140 | The bar plot itself cannot convey this aspect of our data. 
141 | 
142 | **Take home message**: I _highly_ recommend plotting all data points in mean separation graphs. 
143 | 
144 | # What are the optimal graph type given the number of observations? 
145 | Be very careful with data with low number of observations (n < 30), and very large number of observations (n > 100). 
146 | 
147 | For data with large number of observations, as we just saw in the dot plot, 
148 | even with lateral offset, you might still get overlapping dots.
149 | A good practice is turn on transparency of the dots. 
150 | Another solution is to make a violin plot. 
151 | 
152 | ```{r}
153 | groups_long %>% 
154 |   ggplot(aes(x = group, y = response)) +
155 |   geom_quasirandom(aes(color = group), alpha = 0.8) +
156 |   geom_violin(fill = NA) +
157 |   geom_boxplot(width = 0.1, fill = NA) +
158 |   scale_color_manual(values = brewer.pal(8, "Accent")) +
159 |   theme_classic() +
160 |   theme(
161 |     legend.position = "none"
162 |   )
163 | 
164 | ggsave("../Results/04_dots_violin.png", height = 2.5, width = 2)
165 | ```
166 | In this case, I overlaid both violin plot and box plot onto the dots. 
167 | The width of the violin plot shows the distribution, 
168 | which as you can see follows the offset of the dots.
169 | A quick note: I turned on `fill = NA` in `geom_violin()` and `geom_boxplot()` so that we can see through them and see the dots under them. 
170 | The default has white (non-transparent) fill. 
171 | 
172 | You can read more about fancy mean separation plots (such as "raincloud plot") [here](https://z3tt.github.io/Rainclouds/). 
173 | 
174 | 
175 | Violin and box plots are great for data with many observations. 
176 | However, they can be misleading in small datasets. 
177 | Here is an example: 
178 | 
179 | ![what's wrong here?](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/Beware_of_small_n_box_violin_plot.svg) 
180 | 
181 | Violin plots (or any sort of smoothed distribution curves) and box plots make no sense for small n.
182 | Distributions and quartiles can vary widely with small n, 
183 | even if the underlying observations are similar.
184 | Distribution and quartiles are only meaningful with large n. 
185 | I did an experiment before, where I sampled the same normal distribution several times and computed the quartiles for each sample. 
186 | The quartiles only stabilize when n gets larger than 50.
187 | 
188 | **Take home message**: pay attention to the data set size when designing a mean separation plot! 
189 | 
190 | # Mean separation in multifactorial experiments. 
191 | Multifactorial experiments are very common in science. 
192 | To explain this, let's use a real world example. 
193 | Data from [Matand et al., 2020, BMC Plant Biology](https://link.springer.com/article/10.1186/s12870-020-2243-7).
194 | 
195 | ```{r}
196 | tissue_culture <- read_excel("../Data/LU-Matand - Stem Data.xlsx")
197 | head(tissue_culture)
198 | ```
199 | 
200 | In this experiment, the authors looked at the ability of stem pieces of [daylily](https://en.wikipedia.org/wiki/Daylily) to regrow into a full plant. 
201 | 
202 | They have 19 cultivars of the plant, 3 explant type, and 3 hormone treatment. 
203 | They want to look at the best regeneration condition for each cultivar that they have. 
204 | The ability to regenerate is quantified by the number of buds or shoots. 
205 | 
206 | Multifactorial experiments like this are _very_ common. 
207 | However, we need to pay extra attention to the design of the graphs. 
208 | 
209 | ```{r}
210 | tissue_culture %>% 
211 |   ggplot(aes(x = Variety, y = Buds_Shoots)) +
212 |   geom_bar(stat = "summary", 
213 |            aes(fill = interaction(Explant, Treatment)), 
214 |            fun = mean,
215 |            position = position_dodge2()) +
216 |   labs(x = NULL,
217 |        y = "Num buds and shoots",
218 |        fill = NULL) +
219 |   theme_classic() + 
220 |   theme(axis.text.x = element_text(angle = 45, hjust = 1))
221 | 
222 | ggsave("../Results/04_multi_bar.png", height = 3, width = 6)
223 | ```
224 | I have cultivar on x axis, buds/shoots on y axis, 
225 | and color bars by the combination of explant and treatment. 
226 | This is very horrendous! It is extremely difficult to interpret what is going on. 
227 | 
228 | To overcome this problem, I am introducing you to your new friend, facets! 
229 | Facetting is a method to layout your plot into subplots and better direct the attention of your readers. 
230 | Let me show you an example: 
231 | 
232 | ```{r}
233 | tissue_culture %>% 
234 |   mutate(Treatment = factor(Treatment, 
235 |                             levels = c("T1", "T5", "T10"))) %>% 
236 |   ggplot(aes(x = Treatment, y = Buds_Shoots)) +
237 |   facet_wrap(~ Variety, scales = "free") +
238 |   geom_quasirandom(aes(color = Explant), alpha = 0.8) +
239 |   scale_colour_manual(values = brewer.pal(8, "Set2")) +
240 |   labs(x = "Hormone Treatment",
241 |        y = "Num buds and shoots",
242 |        fill = NULL) +
243 |   theme_classic() +
244 |   theme(legend.position = "bottom") 
245 | 
246 | ggsave("../Results/04_multi_dots.png", height = 6, width = 9)
247 | ```
248 | Now this is a lot better. 
249 | Since the authors want to find the best conditions for each cultivar, we make each cultivar a subplot. 
250 | This is provided by `facet_wrap()`. 
251 | The value after `~` is the column name to make subplot from, which is "Variety" in the table. 
252 | From this new plot, it is obvious that the split stem explant and T10 treatment gave the best results for most cultivars, with some exceptions in a cultivar-specific manner. 
253 | 
254 | **Take home message**: facets are very powerful and you should learn to use them! Read more about facets [here](https://ggplot2.tidyverse.org/reference/facet_grid.html). 
255 | 
256 | # Exercise
257 | ## Q1
258 | Notice the `scales = "free` inside `facet_wrap()`. 
259 | Make a new graph but remove the `scales = "free` inside `facet_wrap()`.
260 | 
261 | Can you explain what happened? 
262 | Can you explain in what scenario is more appropriate to have `scales = "free"` turned on vs off? 
263 | 
264 | ## Q2 
265 | Here is an example data for you to practice: 
266 | ```{r}
267 | M1 <- data.frame(
268 |   conc = rnorm(n = 8, mean = 0.03, sd = 0.01)
269 | ) %>% 
270 |   mutate(group = "ctrl") %>% 
271 |   rbind(
272 |     data.frame(
273 |       conc = rnorm(n = 6, mean = 0.25, sd = 0.02)
274 |     ) %>% 
275 |       mutate(group = "trt")
276 |   ) %>% 
277 |   mutate(pest = "Pest 1")
278 | 
279 | M2 <- data.frame(
280 |   conc = rnorm(n = 8, mean = 6, sd = 1)
281 | ) %>% 
282 |   mutate(group = "ctrl") %>% 
283 |   rbind(
284 |     data.frame(
285 |       conc = rnorm(n = 6, mean = 5.5, sd = 1.1)
286 |     ) %>% 
287 |       mutate(group = "trt")
288 |   ) %>% 
289 |   mutate(pest = "Pest 2")
290 | 
291 | M3 <- data.frame(
292 |   conc = rnorm(n = 8, mean = 20, sd = 0.5)
293 | ) %>% 
294 |   mutate(group = "ctrl") %>% 
295 |   rbind(
296 |     data.frame(
297 |       conc = rnorm(n = 6, mean = 19.5, sd = 1.2)
298 |     ) %>% 
299 |       mutate(group = "trt")
300 |   ) %>% 
301 |   mutate(pest = "Pest 3")
302 | 
303 | Spray <- rbind(
304 |   M1, M2, M3
305 | )
306 | 
307 | head(Spray)
308 | ```
309 | 
310 | In this hypothetical experiment, I have two treatments: ctrl vs. trt. 
311 | And I measured the occurrence of three different pests after spraying either treatments. 
312 | Make a mean separation plot for this experiment. 
313 | 
314 | Hints: 
315 | 
316 | * Is this a multifactorial experiment? 
317 | * How many observations are there in each group? 
318 | 
319 | Was the treatment effective in controlling any of the three pests? 
320 | Save the graph using `ggsave()`. 
321 | 
322 |  
323 | 
324 | 
325 | 
326 | 
327 | 
328 | 
329 | 
330 | 
331 | 
332 | 
333 | 
334 | 


--------------------------------------------------------------------------------
/Lessons/04_Intro_mean_separation.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Introduction
  3 | In this lesson we will explore the best practice for mean separation visualizations. 
  4 | Mean separation is one of the most common graph types in scientific publications. 
  5 | In these graphs, you have two or more groups.
  6 | These groups can be experimental treatments, genotypes, conditions, and so on. 
  7 | Within each group, there are multiple observations. 
  8 | The goal of the visualization is to represent the mean (the central tendency) and spread of the data. 
  9 | 
 10 | We will explore: 
 11 | 
 12 | 1. Is your bar plot hiding data from you?  
 13 | 2. What are the optimal graph type given the number of observations? 
 14 | 3. How to handle mean separation plots in multifactorial experiments? 
 15 | 
 16 | # Packages 
 17 | ```{r}
 18 | library(tidyverse)
 19 | library(readxl)
 20 | library(RColorBrewer)
 21 | library(ggbeeswarm)
 22 | ```
 23 | 
 24 | We have a new package `ggbeeswarm`.
 25 | `ggbeeswarm` is useful for mean separation graphs with many overlapping dots. 
 26 | You will see that in a moment. 
 27 | 
 28 | # Is your bar plot hiding data from you? 
 29 | For a first example, let's simulate some data, then make a plain old bar plot. 
 30 | You can ignore this code chunk. 
 31 | ```{r}
 32 | set.seed(666)
 33 | group1 <- rnorm(n = 100, mean = 1, sd = 1)
 34 | group2 <- rlnorm(n = 100, 
 35 |                  meanlog = log(1^2/sqrt(1^2 + 1^2)), 
 36 |                  sdlog = sqrt(log(1+(1^2/1^2))))
 37 | 
 38 | groups_long <- cbind(
 39 |   group1,
 40 |   group2
 41 | ) %>% 
 42 |   as.data.frame() %>% 
 43 |   pivot_longer(cols = 1:2, names_to = "group", values_to = "response")
 44 | 
 45 | head(groups_long)
 46 | ```
 47 | 
 48 | I simulated two groups, each has 100 observations. Now let's make a bar plot. 
 49 | To make a bar plot, we will use `geom_bar()`. 
 50 | 
 51 | ```{r}
 52 | groups_long %>% 
 53 |   ggplot(aes(x = group, y = response)) +
 54 |   geom_bar(stat = "summary", fun = mean, width = 0.5,
 55 |            aes(fill = group)) +
 56 |   stat_summary(geom = "errorbar", fun.data = mean_se, 
 57 |                width = 0.1) +
 58 |   scale_fill_manual(values = brewer.pal(8, "Accent")) +
 59 |   theme_classic() +
 60 |   theme(
 61 |     legend.position = "none"
 62 |   )
 63 | 
 64 | ggsave("../Results/04_bar.png", height = 2.5, width = 2)
 65 | ```
 66 | 
 67 | ![bar plot](https://github.com/cxli233/Quick_data_vis/blob/main/Results/04_bar.png)
 68 | 
 69 | `stat = "summary", fun = mean` inside `geom_bar()` calculates the mean for each group,
 70 | such that the bar lengths are the mean, not the individual observations. 
 71 | The `stat_summary()` layer adds the means +/- standard errors (SE) as error bars,
 72 | which is specified by `fun.data = mean_se`. 
 73 | 
 74 | As you can see in this simple bar plot, the two groups are _very_ similar in terms of mean and SE.
 75 | But is this the full story? Let's make another plot. 
 76 | 
 77 | ```{r}
 78 | groups_long %>% 
 79 |   ggplot(aes(x = group, y = response)) +
 80 |   geom_boxplot(aes(fill = group), width = 0.5) +
 81 |   scale_fill_manual(values = brewer.pal(8, "Accent")) +
 82 |   theme_classic() +
 83 |   theme(
 84 |     legend.position = "none"
 85 |   )
 86 | 
 87 | ggsave("../Results/04_box.png", height = 2.5, width = 2)
 88 | ```
 89 | 
 90 | ![box plot](https://github.com/cxli233/Quick_data_vis/blob/main/Results/04_box.png)
 91 | 
 92 | Now I've made a box plot. 
 93 | In a box plot, the center line is median. 
 94 | The box spans interquartile range (from 25th percentile to 75th percentile). 
 95 | The whiskers span 1.5 IQR or (min and max, whichever is less extreme). 
 96 | 
 97 | Looking at the box plots, we can notice that group2 is more skewed (more outliers on the upper end).
 98 | In this case, our simple bar plot was hiding data from us. 
 99 | 
100 | For mean separation plots, my go-to is plotting all dots with lateral offsets based on density. 
101 | What does that even mean? Let's graph it first and I will explain. 
102 | ```{r}
103 | groups_long %>% 
104 |   ggplot(aes(x = group, y = response)) +
105 |   geom_quasirandom(aes(color = group), alpha = 0.8) +
106 |   scale_color_manual(values = brewer.pal(8, "Accent")) +
107 |   theme_classic() +
108 |   theme(
109 |     legend.position = "none"
110 |   )
111 | 
112 | ggsave("../Results/04_dots.png", height = 2.5, width = 2)
113 | ```
114 | 
115 | ![dot plot](https://github.com/cxli233/Quick_data_vis/blob/main/Results/04_dots.png)
116 | 
117 | The offset dots layer is provided by `geom_quasirandom()` of the `ggbeeswarm` package.
118 | The name comes from the spread of dots looks like a swarm of bees.
119 | As you can see, this is not just a dot plot. 
120 | The dots are offset to the left or right when there are multiple observations around the same value.
121 | The term "quasirandom" means the dots are offset more if the density is higher. 
122 | The result is that the dots lay themselves out into the distribution of the data. 
123 | 
124 | A final note is that we have 100 observations per group. 
125 | Even with `geom_quasirandom()` and offsets, there are still overlapping dots. 
126 | This is called "overplotting" in data visualization. 
127 | I turned on `alpha = 0.8` inside `geom_quasirandom()` to make the dots a bit transparent. 
128 | `alpha = 0.8` means 80% opacity, or 20% transparency. 
129 | This way, when dots do overlap, we can at least see through them. 
130 | 
131 | Comparing the bar plot, the box plot, and the dot plot, 
132 | we can see that while the mean and SE are similar between the groups, 
133 | their underlying distributions are very different. 
134 | The bar plot itself cannot convey this aspect of our data. 
135 | 
136 | **Take home message**: I _highly_ recommend plotting all data points in mean separation graphs. 
137 | 
138 | # What are the optimal graph type given the number of observations? 
139 | Be very careful with data with low number of observations (n < 30), and very large number of observations (n > 100). 
140 | 
141 | For data with large number of observations, as we just saw in the dot plot, 
142 | even with lateral offset, you might still get overlapping dots.
143 | A good practice is turn on transparency of the dots. 
144 | Another solution is to make a violin plot. 
145 | 
146 | ```{r}
147 | groups_long %>% 
148 |   ggplot(aes(x = group, y = response)) +
149 |   geom_quasirandom(aes(color = group), alpha = 0.8) +
150 |   geom_violin(fill = NA) +
151 |   geom_boxplot(width = 0.1, fill = NA) +
152 |   scale_color_manual(values = brewer.pal(8, "Accent")) +
153 |   theme_classic() +
154 |   theme(
155 |     legend.position = "none"
156 |   )
157 | 
158 | ggsave("../Results/04_dots_violin.png", height = 2.5, width = 2)
159 | ```
160 | 
161 | ![dot and violin plot](https://github.com/cxli233/Quick_data_vis/blob/main/Results/04_dots_violin.png)
162 | 
163 | In this case, I overlaid both violin plot and box plot onto the dots. 
164 | The width of the violin plot shows the distribution, 
165 | which as you can see follows the offset of the dots.
166 | A quick note: I turned on `fill = NA` in `geom_violin()` and `geom_boxplot()` so that we can see through them and see the dots under them. 
167 | The default has white (non-transparent) fill. 
168 | 
169 | You can read more about fancy mean separation plots (such as "raincloud plot") [here](https://z3tt.github.io/Rainclouds/). 
170 | 
171 | 
172 | Violin and box plots are great for data with many observations. 
173 | However, they can be misleading in small datasets. 
174 | Here is an example: 
175 | 
176 | ![what's wrong here?](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/Beware_of_small_n_box_violin_plot.svg) 
177 | 
178 | Violin plots (or any sort of smoothed distribution curves) and box plots make no sense for small n.
179 | Distributions and quartiles can vary widely with small n, 
180 | even if the underlying observations are similar.
181 | Distribution and quartiles are only meaningful with large n. 
182 | I did an experiment before, where I sampled the same normal distribution several times and computed the quartiles for each sample. 
183 | The quartiles only stabilize when n gets larger than 50.
184 | 
185 | **Take home message**: pay attention to the data set size when designing a mean separation plot! 
186 | 
187 | # Mean separation in multifactorial experiments. 
188 | Multifactorial experiments are very common in science. 
189 | To explain this, let's use a real world example. 
190 | Data from [Matand et al., 2020, BMC Plant Biology](https://link.springer.com/article/10.1186/s12870-020-2243-7).
191 | 
192 | ```{r}
193 | tissue_culture <- read_excel("../Data/LU-Matand - Stem Data.xlsx")
194 | head(tissue_culture)
195 | ```
196 | 
197 | In this experiment, the authors looked at the ability of stem pieces of [daylily](https://en.wikipedia.org/wiki/Daylily) to regrow into a full plant. 
198 | 
199 | They have 19 cultivars of the plant, 3 explant type, and 3 hormone treatment. 
200 | They want to look at the best regeneration condition for each cultivar that they have. 
201 | The ability to regenerate is quantified by the number of buds or shoots. 
202 | 
203 | Multifactorial experiments like this are _very_ common. 
204 | However, we need to pay extra attention to the design of the graphs. 
205 | 
206 | ```{r}
207 | tissue_culture %>% 
208 |   ggplot(aes(x = Variety, y = Buds_Shoots)) +
209 |   geom_bar(stat = "summary", 
210 |            aes(fill = interaction(Explant, Treatment)), 
211 |            fun = mean,
212 |            position = position_dodge2()) +
213 |   labs(x = NULL,
214 |        y = "Num buds and shoots",
215 |        fill = NULL) +
216 |   theme_classic() + 
217 |   theme(axis.text.x = element_text(angle = 45, hjust = 1))
218 | 
219 | ggsave("../Results/04_multi_bar.png", height = 3, width = 6)
220 | ```
221 | ![very bad bar plot](https://github.com/cxli233/Quick_data_vis/blob/main/Results/04_multi_bar.png)
222 | 
223 | I have cultivar on x axis, buds/shoots on y axis, 
224 | and color bars by the combination of explant and treatment. 
225 | This is very horrendous! It is extremely difficult to interpret what is going on. 
226 | 
227 | To overcome this problem, I am introducing you to your new friend, facets! 
228 | Facetting is a method to layout your plot into subplots and better direct the attention of your readers. 
229 | Let me show you an example: 
230 | 
231 | ```{r}
232 | tissue_culture %>% 
233 |   mutate(Treatment = factor(Treatment, 
234 |                             levels = c("T1", "T5", "T10"))) %>% 
235 |   ggplot(aes(x = Treatment, y = Buds_Shoots)) +
236 |   facet_wrap(~ Variety, scales = "free") +
237 |   geom_quasirandom(aes(color = Explant), alpha = 0.8) +
238 |   scale_colour_manual(values = brewer.pal(8, "Set2")) +
239 |   labs(x = "Hormone Treatment",
240 |        y = "Num buds and shoots",
241 |        fill = NULL) +
242 |   theme_classic() +
243 |   theme(legend.position = "bottom") 
244 | 
245 | ggsave("../Results/04_multi_dots.png", height = 6, width = 9)
246 | ```
247 | 
248 | ![much better](https://github.com/cxli233/Quick_data_vis/blob/main/Results/04_multi_dots.png) 
249 | 
250 | Now this is a lot better. 
251 | Since the authors want to find the best conditions for each cultivar, we make each cultivar a subplot. 
252 | This is provided by `facet_wrap()`. 
253 | The value after `~` is the column name to make subplot from, which is "Variety" in the table. 
254 | From this new plot, it is obvious that the split stem explant and T10 treatment gave the best results for most cultivars, with some exceptions in a cultivar-specific manner. 
255 | 
256 | **Take home message**: facets are very powerful and you should learn to use them! Read more about facets [here](https://ggplot2.tidyverse.org/reference/facet_grid.html). 
257 | 
258 | # Exercise
259 | ## Q1
260 | Notice the `scales = "free` inside `facet_wrap()`. 
261 | Make a new graph but remove the `scales = "free` inside `facet_wrap()`.
262 | 
263 | Can you explain what happened? 
264 | Can you explain in what scenario is more appropriate to have `scales = "free"` turned on vs off? 
265 | 
266 | ## Q2 
267 | Here is an example data for you to practice: 
268 | ```{r}
269 | M1 <- data.frame(
270 |   conc = rnorm(n = 8, mean = 0.03, sd = 0.01)
271 | ) %>% 
272 |   mutate(group = "ctrl") %>% 
273 |   rbind(
274 |     data.frame(
275 |       conc = rnorm(n = 6, mean = 0.25, sd = 0.02)
276 |     ) %>% 
277 |       mutate(group = "trt")
278 |   ) %>% 
279 |   mutate(pest = "Pest 1")
280 | 
281 | M2 <- data.frame(
282 |   conc = rnorm(n = 8, mean = 6, sd = 1)
283 | ) %>% 
284 |   mutate(group = "ctrl") %>% 
285 |   rbind(
286 |     data.frame(
287 |       conc = rnorm(n = 6, mean = 5.5, sd = 1.1)
288 |     ) %>% 
289 |       mutate(group = "trt")
290 |   ) %>% 
291 |   mutate(pest = "Pest 2")
292 | 
293 | M3 <- data.frame(
294 |   conc = rnorm(n = 8, mean = 20, sd = 0.5)
295 | ) %>% 
296 |   mutate(group = "ctrl") %>% 
297 |   rbind(
298 |     data.frame(
299 |       conc = rnorm(n = 6, mean = 19.5, sd = 1.2)
300 |     ) %>% 
301 |       mutate(group = "trt")
302 |   ) %>% 
303 |   mutate(pest = "Pest 3")
304 | 
305 | Spray <- rbind(
306 |   M1, M2, M3
307 | )
308 | 
309 | head(Spray)
310 | ```
311 | 
312 | In this hypothetical experiment, I have two treatments: ctrl vs. trt. 
313 | And I measured the occurrence of three different pests after spraying either treatments. 
314 | Make a mean separation plot for this experiment. 
315 | 
316 | Hints: 
317 | 
318 | * Is this a multifactorial experiment? 
319 | * How many observations are there in each group? 
320 | 
321 | Was the treatment effective in controlling any of the three pests? 
322 | Save the graph using `ggsave()`. 
323 | 
324 |  
325 | 
326 | 
327 | 
328 | 
329 | 
330 | 
331 | 
332 | 
333 | 
334 | 
335 | 
336 | 


--------------------------------------------------------------------------------
/Scripts/07_Intro_elationship_data.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Intro_to_network.Rmd"
  3 | author: "Chenxin Li"
  4 | date: "2023-01-15"
  5 | output: html_notebook
  6 | ---
  7 | 
  8 | ```{r setup, include=FALSE}
  9 | knitr::opts_chunk$set(echo = TRUE)
 10 | ```
 11 | 
 12 | # Introduction
 13 | Network analyses are useful in many fields. 
 14 | They are very powerful in modeling relationship data, which have a wide range of applications. 
 15 | We will explore some of the applications below.
 16 | Given its prevalence and usefulness, network (and the underlying relationship data) should have a lesson of its own.  
 17 | 
 18 | In this lesson, we will explore: 
 19 | 
 20 | 1. What underlies a network?
 21 | 2. How to handle the visualization of network? 
 22 | 
 23 | # Packages 
 24 | ```{r}
 25 | library(tidyverse)
 26 | library(igraph)
 27 | library(ggraph)
 28 | library(readxl)
 29 | library(RColorBrewer)
 30 | ```
 31 | 
 32 | We have two new packages, `igraph` and `ggraph`. 
 33 | `igraph` is a network analysis package that can do a lot of heavy-lifting when it comes of dealing with networks. 
 34 | `ggraph` is a `ggplot` extension for network visualization. 
 35 | 
 36 | # What underlies a network? 
 37 | ![insert network here]()
 38 | 
 39 | A network consists of nodes of edges. 
 40 | Nodes are points, and edges are lines connecting nodes. 
 41 | Nodes are in fact observations or entities. 
 42 | Edges are the relationship between nodes. 
 43 | When nodes and edges are combined to make a network, 
 44 | the network becomes an object that contains both observational data (from the nodes) and relationship data (from the edges). 
 45 | (In mathematics, a network is also referred to as a "graph", thus the "graph" in `igraph`.
 46 | Nodes are sometimes also referred to as "vertices".) 
 47 | 
 48 | Thus, a network is fundamentally two tables: 
 49 | 
 50 | 1. an edge table - a data frame that records how nodes are connected to each other. 
 51 | 2. a node table - data frame that records attributes of nodes. 
 52 | 
 53 | An additional requirement for the node table is that 
 54 | each node in the edge table must appear in the node table once and only once. 
 55 | 
 56 | An edge table can look like this:
 57 | 
 58 | | From | To | Additional info |  
 59 | |:----:|:--:|:---------------:|  
 60 | | Robin| Li | Started 09/2021 |  
 61 | 
 62 | * The first column of the edge table must be `from`: where the edge starts.
 63 | * The second column of the edge table must be `to`: where the edge ends.
 64 | * Additional information can be added as additional  columns.
 65 | 
 66 | In this case, this table has a single edge going from `Robin` to `Li`. 
 67 | 
 68 | A node table can look like this: 
 69 | 
 70 | | Node | Additional info |  
 71 | |:----:|:---------------:|  
 72 | |Robin |    PI           |  
 73 | |   Li |    Postdoc      |  
 74 | 
 75 | In a node table, the first column must be the names of the nodes. 
 76 | Each row is a node. In this case, we have a table of two nodes. 
 77 | 
 78 | # How to visualize a network? 
 79 | ## Make a network object from edges and nodes 
 80 | We will use an example. Here is a hypothetical network that I made up. 
 81 | ```{r}
 82 | example_network_edges <- read_excel("../Data/Example_network_edges.xlsx")
 83 | ```
 84 | Without a given node table, 
 85 | you can actually produce a node table from the edge table. 
 86 | The node table is just a non-redundant list of all members of `from` and `to` in the edge table. 
 87 | 
 88 | ```{r}
 89 | example_network_nodes <- data.frame(
 90 |   nodes = unique(c(example_network_edges$From, example_network_edges$To))
 91 | ) %>% 
 92 |   mutate(nodes = as.character(nodes))
 93 | ```
 94 | 
 95 | `unique(c(example_network_edges$From, example_network_edges$To)` takes the members of `from` and `to` from `example_network_edges` and take the non-redundant set. 
 96 | 
 97 | To make a network, use the `graph_from_data_frame()` function from the `igraph` package. 
 98 | ```{r}
 99 | my_network <- graph_from_data_frame(
100 |   d = example_network_edges,
101 |   vertices = example_network_nodes,
102 |   directed = F
103 | )
104 | ```
105 | 
106 | The `d` in `graph_from_data_frame()` specifies the edge table.
107 | `vertices` species the node table. 
108 | In this example I set `directed` to `FALSE`.
109 | This means from node 1 to node 2 is the same from node 2 to node 1. 
110 | 
111 | ## Using ggraph 
112 | To make a network diagram, we will use `ggraph`.
113 | ```{r}
114 | ggraph(my_network, layout = "kk") +
115 |   geom_edge_diagonal(color = "grey70") +
116 |   geom_node_point(size = 3, shape = 21, color = "white", fill = "grey20") +
117 |   geom_node_text(repel = T, aes(label = example_network_nodes$nodes)) +
118 |   theme_void()
119 | 
120 | ggsave("../Results/07_my_network1.png", height = 3, width = 3.5, bg = "white")
121 | ```
122 | There are 3 fundamental elements to a network diagram. 
123 | 
124 | 1. Layout, which controls how nodes are placed in the 2-D plane. In this diagram I used "kk" or "Kamada-Kawai" layout algorithm. More about layout algorithms is discussed below. 
125 | 2. Edges. In this example, I used `geom_edge_diagonal()`, which draws curves between nodes.
126 | 3. Nodes, which is provided by `geom_node_point()`, draws each node as a point. 
127 | 
128 | ## Try different network layouts
129 | There are multiple layout algorithms for network diagrams. 
130 | Layouts can drastically change the appearance of networks, making them easier or harder to interpret. 
131 | Here are 3 network diagrams from the same data using different layout algorithms. 
132 | They look very different from each other. Data from: [Li et al., 2022, BioRxiv](https://www.biorxiv.org/content/10.1101/2022.07.04.498697v1.abstract) 
133 | 
134 | My go-to is "kk", which generally gives good results.
135 | But I encourage you to read more about different layouts [here](https://www.data-imaginist.com/2017/ggraph-introduction-layouts/) and [here](https://r-graph-gallery.com/247-network-chart-layouts.html). 
136 | 
137 | ## Mapping variables to networks
138 | ![insert network here]()
139 | 
140 | If you look at the network that we made earlier, all the dots are just black. 
141 | What if we want to color nodes based on additional attributes? 
142 | This is when additional columns of the node table become useful. 
143 | The method is like any other `ggplot` layers, you open an `aes()` argument inside `geom_node_point()`. 
144 | 
145 | As a simple example, let's use Robin and Li. 
146 | ```{r}
147 | small_network_edges <- data.frame(
148 |   from = "Robin",
149 |   to = "Li",
150 |   info = "started 09/2021"
151 | )
152 | 
153 | small_network_nodes <- data.frame(
154 |   node = c("Robin", "Li"),
155 |   role = c("PI", "Postdoc")
156 | )
157 | 
158 | small_network <- graph_from_data_frame(
159 |   d = small_network_edges,
160 |   vertices = small_network_nodes,
161 |   directed = T
162 | )
163 | ```
164 | 
165 | Here I wrote the edge and node table, and make a network object from them. 
166 | Now let's make a network diagram, and color nodes by `role`.
167 | ```{r}
168 | ggraph(small_network, 
169 |        layout = rbind(c(1, 1),
170 |                       c(1, 0))) +
171 |   geom_edge_link(arrow = arrow(length = unit(4, "mm")), 
172 |                  end_cap = circle(4, "mm"), 
173 |                  start_cap = circle(4, "mm"),
174 |                  aes(label = info),
175 |                  angle_calc = "along",
176 |                  vjust = -0.5) +
177 |   geom_node_point(size = 3, aes(color = role)) +
178 |   geom_node_text(repel = T, aes(label = small_network_nodes$node)) +
179 |   theme_void() 
180 | 
181 | ggsave("../Results/07_small_network1.png", height = 2.5, width = 2, bg = "white")
182 | ```
183 | 
184 | Now the network diagram is colored by our roles, PI or postdoc. 
185 | `geom_edge_link()` makes straight lines between nodes. 
186 | There are multiple additional parameters in `geom_edge_link()` to make the arrow and edge label. 
187 | You can look into what these arguments do by removing them and see what happens to the diagram. 
188 | 
189 | Note that you can provide manual layout for the network if you want to. 
190 | In this example, I provided a matrix using `layout = rbind(c(1, 1),c(1, 0))`.
191 | This specifies the node `Robin` to be at coordinate x = 1, y = 1;
192 | and node `Li` to be at x = 1, y = 0. 
193 | For a small network you can get away with this. 
194 | 
195 | # How to deconvolute a large network into smaller ones? 
196 | Networks can be very large, to better understand them we might have to deconstruct them into smaller sub-networks. 
197 | These sub-networks are often referred to as modules. 
198 | There are multiple ways to achieve that. I will introduce the two most common approaches below. 
199 | 
200 | ## Deconvolute using neigbors of nodes of interest
201 | The first approach is to pull out neighbors of nodes of interest. 
202 | Let's look at the example network we made earlier. 
203 | ![insert network here]() 
204 | 
205 | Say we are interested in nodes 1, 14, and 23.
206 | And we want to look at what are their direct network neighbors. 
207 | We can do that using the `neighbors()` function in the `igraph` package. 
208 | 
209 | ```{r}
210 | neighbors_of_interst <- c(
211 |   names(neighbors(my_network, v = "1")),  # pull neighbors of node 1
212 |   names(neighbors(my_network, v = "14")), # pull neighbors of node 14 
213 |   names(neighbors(my_network, v = "23")), # pull neighbors of node 23
214 |   c(1, 14, 23) # don't forget 1, 14, 23 themselves
215 | ) %>% 
216 |   unique() # remove redundant nodes, if any. 
217 |   
218 | neighbors_of_interst
219 | ```
220 | We can make a sub-network by filtering nodes from the node table,
221 | as well as filtering for edges that connect them. 
222 | 
223 | ```{r}
224 | my_subnetwork_nodes <- example_network_nodes %>% 
225 |   filter(nodes %in% neighbors_of_interst)
226 | 
227 | my_subnetwork_edges <- example_network_edges %>% 
228 |   filter(From %in% neighbors_of_interst &
229 |            To %in% neighbors_of_interst)
230 | 
231 | my_subnetwork <- graph_from_data_frame(
232 |   d = my_subnetwork_edges,
233 |   vertices = my_subnetwork_nodes,
234 |   directed = F
235 | )
236 | ```
237 | 
238 | We made the sub-network object. 
239 | Now we can make a new diagram. 
240 | ```{r}
241 | ggraph(my_subnetwork, layout = "kk") +
242 |   geom_edge_diagonal(color = "grey70") +
243 |   geom_node_point(size = 3, shape = 21, color = "white", fill = "grey20") +
244 |   geom_node_text(repel = T, aes(label = my_subnetwork_nodes$nodes)) +
245 |   theme_void()
246 | 
247 | ggsave("../Results/07_my_network2.png", height = 3, width = 3, bg = "white")
248 | ```
249 | Now we are left with nodes 1, 14, and 23 and their direct neighbors. 
250 | 
251 | ## Deconvolute using module detection 
252 | A very common method of network deconvolution is clustering. 
253 | Clustering looks for highly interconnected nodes and assigns modules accordingly. 
254 | The `igraph` package comes with clustering algorithms, such as `leiden`. 
255 | You can break a network into multiple highly interconnected parts using `cluster_leiden()`. 
256 | 
257 | ```{r}
258 | my_netowrk_modules <- cluster_leiden(graph = my_network,  
259 |                                      objective_function = "modularity") 
260 | 
261 | node_table_clustered <- data.frame(
262 |   nodes = my_netowrk_modules$names,   # Pull node names 
263 |   Module = my_netowrk_modules$membership # Pull membership  
264 | ) %>% 
265 |   arrange(as.numeric(nodes)) %>% 
266 |   inner_join(example_network_nodes, by = "nodes") # add membership info to original node table  
267 | 
268 | head(node_table_clustered)
269 | ```
270 | What I did in this chunk is I ran the leiden clustering algorithm on my network.
271 | Then I made a new data frame that contains the nodes and which module they belong to.
272 | Finally, I joined the membership information to the original node table. 
273 | 
274 | Now we can make a new network object with the updated node table. 
275 | ```{r}
276 | my_network_clustered <- graph_from_data_frame(d = example_network_edges,
277 |                                          vertices = node_table_clustered,
278 |                                          directed = F)
279 | ```
280 | 
281 | ```{r}
282 | ggraph(my_network_clustered, layout = "kk") +
283 |   geom_edge_diagonal(color = "grey70") +
284 |   geom_node_point(size = 3, shape = 21, color = "white", 
285 |                   aes(fill = as.factor(Module))) +
286 |   geom_node_text(repel = T, aes(label = node_table_clustered$nodes)) +
287 |   scale_fill_manual(values = brewer.pal(8, "Set2")) +
288 |   labs(fill = "Module") +
289 |   theme_void() +
290 |   theme(
291 |     legend.position = c(0.2, 0.25)
292 |   )
293 | 
294 | ggsave("../Results/07_my_network3.png", height = 3, width = 3.5, bg = "white")
295 | ```
296 | The key addition in this code chunk is `aes(fill = as.factor(Module)` inside `geom_node_point()`. 
297 | Now the nodes are colored by which module they belong to. 
298 | You can see that the clustering actually makes sense: 
299 | Nodes that are highly interconnected are assigned in the same cluster. 
300 | Incidentally, this network has 3 modules. 
301 | While there are multiple connections _within_ each module, 
302 | there is only one edge _between_ each module. 
303 | 
304 | One application of network analysis is in gene expression. 
305 | In gene co-expression networks, each node is a gene, 
306 | and an edge is drawn between two genes if two genes are co-expressed (having very similar expression patterns). 
307 | Gene co-expression modules represent groups of genes that may function in the same biological process. 
308 | 
309 | # Homework 
310 | ## Q1 
311 | Metabolic pathways can be modeled by networks, where each node is a metabolite and each edge is a metabolic enzyme. 
312 | You can read more about how to represent metabolic pathways [here](https://github.com/cxli233/ggpathway). 
313 | 
314 | Here is an example:
315 | ```{r}
316 | calvin_cycle_edges <- read_excel("../Data/Calvin_cycle_edges.xlsx")
317 | calvin_cycle_nodes <- read_excel("../Data/Calvin_cycle_nodes.xlsx")
318 | 
319 | head(calvin_cycle_edges)
320 | head(calvin_cycle_nodes)
321 | ```
322 | Make a network diagram for the Calvin cycle. 
323 | Color each node by how many carbons they have (the `carbon` column in the node table).
324 | Label each metabolite using `geom_node_text(aes(label = name), hjust = 0.5, repel = T)`. 
325 | Save the figure using `ggsave()`.
326 | Use `bg = "white"` inside `ggsave()` to add a white background when exporting the png file.  
327 | 
328 |  
329 | ## Q2 
330 | [Phylogenetic trees](https://en.wikipedia.org/wiki/Phylogenetic_tree) and [dendrogram](https://en.wikipedia.org/wiki/Dendrogram) can be modeled by a network. 
331 | Say we have a tree like this: 
332 | 
333 | ```    
334 |     I
335 |     |
336 |     G - H
337 |     |
338 | A - B - C 
339 |     |   
340 |     D - E
341 |     |
342 |     F
343 | ``` 
344 | 
345 | Make a network diagram and label each node.
346 | Save the diagram using `ggave()`.
347 | Use `bg = "white"` inside `ggsave()` to add a white background when exporting the png file.  
348 | Hint: You can write the edge table in Excel and import it into R if that's easier. 
349 | 


--------------------------------------------------------------------------------
/Lessons/07_Intro_elationship_data.md:
--------------------------------------------------------------------------------
  1 | # Introduction
  2 | Network analyses are useful in many fields. 
  3 | They are very powerful in modeling relationship data, which have a wide range of applications. 
  4 | We will explore some of the applications below.
  5 | Given its prevalence and usefulness, network (and the underlying relationship data) should have a lesson of its own.  
  6 | 
  7 | In this lesson, we will explore: 
  8 | 
  9 | 1. What underlies a network?
 10 | 2. How to handle the visualization of network? 
 11 | 
 12 | # Packages 
 13 | ```{r}
 14 | library(tidyverse)
 15 | library(igraph)
 16 | library(ggraph)
 17 | library(readxl)
 18 | library(RColorBrewer)
 19 | ```
 20 | 
 21 | We have two new packages, `igraph` and `ggraph`. 
 22 | `igraph` is a network analysis package that can do a lot of heavy-lifting when it comes of dealing with networks. 
 23 | `ggraph` is a `ggplot` extension for network visualization. 
 24 | 
 25 | # What underlies a network? 
 26 | ![insert network here](https://github.com/cxli233/Quick_data_vis/blob/main/Results/07_my_network1.png)
 27 | 
 28 | A network consists of nodes of edges. 
 29 | Nodes are points, and edges are lines connecting nodes. 
 30 | Nodes are in fact observations or entities. 
 31 | Edges are the relationship between nodes. 
 32 | When nodes and edges are combined to make a network, 
 33 | the network becomes an object that contains both observational data (from the nodes) and relationship data (from the edges). 
 34 | (In mathematics, a network is also referred to as a "graph", thus the "graph" in `igraph`.
 35 | Nodes are sometimes also referred to as "vertices".) 
 36 | 
 37 | Thus, a network is fundamentally two tables: 
 38 | 
 39 | 1. an edge table - a data frame that records how nodes are connected to each other. 
 40 | 2. a node table - data frame that records attributes of nodes. 
 41 | 
 42 | An additional requirement for the node table is that 
 43 | each node in the edge table must appear in the node table once and only once. 
 44 | 
 45 | An edge table can look like this:
 46 | 
 47 | | From | To | Additional info |  
 48 | |:----:|:--:|:---------------:|  
 49 | | Robin| Li | Started 09/2021 |  
 50 | 
 51 | * The first column of the edge table must be `from`: where the edge starts.
 52 | * The second column of the edge table must be `to`: where the edge ends.
 53 | * Additional information can be added as additional  columns.
 54 | 
 55 | In this case, this table has a single edge going from `Robin` to `Li`. 
 56 | 
 57 | A node table can look like this: 
 58 | 
 59 | | Node | Additional info |  
 60 | |:----:|:---------------:|  
 61 | |Robin |    PI           |  
 62 | |   Li |    Postdoc      |  
 63 | 
 64 | In a node table, the first column must be the names of the nodes. 
 65 | Each row is a node. In this case, we have a table of two nodes. 
 66 | 
 67 | # How to visualize a network? 
 68 | ## Make a network object from edges and nodes 
 69 | We will use an example. Here is a hypothetical network that I made up. 
 70 | ```{r}
 71 | example_network_edges <- read_excel("../Data/Example_network_edges.xlsx")
 72 | ```
 73 | Without a given node table, 
 74 | you can actually produce a node table from the edge table. 
 75 | The node table is just a non-redundant list of all members of `from` and `to` in the edge table. 
 76 | 
 77 | ```{r}
 78 | example_network_nodes <- data.frame(
 79 |   nodes = unique(c(example_network_edges$From, example_network_edges$To))
 80 | ) %>% 
 81 |   mutate(nodes = as.character(nodes))
 82 | ```
 83 | 
 84 | `unique(c(example_network_edges$From, example_network_edges$To)` takes the members of `from` and `to` from `example_network_edges` and take the non-redundant set. 
 85 | 
 86 | To make a network, use the `graph_from_data_frame()` function from the `igraph` package. 
 87 | ```{r}
 88 | my_network <- graph_from_data_frame(
 89 |   d = example_network_edges,
 90 |   vertices = example_network_nodes,
 91 |   directed = F
 92 | )
 93 | ```
 94 | 
 95 | The `d` in `graph_from_data_frame()` specifies the edge table.
 96 | `vertices` species the node table. 
 97 | In this example I set `directed` to `FALSE`.
 98 | This means from node 1 to node 2 is the same from node 2 to node 1. 
 99 | 
100 | ## Using ggraph 
101 | To make a network diagram, we will use `ggraph`.
102 | ```{r}
103 | ggraph(my_network, layout = "kk") +
104 |   geom_edge_diagonal(color = "grey70") +
105 |   geom_node_point(size = 3, shape = 21, color = "white", fill = "grey20") +
106 |   geom_node_text(repel = T, aes(label = example_network_nodes$nodes)) +
107 |   theme_void()
108 | 
109 | ggsave("../Results/07_my_network1.png", height = 3, width = 3.5, bg = "white")
110 | ```
111 | ![network 1](https://github.com/cxli233/Quick_data_vis/blob/main/Results/07_my_network1.png) 
112 | 
113 | There are 3 fundamental elements to a network diagram. 
114 | 
115 | 1. Layout, which controls how nodes are placed in the 2-D plane. In this diagram I used "kk" or "Kamada-Kawai" layout algorithm. More about layout algorithms is discussed below. 
116 | 2. Edges. In this example, I used `geom_edge_diagonal()`, which draws curves between nodes.
117 | 3. Nodes, which is provided by `geom_node_point()`, draws each node as a point. 
118 | 
119 | ## Try different network layouts
120 | There are multiple layout algorithms for network diagrams. 
121 | Layouts can drastically change the appearance of networks, making them easier or harder to interpret. 
122 | Here are 3 network diagrams from the same data using different layout algorithms. 
123 | ![Try different layouts](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/TryDifferentLayouts.svg) 
124 | 
125 | They look very different from each other. Data from: [Li et al., 2022, BioRxiv](https://www.biorxiv.org/content/10.1101/2022.07.04.498697v1.abstract) 
126 | 
127 | My go-to is "kk", which generally gives good results.
128 | But I encourage you to read more about different layouts [here](https://www.data-imaginist.com/2017/ggraph-introduction-layouts/) and [here](https://r-graph-gallery.com/247-network-chart-layouts.html). 
129 | 
130 | ## Mapping variables to networks
131 | ![insert network here](https://github.com/cxli233/Quick_data_vis/blob/main/Results/07_my_network1.png)
132 | 
133 | If you look at the network that we made earlier, all the dots are just black. 
134 | What if we want to color nodes based on additional attributes? 
135 | This is when additional columns of the node table become useful. 
136 | The method is like any other `ggplot` layers, you open an `aes()` argument inside `geom_node_point()`. 
137 | 
138 | As a simple example, let's use Robin and Li. 
139 | ```{r}
140 | small_network_edges <- data.frame(
141 |   from = "Robin",
142 |   to = "Li",
143 |   info = "started 09/2021"
144 | )
145 | 
146 | small_network_nodes <- data.frame(
147 |   node = c("Robin", "Li"),
148 |   role = c("PI", "Postdoc")
149 | )
150 | 
151 | small_network <- graph_from_data_frame(
152 |   d = small_network_edges,
153 |   vertices = small_network_nodes,
154 |   directed = T
155 | )
156 | ```
157 | 
158 | Here I wrote the edge and node table, and make a network object from them. 
159 | Now let's make a network diagram, and color nodes by `role`.
160 | ```{r}
161 | ggraph(small_network, 
162 |        layout = rbind(c(1, 1),
163 |                       c(1, 0))) +
164 |   geom_edge_link(arrow = arrow(length = unit(4, "mm")), 
165 |                  end_cap = circle(4, "mm"), 
166 |                  start_cap = circle(4, "mm"),
167 |                  aes(label = info),
168 |                  angle_calc = "along",
169 |                  vjust = -0.5) +
170 |   geom_node_point(size = 3, aes(color = role)) +
171 |   geom_node_text(repel = T, aes(label = small_network_nodes$node)) +
172 |   theme_void() 
173 | 
174 | ggsave("../Results/07_small_network1.png", height = 2.5, width = 2, bg = "white")
175 | ```
176 | ![small_network](https://github.com/cxli233/Quick_data_vis/blob/main/Results/07_small_network1.png)
177 | 
178 | Now the network diagram is colored by our roles, PI or postdoc. 
179 | `geom_edge_link()` makes straight lines between nodes. 
180 | There are multiple additional parameters in `geom_edge_link()` to make the arrow and edge label. 
181 | You can look into what these arguments do by removing them and see what happens to the diagram. 
182 | 
183 | Note that you can provide manual layout for the network if you want to. 
184 | In this example, I provided a matrix using `layout = rbind(c(1, 1),c(1, 0))`.
185 | This specifies the node `Robin` to be at coordinate x = 1, y = 1;
186 | and node `Li` to be at x = 1, y = 0. 
187 | For a small network you can get away with this. 
188 | 
189 | # How to deconvolute a large network into smaller ones? 
190 | Networks can be very large, to better understand them we might have to deconstruct them into smaller sub-networks. 
191 | These sub-networks are often referred to as modules. 
192 | There are multiple ways to achieve that. I will introduce the two most common approaches below. 
193 | 
194 | ## Deconvolute using neigbors of nodes of interest
195 | The first approach is to pull out neighbors of nodes of interest. 
196 | Let's look at the example network we made earlier. 
197 | ![insert network here](https://github.com/cxli233/Quick_data_vis/blob/main/Results/07_my_network1.png) 
198 | 
199 | Say we are interested in nodes 1, 14, and 23.
200 | And we want to look at what are their direct network neighbors. 
201 | We can do that using the `neighbors()` function in the `igraph` package. 
202 | 
203 | ```{r}
204 | neighbors_of_interst <- c(
205 |   names(neighbors(my_network, v = "1")),  # pull neighbors of node 1
206 |   names(neighbors(my_network, v = "14")), # pull neighbors of node 14 
207 |   names(neighbors(my_network, v = "23")), # pull neighbors of node 23
208 |   c(1, 14, 23) # don't forget 1, 14, 23 themselves
209 | ) %>% 
210 |   unique() # remove redundant nodes, if any. 
211 |   
212 | neighbors_of_interst
213 | ```
214 | We can make a sub-network by filtering nodes from the node table,
215 | as well as filtering for edges that connect them. 
216 | 
217 | ```{r}
218 | my_subnetwork_nodes <- example_network_nodes %>% 
219 |   filter(nodes %in% neighbors_of_interst)
220 | 
221 | my_subnetwork_edges <- example_network_edges %>% 
222 |   filter(From %in% neighbors_of_interst &
223 |            To %in% neighbors_of_interst)
224 | 
225 | my_subnetwork <- graph_from_data_frame(
226 |   d = my_subnetwork_edges,
227 |   vertices = my_subnetwork_nodes,
228 |   directed = F
229 | )
230 | ```
231 | 
232 | We made the sub-network object. 
233 | Now we can make a new diagram. 
234 | ```{r}
235 | ggraph(my_subnetwork, layout = "kk") +
236 |   geom_edge_diagonal(color = "grey70") +
237 |   geom_node_point(size = 3, shape = 21, color = "white", fill = "grey20") +
238 |   geom_node_text(repel = T, aes(label = my_subnetwork_nodes$nodes)) +
239 |   theme_void()
240 | 
241 | ggsave("../Results/07_my_network2.png", height = 3, width = 3, bg = "white")
242 | ```
243 | ![network2](https://github.com/cxli233/Quick_data_vis/blob/main/Results/07_my_network2.png)
244 | 
245 | Now we are left with nodes 1, 14, and 23 and their direct neighbors. 
246 | 
247 | ## Deconvolute using module detection 
248 | A very common method of network deconvolution is clustering. 
249 | Clustering looks for highly interconnected nodes and assigns modules accordingly. 
250 | The `igraph` package comes with clustering algorithms, such as `leiden`. 
251 | You can break a network into multiple highly interconnected parts using `cluster_leiden()`. 
252 | 
253 | ```{r}
254 | my_netowrk_modules <- cluster_leiden(graph = my_network,  
255 |                                      objective_function = "modularity") 
256 | 
257 | node_table_clustered <- data.frame(
258 |   nodes = my_netowrk_modules$names,   # Pull node names 
259 |   Module = my_netowrk_modules$membership # Pull membership  
260 | ) %>% 
261 |   arrange(as.numeric(nodes)) %>% 
262 |   inner_join(example_network_nodes, by = "nodes") # add membership info to original node table  
263 | 
264 | head(node_table_clustered)
265 | ```
266 | What I did in this chunk is I ran the leiden clustering algorithm on my network.
267 | Then I made a new data frame that contains the nodes and which module they belong to.
268 | Finally, I joined the membership information to the original node table. 
269 | 
270 | Now we can make a new network object with the updated node table. 
271 | ```{r}
272 | my_network_clustered <- graph_from_data_frame(d = example_network_edges,
273 |                                          vertices = node_table_clustered,
274 |                                          directed = F)
275 | ```
276 | 
277 | ```{r}
278 | ggraph(my_network_clustered, layout = "kk") +
279 |   geom_edge_diagonal(color = "grey70") +
280 |   geom_node_point(size = 3, shape = 21, color = "white", 
281 |                   aes(fill = as.factor(Module))) +
282 |   geom_node_text(repel = T, aes(label = node_table_clustered$nodes)) +
283 |   scale_fill_manual(values = brewer.pal(8, "Set2")) +
284 |   labs(fill = "Module") +
285 |   theme_void() +
286 |   theme(
287 |     legend.position = c(0.2, 0.25)
288 |   )
289 | 
290 | ggsave("../Results/07_my_network3.png", height = 3, width = 3.5, bg = "white")
291 | ```
292 | ![network3](https://github.com/cxli233/Quick_data_vis/blob/main/Results/07_my_network3.png) 
293 | 
294 | The key addition in this code chunk is `aes(fill = as.factor(Module)` inside `geom_node_point()`. 
295 | Now the nodes are colored by which module they belong to. 
296 | You can see that the clustering actually makes sense: 
297 | Nodes that are highly interconnected are assigned in the same cluster. 
298 | Incidentally, this network has 3 modules. 
299 | While there are multiple connections _within_ each module, 
300 | there is only one edge _between_ each module. 
301 | 
302 | One application of network analysis is in gene expression. 
303 | In gene co-expression networks, each node is a gene, 
304 | and an edge is drawn between two genes if two genes are co-expressed (having very similar expression patterns). 
305 | Gene co-expression modules represent groups of genes that may function in the same biological process. 
306 | 
307 | # Homework 
308 | ## Q1 
309 | Metabolic pathways can be modeled by networks, where each node is a metabolite and each edge is a metabolic enzyme. 
310 | You can read more about how to represent metabolic pathways [here](https://github.com/cxli233/ggpathway). 
311 | 
312 | Here is an example:
313 | ```{r}
314 | calvin_cycle_edges <- read_excel("../Data/Calvin_cycle_edges.xlsx")
315 | calvin_cycle_nodes <- read_excel("../Data/Calvin_cycle_nodes.xlsx")
316 | 
317 | head(calvin_cycle_edges)
318 | head(calvin_cycle_nodes)
319 | ```
320 | Make a network diagram for the Calvin cycle. 
321 | Color each node by how many carbons they have (the `carbon` column in the node table).
322 | Label each metabolite using `geom_node_text(aes(label = name), hjust = 0.5, repel = T)`. 
323 | Save the figure using `ggsave()`.
324 | Use `bg = "white"` inside `ggsave()` to add a white background when exporting the png file.  
325 | 
326 |  
327 | ## Q2 
328 | [Phylogenetic trees](https://en.wikipedia.org/wiki/Phylogenetic_tree) and [dendrogram](https://en.wikipedia.org/wiki/Dendrogram) can be modeled by a network. 
329 | Say we have a tree like this: 
330 | 
331 | ```    
332 |     I
333 |     |
334 |     G - H
335 |     |
336 | A - B - C 
337 |     |   
338 |     D - E
339 |     |
340 |     F
341 | ``` 
342 | 
343 | Make a network diagram and label each node.
344 | Save the diagram using `ggave()`.
345 | Use `bg = "white"` inside `ggsave()` to add a white background when exporting the png file.  
346 | Hint: You can write the edge table in Excel and import it into R if that's easier. 
347 | 


--------------------------------------------------------------------------------