├── Data ├── README ├── C_elegans_DAPI.xlsx ├── Calvin_cycle_edges.xlsx ├── Calvin_cycle_nodes.xlsx ├── Example_network_edges.xlsx ├── LU-Matand - Stem Data.xlsx ├── dataset2.csv ├── dataset1.csv └── USDA_2022_matrix.csv ├── Results ├── README ├── 08_A.png ├── 08_A1.png ├── 08_A2.png ├── 08_A3.png ├── 08_B.png ├── 08_C.png ├── 03_bar.png ├── 04_bar.png ├── 04_box.png ├── 04_dots.png ├── 05_dot1.png ├── 05_dot2.png ├── 05_pie1.png ├── 06_abc_1.png ├── 06_abc_2.png ├── 08_A_1.png ├── 08_A_2.png ├── 05_stack1.png ├── 05_stack2.png ├── 06_genes_1.png ├── 06_genes_2.png ├── 06_genes_3.png ├── 03_line_plot.png ├── 03_scatter_1.png ├── 03_scatter_2.png ├── 03_scatter_3.png ├── 03_scatter_4.png ├── 03_scatter_5.png ├── 04_multi_bar.png ├── 04_multi_dots.png ├── 08_assembled.png ├── 02_scatter_plot.png ├── 04_dots_violin.png ├── 07_my_network1.png ├── 07_my_network2.png ├── 07_my_network3.png ├── 07_small_network1.png └── 08_assembled_label.png ├── Scripts ├── README ├── 05_Intro_proportions.Rmd ├── 06_Intro_to_heatmaps.Rmd ├── 08_Intro_plot_assembly.Rmd ├── 03_Intro_to_data_vis.Rmd ├── 01_Intro_to_R.Rmd ├── 04_Intro_mean_separation.Rmd └── 07_Intro_elationship_data.Rmd ├── Lessons ├── README ├── 05_Intro_proportions.md ├── 08_Intro_plot_assembly.md ├── 06_Intro_to_heatmaps.md ├── 01_Intro_to_R.md ├── 03_Intro_to_data_vis.md ├── 04_Intro_mean_separation.md └── 07_Intro_elationship_data.md ├── Answer_key ├── readme ├── Lesson3_answer1.png ├── Lesson3_answer2.png ├── Lesson4_answer1.png ├── Lesson4_answer2.png ├── Lesson5_answer1.png ├── Lesson7_answer1.png ├── Lesson7_answer2.png ├── Lesson8_answer1.png ├── 05_Intro_proportions_answer_key.Rmd ├── 01_Intro_to_R_answer_key.Rmd ├── 06_Intro_to_heatmaps_answer_key.Rmd ├── 03_Intro_to_data_vis_answer_key.Rmd ├── 07_Intro_relationship_data_answer_key.Rmd ├── 08_Intro_plot_assembly_answer_key.Rmd ├── 04_Intro_mean_separation_answer_key.Rmd └── 02_Intro_to_tidy_data_answer_key.Rmd ├── Graphical_abstract_2023_01_14.png ├── LICENSE └── README.md /Data/README: -------------------------------------------------------------------------------- 1 | Data files for the course will be here. 2 | -------------------------------------------------------------------------------- /Results/README: -------------------------------------------------------------------------------- 1 | Results from the R scripts will be here. 2 | -------------------------------------------------------------------------------- /Scripts/README: -------------------------------------------------------------------------------- 1 | .Rmd Scripts for the course will be here. 2 | -------------------------------------------------------------------------------- /Lessons/README: -------------------------------------------------------------------------------- 1 | Markdown formatted text for the lessons will be here. 2 | -------------------------------------------------------------------------------- /Answer_key/readme: -------------------------------------------------------------------------------- 1 | Answer key for homework questions are available in this folder. 2 | -------------------------------------------------------------------------------- /Results/08_A.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/08_A.png -------------------------------------------------------------------------------- /Results/08_A1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/08_A1.png -------------------------------------------------------------------------------- /Results/08_A2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/08_A2.png -------------------------------------------------------------------------------- /Results/08_A3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/08_A3.png -------------------------------------------------------------------------------- /Results/08_B.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/08_B.png -------------------------------------------------------------------------------- /Results/08_C.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/08_C.png -------------------------------------------------------------------------------- /Results/03_bar.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/03_bar.png -------------------------------------------------------------------------------- /Results/04_bar.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/04_bar.png -------------------------------------------------------------------------------- /Results/04_box.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/04_box.png -------------------------------------------------------------------------------- /Results/04_dots.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/04_dots.png -------------------------------------------------------------------------------- /Results/05_dot1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/05_dot1.png -------------------------------------------------------------------------------- /Results/05_dot2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/05_dot2.png -------------------------------------------------------------------------------- /Results/05_pie1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/05_pie1.png -------------------------------------------------------------------------------- /Results/06_abc_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/06_abc_1.png -------------------------------------------------------------------------------- /Results/06_abc_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/06_abc_2.png -------------------------------------------------------------------------------- /Results/08_A_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/08_A_1.png -------------------------------------------------------------------------------- /Results/08_A_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/08_A_2.png -------------------------------------------------------------------------------- /Results/05_stack1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/05_stack1.png -------------------------------------------------------------------------------- /Results/05_stack2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/05_stack2.png -------------------------------------------------------------------------------- /Results/06_genes_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/06_genes_1.png -------------------------------------------------------------------------------- /Results/06_genes_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/06_genes_2.png -------------------------------------------------------------------------------- /Results/06_genes_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/06_genes_3.png -------------------------------------------------------------------------------- /Data/C_elegans_DAPI.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Data/C_elegans_DAPI.xlsx -------------------------------------------------------------------------------- /Results/03_line_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/03_line_plot.png -------------------------------------------------------------------------------- /Results/03_scatter_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/03_scatter_1.png -------------------------------------------------------------------------------- /Results/03_scatter_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/03_scatter_2.png -------------------------------------------------------------------------------- /Results/03_scatter_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/03_scatter_3.png -------------------------------------------------------------------------------- /Results/03_scatter_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/03_scatter_4.png -------------------------------------------------------------------------------- /Results/03_scatter_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/03_scatter_5.png -------------------------------------------------------------------------------- /Results/04_multi_bar.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/04_multi_bar.png -------------------------------------------------------------------------------- /Results/04_multi_dots.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/04_multi_dots.png -------------------------------------------------------------------------------- /Results/08_assembled.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/08_assembled.png -------------------------------------------------------------------------------- /Results/02_scatter_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/02_scatter_plot.png -------------------------------------------------------------------------------- /Results/04_dots_violin.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/04_dots_violin.png -------------------------------------------------------------------------------- /Results/07_my_network1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/07_my_network1.png -------------------------------------------------------------------------------- /Results/07_my_network2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/07_my_network2.png -------------------------------------------------------------------------------- /Results/07_my_network3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/07_my_network3.png -------------------------------------------------------------------------------- /Answer_key/Lesson3_answer1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Answer_key/Lesson3_answer1.png -------------------------------------------------------------------------------- /Answer_key/Lesson3_answer2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Answer_key/Lesson3_answer2.png -------------------------------------------------------------------------------- /Answer_key/Lesson4_answer1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Answer_key/Lesson4_answer1.png -------------------------------------------------------------------------------- /Answer_key/Lesson4_answer2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Answer_key/Lesson4_answer2.png -------------------------------------------------------------------------------- /Answer_key/Lesson5_answer1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Answer_key/Lesson5_answer1.png -------------------------------------------------------------------------------- /Answer_key/Lesson7_answer1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Answer_key/Lesson7_answer1.png -------------------------------------------------------------------------------- /Answer_key/Lesson7_answer2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Answer_key/Lesson7_answer2.png -------------------------------------------------------------------------------- /Answer_key/Lesson8_answer1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Answer_key/Lesson8_answer1.png -------------------------------------------------------------------------------- /Data/Calvin_cycle_edges.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Data/Calvin_cycle_edges.xlsx -------------------------------------------------------------------------------- /Data/Calvin_cycle_nodes.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Data/Calvin_cycle_nodes.xlsx -------------------------------------------------------------------------------- /Results/07_small_network1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/07_small_network1.png -------------------------------------------------------------------------------- /Results/08_assembled_label.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Results/08_assembled_label.png -------------------------------------------------------------------------------- /Data/Example_network_edges.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Data/Example_network_edges.xlsx -------------------------------------------------------------------------------- /Data/LU-Matand - Stem Data.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Data/LU-Matand - Stem Data.xlsx -------------------------------------------------------------------------------- /Data/dataset2.csv: -------------------------------------------------------------------------------- 1 | "t1p1","t1p2","t2p1","t2p2","outcome" 2 | 75,80,90,90,"pass" 3 | 15,15,5,4,"fail" 4 | 10,5,5,6,"ambiguous" 5 | -------------------------------------------------------------------------------- /Graphical_abstract_2023_01_14.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxli233/Quick_data_vis/HEAD/Graphical_abstract_2023_01_14.png -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 C. Li 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Answer_key/05_Intro_proportions_answer_key.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "05_Intro_proportions_ANSWER_KEY" 3 | author: "Chenxin Li" 4 | date: "2023-01-14" 5 | output: html_notebook 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | ``` 11 | 12 | # Packages 13 | ```{r} 14 | library(tidyverse) 15 | library(readxl) 16 | library(RColorBrewer) 17 | ``` 18 | 19 | 20 | # Exercise 21 | As an exercise, let's visualize this example. 22 | We have two groups, each contains 4 categories. 23 | ```{r} 24 | group1 <- data.frame( 25 | "Type" = c("Type I", "Type II", "Type III", "Type IV"), 26 | "Percentage" = c(15, 35, 30, 20) 27 | ) 28 | group2 <- data.frame( 29 | "Type" = c("Type I", "Type II", "Type III", "Type IV"), 30 | "Percentage" = c(10, 25, 35, 30) 31 | ) 32 | ``` 33 | 34 | ```{r} 35 | group1_and_group2 <- rbind( 36 | group1 %>% 37 | mutate(group = "group1"), 38 | group2 %>% 39 | mutate(group = "group2") 40 | ) 41 | 42 | head(group1_and_group2) 43 | ``` 44 | 45 | ```{r} 46 | group1_and_group2 %>% 47 | ggplot(aes(x = group, y = Percentage, fill = Type)) + 48 | geom_bar(stat = "identity", width = 0.5) + 49 | scale_fill_manual(values = brewer.pal(8, "Accent"))+ 50 | theme_classic() 51 | 52 | ggsave("Lesson5_answer1.png", width = 3, height = 3) 53 | ``` 54 | 55 | Use `ggsave` to save your visualization. 56 | 57 | Hint: look in this lesson to see what I did to combine two entities into one data frame while giving each a unique identifier. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # "Quick data vis" 2 | [![DOI](https://zenodo.org/badge/876821586.svg)](https://doi.org/10.5281/zenodo.13973004) 3 | 4 | This is a 8-part quick course in data visualization using R. 5 | 6 | **Graphical Abstract** 7 | 8 | ![Graphical Abstract](https://github.com/cxli233/Quick_data_vis/blob/main/Graphical_abstract_2023_01_14.png) 9 | 10 | * Author: Chenxin Li, Ph.D., Assistant Professor at Department of Plant Biology, Michigan State University. 11 | * Contact: lichen27@msu.edu | [@chenxinli2.bsky.social](https://bsky.app/profile/chenxinli2.bsky.social) 12 | 13 | # Browse lessons using an internet browser 14 | 1. Navigate to [Quick_data_vis/Lessons/](https://github.com/cxli233/Quick_data_vis/tree/main/Lessons) 15 | 2. Click on any .md files 16 | 3. GitHub will render a nice html 17 | 18 | # Content 19 | This repository has 8 activities: 20 | 1. Very bascis of R coding 21 | 2. Introduction to tidy data frames 22 | 3. Introduction to data visualization using ggplot 23 | 4. Introduction to mean separation 24 | 5. Introduction to proportional data 25 | 6. Introduction to heatmaps 26 | 7. Introduction to relationship data and networks 27 | 8. Introduction to plot composition/assembly 28 | 29 | # Required software: 30 | 31 | * R: [R Download](https://cran.r-project.org/bin/) 32 | * RStudio: [RStudio Download](https://www.rstudio.com/products/rstudio/download/) 33 | * rmarkdown can be installed using the intall packages interface in RStudio. 34 | 35 | # Getting started 36 | 1. [Clone](https://github.com/cxli233/Quick_data_vis.git) the repository to your machine by downloading the zip file. 37 | 2. Unzip and move folder to whichever location you like on your machine. 38 | 3. Open RStudio. 39 | 4. Open .Rmd files under the Scripts folder. 40 | 5. Enjoy! 41 | 42 | 43 | 44 | -------------------------------------------------------------------------------- /Answer_key/01_Intro_to_R_answer_key.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Very basics of R coding ANSWER KEY" 3 | author: "Chenxin Li" 4 | date: "01/06/2023" 5 | output: 6 | html_notebook: 7 | number_sections: yes 8 | toc: yes 9 | toc_float: yes 10 | --- 11 | 12 | ```{r setup, include=FALSE} 13 | knitr::opts_chunk$set(echo = TRUE) 14 | ``` 15 | 16 | # Packages 17 | ```{r} 18 | library(dplyr) 19 | ``` 20 | 21 | # Excerice 22 | 23 | Today you have learned some basic syntax of R. 24 | Now it's time for you to practice. 25 | 26 | ## Q1: 27 | 28 | 1. Insert a new code chunk 29 | 2. Make this matrix 30 | 31 | | 1 | 1 | 2 | 2 | 32 | | 2 | 2 | 1 | 2 | 33 | | 2 | 3 | 3 | 4 | 34 | | 1 | 2 | 3 | 4 | 35 | 36 | and save it as an item called `my_mat2`. 37 | ```{r} 38 | my_mat2 <- rbind( 39 | c(1, 1, 2, 2), 40 | c(2, 2, 1, 2), 41 | c(2, 3, 3, 4), 42 | c(1, 2, 3, 4) 43 | ) 44 | 45 | my_mat2 46 | ``` 47 | 48 | 49 | 3. Select the 1st and 3rd rows and the 1st, 2nd and 4th columns, and save it as an item. 50 | ```{r} 51 | item <- my_mat2[c(1,3), c(1,2,4)] 52 | item 53 | ``` 54 | 55 | 4. Take the square root for each member of my_mat2, then take log2(), and lastly find the maximum value. 56 | Use the pipe syntax. The command for maximum is `max()`. 57 | 58 | ```{r} 59 | my_mat2 %>% 60 | sqrt() %>% 61 | log2() %>% 62 | max() 63 | ``` 64 | 65 | ## Q2: 66 | 67 | 1. Use the following info to make a data frame and save it as an item called "grade". 68 | Adel got 85 on the exam, Bren got 83, and Cecil got 93. 69 | Their letter grades are B, B, and A, respectively. 70 | (Hint: How many columns do you have to have?) 71 | 72 | ```{r} 73 | grade <- data.frame( 74 | name = c("Adel", "Bren", "Cecil"), 75 | score = c(85, 83, 93), 76 | letters = c("B", "B", "A") 77 | ) 78 | 79 | grade 80 | ``` 81 | 82 | 2. Pull out the column with the scores. 83 | Use the `$` syntax. 84 | ```{r} 85 | grade$score 86 | ``` 87 | 88 | -------------------------------------------------------------------------------- /Data/dataset1.csv: -------------------------------------------------------------------------------- 1 | "treatment","tensile_strength","temperature","pressure","resistance" 2 | "t1p1",11.506622092435565,"t1","p1",17.128494668708832 3 | "t2p1",14.100085235689477,"t2","p1",2.5589702596091675 4 | "t1p2",13.615801427173414,"t1","p2",15.527962563665543 5 | "t2p2",13.5099923273713,"t2","p2",4.721478705270513 6 | "t1p1",14.028709331397305,"t1","p1",16.203734568157543 7 | "t2p1",6.259538323593588,"t2","p1",5.978400931935168 8 | "t1p2",12.633912927374507,"t1","p2",15.961706984662447 9 | "t2p2",10.717022205761431,"t2","p2",4.714634074022134 10 | "t1p1",9.289731079256217,"t1","p1",17.728534497569896 11 | "t2p1",11.52930718913078,"t2","p1",4.535213990213128 12 | "t1p2",17.537701412126225,"t1","p2",14.89671060171909 13 | "t2p2",14.862264653605669,"t2","p2",3.514235497735813 14 | "t1p1",14.056335685284438,"t1","p1",15.222575582702989 15 | "t2p1",6.3596882036778215,"t2","p1",6.334807900162937 16 | "t1p2",14.3748969079777,"t1","p2",15.282977444075684 17 | "t2p2",10.750898568933968,"t2","p2",3.967891905343131 18 | "t1p1",5.566251097715127,"t1","p1",18.141227761742456 19 | "t2p1",10.068251336282362,"t2","p1",4.350111449210299 20 | "t1p2",15.06114251819578,"t1","p2",16.05765273364922 21 | "t2p2",12.457989671827754,"t2","p2",4.794965577505662 22 | "t1p1",11.516792356002085,"t1","p1",15.750018418634559 23 | "t2p1",9.648346870695256,"t2","p1",4.450955602610659 24 | "t1p2",12.035450174518523,"t1","p2",15.334780663377284 25 | "t2p2",12.526361047296032,"t2","p2",5.28441492721479 26 | "t1p1",7.387629481976594,"t1","p1",18.326161223857568 27 | "t2p1",11.516601088751841,"t2","p1",3.919718778665725 28 | "t1p2",12.746703113255409,"t1","p2",16.324424180528453 29 | "t2p2",12.858436597291924,"t2","p2",3.890657377240126 30 | "t1p1",8.394960862592415,"t1","p1",18.79670808927312 31 | "t2p1",10.48980069620454,"t2","p1",4.7053804564748045 32 | "t1p2",11.472313423671851,"t1","p2",16.305175757716956 33 | "t2p2",9.031559755128118,"t2","p2",5.367688131005203 34 | "t1p1",6.415518331077725,"t1","p1",18.638230241629685 35 | "t2p1",8.635094619784446,"t2","p1",5.835800882546767 36 | "t1p2",12.874761841578259,"t1","p2",15.15526760876225 37 | "t2p2",12.360016747779738,"t2","p2",3.767237799186369 38 | "t1p1",9.915935091954513,"t1","p1",16.396518137608442 39 | "t2p1",11.372340751850805,"t2","p1",1.9592745082574727 40 | "t1p2",12.314001521011349,"t1","p2",15.80367212667379 41 | "t2p2",16.122120677779193,"t2","p2",1.4964095536105786 42 | -------------------------------------------------------------------------------- /Answer_key/06_Intro_to_heatmaps_answer_key.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Intro_to_heatmap_ANSWER_KEY" 3 | author: "Chenxin Li" 4 | date: "2023-01-15" 5 | output: html_notebook 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | ``` 11 | 12 | # Packages 13 | ```{r} 14 | library(tidyverse) 15 | library(RColorBrewer) 16 | library(viridis) 17 | ``` 18 | 19 | # Data from lecture 20 | ```{r} 21 | abc_1 <- expand.grid( 22 | a = c(1:10), 23 | b = c(1:10) 24 | ) %>% 25 | mutate(c = a + b) 26 | 27 | head(abc_1) 28 | ``` 29 | 30 | # Exercise 31 | ## Q1 32 | Here is the rainbow color scale: 33 | ```{r} 34 | abc_1 %>% 35 | ggplot(aes(x = a, y = b)) + 36 | geom_tile(aes(fill = c)) + 37 | scale_fill_gradientn(colors = rainbow(20)) + 38 | labs(title = "c = a + b") + 39 | theme_classic() + 40 | coord_fixed() 41 | 42 | ggsave("../Results/06_abc_2.png", height = 3, width = 3) 43 | ``` 44 | Can you explain why the rainbow color scale is inappropriate for heatmaps? 45 | 46 | There are multiple issues with the rainbow color scale. 47 | 1. It is not red/green colorblind friendly. 48 | 2. It is neither a unidirectional or bidirectional color scale. 49 | 50 | ## Q2 51 | You can ignore the following chunk. It generates the data. 52 | ```{r} 53 | Q2_data <- rbind( 54 | c(10, 20, 30, 40, 50, 60), # 6 55 | c(0, 10, 30, 10, 5, 0), # 3 56 | c(100, 200, 300, 400, 500, 550), # 6 57 | c(50, 45, 40, 30, 20, 0), # 1 58 | c(0, 200, 300, 500, 200, 100), # 4 59 | c(10, 500, 200, 100, 10, 0), # 2 60 | c(0, 0, 0, 10, 500, 0) # 5 61 | ) %>% 62 | as.data.frame() %>% 63 | cbind(gene = c("a", "b", "c", "d", "e", "f", "g")) %>% 64 | pivot_longer(cols = !gene, names_to = "V", values_to = "expression") %>% 65 | mutate(stage = str_remove(V, "V")) 66 | 67 | Q2_data %>% 68 | ggplot(aes(x = stage, y = gene)) + 69 | geom_tile(aes(fill = expression)) + 70 | geom_text(aes(label = expression)) + 71 | scale_fill_gradientn(colors = rev(brewer.pal(11, "RdBu")))+ 72 | theme_classic() 73 | 74 | ggsave("../Results/06_genes_3.png", height = 3.5, width = 4.2) 75 | ``` 76 | What is (are) wrong with this heatmap? 77 | Hint: there is a lot going wrong with this heatmap. 78 | 79 | 1. Values are not scaled to z-score. 80 | 2. Bidirectional color scale used for unidirectional data. 81 | 3. Potential issue with outlier(s). 82 | -------------------------------------------------------------------------------- /Answer_key/03_Intro_to_data_vis_answer_key.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Intro_to_data_vis ANSWER KEY" 3 | author: "Chenxin Li" 4 | date: "2023-01-06" 5 | output: 6 | html_notebook: 7 | number_sections: yes 8 | toc: yes 9 | toc_float: yes 10 | --- 11 | 12 | ```{r setup, include=FALSE} 13 | knitr::opts_chunk$set(echo = TRUE) 14 | ``` 15 | 16 | 17 | # Required packages 18 | ```{r} 19 | library(tidyverse) 20 | library(RColorBrewer) 21 | ``` 22 | 23 | # Data from lecture 24 | ```{r} 25 | child_mortality <- read_csv("../Data/child_mortality_0_5_year_olds_dying_per_1000_born.csv", col_types = cols()) 26 | babies_per_woman <- read_csv("../Data/children_per_woman_total_fertility.csv", col_types = cols()) 27 | income <- read_csv("../Data/income_per_person_gdppercapita_ppp_inflation_adjusted.csv", col_types = cols()) 28 | ``` 29 | 30 | Re-shape data into tidy format. 31 | ```{r} 32 | babies_per_woman_tidy <- babies_per_woman %>% 33 | pivot_longer(names_to = "year", values_to = "birth", cols = c(2:302)) 34 | 35 | child_mortality_tidy <- child_mortality %>% 36 | pivot_longer(names_to = "year", values_to = "death_per_1000_born", cols = c(2:302)) 37 | 38 | income_tidy <- income %>% 39 | pivot_longer(names_to = "year", values_to = "income", cols = c(2:242)) 40 | ``` 41 | 42 | Join them together. 43 | ```{r} 44 | example2_data <- babies_per_woman_tidy %>% 45 | inner_join(child_mortality_tidy, by = c("country", "year")) %>% 46 | inner_join(income_tidy, by = c("country", "year")) 47 | 48 | head(example2_data) 49 | ``` 50 | 51 | # Exercise 52 | Graph income (in log10 scale) on x axis, child mortality on y axis, and color with children/woman in year 2010. 53 | Were the trend similar to year 1945? 54 | Save the graph using `ggsave()`. 55 | 56 | 57 | ```{r} 58 | example2_data %>% 59 | filter(year == 2010) %>% 60 | ggplot(aes(x = log10(income), y = death_per_1000_born)) + 61 | geom_point(aes(color = birth)) + 62 | scale_color_gradientn(colours = brewer.pal(9, "YlGnBu")) + 63 | labs(x = "log10 income", 64 | y = "death per 1000 born", 65 | title = "2010") + 66 | theme_classic() 67 | 68 | ggsave("Lesson3_answer1.png", width = 3, height = 3) 69 | ``` 70 | ```{r} 71 | example2_data %>% 72 | filter(year == 1945) %>% 73 | ggplot(aes(x = log10(income), y = death_per_1000_born)) + 74 | geom_point(aes(color = birth)) + 75 | scale_color_gradientn(colours = brewer.pal(9, "YlGnBu")) + 76 | labs(x = "log10 income", 77 | y = "death per 1000 born", 78 | title = "1945") + 79 | theme_classic() 80 | 81 | ggsave("Lesson3_answer2.png", width = 3, height = 3) 82 | ``` 83 | 84 | -------------------------------------------------------------------------------- /Data/USDA_2022_matrix.csv: -------------------------------------------------------------------------------- 1 | "Crop","2022","2021","2020","2019","2018","2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007","2006","2005","2004","2003","2002","2001","2000","1999","1998","1997","1996","1995","1994","1993","1992","1991","1990","1989","1988","1987","1986","1985","1984","1983","1982","1981","1980","1979","1978","1977","1976","1975","1974","1973","1972","1971","1970","1969","1968","1967","1966","1965","1964","1963","1962","1961","1960","1959","1958","1957","1956","1955","1954","1953","1952","1951","1950","1949","1948","1947","1946","1945" 2 | "barley",71.7,60.3,77.2,77.7,77.5,73,77.9,69.1,72.7,71.3,66.9,69.1,73.1,72.8,63.3,60,61.1,64.8,69.6,58.9,55,58.1,61.1,59.5,60.1,58.1,58.5,57.2,56.2,58.9,62.5,55.2,56.1,48.6,38,52.4,50.8,50.9,53.3,52.3,57.2,52.4,49.7,50.9,49.2,44,45.4,44,37.7,40.5,43.7,45.8,42.8,44.7,43.8,40.5,38.3,42.9,37.6,35,35,30.6,31,28.3,32.3,29.8,29.3,27.8,28.4,28.4,27.7,27.3,27.2,24,26.5,25.7,25.5,25.5 3 | "canola",1762,1302,1931,1781,1861,1526,1825,1679,1610,1742,1392,1479,1711,1813,1461,1238,1366,1419,1618,1416,1197,1374,1334,1306,1448,1237,1385,1278,1316,1350,1286,1300,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA 4 | "corn",173.3,176.7,171.4,167.5,176.4,176.6,174.6,168.4,171,158.1,123.1,146.8,152.6,164.4,153.3,150.7,149.1,147.9,160.3,142.2,129.3,138.2,136.9,133.8,134.4,126.7,127.1,113.5,138.6,100.7,131.5,108.6,118.5,116.3,84.6,119.8,119.4,118,106.7,81.1,113.2,108.9,91,109.5,101,90.8,88,86.4,71.9,91.3,97,88.1,72.4,85.9,79.5,80.1,73.1,74.1,62.9,67.9,64.7,62.4,54.7,53.1,52.8,48.3,47.4,42,39.4,40.7,41.8,36.9,38.2,38.2,43,28.6,37.2,33.1 5 | "cotton",950,819,853,831,882,905,867,766,838,822,892,790,812,776,813,879,814,831,855,730,665,705,632,607,625,673,705,537,708,606,700,652,634,614,619,706,552,630,600,508,590,542,404,547,420,520,465,453,442,520,507,438,438,434,516,447,480,527,517,517,457,438,446,461,466,388,409,417,341,324,280,269,269,282,311,267,236,254 6 | "rice",7383,7709,7619,7473,7692,7507,7237,7472,7576,7694,7463,7067,6725,7085,6846,7219,6898,6624,6988,6670,6578,6496,6281,5866,5663,5897,6120,5621,5964,5510,5736,5731,5529,5749,5514,5555,5651,5414,4954,4598,4710,4819,4413,4599,4484,4412,4663,4558,4440,4274,4700,4718,4618,4318,4425,4537,4322,4255,4098,3968,3726,3411,3423,3382,3164,3204,3151,3061,2517,2447,2413,2309,2371,2194,2122,2062,2054,2046 7 | "sorghum",41.1,69,73.2,73,72.1,71.7,77.9,76,67.6,59.6,49.6,54,71.9,69.4,65.1,73.2,56.1,68.5,69.6,52.7,50.6,59.9,60.9,69.7,67.3,69.2,67.3,55.6,72.7,59.9,72.6,59.3,63.1,55.4,63.8,69.4,67.7,66.8,56.4,48.7,59.1,64,46.3,62.6,54.5,56.6,49.1,49,45.1,58.8,60.7,53.8,50.4,54.3,52.6,50.4,55.8,51.6,41.7,43.9,44.1,43.7,39.7,36.1,35.2,28.8,22.2,18.8,20.1,18.4,17,19.1,22.6,22.5,18,17,15.9,15.2 8 | "soybeans",49.5,51.7,51,47.4,50.6,49.3,51.9,48,47.5,44,40,42,43.5,44,39.7,41.7,42.9,43.1,42.2,33.9,38,39.6,38.1,36.6,38.9,38.9,37.6,35.3,41.4,32.6,37.6,34.2,34.1,32.3,27,33.9,33.3,34.1,28.1,26.2,31.5,30.1,26.5,32.1,29.4,30.6,26.1,28.9,23.7,27.8,27.8,27.5,26.7,27.4,26.7,24.5,25.4,24.5,22.8,24.4,24.2,25.1,23.5,23.5,24.2,23.2,21.8,20.1,20,18.2,20.7,20.8,21.7,22.3,21.3,16.3,20.5,18 9 | "wheat",46.5,44.3,49.7,51.7,47.6,46.4,52.7,43.6,43.7,47.1,46.2,43.6,46.1,44.3,44.8,40.2,38.6,42,43.2,44.2,35,40.2,42,42.7,43.2,39.5,36.3,35.8,37.6,38.2,39.3,34.3,39.5,32.7,34.1,37.7,34.4,37.5,38.8,39.4,35.5,34.5,33.5,34.2,31.4,30.7,30.3,30.6,27.3,31.6,32.7,33.9,31,30.6,28.4,25.8,26.3,26.5,25.8,25.2,25,23.9,26.1,21.6,27.5,21.8,20.2,19.8,18.1,17.3,18.4,16,16.5,14.5,17.9,18.2,17.2,17 10 | -------------------------------------------------------------------------------- /Answer_key/07_Intro_relationship_data_answer_key.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Intro_to_network ANSWER KEY" 3 | author: "Chenxin Li" 4 | date: "2023-01-15" 5 | output: html_notebook 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | ``` 11 | 12 | # Packages 13 | ```{r} 14 | library(tidyverse) 15 | library(igraph) 16 | library(ggraph) 17 | library(readxl) 18 | library(RColorBrewer) 19 | ``` 20 | # Homework 21 | ## Q1 22 | Metabolic pathways can be modeled by networks, where each node is a metabolite and each edge is a metabolic enzyme. 23 | You can read more about how to represent metabolic pathways [here](https://github.com/cxli233/ggpathway). 24 | 25 | Here is an example: 26 | ```{r} 27 | calvin_cycle_edges <- read_excel("../Data/Calvin_cycle_edges.xlsx") 28 | calvin_cycle_nodes <- read_excel("../Data/Calvin_cycle_nodes.xlsx") 29 | 30 | head(calvin_cycle_edges) 31 | head(calvin_cycle_nodes) 32 | ``` 33 | Make a network diagram for the Calvin cycle. 34 | Color each node by how many carbons they have (the `carbon` column in the node table). 35 | Label each metabolite using `geom_node_text(aes(label = name), hjust = 0.5, repel = T)`. 36 | Save the figure using `ggsave()`. 37 | Use `bg = "white"` inside `ggsave()` to add a white background when exporting the png file. 38 | 39 | ```{r} 40 | network <- graph_from_data_frame( 41 | d = calvin_cycle_edges, 42 | vertices = calvin_cycle_nodes, 43 | directed = F) 44 | ``` 45 | 46 | ```{r} 47 | ggraph(network, layout = "kk") + 48 | geom_edge_link(color = "grey70", arrow = arrow(length = unit(3, "mm")))+ 49 | geom_node_point(size = 3, shape = 21, color = "white", aes(fill = as.factor(carbon))) + 50 | geom_node_text(aes(label = name), repel = T, hjust = 0.5) + 51 | scale_fill_manual(values = brewer.pal(8, "Set2")) + 52 | labs(fill = "carbon") + 53 | theme_void() 54 | 55 | ggsave("Lesson7_answer1.png", height = 4, width = 6, bg = "white") 56 | ``` 57 | 58 | ## Q2 59 | [Phylogenetic trees](https://en.wikipedia.org/wiki/Phylogenetic_tree) and [dendrogram](https://en.wikipedia.org/wiki/Dendrogram) can be modeled by a network. 60 | Say we have a tree like this: 61 | 62 | ``` 63 | I 64 | | 65 | G - H 66 | | 67 | A - B - C 68 | | 69 | D - E 70 | | 71 | F 72 | ``` 73 | 74 | Make a network diagram and label each node. 75 | Save the diagram using `ggave()`. 76 | Use `bg = "white"` inside `ggsave()` to add a white background when exporting the png file. 77 | Hint: You can write the edge table in Excel and import it into R if that's easier. 78 | ```{r} 79 | edgetable <- data.frame( 80 | from = c("G", "G", "A", "B", "B","B", "D", "D"), 81 | to = c("I", "H", "B", "C", "G","D", "E", "F") 82 | ) 83 | head(edgetable) 84 | ``` 85 | 86 | ```{r} 87 | nodes <- data.frame( 88 | nodes = unique(c(edgetable$from, edgetable$to)) 89 | ) 90 | 91 | nodes 92 | ``` 93 | 94 | ```{r} 95 | treenetwork <- graph_from_data_frame( 96 | d = edgetable, 97 | vertices = nodes, 98 | directed = F) 99 | ``` 100 | 101 | ```{r} 102 | ggraph(treenetwork, layout = "tree") + 103 | geom_edge_link(color = "grey70") + 104 | geom_node_point(size = 3, shape = 21, color = "white", fill = "grey20") + 105 | geom_node_text(repel = T, aes(label = nodes$nodes)) + 106 | theme_void() 107 | 108 | ggsave("Lesson7_answer2.png", height = 3, width = 3.5, bg = "white") 109 | ``` 110 | 111 | -------------------------------------------------------------------------------- /Answer_key/08_Intro_plot_assembly_answer_key.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "08_Plot_assembly_ANSWER_KEY" 3 | author: "Chenxin Li" 4 | date: "2023-07-09" 5 | output: html_notebook 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | ``` 11 | 12 | 13 | ```{r} 14 | library(tidyverse) 15 | library(ggbeeswarm) 16 | library(patchwork) 17 | 18 | library(RColorBrewer) 19 | ``` 20 | 21 | # Exercise 22 | Use your own data from your research to generate a composite figure. 23 | Save it using `ggsave()`. 24 | 25 | For those of you who do not have your own data, please use these datasets from a hypothetical project. 26 | 27 | You measured tensile strength and resistance of a material at two temperatures (t1 and t2) and two pressures (p1 and p2) (dataset1). 28 | You also tested samples of this material in the above-mentioned temperature-pressure combinations. You recorded the percentages of samples that passed, failed, or were unclear (dataset2). 29 | 30 | * What is the condition that resulted in the highest tensile strength? 31 | * What is the condition that resulted in the lowest resistance? 32 | * Is there any correlation between tensile strength and resistance across various conditions? 33 | * What is the condition that resulted in the lowest percentage of failure? What is the breakdown of test outcomes across treatment? 34 | * Based on all the data available, which treatment is the best? 35 | 36 | ## Datasets 37 | ```{r} 38 | dataset1 <- read.csv("../Data/dataset1.csv") 39 | dataset2 <- read.csv("../Data/dataset2.csv") 40 | 41 | head(dataset1) 42 | head(dataset2) 43 | ``` 44 | ### Panel A 45 | ```{r} 46 | panel_A <- dataset1 %>% 47 | ggplot(aes(x = temperature, y = tensile_strength)) + 48 | facet_grid(.~ pressure) + 49 | ggbeeswarm::geom_quasirandom(alpha = 0.8, aes(color = treatment)) + 50 | scale_color_manual(values = brewer.pal(8, "Set2")) + 51 | labs(x = "Temperature", 52 | y = "Tensile strength") + 53 | theme_classic() 54 | 55 | panel_A 56 | ``` 57 | ### Panel B 58 | ```{r} 59 | panel_B <- dataset1 %>% 60 | ggplot(aes(x = temperature, y = resistance)) + 61 | facet_grid(.~ pressure) + 62 | ggbeeswarm::geom_quasirandom(alpha = 0.8, aes(color = treatment)) + 63 | scale_color_manual(values = brewer.pal(8, "Set2")) + 64 | labs(x = "Temperature", 65 | y = "Resistance") + 66 | theme_classic() 67 | 68 | panel_B 69 | ``` 70 | ### Panel C 71 | ```{r} 72 | panel_C <- dataset1 %>% 73 | ggplot(aes(x = resistance, y = tensile_strength)) + 74 | facet_wrap(~ treatment) + 75 | geom_point(aes(color = treatment)) + 76 | labs(x = "Resistance", 77 | y = "Tesnsile strength") + 78 | scale_color_manual(values = brewer.pal(8, "Set2")) + 79 | theme_classic() 80 | 81 | panel_C 82 | ``` 83 | 84 | ### Panel D 85 | ```{r} 86 | panel_D <- dataset2 %>% 87 | pivot_longer(names_to = "treatment", values_to = "percent_samples", cols = !outcome) %>% 88 | ggplot(aes(x = treatment, y = percent_samples)) + 89 | geom_bar(stat = "identity", position = "stack", aes(fill = outcome)) + 90 | scale_fill_manual(values = c("grey20", "tomato1", "dodgerblue4")) + 91 | labs(y = "Percent samples", 92 | x = "Treatment") + 93 | theme_classic() 94 | 95 | panel_D 96 | ``` 97 | 98 | ### Assemble them 99 | ```{r} 100 | wrap_plots( 101 | panel_A + labs(tag = "A") + theme(legend.position = "none"), 102 | panel_B + labs(tag = "B") + theme(legend.position = "none"), 103 | panel_C + labs(tag = "C"), 104 | panel_D + labs(tag = "D"), 105 | guides = "collect", 106 | design = "AB 107 | CD") 108 | 109 | ggsave("Lesson8_answer1.svg", height = 6, width = 8) 110 | ggsave("Lesson8_answer1.png", height = 6, width = 8) 111 | ``` 112 | 113 | Based on all the data available, t2p2 appeared to the best performer out of all metrics studied. 114 | -------------------------------------------------------------------------------- /Answer_key/04_Intro_mean_separation_answer_key.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "mean_separation_ANSWER_KEY" 3 | author: "Chenxin Li" 4 | date: "2023-01-07" 5 | output: 6 | html_notebook: 7 | number_sections: yes 8 | toc: yes 9 | toc_float: yes 10 | --- 11 | 12 | ```{r setup, include=FALSE} 13 | knitr::opts_chunk$set(echo = TRUE) 14 | ``` 15 | 16 | 17 | # Packages 18 | ```{r} 19 | library(tidyverse) 20 | library(readxl) 21 | library(RColorBrewer) 22 | library(ggbeeswarm) 23 | ``` 24 | 25 | 26 | # Data from lecture 27 | ```{r} 28 | tissue_culture <- read_excel("../Data/LU-Matand - Stem Data.xlsx") 29 | head(tissue_culture) 30 | ``` 31 | 32 | ```{r} 33 | tissue_culture %>% 34 | mutate(Treatment = factor(Treatment, 35 | levels = c("T1", "T5", "T10"))) %>% 36 | ggplot(aes(x = Treatment, y = Buds_Shoots)) + 37 | facet_wrap(~ Variety, scales = "free") + 38 | geom_quasirandom(aes(color = Explant), alpha = 0.8) + 39 | scale_colour_manual(values = brewer.pal(8, "Set2")) + 40 | labs(x = "Hormone Treatment", 41 | y = "Num buds and shoots", 42 | fill = NULL) + 43 | theme_classic() + 44 | theme(legend.position = "bottom") 45 | 46 | ggsave("../Results/04_multi_dots.png", height = 6, width = 9) 47 | ``` 48 | 49 | 50 | # Exercise 51 | ## Q1 52 | Notice the `scales = "free` inside `facet_wrap()`. 53 | Make a new graph but remove the `scales = "free` inside `facet_wrap()`. 54 | 55 | 56 | ```{r} 57 | tissue_culture %>% 58 | mutate(Treatment = factor(Treatment, 59 | levels = c("T1", "T5", "T10"))) %>% 60 | ggplot(aes(x = Treatment, y = Buds_Shoots)) + 61 | facet_wrap(~ Variety) + 62 | geom_quasirandom(aes(color = Explant), alpha = 0.8) + 63 | scale_colour_manual(values = brewer.pal(8, "Set2")) + 64 | labs(x = "Hormone Treatment", 65 | y = "Num buds and shoots", 66 | fill = NULL) + 67 | theme_classic() + 68 | theme(legend.position = "bottom") 69 | 70 | ggsave("Lesson4_answer1.png", height = 6, width = 9) 71 | ``` 72 | 73 | Can you explain what happened? 74 | Now all the varieties have the same y axis. 75 | 76 | Can you explain in what scenario is more appropriate to have `scales = "free"` turned on vs off? 77 | When we are only comparing within subplots, it's better to use `scales = "free"`. 78 | When we are comparing across subplots, it's better to not use `scales = "free"`. 79 | 80 | ## Q2 81 | Here is an example data for you to practice: 82 | ```{r} 83 | M1 <- data.frame( 84 | conc = rnorm(n = 8, mean = 0.03, sd = 0.01) 85 | ) %>% 86 | mutate(group = "ctrl") %>% 87 | rbind( 88 | data.frame( 89 | conc = rnorm(n = 6, mean = 0.25, sd = 0.02) 90 | ) %>% 91 | mutate(group = "trt") 92 | ) %>% 93 | mutate(pest = "Pest 1") 94 | 95 | M2 <- data.frame( 96 | conc = rnorm(n = 8, mean = 6, sd = 1) 97 | ) %>% 98 | mutate(group = "ctrl") %>% 99 | rbind( 100 | data.frame( 101 | conc = rnorm(n = 6, mean = 5.5, sd = 1.1) 102 | ) %>% 103 | mutate(group = "trt") 104 | ) %>% 105 | mutate(pest = "Pest 2") 106 | 107 | M3 <- data.frame( 108 | conc = rnorm(n = 8, mean = 20, sd = 0.5) 109 | ) %>% 110 | mutate(group = "ctrl") %>% 111 | rbind( 112 | data.frame( 113 | conc = rnorm(n = 6, mean = 19.5, sd = 1.2) 114 | ) %>% 115 | mutate(group = "trt") 116 | ) %>% 117 | mutate(pest = "Pest 3") 118 | 119 | Spray <- rbind( 120 | M1, M2, M3 121 | ) 122 | 123 | head(Spray) 124 | ``` 125 | 126 | In this hypothetical experiment, I have two treatments: ctrl vs. trt. 127 | And I measured the occurrence of three different pests after spraying either treatments. 128 | Make a mean separation plot for this experiment. 129 | 130 | Hints: 131 | 132 | * Is this a multifactorial experiment? Yes! 133 | * How many observations are there in each group? 134 | 135 | ```{r} 136 | Spray %>% 137 | group_by(group, pest) %>% 138 | count() 139 | ``` 140 | Ctrl groups has 8 observations, while trt group have 6 observations. 141 | 142 | Was the treatment effective in controlling any of the three pests? 143 | ```{r} 144 | Spray %>% 145 | ggplot(aes(x = group, y = conc)) + 146 | facet_wrap(~ pest, scales = "free") + 147 | geom_quasirandom(aes(color = group), alpha = 0.8) + 148 | scale_colour_manual(values = brewer.pal(8, "Set2")) + 149 | labs(x = "Treatment", 150 | y = "concentration", 151 | fill = NULL) + 152 | theme_classic() + 153 | theme(legend.position = "bottom") 154 | 155 | ggsave("Lesson4_answer2.png", width = 4, height = 3) 156 | ``` 157 | 158 | Save the graph using `ggsave()`. 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | -------------------------------------------------------------------------------- /Answer_key/02_Intro_to_tidy_data_answer_key.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Data Arrangement ANSWER KEY" 3 | author: "Chenxin Li" 4 | date: "01/06/2023" 5 | output: 6 | html_notebook: 7 | number_sections: yes 8 | toc: yes 9 | toc_float: yes 10 | 11 | --- 12 | 13 | ```{r setup, include=FALSE} 14 | knitr::opts_chunk$set(echo = TRUE) 15 | ``` 16 | 17 | # Load packages 18 | ```{r} 19 | library(tidyverse) 20 | library(readxl) 21 | ``` 22 | ## Data from lecture 23 | ```{r} 24 | child_mortality <- read_csv("../Data/child_mortality_0_5_year_olds_dying_per_1000_born.csv", col_types = cols()) 25 | babies_per_woman <- read_csv("../Data/children_per_woman_total_fertility.csv", col_types = cols()) 26 | ``` 27 | 28 | These are two datasets downloaded from the [Gapminder foundation](https://www.gapminder.org/data/). 29 | The Gapminder foundation has datasets on life expectancy, economy, education, and population across countries and years. 30 | The goal is to remind us not only the "gaps" between developed and developing worlds, but also the amazing continuous improvements of quality of life through time. 31 | 32 | 1. Child mortality (0 - 5 year old) dying per 1000 born. 33 | 2. Births per woman. 34 | 35 | These were recorded from year 1800 and projected all the way to 2100. 36 | 37 | Let's look at them. 38 | 39 | ```{r} 40 | head(child_mortality) 41 | head(babies_per_woman) 42 | ``` 43 | 44 | 45 | ```{r} 46 | babies_per_woman_tidy <- babies_per_woman %>% 47 | pivot_longer(names_to = "year", values_to = "birth", cols = c(2:302)) 48 | 49 | head(babies_per_woman_tidy) 50 | 51 | child_mortality_tidy <- child_mortality %>% 52 | pivot_longer(names_to = "year", values_to = "death_per_1000_born", cols = c(2:302)) 53 | 54 | head(child_mortality_tidy) 55 | ``` 56 | 57 | ```{r} 58 | birth_and_mortality <- babies_per_woman_tidy %>% 59 | inner_join(child_mortality_tidy, by = c("country", "year")) 60 | 61 | head(birth_and_mortality) 62 | ``` 63 | 64 | # Exercise 65 | 66 | You have learned data arrangement! Let's do an exercise to practice what 67 | you have learned today. 68 | As the example, this time we will use income per person dataset from Gapminder foundation. 69 | 70 | ```{r} 71 | income <- read_csv("../Data/income_per_person_gdppercapita_ppp_inflation_adjusted.csv", col_types = cols()) 72 | head(income) 73 | ``` 74 | 75 | ## Tidy data 76 | Is this a tidy data frame? 77 | NO! 78 | 79 | Make it a tidy data frame using this code chunk. 80 | ```{r} 81 | income_tidy <- income %>% 82 | pivot_longer(names_to = "year", values_to = "income", cols = !country) 83 | 84 | head(income_tidy) 85 | ``` 86 | 87 | ## Joining data 88 | 89 | Combine the income data with birth per woman and child mortality data using this code chunk. 90 | Name the new data frame "birth_and_mortality_and_income". 91 | 92 | ```{r} 93 | birth_and_mortality_and_income <- income_tidy %>% 94 | inner_join(babies_per_woman_tidy, by = c("country", "year")) %>% 95 | inner_join(child_mortality_tidy, by = c("country", "year")) 96 | 97 | head(birth_and_mortality_and_income) 98 | ``` 99 | 100 | 101 | ## Filtering data 102 | 103 | Filter out the data for Bangladesh and Sweden, in years 1945 (when WWII ended) and 2010. 104 | Name the new data frame BS_1945_2010. 105 | How has income, birth per woman and child mortality rate changed during this 55-year period? 106 | 107 | ```{r} 108 | BS_1945_2010 <- birth_and_mortality_and_income %>% 109 | filter(country == "Bangladesh" | 110 | country == "Sweden") %>% 111 | filter(year == 1945 | 112 | year == 2010) 113 | 114 | 115 | head(BS_1945_2010) 116 | ``` 117 | 118 | 119 | ## Mutate data 120 | 121 | Let's say for countries with income between 1000 to 10,000 dollars per year, they are called "fed". 122 | For countries with income above 10,000 dollars per year, they are called "wealthy". 123 | Below 1000, they are called "poor". 124 | 125 | Using this info to make a new column called "status". 126 | Hint: you will have to use case_when() and the "&" logic somewhere in this chunk. 127 | 128 | ```{r} 129 | birth_and_mortality_and_income <- birth_and_mortality_and_income %>% 130 | mutate(status = case_when( 131 | income >= 1000 & income <= 10000 ~ "fed", 132 | income > 10000 ~ "wealthy", 133 | income < 1000 ~ "poor" 134 | )) 135 | 136 | head(birth_and_mortality_and_income) 137 | ``` 138 | 139 | ## Summarise the data 140 | 141 | Let's look at the average child mortality and its sd in year 2010. 142 | across countries across different status that we just defined. 143 | Name the new data frame "child_mortality_summmary_2010". 144 | 145 | ```{r} 146 | child_mortality_summary_2010 <- birth_and_mortality_and_income %>% 147 | filter(year == 2010) %>% 148 | group_by(status) %>% 149 | summarize( 150 | avg = mean(death_per_1000_born), 151 | sd = sd(death_per_1000_born)) 152 | 153 | head(child_mortality_summary_2010) 154 | ``` 155 | 156 | How does child mortality compare across income group in year 2010? 157 | Child mortality is higher for lower income groups. 158 | -------------------------------------------------------------------------------- /Scripts/05_Intro_proportions.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "05_Intro_proportions.Rmd" 3 | author: "Chenxin Li" 4 | date: "2023-01-14" 5 | output: html_notebook 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | ``` 11 | 12 | # Introduction 13 | In this lesson we will explore the best practice for presenting proportional data. 14 | The presentation of proportional data is very common, and thus warrant a lesson of its own. 15 | Proportional data are data that range from 0 to 1. 16 | More importantly, proportional data should add up to 1 if they are derived from the same entity. 17 | For example, let's say on a corn ear, there are yellow and white kernels (and no other colors). 18 | The proportion of yellow and white kernels should add up to 1. 19 | Let's say if 75% of the kernel are yellow, it implies the proportion of white ones is 25%. 20 | 21 | Goals might differ when we are presenting proportional data. Here are two common objectives: 22 | 23 | 1. Showing different proportions of an entity add up to 1. 24 | 2. Comparing proportions of different entities. 25 | 26 | These two objectives may not be mutually exclusive. 27 | We will explore what are the best practice to achieve these objectives. 28 | 29 | # Packages 30 | ```{r} 31 | library(tidyverse) 32 | library(readxl) 33 | library(RColorBrewer) 34 | ``` 35 | 36 | # What really is a pie chart? 37 | The most common visualization for proportional data is a pie chart. 38 | You literally see them all the time. 39 | Pie chart is a common type of visualization for proportional data, where proportions add up to 100%. 40 | This is achieved by dividing a circle into sectors, and the sectors add up to a full circle. 41 | This all sounds fine, but how do we construct a pie chart using ggplot? 42 | The short answer is we make a stacked bar chart and wrap it into a circle. 43 | 44 | Let's make a pie chart. 45 | Let's say we have a corn ear that has been open pollinated. 46 | As a result, there are purple, yellow, and white kernels on the ear. 47 | Let's say the proportions are 15% purple, 70% yellow, and 15% white. 48 | 49 | ```{r} 50 | ear_1 <- data.frame( 51 | colors = c("purple", "yellow", "white"), 52 | proportions = c(0.15, 0.7, 0.15) 53 | ) 54 | 55 | head(ear_1) 56 | ``` 57 | 58 | ```{r} 59 | example_1_stacked <- ear_1 %>% 60 | ggplot(aes(x = "A Corn Ear", y = proportions)) + 61 | geom_bar(stat = "identity", aes(fill = colors), 62 | color = "black", width = 0.5) + 63 | scale_fill_identity() + 64 | labs(x = NULL) + 65 | theme_classic() 66 | 67 | example_1_stacked 68 | 69 | ggsave("../Results/05_stack1.png", width = 2, height = 2.5) 70 | ``` 71 | 72 | Note that there are two coloring options in `ggplot`. 73 | There is `color`, which is the color of the outline. 74 | In this case, the outline of the stacks is black. 75 | There is also `fill`, which is the color of the interior. 76 | In this case, the interior of the stacks is colored by the color of the corn kernel. 77 | 78 | So now we have a stacked bar plot. 79 | Technically, we are done. The proportions are presented by the heights of the stacks within the bar. 80 | And this is a perfectly fine visualization of our proportional data. 81 | But what if we want to make a pie chart? 82 | Well, we will use the polar coordinate. 83 | 84 | ```{r} 85 | example_1_stacked + 86 | coord_polar(theta = "y") + 87 | theme_void() 88 | 89 | ggsave("../Results/05_pie1.png", width = 2, height = 2.5) 90 | ``` 91 | There are just two extra lines of code to convert a stack bar into a pie chart. 92 | First, `coord_polar(theta = "y")` wraps the y axis into a circle. 93 | Second, `theme_void()` turns off the x and y axis. 94 | Axis are not all that informative in for pie charts anyway. 95 | So if you want to make a pie chart, this is what you do. 96 | 97 | A final note about pie chart: what are the data represented by? 98 | In this case, the data are represented by the arc length. 99 | So a pie chart is a length based visualization. 100 | Due the properties of a circle, the data are also presented by the arc angle and sector area. 101 | 102 | # What is the best visualization for comparing proportional data? 103 | However, oftentimes we don't want to just present proportions of one entity. 104 | Instead, we want to compare the proportions of multiple entities. 105 | For this purpose, pie chart is probably not the best option. 106 | Pie charts have been criticized, because we are much worse in reading angles, areas, and lengths of arcs than reading lengths of straight lines. 107 | 108 | The best way to visualize proportions from multiple entities is stacked bars. 109 | Here is an example. 110 | ![Just make a stacked bar chart](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/dont_pie_chart.svg) 111 | 112 | In this example, we have two groups, each contains 4 sub-category. 113 | In classic pie charts, the angles (and thus arc lengths & sector area) represent the data. 114 | The problem is that it is very difficult to compare between entities. 115 | We can visually simplify the pie chart into donut charts, where the data are now represented by arc lengths. 116 | However, if we want to use lengths to represent the data, why don't we just unwrap the donut and make stacked bars? 117 | In stacked bar graphs, bars are shown side-by-side and thus easier to compare across entities. 118 | 119 | Let's try an example on our own. 120 | Let's say we have another open pollinated corn ear. 121 | This ear has 20% purple kernels, 75% yellow kernels, and 5% white kernels. 122 | ```{r} 123 | ear_2 <- data.frame( 124 | colors = c("purple", "yellow", "white"), 125 | proportions = c(0.20, 0.75, 0.05) 126 | ) 127 | 128 | ears_1_and_2 <- rbind( 129 | ear_1 %>% 130 | mutate(ear = "ear1"), 131 | ear_2 %>% 132 | mutate(ear = "ear2") 133 | ) 134 | 135 | head(ears_1_and_2) 136 | ``` 137 | Say now we want to make a graph to compare the proportions of these two ears. 138 | The code is in fact quite straight forward. 139 | 140 | ```{r} 141 | ears_1_and_2 %>% 142 | ggplot(aes(x = ear, y = proportions)) + 143 | geom_bar(stat = "identity", aes(fill = colors), 144 | color = "black", width = 0.5) + 145 | scale_fill_identity() + 146 | theme_classic() 147 | 148 | ggsave("../Results/05_stack2.png", width = 2, height = 2.5) 149 | ``` 150 | That's it! 151 | I would say stacked bars are my go-to visualization for comparing proportions of multiple entities. 152 | A key advantage is that side-by-side stacked bars are more space efficient. 153 | Imagine you are comparing proportions of hundreds of entities. 154 | It is unrealistic to present hundreds of pie or donut charts, let alone comparing across them. 155 | But stacked bars make the task easy. 156 | 157 | # Donut charts? 158 | Donut charts are good alternatives of pie charts for presenting proportions of a small number of entities. 159 | We won't cover how to make donut charts here. 160 | If you are interested, you can explore on your own. 161 | But what you should _never_ do is making concentric donuts. 162 | Here is an example. 163 | ![concentric donuts](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/dont_concentric_donuts.svg) 164 | For concentric donuts, you might be tempted to say the data are represented by the arc lengths, 165 | which is in fact inaccurate. 166 | The arc lengths on the outer rings are much longer than those in the inner rings. 167 | Group 2 and Group 3 have the same exact values, but the arc lengths of Group 3 are much longer. 168 | In fact the data are represented by the arc angles, which we are bad at reading. 169 | 170 | Since outer rings are longer, the ordering of the groups (which group goes to which ring) has a big impact on the impression of the plot. 171 | It can lead to the apparent paradox where larger values have shorter arcs. 172 | The better (and simpler!) alternative is just unwrap the donuts and make a good old stacked bar plot. 173 | 174 | # Should you use log scale when preseneting proportional data? 175 | This might be something you have not thought about. 176 | Let's look an example. 177 | In this hypothetical example, we quantified biodiversity of a rain forest relative to year 1960. 178 | The data are normalized to the value of year 1960. 179 | 180 | ```{r} 181 | example_2 <- data.frame( 182 | year = c(1960, 1970, 1980, 1990, 2000, 2010, 2020), 183 | relative_biodiversity = c(1, 0.6, 0.3, 0.2, 0.15, 0.1, 0.01) 184 | ) 185 | 186 | head(example_2) 187 | ``` 188 | 189 | ```{r} 190 | example_2 %>% 191 | ggplot(aes(x = year, y = relative_biodiversity)) + 192 | geom_point(size = 3) + 193 | labs(y = "biodiversity\n(relative to 1960)") + 194 | theme_classic() 195 | 196 | ggsave("../Results/05_dot1.png", width = 3, height = 2.5) 197 | ``` 198 | Looking at this graph, you would probably say the loss of biodiversity has stabilized in the last 4 decades. 199 | But is that so? 200 | 201 | A related concept of proportion is odds. 202 | Odds = proportion / (1 - proportion). 203 | For example, if p = 0.5, then the odds is 1:1. 204 | If p = 0.1, then the odds is 1:9. 205 | If p = 0.01, then then odds is 1:99. 206 | A property of proportions is that is bound between 0 and 1. 207 | Thus, any changes near 0 and 1 will appear small by definition. 208 | However, if you think about it, going from 0.1 to 0.01 is 10 fold change in proportion and 11 fold change in odds, 209 | We can capture the relative changes using the log scale. 210 | We can use the log10 or nature log scale. It doesn't matter. 211 | 212 | ```{r} 213 | example_2 %>% 214 | mutate(log_odds = log(relative_biodiversity/(1-relative_biodiversity))) %>% 215 | ggplot(aes(x = year, y = log_odds)) + 216 | geom_point(size = 3) + 217 | labs(y = "biodiversity\n(relative to 1960\nlog odds scale)") + 218 | theme_classic() 219 | 220 | ggsave("../Results/05_dot2.png", width = 3, height = 2.5) 221 | ``` 222 | In this case, presenting proportional data in the log odds scale paints a different picture. 223 | Biodiversity has decreased sharply relative to the previous decade. 224 | It makes sense: 225 | From 2000 to 2010, the change was 0.15 to 0.1, or 0.1/0.15 = 0.67, or 33% decrease from 2000. 226 | However, from 2010 to 2020, the change was 0.1 to 0.01, which is a 90% decrease from 2010. 227 | In practice, you will need to think very carefully about if you need to present your proportional data in log scale. 228 | 229 | # Exercise 230 | As an exercise, let's visualize this example. 231 | We have two groups, each contains 4 categories. 232 | ```{r} 233 | group1 <- data.frame( 234 | "Type" = c("Type I", "Type II", "Type III", "Type IV"), 235 | "Percentage" = c(15, 35, 30, 20) 236 | ) 237 | group2 <- data.frame( 238 | "Type" = c("Type I", "Type II", "Type III", "Type IV"), 239 | "Percentage" = c(10, 25, 35, 30) 240 | ) 241 | ``` 242 | 243 | Use `ggsave` to save your visualization. 244 | 245 | Hint: look in this lesson to see what I did to combine two entities into one data frame while giving each a unique identifier. -------------------------------------------------------------------------------- /Lessons/05_Intro_proportions.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | In this lesson we will explore the best practice for presenting proportional data. 3 | The presentation of proportional data is very common, and thus warrant a lesson of its own. 4 | Proportional data are data that range from 0 to 1. 5 | More importantly, proportional data should add up to 1 if they are derived from the same entity. 6 | For example, let's say on a corn ear, there are yellow and white kernels (and no other colors). 7 | The proportion of yellow and white kernels should add up to 1. 8 | Let's say if 75% of the kernel are yellow, it implies the proportion of white ones is 25%. 9 | 10 | Goals might differ when we are presenting proportional data. Here are two common objectives: 11 | 12 | 1. Showing different proportions of an entity add up to 1. 13 | 2. Comparing proportions of different entities. 14 | 15 | These two objectives may not be mutually exclusive. 16 | We will explore what are the best practice to achieve these objectives. 17 | 18 | # Packages 19 | ```{r} 20 | library(tidyverse) 21 | library(readxl) 22 | library(RColorBrewer) 23 | ``` 24 | 25 | # What really is a pie chart? 26 | The most common visualization for proportional data is a pie chart. 27 | You literally see them all the time. 28 | Pie chart is a common type of visualization for proportional data, where proportions add up to 100%. 29 | This is achieved by dividing a circle into sectors, and the sectors add up to a full circle. 30 | This all sounds fine, but how do we construct a pie chart using ggplot? 31 | The short answer is we make a stacked bar chart and wrap it into a circle. 32 | 33 | Let's make a pie chart. 34 | Let's say we have a corn ear that has been open pollinated. 35 | As a result, there are purple, yellow, and white kernels on the ear. 36 | Let's say the proportions are 15% purple, 70% yellow, and 15% white. 37 | 38 | ```{r} 39 | ear_1 <- data.frame( 40 | colors = c("purple", "yellow", "white"), 41 | proportions = c(0.15, 0.7, 0.15) 42 | ) 43 | 44 | head(ear_1) 45 | ``` 46 | 47 | ```{r} 48 | example_1_stacked <- ear_1 %>% 49 | ggplot(aes(x = "A Corn Ear", y = proportions)) + 50 | geom_bar(stat = "identity", aes(fill = colors), 51 | color = "black", width = 0.5) + 52 | scale_fill_identity() + 53 | labs(x = NULL) + 54 | theme_classic() 55 | 56 | example_1_stacked 57 | 58 | ggsave("../Results/05_stack1.png", width = 2, height = 2.5) 59 | ``` 60 | ![stacked_bar1](https://github.com/cxli233/Quick_data_vis/blob/main/Results/05_stack1.png) 61 | 62 | Note that there are two coloring options in `ggplot`. 63 | There is `color`, which is the color of the outline. 64 | In this case, the outline of the stacks is black. 65 | There is also `fill`, which is the color of the interior. 66 | In this case, the interior of the stacks is colored by the color of the corn kernel. 67 | 68 | So now we have a stacked bar plot. 69 | Technically, we are done. The proportions are presented by the heights of the stacks within the bar. 70 | And this is a perfectly fine visualization of our proportional data. 71 | But what if we want to make a pie chart? 72 | Well, we will use the polar coordinate. 73 | 74 | ```{r} 75 | example_1_stacked + 76 | coord_polar(theta = "y") + 77 | theme_void() 78 | 79 | ggsave("../Results/05_pie1.png", width = 2, height = 2.5) 80 | ``` 81 | ![pie1](https://github.com/cxli233/Quick_data_vis/blob/main/Results/05_pie1.png) 82 | 83 | There are just two extra lines of code to convert a stack bar into a pie chart. 84 | First, `coord_polar(theta = "y")` wraps the y axis into a circle. 85 | Second, `theme_void()` turns off the x and y axis. 86 | Axis are not all that informative in for pie charts anyway. 87 | So if you want to make a pie chart, this is what you do. 88 | 89 | A final note about pie chart: what are the data represented by? 90 | In this case, the data are represented by the arc length. 91 | So a pie chart is a length based visualization. 92 | Due the properties of a circle, the data are also presented by the arc angle and sector area. 93 | 94 | # What is the best visualization for comparing proportional data? 95 | However, oftentimes we don't want to just present proportions of one entity. 96 | Instead, we want to compare the proportions of multiple entities. 97 | For this purpose, pie chart is probably not the best option. 98 | Pie charts have been criticized, because we are much worse in reading angles, areas, and lengths of arcs than reading lengths of straight lines. 99 | 100 | The best way to visualize proportions from multiple entities is stacked bars. 101 | Here is an example. 102 | ![Just make a stacked bar chart](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/dont_pie_chart.svg) 103 | 104 | In this example, we have two groups, each contains 4 sub-category. 105 | In classic pie charts, the angles (and thus arc lengths & sector area) represent the data. 106 | The problem is that it is very difficult to compare between entities. 107 | We can visually simplify the pie chart into donut charts, where the data are now represented by arc lengths. 108 | However, if we want to use lengths to represent the data, why don't we just unwrap the donut and make stacked bars? 109 | In stacked bar graphs, bars are shown side-by-side and thus easier to compare across entities. 110 | 111 | Let's try an example on our own. 112 | Let's say we have another open pollinated corn ear. 113 | This ear has 20% purple kernels, 75% yellow kernels, and 5% white kernels. 114 | ```{r} 115 | ear_2 <- data.frame( 116 | colors = c("purple", "yellow", "white"), 117 | proportions = c(0.20, 0.75, 0.05) 118 | ) 119 | 120 | ears_1_and_2 <- rbind( 121 | ear_1 %>% 122 | mutate(ear = "ear1"), 123 | ear_2 %>% 124 | mutate(ear = "ear2") 125 | ) 126 | 127 | head(ears_1_and_2) 128 | ``` 129 | Say now we want to make a graph to compare the proportions of these two ears. 130 | The code is in fact quite straight forward. 131 | 132 | ```{r} 133 | ears_1_and_2 %>% 134 | ggplot(aes(x = ear, y = proportions)) + 135 | geom_bar(stat = "identity", aes(fill = colors), 136 | color = "black", width = 0.5) + 137 | scale_fill_identity() + 138 | theme_classic() 139 | 140 | ggsave("../Results/05_stack2.png", width = 2, height = 2.5) 141 | ``` 142 | ![stack bar2](https://github.com/cxli233/Quick_data_vis/blob/main/Results/05_stack2.png) 143 | 144 | That's it! 145 | I would say stacked bars are my go-to visualization for comparing proportions of multiple entities. 146 | A key advantage is that side-by-side stacked bars are more space efficient. 147 | Imagine you are comparing proportions of hundreds of entities. 148 | It is unrealistic to present hundreds of pie or donut charts, let alone comparing across them. 149 | But stacked bars make the task easy. 150 | 151 | # Donut charts? 152 | Donut charts are good alternatives of pie charts for presenting proportions of a small number of entities. 153 | We won't cover how to make donut charts here. 154 | If you are interested, you can explore on your own. 155 | But what you should _never_ do is making concentric donuts. 156 | Here is an example. 157 | ![concentric donuts](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/dont_concentric_donuts.svg) 158 | 159 | For concentric donuts, you might be tempted to say the data are represented by the arc lengths, 160 | which is in fact inaccurate. 161 | The arc lengths on the outer rings are much longer than those in the inner rings. 162 | Group 2 and Group 3 have the same exact values, but the arc lengths of Group 3 are much longer. 163 | In fact the data are represented by the arc angles, which we are bad at reading. 164 | 165 | Since outer rings are longer, the ordering of the groups (which group goes to which ring) has a big impact on the impression of the plot. 166 | It can lead to the apparent paradox where larger values have shorter arcs. 167 | The better (and simpler!) alternative is just unwrap the donuts and make a good old stacked bar plot. 168 | 169 | # Should you use log scale when preseneting proportional data? 170 | This might be something you have not thought about. 171 | Let's look an example. 172 | In this hypothetical example, we quantified biodiversity of a rain forest relative to year 1960. 173 | The data are normalized to the value of year 1960. 174 | 175 | ```{r} 176 | example_2 <- data.frame( 177 | year = c(1960, 1970, 1980, 1990, 2000, 2010, 2020), 178 | relative_biodiversity = c(1, 0.6, 0.3, 0.2, 0.15, 0.1, 0.01) 179 | ) 180 | 181 | head(example_2) 182 | ``` 183 | 184 | ```{r} 185 | example_2 %>% 186 | ggplot(aes(x = year, y = relative_biodiversity)) + 187 | geom_point(size = 3) + 188 | labs(y = "biodiversity\n(relative to 1960)") + 189 | theme_classic() 190 | 191 | ggsave("../Results/05_dot1.png", width = 3, height = 2.5) 192 | ``` 193 | ![actual scale](https://github.com/cxli233/Quick_data_vis/blob/main/Results/05_dot1.png) 194 | 195 | Looking at this graph, you would probably say the loss of biodiversity has stabilized in the last 4 decades. 196 | But is that so? 197 | 198 | A related concept of proportion is odds. 199 | Odds = proportion / (1 - proportion). 200 | For example, if p = 0.5, then the odds is 1:1. 201 | If p = 0.1, then the odds is 1:9. 202 | If p = 0.01, then then odds is 1:99. 203 | A property of proportions is that is bound between 0 and 1. 204 | Thus, any changes near 0 and 1 will appear small by definition. 205 | However, if you think about it, going from 0.1 to 0.01 is 10 fold change in proportion and 11 fold change in odds, 206 | We can capture the relative changes using the log scale. 207 | We can use the log10 or nature log scale. It doesn't matter. 208 | 209 | ```{r} 210 | example_2 %>% 211 | mutate(log_odds = log(relative_biodiversity/(1-relative_biodiversity))) %>% 212 | ggplot(aes(x = year, y = log_odds)) + 213 | geom_point(size = 3) + 214 | labs(y = "biodiversity\n(relative to 1960\nlog odds scale)") + 215 | theme_classic() 216 | 217 | ggsave("../Results/05_dot2.png", width = 3, height = 2.5) 218 | ``` 219 | ![log odds scale](https://github.com/cxli233/Quick_data_vis/blob/main/Results/05_dot2.png) 220 | 221 | In this case, presenting proportional data in the log odds scale paints a different picture. 222 | Biodiversity has decreased sharply relative to the previous decade. 223 | It makes sense: 224 | From 2000 to 2010, the change was 0.15 to 0.1, or 0.1/0.15 = 0.67, or 33% decrease from 2000. 225 | However, from 2010 to 2020, the change was 0.1 to 0.01, which is a 90% decrease from 2010. 226 | In practice, you will need to think very carefully about if you need to present your proportional data in log scale. 227 | 228 | # Exercise 229 | As an exercise, let's visualize this example. 230 | We have two groups, each contains 4 categories. 231 | ```{r} 232 | group1 <- data.frame( 233 | "Type" = c("Type I", "Type II", "Type III", "Type IV"), 234 | "Percentage" = c(15, 35, 30, 20) 235 | ) 236 | group2 <- data.frame( 237 | "Type" = c("Type I", "Type II", "Type III", "Type IV"), 238 | "Percentage" = c(10, 25, 35, 30) 239 | ) 240 | ``` 241 | 242 | Use `ggsave` to save your visualization. 243 | 244 | Hint: look in this lesson to see what I did to combine two entities into one data frame while giving each a unique identifier. 245 | -------------------------------------------------------------------------------- /Scripts/06_Intro_to_heatmaps.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Intro_to_heatmap.Rmd" 3 | author: "Chenxin Li" 4 | date: "2023-01-15" 5 | output: html_notebook 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | ``` 11 | 12 | # Introduction 13 | Heatmap is another very common type of visualization in scientific publications. 14 | In biology they are extremely common in omics papers. 15 | However, extra care must be taken to not produce an uninterpretable or misleading heatmap. 16 | 17 | We will cover: 18 | 19 | 1. What is a heat map? 20 | 2. Are you using an appropriate color scale? 21 | 3. Are you scaling the data correctly to best visualize your data? 22 | 4. How to handle outliers in heatmaps? 23 | 5. Should you reorder rows and columns of a heatmap? 24 | 25 | # Packages 26 | ```{r} 27 | library(tidyverse) 28 | library(RColorBrewer) 29 | library(viridis) 30 | ``` 31 | We have new package `viridis`. 32 | `viridis` is a collection of beautiful color scales for heatmaps. 33 | Read more [here](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html). 34 | 35 | # What is a heat map? 36 | A heatmap is a representation of 3 dimension data. 37 | Say we have 3 dimensions x, y, z. 38 | In a heatmap, the z dimension is visualized by different colors. 39 | Let's look at an example: 40 | 41 | ```{r} 42 | abc_1 <- expand.grid( 43 | a = c(1:10), 44 | b = c(1:10) 45 | ) %>% 46 | mutate(c = a + b) 47 | 48 | head(abc_1) 49 | ``` 50 | In this example, we have 3 dimensions: a, b, and c. 51 | a and b ranges from 1 to 10. 52 | And the c is the sum of a and b. 53 | 54 | ```{r} 55 | abc_1 %>% 56 | ggplot(aes(x = a, y = b)) + 57 | geom_tile(aes(fill = c)) + 58 | scale_fill_gradientn(colors = viridis(100)) + 59 | labs(title = "c = a + b") + 60 | theme_classic() + 61 | coord_fixed() 62 | 63 | ggsave("../Results/06_abc_1.png", height = 3, width = 3) 64 | ``` 65 | 66 | Now this is a heatmap. 67 | In ggplot, heatmap can be made using `geom_tile()`. 68 | To map values to the tiles, use `aes(fill = ...)`. 69 | In this case, we want to color the interior of tiles using the values of `c`. 70 | We do `aes(fill = c)` inside `geom_tile()`. 71 | 72 | # Choosing the correct color scale 73 | Note the `scale_fill_gradientn(colors = viridis(100))` line in the previous code chunk. 74 | What it does is it maps values of z to the `viridis` color scale, a purple-blue-green-yellow color gradient, where smaller values are darker purple, and larger values are mapped to progressively lighter and yellower colors. 75 | 76 | Color scales are pretty, but we must be extra careful. 77 | There are a couple important concepts regarding choosing a color scale. 78 | 79 | ## 1. Never use both red/green colors for heatmaps. 80 | Yes, I repeat, _NEVER_ use red and green in the same heat map. 81 | Red/green color blindness occurs in about 1/16 in male and 1/256 in female. 82 | People who are red/green colorblind cannot distinguish between red and green; they see brown instead. 83 | They will not be able to read a heatmap where numerical values are represented by a gradient of red and green. 84 | 85 | ## 2. Correctly use unidirectional and bidirectional color scales. 86 | There are two types of color scales: 87 | 88 | 1. Unidirectional or sequential: the color gradient goes from darkest to lightest.The viridis color gradient is a unidirectional color scale. 89 | 2. Bidirectional or divergent: the color gradient has a single lightest point at the center and two darkest points at the extremes. 90 | 91 | You should _NEVER_ use a bidirectional color scale for unidirectional data. 92 | ![Color scales](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/ColorScales.svg) 93 | You might not have thought about this one, and people make this mistake all the time. 94 | In this example, the two plots on the right use a bidirectional color scale. 95 | Why? Because the blue and red are darker than white. 96 | The first three plots are fine, but last one is not okay. 97 | You might ask, what's wrong with it? 98 | 99 | When color scales (or color gradients) are used to represent numerical data, 100 | the darkest and lightest colors should have special meanings. 101 | You can decide what those special meanings are: e.g., max, min, mean, zero. 102 | But they should represent something meaningful. 103 | A data visualization sin for heat maps/color gradients is when the lightest or darkest colors are some arbitrary numbers. 104 | _This is as bad as the longest bar in a bar chart not being the largest value._ Can you imagine that? 105 | 106 | # Same heatmap, same scale 107 | The next issue with heatmap is scaling. 108 | Let's use an example to illustrate this. 109 | In this hypothetical example, I have the expression level of 3 genes across 5 stages of development. 110 | I want to visualize their expression pattern across development. 111 | You can ignore this chunk. 112 | ```{r} 113 | genes123 <- data.frame( 114 | stages = c(1:5), 115 | gene1 = c(15, 25, 35, 45, 55), 116 | gene2 = c(100, 200, 300, 400, 500), 117 | gene3 = c(100, 80, 70, 60, 50) 118 | ) %>% 119 | pivot_longer(cols = !stages, names_to = "gene", values_to = "expression") 120 | 121 | head(genes123) 122 | ``` 123 | 124 | If I just make a heatmap, what will happen? 125 | ```{r} 126 | genes123 %>% 127 | ggplot(aes(x = stages, y = gene)) + 128 | geom_tile(aes(fill = expression), color = "grey90") + 129 | scale_fill_gradientn(colors = brewer.pal(9, "YlGnBu")) + 130 | theme_classic() + 131 | theme( 132 | legend.position = "top" 133 | ) 134 | 135 | ggsave("../Results/06_genes_1.png", height = 2.5, width = 3) 136 | ``` 137 | We have stages on x axis, genes on y axis, and tiles colored by expression. 138 | If you look at the heatmap, you might be tempted to say, gene2 is going up during development, 139 | and genes 1 and 3 are not changing that much. 140 | But is that so? 141 | 142 | In this example, gene2 ranges from 100-500, whereas the rest of the two genes ranges from 10-100. 143 | The large difference in range implies that the lower range values are not visualized well. 144 | 145 | A good way to scale the data is using z score. 146 | A z score is the difference between a value and the mean, divided by the standard deviation. 147 | In this example, we will calculate z score for each gene. 148 | Remember our good friend `group_by()`? 149 | 150 | ```{r} 151 | genes123 <- genes123 %>% 152 | group_by(gene) %>% 153 | mutate(z = (expression - mean(expression))/sd(expression)) %>% 154 | ungroup() 155 | 156 | head(genes123) 157 | ``` 158 | 159 | What happened here is we `group_by(gene)` first, 160 | such that z score calculation is done relative to each gene, not the entire data. 161 | Then `mutate(z = (expression - mean(expression))/sd(expression))` quite literally calculates the z score. 162 | Lastly, we `ungroup()`, and we are done. 163 | 164 | Now we make a new heatmap, but color tiles with z scores. 165 | ```{r} 166 | genes123 %>% 167 | ggplot(aes(x = stages, y = gene)) + 168 | geom_tile(aes(fill = z), color = "grey90") + 169 | scale_fill_gradientn(colors = brewer.pal(9, "YlGnBu")) + 170 | theme_classic() + 171 | theme(legend.position = "top") 172 | 173 | ggsave("../Results/06_genes_2.png", height = 2.5, width = 3) 174 | ``` 175 | In this version of the heatmap, it is obvious that genes 1 and 2 have the same expression pattern. 176 | They both go up during development, while gene3 has the opposite pattern. 177 | 178 | # Be aware of outliers 179 | This is related to the previous point. 180 | Outliers can drastically change the appearance of a heatmap. 181 | Here is an example: 182 | ![check outliers for heatmap](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/Check_outliers_for_heatmap.svg) 183 | 184 | In this example, I have 2 observations. 185 | For each observations, I measured 20 features. 186 | Without checking for outliers, it may appear that the 2 observations are overall similar, except at 2 features. 187 | However, after maxing out the color scale around the 95th percentile of the data, it reveals that the two observations are distinct across all features. 188 | 189 | The point here is always check for outliers when making heatmaps. 190 | If there are outliers, it may be a good idea to clip outliers at the extremes. 191 | 192 | # Reordering rows and columns 193 | In most cases, the ordering of rows and columns are arbitrary. 194 | For example, nothing says the y axis has to be in order of gene1, followed by gene2, then gene3. 195 | However, if the heatmap reflects a physical map (x and y axis maps to physical locations in space), 196 | then you can't reorder rows and columns. 197 | 198 | The ordering of the x and y axis has a _HUGE_ impact on how useful a heatmap is, especially a large heatmap. 199 | Let's look at an example: 200 | ![recorder rows and columns](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/Reorder_rows_and_columns_for_heatmap.png) 201 | 202 | In this example, I have cells as columns and features as rows. 203 | Each tile is showing z scores. 204 | It is impossible to get anything useful out of the heatmap without reordering rows and columns. 205 | We can reorder rows and columns, such that features that are enriched in cell type 1 appears first. 206 | Then features that are enriched in cell type 2 appears second. Data from: [Li et al., 2022, BioRxiv](https://www.biorxiv.org/content/10.1101/2022.07.04.498697v1.abstract) 207 | 208 | When rows and columns of a heatmap is properly ordered, you can see a clear signal across the diagonal. For example: 209 | 210 | ![Beautiful heatmap](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/Abstract_R_2022_11_24.svg) 211 | 212 | The coding underlying reordering rows and columns of a heatmap is actually quite complex. 213 | I would say this is a bit outside the scope of this introductory lesson. 214 | That being said, there are packages that produces heatmaps, and they also handles reordering rows and columns automatically. 215 | For example, [tidyHeatmap](https://github.com/stemangiola/tidyHeatmap) is a good one. 216 | If you ever need to make heatmaps, you should explore how to reorder rows and columns, or explore how to use a package like `tidyHeatmap`. 217 | 218 | # Exercise 219 | ## Q1 220 | Here is the rainbow color scale: 221 | ```{r} 222 | abc_1 %>% 223 | ggplot(aes(x = a, y = b)) + 224 | geom_tile(aes(fill = c)) + 225 | scale_fill_gradientn(colors = rainbow(20)) + 226 | labs(title = "c = a + b") + 227 | theme_classic() + 228 | coord_fixed() 229 | 230 | ggsave("../Results/06_abc_2.png", height = 3, width = 3) 231 | ``` 232 | Can you explain why the rainbow color scale is inappropriate for heatmaps? 233 | 234 | ## Q2 235 | You can ignore the following chunk. It generates the data. 236 | ```{r} 237 | Q2_data <- rbind( 238 | c(10, 20, 30, 40, 50, 60), # 6 239 | c(0, 10, 30, 10, 5, 0), # 3 240 | c(100, 200, 300, 400, 500, 550), # 6 241 | c(50, 45, 40, 30, 20, 0), # 1 242 | c(0, 200, 300, 500, 200, 100), # 4 243 | c(10, 500, 200, 100, 10, 0), # 2 244 | c(0, 0, 0, 10, 500, 0) # 5 245 | ) %>% 246 | as.data.frame() %>% 247 | cbind(gene = c("a", "b", "c", "d", "e", "f", "g")) %>% 248 | pivot_longer(cols = !gene, names_to = "V", values_to = "expression") %>% 249 | mutate(stage = str_remove(V, "V")) 250 | 251 | Q2_data %>% 252 | ggplot(aes(x = stage, y = gene)) + 253 | geom_tile(aes(fill = expression)) + 254 | geom_text(aes(label = expression)) + 255 | scale_fill_gradientn(colors = rev(brewer.pal(11, "RdBu")))+ 256 | theme_classic() 257 | 258 | ggsave("../Results/06_genes_3.png", height = 3.5, width = 4.2) 259 | ``` 260 | What is (are) wrong with this heatmap? 261 | Hint: there is a lot going wrong with this heatmap. 262 | -------------------------------------------------------------------------------- /Lessons/08_Intro_plot_assembly.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | In this lesson, we will create publication ready composite plots. 3 | I.e., each composite plot (or figure) is assembled from multiple panels. 4 | And the panels themselves could also be a composite of multiple plots. 5 | 6 | You will learn: 7 | 8 | 1. How to arrange and assemble composite plots. 9 | 2. How to control relative heights and widths of panels. 10 | 3. How to generate hierarchical composite plots. 11 | 4. How to control the panel labels. 12 | 13 | As an example, we will be using the "mtcars" dataset from R. 14 | Don't worry about what the data or these plots mean. 15 | This is just an example. 16 | 17 | # Pacakge 18 | We will be using the `patchwork` package. 19 | Documentation and tutorials can be found [here](https://github.com/thomasp85/patchwork). 20 | While it is not the only package for plot assembly, it is my go-to. 21 | 22 | ```{r} 23 | library(tidyverse) 24 | library(ggbeeswarm) 25 | library(patchwork) 26 | 27 | library(RColorBrewer) 28 | ``` 29 | 30 | # Making the panels 31 | You can ignore this entire section. 32 | This section generates the parts that we will use to assemble our final figure. 33 | 34 | ## A 35 | ```{r} 36 | A2 <- mtcars %>% 37 | mutate(AM = case_when( 38 | am == 1 ~ "manual", 39 | am == 0 ~ "automatic" 40 | )) %>% 41 | ggplot(aes(x = mpg, y = disp)) + 42 | geom_point(size = 3, alpha = 0.8, color = "white", 43 | shape = 21, aes(fill = AM)) + 44 | scale_fill_manual(values = c("tomato1", "grey20")) + 45 | labs(x = "Miles/(US) gallon", 46 | y = "Displacement (cu.in.)", 47 | fill = "Automatic") + 48 | theme_classic() + 49 | theme( 50 | legend.position = c(0.8, 0.8) 51 | ) 52 | 53 | A1 <- mtcars %>% 54 | mutate(AM = case_when( 55 | am == 1 ~ "manual", 56 | am == 0 ~ "automatic" 57 | )) %>% 58 | ggplot(aes(y = mpg, x = "")) + 59 | geom_quasirandom(size = 3, alpha = 0.8, color = "white", 60 | shape = 21, aes(fill = AM)) + 61 | scale_fill_manual(values = c("tomato1", "grey20")) + 62 | labs(y = "Miles/(US) gallon", 63 | x = NULL) + 64 | theme_classic() + 65 | theme( 66 | legend.position = "none" 67 | ) + 68 | coord_flip() 69 | 70 | A3 <- mtcars %>% 71 | mutate(AM = case_when( 72 | am == 1 ~ "manual", 73 | am == 0 ~ "automatic" 74 | )) %>% 75 | ggplot(aes(y = disp, x = "" )) + 76 | geom_quasirandom(size = 3, alpha = 0.8, color = "white", 77 | shape = 21, aes(fill = AM)) + 78 | scale_fill_manual(values = c("tomato1", "grey20")) + 79 | labs(y = "Displacement (cu.in.)", 80 | x = NULL) + 81 | theme_classic() + 82 | theme( 83 | legend.position = "none" 84 | ) 85 | 86 | # A_scatter 87 | ggsave(plot = A2, filename = "../Results/08_A2.png", height = 5, width = 5) 88 | 89 | # A_hist_mpg 90 | ggsave(plot = A1, filename = "../Results/08_A1.png", height = 3, width = 5) 91 | 92 | # A_hist_disp 93 | ggsave(plot = A3, filename = "../Results/08_A3.png", height = 3, width = 5) 94 | ``` 95 | 96 | ## B 97 | ```{r} 98 | Panel_B <- mtcars %>% 99 | mutate(VS = case_when( 100 | vs == 0 ~ "V shaped", 101 | vs == 1 ~ "Straight" 102 | )) %>% 103 | ggplot(aes(x = VS, y = hp)) + 104 | geom_quasirandom(size = 3, alpha = 0.8, 105 | aes(color = VS)) + 106 | scale_color_manual(values = brewer.pal(8, "Accent")[5:6]) + 107 | labs(x = "Engine", 108 | y = "Horsepower") + 109 | theme_classic() + 110 | theme(legend.position = "none") + 111 | coord_flip() 112 | 113 | # Panel_B 114 | ggsave("../Results/08_B.png", height = 2, width = 5) 115 | ``` 116 | 117 | ## C 118 | ```{r} 119 | Panel_C <- mtcars %>% 120 | mutate(rank_wt = rank(wt, ties.method = "first")) %>% 121 | ggplot(aes(x = "", y = rank_wt)) + 122 | geom_tile(aes(fill = mpg)) + 123 | scale_fill_gradientn(colours = brewer.pal(9, "YlGnBu")) + 124 | labs(x = NULL, 125 | y = "Rank of Weight", 126 | fill = "Miles/(US) gallon") + 127 | theme_classic() 128 | 129 | # Panel_C 130 | ggsave("../Results/08_C.png", height = 6, width = 3) 131 | ``` 132 | 133 | 134 | # Putting plots together 135 | Let's say we want to make a figure with three panels: A, B & C. 136 | And we want a layout like this: 137 | 138 | | Column 1 | Column 2 | 139 | | :---------: | :------: | 140 | | Panel A | Panel C | 141 | | Panel B | Panel C | 142 | 143 | The top left corner is panel A. 144 | The lower left corner is panel B. 145 | The right side is panel C. 146 | 147 | ## Panel A 148 | For the sake of this lesson, let's make panel A itself a composite of three sub-panels. 149 | The 3 sub-panels are called "A1", "A2", and "A3". 150 | 151 | Let's say I want panel A to look like this: 152 | 153 | | Column 1 | Column 2 | 154 | | :---------: | :-------: | 155 | | A1 | Blank | 156 | | A2 | A3 | 157 | 158 | Notice that we only have 3 sub-panels to fill this 2x2 layout. 159 | In this case the upper right corner is left blank. 160 | What should we do? 161 | 162 | Let's look at each sub-panel for panel A first. 163 | I already saved them as an R item in the memory. 164 | ```{r} 165 | A1 166 | A2 167 | A3 168 | ``` 169 | 170 | ![A1](https://github.com/cxli233/Quick_data_vis/blob/main/Results/08_A1.png) 171 | 172 | ![A2](https://github.com/cxli233/Quick_data_vis/blob/main/Results/08_A_2.png) 173 | 174 | ![A3](https://github.com/cxli233/Quick_data_vis/blob/main/Results/08_A3.png) 175 | 176 | ### The design of the composite 177 | To create a composite, we use the `wrap_plots()` function from `patchwork`. 178 | The name is somewhat self-explanatory as for wrapping plots into a larger one. 179 | `patchwork` has an intuitive way to arrange plots using the `design = ` argument. 180 | In this case, the design looks like this: 181 | 182 | ```{r} 183 | design <- c( 184 | "A# 185 | BC" 186 | ) 187 | ``` 188 | 189 | The `design` argument is a character vector of uppercase letters (A, B, C, and so on). 190 | within this vector we start new lines to mean different rows in the composite. 191 | In this case, the `design` has two rows, so the composite will have two rows. 192 | It specified that "part A" will be at the top left, and that "part B" will be at the lower left. 193 | Finally, "part C" will be at the lower right. 194 | The "#" sign indicates a blank space. 195 | 196 | To set it up, we simply call `wrap_plot()` and list the sub-panels in order. 197 | ```{r} 198 | wrap_plots(A1, A2, A3, 199 | design = c("A# 200 | BC")) 201 | 202 | ggsave("../Results/08_A_1.png", height = 6, width = 6) 203 | ``` 204 | 205 | ![A_1](https://github.com/cxli233/Quick_data_vis/blob/main/Results/08_A_1.png) 206 | 207 | The way `wrap_plots()` matches sub-panel names to parts A, B, and C is by the order. 208 | In this case, the first sub-panel name (A1) provided is part A, and so on. 209 | 210 | ### Controling relative heights and widths of sub-panels. 211 | `patchwork` allows you control the relative heights and the widths of sub-panels. 212 | In this example, we have a 2x2 design. 213 | Let's say I want the 1st row to 30% of the height of the 2nd row. 214 | And I want the 2nd column to be 30% of the width of the 1st. 215 | This can be specified using the `heights = ` and `widths =` argument within `wrap_plots()`. 216 | 217 | ```{r} 218 | wrap_plots(A1, A2, A3, 219 | design = c("A# 220 | BC"), 221 | heights = c(0.3, 1), 222 | widths = c(1, 0.3)) 223 | 224 | ggsave("../Results/08_A_2.png", height = 6, width = 6) 225 | ``` 226 | 227 | ![A_2](https://github.com/cxli233/Quick_data_vis/blob/main/Results/08_A_2.png) 228 | 229 | For example, vector `c(0.3, 1)` passed on to the `widths` option specifies that the 1st column should be 0.3 of the width of the 2nd column. 230 | Notice we only 2 rows and columns, so the heights and widths vectors each have 2 numbers. 231 | However, if the dimension of the design is different, the heights & widths vectors must have matching dimension too. 232 | For example, if your design have 3 rows and 4 columns, the heights vector must have 3 numbers, and the widths vector must have 4 columns. 233 | 234 | Finally, we can control where we put panel labels. 235 | In publication ready figures, each panel is labeled by a letter (e.g., A, B, C and so on). 236 | In this example, panel A has 3 sub-panels, how do we control which sub-panel gets labeled? 237 | We can control this by using `+ labs(tag = "...")` option in `wrap_plots()`. 238 | Let's say we want the letter "A" to appear at the upper left of panel A. 239 | The best way to do this is appending `+ labs(tag = "A")` to the first sub-panel, in this case A1. 240 | 241 | 242 | ```{r} 243 | Panel_A <- wrap_plots(A1 + 244 | labs(tag = "A"), 245 | A2, A3, 246 | design = c("A# 247 | BC"), 248 | heights = c(0.3, 1), 249 | widths = c(1, 0.3)) 250 | 251 | Panel_A 252 | ggsave("../Results/08_A.png", height = 6, width = 6) 253 | ``` 254 | 255 | ![A](https://github.com/cxli233/Quick_data_vis/blob/main/Results/08_A.png) 256 | 257 | Now we have panel A! 258 | 259 | ## Hierarchical assembly 260 | We made panel A, but we still need to assemble panel A with panels B and C. 261 | Well, we don't have to stop here. 262 | We can call `wrap_plots()` again, now using "Panel_A" as one of the parts. 263 | 264 | Here are panels B and C: 265 | ```{r} 266 | Panel_B 267 | Panel_C 268 | ``` 269 | 270 | ![B](https://github.com/cxli233/Quick_data_vis/blob/main/Results/08_B.png) 271 | 272 | ![C](https://github.com/cxli233/Quick_data_vis/blob/main/Results/08_C.png) 273 | 274 | Let's say we want panel A and panel B to be in the left column, panel C alone take up the entire right column, what should the design argument be? 275 | ```{r} 276 | c("AC 277 | BC") 278 | ``` 279 | In this case, the design vector has two rows. 280 | And we specified that part C should take up both rows of the 2nd column, as we specified that part C should occupy both the 1st and 2nd rows. 281 | 282 | And let's say we want the 2nd column to the 1/10 of the width of the 1st. 283 | And that the 2nd row to be 30% of the height of the 1st. 284 | Again, we can control that by calling `widths = ` and `heights = ` within `wrap_plots()`. 285 | 286 | ```{r} 287 | wrap_plots(Panel_A, 288 | Panel_B, 289 | Panel_C, 290 | design = "AC 291 | BC", 292 | widths = c(1, 0.1), 293 | heights = c(1, 0.3)) 294 | 295 | ggsave("../Results/08_assembled.png", height = 7, width = 8) 296 | ``` 297 | 298 | ![assembled](https://github.com/cxli233/Quick_data_vis/blob/main/Results/08_assembled.png) 299 | 300 | Now we are really close. 301 | Notice that only panel A is labeled? 302 | We want panel B to be labeled with a "B" and panel C to be labeled with a "C". 303 | We just need to append `+ labs(tag = "B")` and `+ labs(tag = "C")` to the respective panels. 304 | 305 | ## Adding panel label 306 | ```{r} 307 | wrap_plots(Panel_A, 308 | Panel_B + 309 | labs(tag = "B"), 310 | Panel_C + 311 | labs(tag = "C"), 312 | design = "AC 313 | BC", 314 | widths = c(1, 0.1), 315 | heights = c(1, 0.3)) 316 | 317 | ggsave("../Results/08_assembled_label.png", height = 7, width = 8) 318 | ``` 319 | 320 | ![assembled_labeled](https://github.com/cxli233/Quick_data_vis/blob/main/Results/08_assembled_label.png) 321 | 322 | Done! 323 | Great, we have a publication ready figure! 324 | 325 | # Exercise 326 | Use your own data from your research to generate a composite figure. 327 | Save it using `ggsave()`. 328 | 329 | -------------------------------------------------------------------------------- /Lessons/06_Intro_to_heatmaps.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | Heatmap is another very common type of visualization in scientific publications. 3 | In biology they are extremely common in omics papers. 4 | However, extra care must be taken to not produce an uninterpretable or misleading heatmap. 5 | 6 | We will cover: 7 | 8 | 1. What is a heat map? 9 | 2. Are you using an appropriate color scale? 10 | 3. Are you scaling the data correctly to best visualize your data? 11 | 4. How to handle outliers in heatmaps? 12 | 5. Should you reorder rows and columns of a heatmap? 13 | 14 | # Packages 15 | ```{r} 16 | library(tidyverse) 17 | library(RColorBrewer) 18 | library(viridis) 19 | ``` 20 | We have new package `viridis`. 21 | `viridis` is a collection of beautiful color scales for heatmaps. 22 | Read more [here](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html). 23 | 24 | # What is a heat map? 25 | A heatmap is a representation of 3 dimension data. 26 | Say we have 3 dimensions x, y, z. 27 | In a heatmap, the z dimension is visualized by different colors. 28 | Let's look at an example: 29 | 30 | ```{r} 31 | abc_1 <- expand.grid( 32 | a = c(1:10), 33 | b = c(1:10) 34 | ) %>% 35 | mutate(c = a + b) 36 | 37 | head(abc_1) 38 | ``` 39 | In this example, we have 3 dimensions: a, b, and c. 40 | a and b ranges from 1 to 10. 41 | And the c is the sum of a and b. 42 | 43 | ```{r} 44 | abc_1 %>% 45 | ggplot(aes(x = a, y = b)) + 46 | geom_tile(aes(fill = c)) + 47 | scale_fill_gradientn(colors = viridis(100)) + 48 | labs(title = "c = a + b") + 49 | theme_classic() + 50 | coord_fixed() 51 | 52 | ggsave("../Results/06_abc_1.png", height = 3, width = 3) 53 | ``` 54 | ![heatmap1](https://github.com/cxli233/Quick_data_vis/blob/main/Results/06_abc_1.png) 55 | 56 | Now this is a heatmap. 57 | In ggplot, heatmap can be made using `geom_tile()`. 58 | To map values to the tiles, use `aes(fill = ...)`. 59 | In this case, we want to color the interior of tiles using the values of `c`. 60 | We do `aes(fill = c)` inside `geom_tile()`. 61 | 62 | # Choosing the correct color scale 63 | Note the `scale_fill_gradientn(colors = viridis(100))` line in the previous code chunk. 64 | What it does is it maps values of z to the `viridis` color scale, a purple-blue-green-yellow color gradient, where smaller values are darker purple, and larger values are mapped to progressively lighter and yellower colors. 65 | 66 | Color scales are pretty, but we must be extra careful. 67 | There are a couple important concepts regarding choosing a color scale. 68 | 69 | ## 1. Never use both red/green colors for heatmaps. 70 | Yes, I repeat, _NEVER_ use red and green in the same heat map. 71 | Red/green color blindness occurs in about 1/16 in male and 1/256 in female. 72 | People who are red/green colorblind cannot distinguish between red and green; they see brown instead. 73 | They will not be able to read a heatmap where numerical values are represented by a gradient of red and green. 74 | 75 | ## 2. Correctly use unidirectional and bidirectional color scales. 76 | There are two types of color scales: 77 | 78 | 1. Unidirectional or sequential: the color gradient goes from darkest to lightest.The viridis color gradient is a unidirectional color scale. 79 | 2. Bidirectional or divergent: the color gradient has a single lightest point at the center and two darkest points at the extremes. 80 | 81 | You should _NEVER_ use a bidirectional color scale for unidirectional data. 82 | ![Color scales](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/ColorScales.svg) 83 | 84 | You might not have thought about this one, and people make this mistake all the time. 85 | In this example, the two plots on the right use a bidirectional color scale. 86 | Why? Because the blue and red are darker than white. 87 | The first three plots are fine, but last one is not okay. 88 | You might ask, what's wrong with it? 89 | 90 | When color scales (or color gradients) are used to represent numerical data, 91 | the darkest and lightest colors should have special meanings. 92 | You can decide what those special meanings are: e.g., max, min, mean, zero. 93 | But they should represent something meaningful. 94 | A data visualization sin for heat maps/color gradients is when the lightest or darkest colors are some arbitrary numbers. 95 | _This is as bad as the longest bar in a bar chart not being the largest value._ Can you imagine that? 96 | 97 | # Same heatmap, same scale 98 | The next issue with heatmap is scaling. 99 | Let's use an example to illustrate this. 100 | In this hypothetical example, I have the expression level of 3 genes across 5 stages of development. 101 | I want to visualize their expression pattern across development. 102 | You can ignore this chunk. 103 | ```{r} 104 | genes123 <- data.frame( 105 | stages = c(1:5), 106 | gene1 = c(15, 25, 35, 45, 55), 107 | gene2 = c(100, 200, 300, 400, 500), 108 | gene3 = c(100, 80, 70, 60, 50) 109 | ) %>% 110 | pivot_longer(cols = !stages, names_to = "gene", values_to = "expression") 111 | 112 | head(genes123) 113 | ``` 114 | 115 | If I just make a heatmap, what will happen? 116 | ```{r} 117 | genes123 %>% 118 | ggplot(aes(x = stages, y = gene)) + 119 | geom_tile(aes(fill = expression), color = "grey90") + 120 | scale_fill_gradientn(colors = brewer.pal(9, "YlGnBu")) + 121 | theme_classic() + 122 | theme( 123 | legend.position = "top" 124 | ) 125 | 126 | ggsave("../Results/06_genes_1.png", height = 2.5, width = 3) 127 | ``` 128 | ![gene_heatmap1](https://github.com/cxli233/Quick_data_vis/blob/main/Results/06_genes_1.png) 129 | 130 | We have stages on x axis, genes on y axis, and tiles colored by expression. 131 | If you look at the heatmap, you might be tempted to say, gene2 is going up during development, 132 | and genes 1 and 3 are not changing that much. 133 | But is that so? 134 | 135 | In this example, gene2 ranges from 100-500, whereas the rest of the two genes ranges from 10-100. 136 | The large difference in range implies that the lower range values are not visualized well. 137 | 138 | A good way to scale the data is using z score. 139 | A z score is the difference between a value and the mean, divided by the standard deviation. 140 | In this example, we will calculate z score for each gene. 141 | Remember our good friend `group_by()`? 142 | 143 | ```{r} 144 | genes123 <- genes123 %>% 145 | group_by(gene) %>% 146 | mutate(z = (expression - mean(expression))/sd(expression)) %>% 147 | ungroup() 148 | 149 | head(genes123) 150 | ``` 151 | 152 | What happened here is we `group_by(gene)` first, 153 | such that z score calculation is done relative to each gene, not the entire data. 154 | Then `mutate(z = (expression - mean(expression))/sd(expression))` quite literally calculates the z score. 155 | Lastly, we `ungroup()`, and we are done. 156 | 157 | Now we make a new heatmap, but color tiles with z scores. 158 | ```{r} 159 | genes123 %>% 160 | ggplot(aes(x = stages, y = gene)) + 161 | geom_tile(aes(fill = z), color = "grey90") + 162 | scale_fill_gradientn(colors = brewer.pal(9, "YlGnBu")) + 163 | theme_classic() + 164 | theme(legend.position = "top") 165 | 166 | ggsave("../Results/06_genes_2.png", height = 2.5, width = 3) 167 | ``` 168 | ![gene_heatmap2](https://github.com/cxli233/Quick_data_vis/blob/main/Results/06_genes_2.png) 169 | 170 | In this version of the heatmap, it is obvious that genes 1 and 2 have the same expression pattern. 171 | They both go up during development, while gene3 has the opposite pattern. 172 | 173 | # Be aware of outliers 174 | This is related to the previous point. 175 | Outliers can drastically change the appearance of a heatmap. 176 | Here is an example: 177 | ![check outliers for heatmap](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/Check_outliers_for_heatmap.svg) 178 | 179 | In this example, I have 2 observations. 180 | For each observations, I measured 20 features. 181 | Without checking for outliers, it may appear that the 2 observations are overall similar, except at 2 features. 182 | However, after maxing out the color scale around the 95th percentile of the data, it reveals that the two observations are distinct across all features. 183 | 184 | The point here is always check for outliers when making heatmaps. 185 | If there are outliers, it may be a good idea to clip outliers at the extremes. 186 | 187 | # Reordering rows and columns 188 | In most cases, the ordering of rows and columns are arbitrary. 189 | For example, nothing says the y axis has to be in order of gene1, followed by gene2, then gene3. 190 | However, if the heatmap reflects a physical map (x and y axis maps to physical locations in space), 191 | then you can't reorder rows and columns. 192 | 193 | The ordering of the x and y axis has a _HUGE_ impact on how useful a heatmap is, especially a large heatmap. 194 | Let's look at an example: 195 | ![recorder rows and columns](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/Reorder_rows_and_columns_for_heatmap.png) 196 | 197 | In this example, I have cells as columns and features as rows. 198 | Each tile is showing z scores. 199 | It is impossible to get anything useful out of the heatmap without reordering rows and columns. 200 | We can reorder rows and columns, such that features that are enriched in cell type 1 appears first. 201 | Then features that are enriched in cell type 2 appears second. Data from: [Li et al., 2022, BioRxiv](https://www.biorxiv.org/content/10.1101/2022.07.04.498697v1.abstract) 202 | 203 | When rows and columns of a heatmap is properly ordered, you can see a clear signal across the diagonal. For example: 204 | 205 | ![Beautiful heatmap](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/Abstract_R_2022_11_24.svg) 206 | 207 | The coding underlying reordering rows and columns of a heatmap is actually quite complex. 208 | I would say this is a bit outside the scope of this introductory lesson. 209 | That being said, there are packages that produces heatmaps, and they also handles reordering rows and columns automatically. 210 | For example, [tidyHeatmap](https://github.com/stemangiola/tidyHeatmap) is a good one. 211 | If you ever need to make heatmaps, you should explore how to reorder rows and columns, or explore how to use a package like `tidyHeatmap`. 212 | 213 | # Exercise 214 | ## Q1 215 | Here is the rainbow color scale: 216 | ```{r} 217 | abc_1 %>% 218 | ggplot(aes(x = a, y = b)) + 219 | geom_tile(aes(fill = c)) + 220 | scale_fill_gradientn(colors = rainbow(20)) + 221 | labs(title = "c = a + b") + 222 | theme_classic() + 223 | coord_fixed() 224 | 225 | ggsave("../Results/06_abc_2.png", height = 3, width = 3) 226 | ``` 227 | ![rainbow](https://github.com/cxli233/Quick_data_vis/blob/main/Results/06_abc_2.png) 228 | 229 | Can you explain why the rainbow color scale is inappropriate for heatmaps? 230 | 231 | ## Q2 232 | You can ignore the following chunk. It generates the data. 233 | ```{r} 234 | Q2_data <- rbind( 235 | c(10, 20, 30, 40, 50, 60), # 6 236 | c(0, 10, 30, 10, 5, 0), # 3 237 | c(100, 200, 300, 400, 500, 550), # 6 238 | c(50, 45, 40, 30, 20, 0), # 1 239 | c(0, 200, 300, 500, 200, 100), # 4 240 | c(10, 500, 200, 100, 10, 0), # 2 241 | c(0, 0, 0, 10, 500, 0) # 5 242 | ) %>% 243 | as.data.frame() %>% 244 | cbind(gene = c("a", "b", "c", "d", "e", "f", "g")) %>% 245 | pivot_longer(cols = !gene, names_to = "V", values_to = "expression") %>% 246 | mutate(stage = str_remove(V, "V")) 247 | 248 | Q2_data %>% 249 | ggplot(aes(x = stage, y = gene)) + 250 | geom_tile(aes(fill = expression)) + 251 | geom_text(aes(label = expression)) + 252 | scale_fill_gradientn(colors = rev(brewer.pal(11, "RdBu")))+ 253 | theme_classic() 254 | 255 | ggsave("../Results/06_genes_3.png", height = 3.5, width = 4.2) 256 | ``` 257 | ![bad heatmap](https://github.com/cxli233/Quick_data_vis/blob/main/Results/06_genes_3.png) 258 | 259 | What is (are) wrong with this heatmap? 260 | Hint: there is a lot going wrong with this heatmap. 261 | -------------------------------------------------------------------------------- /Scripts/08_Intro_plot_assembly.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "08_Plot_assembly" 3 | author: "Chenxin Li" 4 | date: "2023-07-09" 5 | output: html_notebook 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | ``` 11 | 12 | # Introduction 13 | In this lesson, we will create publication ready composite plots. 14 | I.e., each composite plot (or figure) is assembled from multiple panels. 15 | And the panels themselves could also be a composite of multiple plots. 16 | 17 | You will learn: 18 | 19 | 1. How to arrange and assemble composite plots. 20 | 2. How to control relative heights and widths of panels. 21 | 3. How to generate hierarchical composite plots. 22 | 4. How to control the panel labels. 23 | 24 | As an example, we will be using the "mtcars" dataset from R. 25 | Don't worry about what the data or these plots mean. 26 | This is just an example. 27 | 28 | # Pacakge 29 | We will be using the `patchwork` package. 30 | Documentation and tutorials can be found [here](https://github.com/thomasp85/patchwork). 31 | While it is not the only package for plot assembly, it is my go-to. 32 | 33 | ```{r} 34 | library(tidyverse) 35 | library(ggbeeswarm) 36 | library(patchwork) 37 | 38 | library(RColorBrewer) 39 | ``` 40 | 41 | # Making the panels 42 | You can ignore this entire section. 43 | This section generates the parts that we will use to assemble our final figure. 44 | 45 | ## A 46 | ```{r} 47 | A2 <- mtcars %>% 48 | mutate(AM = case_when( 49 | am == 1 ~ "manual", 50 | am == 0 ~ "automatic" 51 | )) %>% 52 | ggplot(aes(x = mpg, y = disp)) + 53 | geom_point(size = 3, alpha = 0.8, color = "white", 54 | shape = 21, aes(fill = AM)) + 55 | scale_fill_manual(values = c("tomato1", "grey20")) + 56 | labs(x = "Miles/(US) gallon", 57 | y = "Displacement (cu.in.)", 58 | fill = "Automatic") + 59 | theme_classic() + 60 | theme( 61 | legend.position = c(0.8, 0.8) 62 | ) 63 | 64 | A1 <- mtcars %>% 65 | mutate(AM = case_when( 66 | am == 1 ~ "manual", 67 | am == 0 ~ "automatic" 68 | )) %>% 69 | ggplot(aes(y = mpg, x = "")) + 70 | geom_quasirandom(size = 3, alpha = 0.8, color = "white", 71 | shape = 21, aes(fill = AM)) + 72 | scale_fill_manual(values = c("tomato1", "grey20")) + 73 | labs(y = "Miles/(US) gallon", 74 | x = NULL) + 75 | theme_classic() + 76 | theme( 77 | legend.position = "none" 78 | ) + 79 | coord_flip() 80 | 81 | A3 <- mtcars %>% 82 | mutate(AM = case_when( 83 | am == 1 ~ "manual", 84 | am == 0 ~ "automatic" 85 | )) %>% 86 | ggplot(aes(y = disp, x = "" )) + 87 | geom_quasirandom(size = 3, alpha = 0.8, color = "white", 88 | shape = 21, aes(fill = AM)) + 89 | scale_fill_manual(values = c("tomato1", "grey20")) + 90 | labs(y = "Displacement (cu.in.)", 91 | x = NULL) + 92 | theme_classic() + 93 | theme( 94 | legend.position = "none" 95 | ) 96 | 97 | # A_scatter 98 | ggsave(plot = A2, filename = "../Results/08_A2.png", height = 5, width = 5) 99 | 100 | # A_hist_mpg 101 | ggsave(plot = A1, filename = "../Results/08_A1.png", height = 3, width = 5) 102 | 103 | # A_hist_disp 104 | ggsave(plot = A3, filename = "../Results/08_A3.png", height = 3, width = 5) 105 | ``` 106 | 107 | ## B 108 | ```{r} 109 | Panel_B <- mtcars %>% 110 | mutate(VS = case_when( 111 | vs == 0 ~ "V shaped", 112 | vs == 1 ~ "Straight" 113 | )) %>% 114 | ggplot(aes(x = VS, y = hp)) + 115 | geom_quasirandom(size = 3, alpha = 0.8, 116 | aes(color = VS)) + 117 | scale_color_manual(values = brewer.pal(8, "Accent")[5:6]) + 118 | labs(x = "Engine", 119 | y = "Horsepower") + 120 | theme_classic() + 121 | theme(legend.position = "none") + 122 | coord_flip() 123 | 124 | # Panel_B 125 | ggsave("../Results/08_B.png", height = 2, width = 5) 126 | ``` 127 | 128 | ## C 129 | ```{r} 130 | Panel_C <- mtcars %>% 131 | mutate(rank_wt = rank(wt, ties.method = "first")) %>% 132 | ggplot(aes(x = "", y = rank_wt)) + 133 | geom_tile(aes(fill = mpg)) + 134 | scale_fill_gradientn(colours = brewer.pal(9, "YlGnBu")) + 135 | labs(x = NULL, 136 | y = "Rank of Weight", 137 | fill = "Miles/(US) gallon") + 138 | theme_classic() 139 | 140 | # Panel_C 141 | ggsave("../Results/08_C.png", height = 6, width = 3) 142 | ``` 143 | 144 | 145 | # Putting plots together 146 | Let's say we want to make a figure with three panels: A, B & C. 147 | And we want a layout like this: 148 | 149 | | Column 1 | Column 2 | 150 | | :---------: | :------: | 151 | | Panel A | Panel C | 152 | | Panel B | Panel C | 153 | 154 | The top left corner is panel A. 155 | The lower left corner is panel B. 156 | The right side is panel C. 157 | 158 | ## Panel A 159 | For the sake of this lesson, let's make panel A itself a composite of three sub-panels. 160 | The 3 sub-panels are called "A1", "A2", and "A3". 161 | 162 | Let's say I want panel A to look like this: 163 | 164 | | Column 1 | Column 2 | 165 | | :---------: | :-------: | 166 | | A1 | Blank | 167 | | A2 | A3 | 168 | 169 | Notice that we only have 3 sub-panels to fill this 2x2 layout. 170 | In this case the upper right corner is left blank. 171 | What should we do? 172 | 173 | Let's look at each sub-panel for panel A first. 174 | I already saved them as an R item in the memory. 175 | ```{r} 176 | A1 177 | A2 178 | A3 179 | ``` 180 | 181 | ### The design of the composite 182 | To create a composite, we use the `wrap_plots()` function from `patchwork`. 183 | The name is somewhat self-explanatory as for wrapping plots into a larger one. 184 | `patchwork` has an intuitive way to arrange plots using the `design = ` argument. 185 | In this case, the design looks like this: 186 | 187 | ```{r} 188 | design <- c( 189 | "A# 190 | BC" 191 | ) 192 | ``` 193 | 194 | The `design` argument is a character vector of uppercase letters (A, B, C, and so on). 195 | within this vector we start new lines to mean different rows in the composite. 196 | In this case, the `design` has two rows, so the composite will have two rows. 197 | It specified that "part A" will be at the top left, and that "part B" will be at the lower left. 198 | Finally, "part C" will be at the lower right. 199 | The "#" sign indicates a blank space. 200 | 201 | To set it up, we simply call `wrap_plot()` and list the sub-panels in order. 202 | ```{r} 203 | wrap_plots(A1, A2, A3, 204 | design = c("A# 205 | BC")) 206 | 207 | ggsave("../Results/08_A_1.png", height = 6, width = 6) 208 | ``` 209 | The way `wrap_plots()` matches sub-panel names to parts A, B, and C is by the order. 210 | In this case, the first sub-panel name (A1) provided is part A, and so on. 211 | 212 | ### Controling relative heights and widths of sub-panels. 213 | `patchwork` allows you control the relative heights and the widths of sub-panels. 214 | In this example, we have a 2x2 design. 215 | Let's say I want the 1st row to 30% of the height of the 2nd row. 216 | And I want the 2nd column to be 30% of the width of the 1st. 217 | This can be specified using the `heights = ` and `widths =` argument within `wrap_plots()`. 218 | 219 | ```{r} 220 | wrap_plots(A1, A2, A3, 221 | design = c("A# 222 | BC"), 223 | heights = c(0.3, 1), 224 | widths = c(1, 0.3)) 225 | 226 | ggsave("../Results/08_A_2.png", height = 6, width = 6) 227 | ``` 228 | For example, vector `c(0.3, 1)` passed on to the `widths` option specifies that the 1st column should be 0.3 of the width of the 2nd column. 229 | Notice we only 2 rows and columns, so the heights and widths vectors each have 2 numbers. 230 | However, if the dimension of the design is different, the heights & widths vectors must have matching dimension too. 231 | For example, if your design have 3 rows and 4 columns, the heights vector must have 3 numbers, and the widths vector must have 4 columns. 232 | 233 | Finally, we can control where we put panel labels. 234 | In publication ready figures, each panel is labeled by a letter (e.g., A, B, C and so on). 235 | In this example, panel A has 3 sub-panels, how do we control which sub-panel gets labeled? 236 | We can control this by using `+ labs(tag = "...")` option in `wrap_plots()`. 237 | Let's say we want the letter "A" to appear at the upper left of panel A. 238 | The best way to do this is appending `+ labs(tag = "A")` to the first sub-panel, in this case A1. 239 | 240 | 241 | ```{r} 242 | Panel_A <- wrap_plots(A1 + 243 | labs(tag = "A"), 244 | A2, A3, 245 | design = c("A# 246 | BC"), 247 | heights = c(0.3, 1), 248 | widths = c(1, 0.3)) 249 | 250 | Panel_A 251 | ggsave("../Results/08_A.png", height = 6, width = 6) 252 | ``` 253 | Now we have panel A! 254 | 255 | ## Hierarchical assembly 256 | We made panel A, but we still need to assemble panel A with panels B and C. 257 | Well, we don't have to stop here. 258 | We can call `wrap_plots()` again, now using "Panel_A" as one of the parts. 259 | 260 | Here are panels B and C: 261 | ```{r} 262 | Panel_B 263 | Panel_C 264 | ``` 265 | 266 | Let's say we want panel A and panel B to be in the left column, panel C alone take up the entire right column, what should the design argument be? 267 | ```{r} 268 | c("AC 269 | BC") 270 | ``` 271 | In this case, the design vector has two rows. 272 | And we specified that part C should take up both rows of the 2nd column, as we specified that part C should occupy both the 1st and 2nd rows. 273 | 274 | And let's say we want the 2nd column to the 1/10 of the width of the 1st. 275 | And that the 2nd row to be 30% of the height of the 1st. 276 | Again, we can control that by calling `widths = ` and `heights = ` within `wrap_plots()`. 277 | 278 | ```{r} 279 | wrap_plots(Panel_A, 280 | Panel_B, 281 | Panel_C, 282 | design = "AC 283 | BC", 284 | widths = c(1, 0.1), 285 | heights = c(1, 0.3)) 286 | 287 | ggsave("../Results/08_assembled.png", height = 7, width = 8) 288 | ``` 289 | Now we are really close. 290 | Notice that only panel A is labeled? 291 | We want panel B to be labeled with a "B" and panel C to be labeled with a "C". 292 | We just need to append `+ labs(tag = "B")` and `+ labs(tag = "C")` to the respective panels. 293 | 294 | ## Adding panel label 295 | ```{r} 296 | wrap_plots(Panel_A, 297 | Panel_B + 298 | labs(tag = "B"), 299 | Panel_C + 300 | labs(tag = "C"), 301 | design = "AC 302 | BC", 303 | widths = c(1, 0.1), 304 | heights = c(1, 0.3)) 305 | 306 | ggsave("../Results/08_assembled_label.png", height = 7, width = 8) 307 | ``` 308 | Done! 309 | Great, we have a publication ready figure! 310 | 311 | # Exercise 312 | Use your own data from your research to generate a composite figure. 313 | Save it using `ggsave()`. 314 | 315 | For those of you who do not have your own data, please use these datasets from a hypothetical project. 316 | 317 | You measured tensile strength and resistance of a material at two temperatures (t1 and t2) and two pressures (p1 and p2) (dataset1). 318 | You also tested samples of this material in the above-mentioned temperature-pressure combinations. You recorded the percentages of samples that passed, failed, or were unclear (dataset2). 319 | 320 | * What is the condition that resulted in the highest tensile strength? 321 | * What is the condition that resulted in the lowest resistance? 322 | * Is there any correlation between tensile strength and resistance across various conditions? 323 | * What is the condition that resulted in the lowest percentage of failure? What is the breakdown of test outcomes across treatment? 324 | * Based on all the data available, which treatment is the best? 325 | 326 | ## Datasets 327 | ```{r} 328 | dataset1 <- read.csv("../Data/dataset1.csv") 329 | dataset2 <- read.csv("../Data/dataset2.csv") 330 | 331 | head(dataset1) 332 | head(dataset2) 333 | ``` 334 | 335 | 336 | -------------------------------------------------------------------------------- /Scripts/03_Intro_to_data_vis.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Intro_to_data_vis" 3 | author: "Chenxin Li" 4 | date: "2023-01-06" 5 | output: 6 | html_notebook: 7 | number_sections: yes 8 | toc: yes 9 | toc_float: yes 10 | --- 11 | 12 | ```{r setup, include=FALSE} 13 | knitr::opts_chunk$set(echo = TRUE) 14 | ``` 15 | 16 | # Introduction 17 | 18 | In this lesson, we will cover: 19 | 20 | 1. How are data represented in visualizations? 21 | 2. What is "the grammar of graphics"? 22 | 3. Introduction to ggplot: mapping variables to axis and aesthetics. 23 | 24 | # Required packages 25 | ```{r} 26 | library(tidyverse) 27 | library(RColorBrewer) 28 | ``` 29 | 30 | The `tidyverse` will be doing all the heavy lifting. 31 | The `RColorBrewer` will be providing some nice color sets for our graphs. 32 | 33 | # How are data represented in visualizations? 34 | As an example, let's simulate some data: 35 | ```{r} 36 | set.seed(666) 37 | Example_data1 <- rbind( 38 | rnorm(3, mean = 5, sd = 0.5) %>% 39 | as.data.frame() %>% 40 | mutate(time = 1), 41 | rnorm(3, mean = 8, sd = 0.5) %>% 42 | as.data.frame() %>% 43 | mutate(time = 2), 44 | rnorm(3, mean = 10, sd = 0.5) %>% 45 | as.data.frame() %>% 46 | mutate(time = 3) 47 | ) %>% 48 | rename(response = ".") 49 | 50 | Example_data1 51 | ``` 52 | |response|time| 53 | |--------|----| 54 | |5.376656| 1| 55 | |6.007177| 1| 56 | |4.822433| 1| 57 | |9.014084| 2| 58 | |6.891563| 2| 59 | |8.379198| 2| 60 | |9.346907| 3| 61 | |9.598740| 3| 62 | |9.103880| 3| 63 | 64 | As you can see, the data are stored in this table. This is already a tidy data frame. 65 | It has two columns, "response" and "time". It appears each time point has 3 observations. 66 | 67 | If you were to just print this table, it will be a perfectly correct representation of the data. 68 | For example, if you read the numbers, you probably notice the response variable goes up as time progresses. 69 | 70 | But that is NOT the point of data visualization. 71 | The purpose of data visualization is to create a visual and intuitive representation of the data. 72 | A good data visualization allows us to gain insight into the data at a glance. 73 | That being said, what are the ways to visually represent data? 74 | 75 | ## Represent data by position 76 | The first way to represent data is by position, such that the data is represented by position along the axis. 77 | ```{r} 78 | Example_data1 %>% 79 | ggplot(aes(x = time, y = response)) + 80 | stat_summary(geom = "line", fun.data = mean_se, 81 | size = 1.1, alpha = 0.8, color = "grey60") + 82 | geom_point(aes(fill = as.factor(time)), shape = 21, color = "grey20", 83 | alpha = 0.8, size = 3, position = position_jitter(0.1, seed = 666)) + 84 | stat_summary(geom = "errorbar", fun.data = mean_se, 85 | width = 0.05, alpha = 0.8) + 86 | scale_fill_manual(values = brewer.pal(8, "Accent")) + 87 | scale_x_continuous(breaks = c(1, 2, 3)) + 88 | labs(x = "Time point", 89 | y = "Response", 90 | title = "Dot/line graphs are\nposition based") + 91 | theme_classic() + 92 | theme( 93 | legend.position = "none", 94 | text = element_text(size = 14, color = "black"), 95 | axis.text = element_text(color = "black"), 96 | plot.title = element_text(size = 12) 97 | ) 98 | 99 | ggsave("../Results/03_line_plot.png", height = 2.5, width = 2) 100 | ``` 101 | In this case, I graphed time on x-axis, and response on y-axis. 102 | This visualization is position based. 103 | 104 | ## Represent data by length/size 105 | ```{r} 106 | Example_data1 %>% 107 | ggplot(aes(x = time, y = response)) + 108 | stat_summary(geom = "bar", fun.data = mean_se, aes(color = as.factor(time)), 109 | size = 1.1, alpha = 0.8, fill = NA, width = 0.5) + 110 | geom_point(aes(fill = as.factor(time)), shape = 21, color = "grey20", 111 | alpha = 0.8, size = 3, position = position_jitter(0.1, seed = 666)) + 112 | stat_summary(geom = "errorbar", fun.data = mean_se, 113 | width = 0.05, alpha = 0.8) + 114 | scale_fill_manual(values = brewer.pal(8, "Accent")) + 115 | scale_color_manual(values = brewer.pal(8, "Accent")) + 116 | scale_x_continuous(breaks = c(1, 2, 3)) + 117 | labs(x = "Time point", 118 | y = "\nResponse", 119 | title = "Bar graphs are\nlength based") + 120 | theme_classic() + 121 | theme( 122 | legend.position = "none", 123 | text = element_text(size = 14, color = "black"), 124 | axis.text = element_text(color = "black"), 125 | plot.title = element_text(size = 12) 126 | ) 127 | 128 | ggsave("../Results/03_bar.png", height = 2.5, width = 2) 129 | ``` 130 | In this case, I graphed time on x-axis, and response on y-axis. 131 | This visualization is length based. Why? 132 | In bar plots, values are represented by the distance from the x axis, and thus the length of the bar. 133 | 134 | Never confuse position and length based visualizations! 135 | Can you tell what is wrong with the graph on the right? 136 | 137 | ![What is wrong?](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/Position_and_length_based_visualizations.svg) 138 | 139 | ## Represent data by color 140 | Finally, data can be represented by color. 141 | We will explore this more when we talk about heatmaps and whatnot. 142 | 143 | # ggplot: data visualization using the grammar of graphics 144 | As you can see in the examples, I've been using `ggplot` to produce graphs. 145 | The "gg" in `ggplot` stands for grammar of graphics, a theoretical framework to visualize data. 146 | In grammar of graphics, each visualization has the following layers: 147 | 148 | 1. Coordinates: what are on x and y axis? 149 | 2. Geometric objects ("geom"): e.g., bars, points, lines... 150 | 3. Mapping: mapping colors, size, and transparency to geometric objects. 151 | 152 | For more resources, you can look at the [ggplot cheat sheet](https://statsandr.com/blog/files/ggplot2-cheatsheet.pdf). 153 | 154 | Let's build a scatter plot for our example data. 155 | Usually I start with the data table and pipe it into `ggplot()` using the pipe `%>%` syntax. 156 | Inside the `ggplot()` function, there is this `aes()` argument. 157 | Inside the `aes()` argument, it takes the x and y variables. 158 | Not intuitive, but it is what it is. 159 | 160 | ```{r} 161 | Example_data1 %>% 162 | ggplot(aes(x = time, y = response)) + 163 | theme_classic() 164 | 165 | ggsave("../Results/03_scatter_1.png", height = 2.5, width = 2) 166 | ``` 167 | If we just do that, we get a blank graph with time on x axis and response on y axis. 168 | The `theme_classic()` only change the appearance of the background, read more [here](https://ggplot2.tidyverse.org/reference/ggtheme.html). 169 | ggplot comes with a collection of themes. My go-to is "classic". It has white background and axis lines, as you can see in this example. 170 | 171 | The graph is blank because we haven't added any geometric objects (geom) yet. 172 | For the sake of this example only, let's make a scatter plot. So we need some points. 173 | The command to add points in `ggplot` is `geom_poit()`. 174 | 175 | ```{r} 176 | Example_data1 %>% 177 | ggplot(aes(x = time, y = response)) + 178 | geom_point() + 179 | theme_classic() 180 | 181 | ggsave("../Results/03_scatter_2.png", height = 2.5, width = 2) 182 | ``` 183 | This is how the graph looks like after we added points. 184 | To include more layers, you literally add (using the `+` syntax) layers to the existing code. 185 | After the `ggplot()` line which initiates the graph, I usually add the geom layers first, then other layers, then lastly the `theme` layer. 186 | 187 | For a minimal plot, we are done! 188 | 189 | # Mapping variables to ggplot geom 190 | The example data we are using only have 2 variables: time and response. What if we have more than 2? 191 | To demonstrate that, let's use another example. 192 | 193 | I am going to read in child mortality, children/women, and income data. 194 | We will use those data for another example. For the sake of this lesson, we will just use year 1945. 195 | 196 | First, read in data. 197 | ```{r} 198 | child_mortality <- read_csv("../Data/child_mortality_0_5_year_olds_dying_per_1000_born.csv", col_types = cols()) 199 | babies_per_woman <- read_csv("../Data/children_per_woman_total_fertility.csv", col_types = cols()) 200 | income <- read_csv("../Data/income_per_person_gdppercapita_ppp_inflation_adjusted.csv", col_types = cols()) 201 | ``` 202 | 203 | Second, re-shape data into tidy format. 204 | ```{r} 205 | babies_per_woman_tidy <- babies_per_woman %>% 206 | pivot_longer(names_to = "year", values_to = "birth", cols = c(2:302)) 207 | 208 | child_mortality_tidy <- child_mortality %>% 209 | pivot_longer(names_to = "year", values_to = "death_per_1000_born", cols = c(2:302)) 210 | 211 | income_tidy <- income %>% 212 | pivot_longer(names_to = "year", values_to = "income", cols = c(2:242)) 213 | ``` 214 | 215 | Third, join them together. 216 | ```{r} 217 | example2_data <- babies_per_woman_tidy %>% 218 | inner_join(child_mortality_tidy, by = c("country", "year")) %>% 219 | inner_join(income_tidy, by = c("country", "year")) 220 | 221 | head(example2_data) 222 | ``` 223 | 224 | Fourth, filter for year 1945. 225 | ```{r} 226 | example2_data_1945 <- example2_data %>% 227 | filter(year == "1945") 228 | 229 | head(example2_data_1945) 230 | ``` 231 | 232 | 233 | You should be able to do all that relatively easily after completing [Intro_to_tidy_data](https://github.com/cxli233/Online_R_learning/blob/master/Quick_data_vis/Lessons/02_Intro_to_tidy_data.md). 234 | If not, practice more! 235 | 236 | Say we want birth rate on x axis, mortality on y axis, make a scatter plot, and color the dots by income, what should we do? 237 | Let's make a plain scatter plot first. 238 | 239 | ```{r} 240 | example2_data_1945 %>% 241 | ggplot(aes(x = birth, y = death_per_1000_born)) + 242 | geom_point() + 243 | labs(x = "No. children per woman", 244 | y = "Child mortality/1000 born", 245 | title = "Year 1945") + 246 | theme_classic() 247 | 248 | ggsave("../Results/03_scatter_3.png", width = 2.5, height = 2.5) 249 | ``` 250 | That looks fine. 251 | To color dots (or any geom) based on a variable in the data frame, you open an `aes()` argument inside the `geom()`. 252 | 253 | ```{r} 254 | example2_data_1945 %>% 255 | ggplot(aes(x = birth, y = death_per_1000_born)) + 256 | geom_point(aes(color = log10(income))) + 257 | labs(x = "No. children per woman", 258 | y = "Child mortality/1000 born", 259 | title = "Year 1945") + 260 | theme_classic() 261 | 262 | ggsave("../Results/03_scatter_4.png", width = 3, height = 2.5) 263 | ``` 264 | Note: income is quite unevenly distributed, so I log-transformed it. 265 | Now we have dots colored by income (in log10 scale). 266 | The problem is the default color scale in ggplot looks bad. 267 | We can use a different set of colors from the `RColorBrewer` package. 268 | You can look at all the available color scales in `RColorBrewer` [here](https://r-graph-gallery.com/38-rcolorbrewers-palettes.html). 269 | Let's use something easier to see than the default ggplot blue. 270 | 271 | ```{r} 272 | example2_data_1945 %>% 273 | ggplot(aes(x = birth, y = death_per_1000_born)) + 274 | geom_point(aes(color = log10(income))) + 275 | scale_color_gradientn(colours = brewer.pal(9, "YlGnBu")) + 276 | labs(x = "No. children per woman", 277 | y = "Child mortality/1000 born", 278 | title = "Year 1945") + 279 | theme_classic() 280 | 281 | ggsave("../Results/03_scatter_5.png", width = 3, height = 2.5) 282 | ``` 283 | To provide a custom color scale that produce a color gradient, use `scale_color_gradientn())`. 284 | Here is a whole family of `scale` functions in ggplot, which allow you to control the scales for colors and axis. 285 | You can read more about scales on the [ggplot cheat sheet](https://statsandr.com/blog/files/ggplot2-cheatsheet.pdf). 286 | I quite like the yellow-green-blue color gradient from `RColorBrewer`, which is what I used here. 287 | As you can see, low income countries have more children/woman and also higher child mortality. 288 | 289 | # Exercise 290 | Graph income (in log10 scale) on x axis, child mortality on y axis, and color with children/woman in year 2010. 291 | Were the trend similar to year 1945? 292 | Save the graph using `ggsave()`. 293 | 294 | 295 | 296 | -------------------------------------------------------------------------------- /Lessons/01_Intro_to_R.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | This is a quick introduction to R! 4 | 5 | As a resource, here's the link to the [base R cheat sheet](https://iqss.github.io/dss-workshops/R/Rintro/base-r-cheat-sheet.pdf) 6 | 7 | # R, RStudio, packages and R Markdown 8 | ## R packages 9 | 10 | A feature about R is R has a lot of packages. 11 | A package is a collection of R functions that other people wrote and published. 12 | We can use pre-existing packages, so that we don't have to re-invent the wheels. 13 | A difference between R and many other coding languages is that in R you learn more about how to use other packages, and less about coding stuff yourself. 14 | This is very advantageous for those who do not have a traditional computer science background. 15 | 16 | When you need to install packages, you use the `install.packages()` command. 17 | For example, if you need to install a package called `dplyr`, 18 | which is a package that we'll use. 19 | You will type `install.packages("dplyr")` at the console. 20 | 21 | Alternatively, you can click the Package tab at the lower right. 22 | Then you select Install. 23 | In the pop-up window, type in the package you need to install and click install. 24 | We will talk about how to load an installed package later. 25 | 26 | ## The R Markdown (.Rmd) file 27 | 28 | R Markdown is an enhanced R script file. 29 | You will need the `rmarkdowm` package. 30 | R markdown files are more user-friendly than a common R script. 31 | 32 | In a .Rmd file, there are two kinds of text. 33 | One is plain text, which is what you are reading right now. 34 | 35 | The other is code chunk, which R will actually run. 36 | 37 | To insert a code chunk, you can click the Code drop-down menu from the top. 38 | Then select Insert Chunk. 39 | 40 | Alternatively, you can do Ctrl+Alt+I on a PC, or Cmd+Opt+I on a Mac. 41 | When you do that, this will show up: 42 | 43 | ```{r} 44 | 45 | ``` 46 | 47 | The above is a code chunk. 48 | If you click the green triangle at the upper right, it will run the entire chunk. 49 | The advantage of having chunks is that you can run codes chunk-by-chunk, and de-bug chunk by chunk. 50 | This is very user-friendly. 51 | If you want to run a particular line in the chunk, you do Ctrl+Enter on a PC or Cmd+Enter on a Mac. 52 | I think Opt+Enter also works on a Mac. 53 | To run the whole chunk, do Ctrl+Shift+Enter on PC or Cmd+Shift+Enter on a Mac. 54 | 55 | When you start a new script, the first thing you want to do is to load your packages. 56 | Say we want to load the dplyr that we just installed, we will do: 57 | 58 | ```{r} 59 | library(dplyr) 60 | ``` 61 | 62 | The command is `library()`. 63 | It's not very intuitive, but this is the command to load packages that we already installed. 64 | The dplyr package is very useful in data arrangement, which we will cover in depth in the next lesson. 65 | 66 | ## Sections in a .Rmd file 67 | 68 | You can see the section headers are in blue color. 69 | To make a new section header, you can type `#` followed by a space, and then any text at the first position of any line in the plain text area. 70 | In addition, `##` will be a sub-header, and `###` will be a sub-sub-header, and so on. 71 | 72 | You can view all the sections at the file outline. 73 | The shortcut is Ctrl+Shift+O for PC and Cmd+Shift+O for Mac. 74 | You'll see that section titles will show up at the right to the script area. 75 | If you click on any section titles, RStudio will jump to that section. 76 | 77 | # Basic syntax in R 78 | 79 | ## Single value items 80 | 81 | In R, there are 3 types of single value items: logical, numeric and character. 82 | For example, 83 | 84 | * TRUE or FALSE are logical 85 | * 1, 2, 3 ... are numeric 86 | * a, b, c ... are character 87 | 88 | This is pretty self-explanatory. 89 | 90 | ## Vectors 91 | 92 | The term vector in R is a bit different from the traditional sense of vector in math. 93 | It stands for a list of items that are the same type, 94 | i.e., a list that is all numeric, or all character, but not mixed. 95 | 96 | To create a vector, you use the `c()` command. 97 | For example, I want to create a vector of 1, 2, 3, the code will be 98 | 99 | ```{r} 100 | c(1, 2, 3) 101 | ``` 102 | 103 | Now, to save my vector (or anything) as an R item, you will use the `<-` syntax. 104 | This is called the left arrow. 105 | Let's say we are saving this vector of c(1, 2, 3) as an item called `a` then 106 | 107 | ```{r} 108 | a <- c(1, 2, 3) 109 | ``` 110 | 111 | When you run this, something happens. 112 | If you look at the environment at the upper right, you should see 'a' shows up as a saved item. 113 | Remember to use a different name for different items. 114 | If you save them as the same name, the later item will just overwrite the previous. 115 | 116 | Say I want a vector consists of apple, banana, and orange, and save it as an item named `b`, what do I do? 117 | 118 | ```{r} 119 | b <- c("apple", "banana", "orange") 120 | b 121 | ``` 122 | 123 | Note that apple, banana and orange all have to be in quotes `" "`. 124 | The quotes `" "` specify the terms apple, banana or oranges are characters, 125 | not items in the environment. 126 | Forgetting to use quotes is a very common source of bugs. 127 | 128 | ### Subsetting vectors 129 | 130 | Say if you want to select a member of a vector by position, 131 | you will use the square bracket `[ ]` syntax. 132 | Say I want the 2nd member in vector `a`, the code will be 133 | 134 | ```{r} 135 | a[2] 136 | ``` 137 | 138 | If I want anything but the 2nd member, it will be 139 | 140 | ```{r} 141 | a[-2] 142 | ``` 143 | 144 | If I want the 1st and 2nd member, it will be 145 | 146 | ```{r} 147 | a[c(1, 2)] 148 | ``` 149 | 150 | If I want anything but the 1st and 2nd member, it will be 151 | 152 | ```{r} 153 | a[-c(1, 2)] 154 | ``` 155 | 156 | You don't have to memorize how to do anything. 157 | If you ever get confused, you can always go look at the base R cheat sheet. 158 | 159 | ## Matrix 160 | 161 | Matrix is rectangular data that is all the same type, e.g., all numeric. 162 | For example, if I want a matrix like this: 163 | 164 | | 1 | 2 | 3 | 4 | 165 | | 1 | 2 | 1 | 2 | 166 | | 2 | 3 | 2 | 4 | 167 | 168 | What do I do? 169 | An easy way to do is combining different vectors. 170 | 171 | ```{r} 172 | my_matrix <- rbind( 173 | c(1, 2, 3, 4), 174 | c(1, 2, 1, 2), 175 | c(2, 3, 2, 4) 176 | ) 177 | 178 | my_matrix 179 | ``` 180 | 181 | The `rbind()` command takes vectors of the same length, and bind them as rows. 182 | The matrix has 3 rows, so I bind 3 vectors together. 183 | (Note: there is a `cbind()` command, which binds vectors as columns.) 184 | 185 | ### Subsetting a matrix 186 | 187 | Now let's talk about selecting members of a matrix. 188 | We will use the `[ ]` syntax again. 189 | Since a matrix has two dimensions, now the `[ ]` takes two inputs, which is [rows, columns] 190 | 191 | Say I want the 1st and 2nd rows of this matrix, what do I do? 192 | 193 | ```{r} 194 | my_matrix[c(1,2), ] 195 | ``` 196 | 197 | In the rows area of the `[ ]`, I put in `c(1,2)`, which selects the 1st and 2nd rows. 198 | I want all the columns, so I left the column area blank. 199 | 200 | Now if I want the 2nd and 4th column of this matrix, what do I do? 201 | 202 | ```{r} 203 | my_matrix[, c(2,4)] 204 | ``` 205 | 206 | Now I left the rows area blank, and specify `c(2, 4)` in the columns area. 207 | 208 | So if I want the 1st and 2nd rows and 2nd and 4th columns, what do I do? 209 | 210 | ```{r} 211 | my_matrix[c(1,2), c(2,4)] 212 | ``` 213 | 214 | Pretty straightforward. 215 | 216 | ## Data frame 217 | 218 | The last type of items I will cover today is data frame. 219 | Data frame is the technical jargon for data tables in R. 220 | We will cover data frames more in depth in the next lesson. 221 | Today we will just have a brief introduction. 222 | 223 | Say I have 3 apples, 4 banana and 2 oranges, 224 | and I want to organize this info into a table, and called this table fruits. 225 | What do I do? 226 | 227 | ```{r} 228 | fruits <- data.frame( 229 | name = c("apple", "banana", "orange"), 230 | count = c(3, 4, 2) 231 | ) 232 | 233 | fruits 234 | ``` 235 | 236 | That will do it. 237 | 238 | To make a new table, you can use the `data.frame()` command. 239 | Inside `data.frame()`, you then type out each columns as vectors. 240 | In this example, the columns are name and count. 241 | 242 | Notice that in a data frame, all members of a column are the same type. 243 | For example, the count column is all numeric. 244 | This is different from a matrix, 245 | because in a matrix, all members are the same type across rows and columns. 246 | Also notice data frames have column names. 247 | In this case, the column names are name and count. 248 | 249 | You can pull out a particular column using its name. 250 | For example, say I want to pull out the count column, what do I do? 251 | 252 | You will use the `$` syntax. 253 | 254 | ```{r} 255 | fruits$count 256 | ``` 257 | 258 | If I want the name column, it would be `fruits$name` instead. 259 | 260 | # Operations 261 | 262 | Now let's go over some basic operations. 263 | Say I want to add 0.5 to each member of my matrix, 264 | then take the log2 of each member then take the square root, what do I do? 265 | 266 | ```{r} 267 | sqrt(log2(my_matrix + 0.5)) 268 | ``` 269 | 270 | See, nothing complicated, just an interactive calculator. 271 | You see how the parentheses of `log2()` and `sqrt()` are nested. 272 | You can imagine if you do this for many steps, there will be too many parentheses to keep track of. 273 | To avoid the too many nested parentheses problem, you can use the pipe syntax. 274 | 275 | The pipe is `%>%` 276 | The pipe belongs to the dplyr package, 277 | so if you want to use the pipe, you have to load the dplyr package first. 278 | We already did that earlier when we did `library(dplyr) `. 279 | 280 | The shortcut is Ctrl+Shift+M on a PC and Cmd+Shift+M on a Mac. 281 | I think Ctrl+Shift+M also works on a Mac. 282 | When you use the pipe, it takes the output of the previous line and feed it as the input for the next line. 283 | In our example, it should look like: 284 | 285 | ```{r} 286 | (my_matrix + 0.5) %>% 287 | log2() %>% 288 | sqrt() 289 | ``` 290 | 291 | See how this is much clearer and easier to read than nested parentheses? 292 | It's a good practice to start a new line after `%>%`, which makes your code easier to read. 293 | `%>%` is a good trick. 294 | Missing parentheses is another common source of bugs, and `%>%` really reduces that. 295 | 296 | ## add comments or skip lines in the code chunk 297 | 298 | Sometimes it's nice to annotate your code in-line as well. 299 | The way to do it is using the `#` symbol in the code chunk. 300 | Any text on the same line after a `#` sign will be ignore in a code chunk by R. 301 | For example: 302 | 303 | ```{r} 304 | (my_matrix + 0.5) %>% #add 0.5 to every value first 305 | log2() %>% #then take log2() 306 | sqrt() #lastly take square root 307 | ``` 308 | 309 | If you add the `#` sign at the very start of the line, it will skip that entire line. 310 | 311 | ```{r} 312 | c(4, 4, 4) %>% 313 | # log2() %>% 314 | # mean() %>% 315 | sqrt() 316 | ``` 317 | 318 | This is helpful when you realize you want to skip a step (or steps), 319 | but you don't want to remove the code. 320 | You can simply add a `#` sign to the start of the lines that you want to skip, and R will ignore them. 321 | The shortcut is Ctrl+Shift+C on a PC and Cmd+Shift+C on a Mac. 322 | 323 | # Excerice 324 | 325 | Today you have learned some basic syntax of R. 326 | Now it's time for you to practice. 327 | 328 | ## Q1: 329 | 330 | 1. Insert a new code chunk 331 | 2. Make this matrix 332 | 333 | | 1 | 1 | 2 | 2 | 334 | | 2 | 2 | 1 | 2 | 335 | | 2 | 3 | 3 | 4 | 336 | | 1 | 2 | 3 | 4 | 337 | 338 | and save it as an item called `my_mat2`. 339 | 340 | 3. Select the 1st and 3rd rows and the 1st, 2nd and 4th columns, and save it as an item. 341 | 342 | 4. Take the square root for each member of my_mat2, then take log2(), and lastly find the maximum value. 343 | Use the pipe syntax. The command for maximum is `max()`. 344 | 345 | ## Q2: 346 | 347 | 1. Use the following info to make a data frame and save it as an item called "grade". 348 | Adel got 85 on the exam, Bren got 83, and Cecil got 93. 349 | Their letter grades are B, B, and A, respectively. 350 | (Hint: How many columns do you have to have?) 351 | 352 | 2. Pull out the column with the scores. 353 | Use the `$` syntax. 354 | -------------------------------------------------------------------------------- /Lessons/03_Intro_to_data_vis.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | In this lesson, we will cover: 4 | 5 | 1. How are data represented in visualizations? 6 | 2. What is "the grammar of graphics"? 7 | 3. Introduction to ggplot: mapping variables to axis and aesthetics. 8 | 9 | # Required packages 10 | ```{r} 11 | library(tidyverse) 12 | library(RColorBrewer) 13 | ``` 14 | 15 | The `tidyverse` will be doing all the heavy lifting. 16 | The `RColorBrewer` will be providing some nice color sets for our graphs. 17 | 18 | # How are data represented in visualizations? 19 | As an example, let's simulate some data: 20 | ```{r} 21 | set.seed(666) 22 | Example_data1 <- rbind( 23 | rnorm(3, mean = 5, sd = 0.5) %>% 24 | as.data.frame() %>% 25 | mutate(time = 1), 26 | rnorm(3, mean = 8, sd = 0.5) %>% 27 | as.data.frame() %>% 28 | mutate(time = 2), 29 | rnorm(3, mean = 10, sd = 0.5) %>% 30 | as.data.frame() %>% 31 | mutate(time = 3) 32 | ) %>% 33 | rename(response = ".") 34 | 35 | Example_data1 36 | ``` 37 | |response|time| 38 | |--------|----| 39 | |5.376656| 1| 40 | |6.007177| 1| 41 | |4.822433| 1| 42 | |9.014084| 2| 43 | |6.891563| 2| 44 | |8.379198| 2| 45 | |9.346907| 3| 46 | |9.598740| 3| 47 | |9.103880| 3| 48 | 49 | As you can see, the data are stored in this table. This is already a tidy data frame. 50 | It has two columns, "response" and "time". It appears each time point has 3 observations. 51 | 52 | If you were to just print this table, it will be a perfectly correct representation of the data. 53 | For example, if you read the numbers, you probably notice the response variable goes up as time progresses. 54 | 55 | But that is NOT the point of data visualization. 56 | The purpose of data visualization is to create a visual and intuitive representation of the data. 57 | A good data visualization allows us to gain insight into the data at a glance. 58 | That being said, what are the ways to visually represent data? 59 | 60 | ## Represent data by position 61 | The first way to represent data is by position, such that the data is represented by position along the axis. 62 | ```{r} 63 | Example_data1 %>% 64 | ggplot(aes(x = time, y = response)) + 65 | stat_summary(geom = "line", fun.data = mean_se, 66 | size = 1.1, alpha = 0.8, color = "grey60") + 67 | geom_point(aes(fill = as.factor(time)), shape = 21, color = "grey20", 68 | alpha = 0.8, size = 3, position = position_jitter(0.1, seed = 666)) + 69 | stat_summary(geom = "errorbar", fun.data = mean_se, 70 | width = 0.05, alpha = 0.8) + 71 | scale_fill_manual(values = brewer.pal(8, "Accent")) + 72 | scale_x_continuous(breaks = c(1, 2, 3)) + 73 | labs(x = "Time point", 74 | y = "Response", 75 | title = "Dot/line graphs are\nposition based") + 76 | theme_classic() + 77 | theme( 78 | legend.position = "none", 79 | text = element_text(size = 14, color = "black"), 80 | axis.text = element_text(color = "black"), 81 | plot.title = element_text(size = 12) 82 | ) 83 | 84 | ggsave("../Results/03_line_plot.png", height = 2.5, width = 2) 85 | ``` 86 | ![line graph](https://github.com/cxli233/Quick_data_vis/blob/main/Results/03_line_plot.png) 87 | 88 | In this case, I graphed time on x-axis, and response on y-axis. 89 | This visualization is position based. 90 | 91 | ## Represent data by length/size 92 | ```{r} 93 | Example_data1 %>% 94 | ggplot(aes(x = time, y = response)) + 95 | stat_summary(geom = "bar", fun.data = mean_se, aes(color = as.factor(time)), 96 | size = 1.1, alpha = 0.8, fill = NA, width = 0.5) + 97 | geom_point(aes(fill = as.factor(time)), shape = 21, color = "grey20", 98 | alpha = 0.8, size = 3, position = position_jitter(0.1, seed = 666)) + 99 | stat_summary(geom = "errorbar", fun.data = mean_se, 100 | width = 0.05, alpha = 0.8) + 101 | scale_fill_manual(values = brewer.pal(8, "Accent")) + 102 | scale_color_manual(values = brewer.pal(8, "Accent")) + 103 | scale_x_continuous(breaks = c(1, 2, 3)) + 104 | labs(x = "Time point", 105 | y = "\nResponse", 106 | title = "Bar graphs are\nlength based") + 107 | theme_classic() + 108 | theme( 109 | legend.position = "none", 110 | text = element_text(size = 14, color = "black"), 111 | axis.text = element_text(color = "black"), 112 | plot.title = element_text(size = 12) 113 | ) 114 | 115 | ggsave("../Results/03_bar.png", height = 2.5, width = 2) 116 | ``` 117 | ![bar graph](https://github.com/cxli233/Quick_data_vis/blob/main/Results/03_bar.png) 118 | 119 | In this case, I graphed time on x-axis, and response on y-axis. 120 | This visualization is length based. Why? 121 | In bar plots, values are represented by the distance from the x axis, and thus the length of the bar. 122 | 123 | Never confuse position and length based visualizations! 124 | Can you tell what is wrong with the graph on the right? 125 | 126 | ![What is wrong?](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/Position_and_length_based_visualizations.svg) 127 | 128 | ## Represent data by color 129 | Finally, data can be represented by color. 130 | We will explore this more when we talk about heatmaps and whatnot. 131 | 132 | # ggplot: data visualization using the grammar of graphics 133 | As you can see in the examples, I've been using `ggplot` to produce graphs. 134 | The "gg" in `ggplot` stands for grammar of graphics, a theoretical framework to visualize data. 135 | In grammar of graphics, each visualization has the following layers: 136 | 137 | 1. Coordinates: what are on x and y axis? 138 | 2. Geometric objects ("geom"): e.g., bars, points, lines... 139 | 3. Mapping: mapping colors, size, and transparency to geometric objects. 140 | 141 | For more resources, you can look at the [ggplot cheat sheet](https://statsandr.com/blog/files/ggplot2-cheatsheet.pdf). 142 | 143 | Let's build a scatter plot for our example data. 144 | Usually I start with the data table and pipe it into `ggplot()` using the pipe `%>%` syntax. 145 | Inside the `ggplot()` function, there is this `aes()` argument. 146 | Inside the `aes()` argument, it takes the x and y variables. 147 | Not intuitive, but it is what it is. 148 | 149 | ```{r} 150 | Example_data1 %>% 151 | ggplot(aes(x = time, y = response)) + 152 | theme_classic() 153 | 154 | ggsave("../Results/03_scatter_1.png", height = 2.5, width = 2) 155 | ``` 156 | ![step1](https://github.com/cxli233/Quick_data_vis/blob/main/Results/03_scatter_1.png) 157 | 158 | If we just do that, we get a blank graph with time on x axis and response on y axis. 159 | The `theme_classic()` only change the appearance of the background, read more [here](https://ggplot2.tidyverse.org/reference/ggtheme.html). 160 | ggplot comes with a collection of themes. My go-to is "classic". It has white background and axis lines, as you can see in this example. 161 | 162 | The graph is blank because we haven't added any geometric objects (geom) yet. 163 | For the sake of this example only, let's make a scatter plot. So we need some points. 164 | The command to add points in `ggplot` is `geom_poit()`. 165 | 166 | ```{r} 167 | Example_data1 %>% 168 | ggplot(aes(x = time, y = response)) + 169 | geom_point() + 170 | theme_classic() 171 | 172 | ggsave("../Results/03_scatter_2.png", height = 2.5, width = 2) 173 | ``` 174 | ![step2](https://github.com/cxli233/Quick_data_vis/blob/main/Results/03_scatter_2.png) 175 | 176 | This is how the graph looks like after we added points. 177 | To include more layers, you literally add (using the `+` syntax) layers to the existing code. 178 | After the `ggplot()` line which initiates the graph, I usually add the geom layers first, then other layers, then lastly the `theme` layer. 179 | 180 | For a minimal plot, we are done! 181 | 182 | # Mapping variables to ggplot geom 183 | The example data we are using only have 2 variables: time and response. What if we have more than 2? 184 | To demonstrate that, let's use another example. 185 | 186 | I am going to read in child mortality, children/women, and income data. 187 | We will use those data for another example. For the sake of this lesson, we will just use year 1945. 188 | 189 | First, read in data. 190 | ```{r} 191 | child_mortality <- read_csv("../Data/child_mortality_0_5_year_olds_dying_per_1000_born.csv", col_types = cols()) 192 | babies_per_woman <- read_csv("../Data/children_per_woman_total_fertility.csv", col_types = cols()) 193 | income <- read_csv("../Data/income_per_person_gdppercapita_ppp_inflation_adjusted.csv", col_types = cols()) 194 | ``` 195 | 196 | Second, re-shape data into tidy format. 197 | ```{r} 198 | babies_per_woman_tidy <- babies_per_woman %>% 199 | pivot_longer(names_to = "year", values_to = "birth", cols = c(2:302)) 200 | 201 | child_mortality_tidy <- child_mortality %>% 202 | pivot_longer(names_to = "year", values_to = "death_per_1000_born", cols = c(2:302)) 203 | 204 | income_tidy <- income %>% 205 | pivot_longer(names_to = "year", values_to = "income", cols = c(2:242)) 206 | ``` 207 | 208 | Third, join them together. 209 | ```{r} 210 | example2_data <- babies_per_woman_tidy %>% 211 | inner_join(child_mortality_tidy, by = c("country", "year")) %>% 212 | inner_join(income_tidy, by = c("country", "year")) 213 | 214 | head(example2_data) 215 | ``` 216 | 217 | Fourth, filter for year 1945. 218 | ```{r} 219 | example2_data_1945 <- example2_data %>% 220 | filter(year == "1945") 221 | 222 | head(example2_data_1945) 223 | ``` 224 | 225 | 226 | You should be able to do all that relatively easily after completing [Intro_to_tidy_data](https://github.com/cxli233/Quick_data_vis/blob/main/Lessons/02_Intro_to_tidy_data.md). 227 | If not, practice more! 228 | 229 | Say we want birth rate on x axis, mortality on y axis, make a scatter plot, and color the dots by income, what should we do? 230 | Let's make a plain scatter plot first. 231 | 232 | ```{r} 233 | example2_data_1945 %>% 234 | ggplot(aes(x = birth, y = death_per_1000_born)) + 235 | geom_point() + 236 | labs(x = "No. children per woman", 237 | y = "Child mortality/1000 born", 238 | title = "Year 1945") + 239 | theme_classic() 240 | 241 | ggsave("../Results/03_scatter_3.png", width = 2.5, height = 2.5) 242 | ``` 243 | ![basic scatter](https://github.com/cxli233/Quick_data_vis/blob/main/Results/03_scatter_3.png) 244 | 245 | That looks fine. 246 | To color dots (or any geom) based on a variable in the data frame, you open an `aes()` argument inside the `geom()`. 247 | 248 | ```{r} 249 | example2_data_1945 %>% 250 | ggplot(aes(x = birth, y = death_per_1000_born)) + 251 | geom_point(aes(color = log10(income))) + 252 | labs(x = "No. children per woman", 253 | y = "Child mortality/1000 born", 254 | title = "Year 1945") + 255 | theme_classic() 256 | 257 | ggsave("../Results/03_scatter_4.png", width = 3, height = 2.5) 258 | ``` 259 | ![scatter, colored](https://github.com/cxli233/Quick_data_vis/blob/main/Results/03_scatter_4.png) 260 | 261 | Note: income is quite unevenly distributed, so I log-transformed it. 262 | Now we have dots colored by income (in log10 scale). 263 | The problem is the default color scale in ggplot looks bad. 264 | We can use a different set of colors from the `RColorBrewer` package. 265 | You can look at all the available color scales in `RColorBrewer` [here](https://r-graph-gallery.com/38-rcolorbrewers-palettes.html). 266 | Let's use something easier to see than the default ggplot blue. 267 | 268 | ```{r} 269 | example2_data_1945 %>% 270 | ggplot(aes(x = birth, y = death_per_1000_born)) + 271 | geom_point(aes(color = log10(income))) + 272 | scale_color_gradientn(colours = brewer.pal(9, "YlGnBu")) + 273 | labs(x = "No. children per woman", 274 | y = "Child mortality/1000 born", 275 | title = "Year 1945") + 276 | theme_classic() 277 | 278 | ggsave("../Results/03_scatter_5.png", width = 3, height = 2.5) 279 | ``` 280 | ![scatter, colored, better](https://github.com/cxli233/Quick_data_vis/blob/main/Results/03_scatter_5.png) 281 | 282 | To provide a custom color scale that produce a color gradient, use `scale_color_gradientn())`. 283 | Here is a whole family of `scale` functions in ggplot, which allow you to control the scales for colors and axis. 284 | You can read more about scales on the [ggplot cheat sheet](https://statsandr.com/blog/files/ggplot2-cheatsheet.pdf). 285 | I quite like the yellow-green-blue color gradient from `RColorBrewer`, which is what I used here. 286 | As you can see, low income countries have more children/woman and also higher child mortality. 287 | 288 | # Exercise 289 | Graph income (in log10 scale) on x axis, child mortality on y axis, and color with children/woman in year 2010. 290 | Were the trend similar to year 1945? 291 | Save the graph using `ggsave()`. 292 | 293 | 294 | 295 | -------------------------------------------------------------------------------- /Scripts/01_Intro_to_R.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Very basics of R coding" 3 | author: "Chenxin Li" 4 | date: "01/06/2023" 5 | output: 6 | html_notebook: 7 | number_sections: yes 8 | toc: yes 9 | toc_float: yes 10 | --- 11 | 12 | ```{r setup, include=FALSE} 13 | knitr::opts_chunk$set(echo = TRUE) 14 | ``` 15 | 16 | # Introduction 17 | 18 | This is a quick introduction to R! 19 | 20 | As a resource, here's the link to the [base R cheat sheet](https://iqss.github.io/dss-workshops/R/Rintro/base-r-cheat-sheet.pdf) 21 | 22 | # R, RStudio, packages and R Markdown 23 | ## R packages 24 | 25 | A feature about R is R has a lot of packages. 26 | A package is a collection of R functions that other people wrote and published. 27 | We can use pre-existing packages, so that we don't have to re-invent the wheels. 28 | A difference between R and many other coding languages is that in R you learn more about how to use other packages, and less about coding stuff yourself. 29 | This is very advantageous for those who do not have a traditional computer science background. 30 | 31 | When you need to install packages, you use the `install.packages()` command. 32 | For example, if you need to install a package called `dplyr`, 33 | which is a package that we'll use. 34 | You will type `install.packages("dplyr")` at the console. 35 | 36 | Alternatively, you can click the Package tab at the lower right. 37 | Then you select Install. 38 | In the pop-up window, type in the package you need to install and click install. 39 | We will talk about how to load an installed package later. 40 | 41 | ## The R Markdown (.Rmd) file 42 | 43 | R Markdown is an enhanced R script file. 44 | You will need the `rmarkdowm` package. 45 | R markdown files are more user-friendly than a common R script. 46 | 47 | In a .Rmd file, there are two kinds of text. 48 | One is plain text, which is what you are reading right now. 49 | 50 | The other is code chunk, which R will actually run. 51 | 52 | To insert a code chunk, you can click the Code drop-down menu from the top. 53 | Then select Insert Chunk. 54 | 55 | Alternatively, you can do Ctrl+Alt+I on a PC, or Cmd+Opt+I on a Mac. 56 | When you do that, this will show up: 57 | 58 | ```{r} 59 | 60 | ``` 61 | 62 | The above is a code chunk. 63 | If you click the green triangle at the upper right, it will run the entire chunk. 64 | The advantage of having chunks is that you can run codes chunk-by-chunk, and de-bug chunk by chunk. 65 | This is very user-friendly. 66 | If you want to run a particular line in the chunk, you do Ctrl+Enter on a PC or Cmd+Enter on a Mac. 67 | I think Opt+Enter also works on a Mac. 68 | To run the whole chunk, do Ctrl+Shift+Enter on PC or Cmd+Shift+Enter on a Mac. 69 | 70 | When you start a new script, the first thing you want to do is to load your packages. 71 | Say we want to load the dplyr that we just installed, we will do: 72 | 73 | ```{r} 74 | library(dplyr) 75 | ``` 76 | 77 | The command is `library()`. 78 | It's not very intuitive, but this is the command to load packages that we already installed. 79 | The dplyr package is very useful in data arrangement, which we will cover in depth in the next lesson. 80 | 81 | ## Sections in a .Rmd file 82 | 83 | You can see the section headers are in blue color. 84 | To make a new section header, you can type `#` followed by a space, and then any text at the first position of any line in the plain text area. 85 | In addition, `##` will be a sub-header, and `###` will be a sub-sub-header, and so on. 86 | 87 | You can view all the sections at the file outline. 88 | The shortcut is Ctrl+Shift+O for PC and Cmd+Shift+O for Mac. 89 | You'll see that section titles will show up at the right to the script area. 90 | If you click on any section titles, RStudio will jump to that section. 91 | 92 | # Basic syntax in R 93 | 94 | ## Single value items 95 | 96 | In R, there are 3 types of single value items: logical, numeric and character. 97 | For example, 98 | 99 | * TRUE or FALSE are logical 100 | * 1, 2, 3 ... are numeric 101 | * a, b, c ... are character 102 | 103 | This is pretty self-explanatory. 104 | 105 | ## Vectors 106 | 107 | The term vector in R is a bit different from the traditional sense of vector in math. 108 | It stands for a list of items that are the same type, 109 | i.e., a list that is all numeric, or all character, but not mixed. 110 | 111 | To create a vector, you use the `c()` command. 112 | For example, I want to create a vector of 1, 2, 3, the code will be 113 | 114 | ```{r} 115 | c(1, 2, 3) 116 | ``` 117 | 118 | Now, to save my vector (or anything) as an R item, you will use the `<-` syntax. 119 | This is called the left arrow. 120 | Let's say we are saving this vector of c(1, 2, 3) as an item called `a` then 121 | 122 | ```{r} 123 | a <- c(1, 2, 3) 124 | ``` 125 | 126 | When you run this, something happens. 127 | If you look at the environment at the upper right, you should see 'a' shows up as a saved item. 128 | Remember to use a different name for different items. 129 | If you save them as the same name, the later item will just overwrite the previous. 130 | 131 | Say I want a vector consists of apple, banana, and orange, and save it as an item named `b`, what do I do? 132 | 133 | ```{r} 134 | b <- c("apple", "banana", "orange") 135 | b 136 | ``` 137 | 138 | Note that apple, banana and orange all have to be in quotes `" "`. 139 | The quotes `" "` specify the terms apple, banana or oranges are characters, 140 | not items in the environment. 141 | Forgetting to use quotes is a very common source of bugs. 142 | 143 | ### Subsetting vectors 144 | 145 | Say if you want to select a member of a vector by position, 146 | you will use the square bracket `[ ]` syntax. 147 | Say I want the 2nd member in vector `a`, the code will be 148 | 149 | ```{r} 150 | a[2] 151 | ``` 152 | 153 | If I want anything but the 2nd member, it will be 154 | 155 | ```{r} 156 | a[-2] 157 | ``` 158 | 159 | If I want the 1st and 2nd member, it will be 160 | 161 | ```{r} 162 | a[c(1, 2)] 163 | ``` 164 | 165 | If I want anything but the 1st and 2nd member, it will be 166 | 167 | ```{r} 168 | a[-c(1, 2)] 169 | ``` 170 | 171 | You don't have to memorize how to do anything. 172 | If you ever get confused, you can always go look at the base R cheat sheet. 173 | 174 | ## Matrix 175 | 176 | Matrix is rectangular data that is all the same type, e.g., all numeric. 177 | For example, if I want a matrix like this: 178 | 179 | | 1 | 2 | 3 | 4 | 180 | | 1 | 2 | 1 | 2 | 181 | | 2 | 3 | 2 | 4 | 182 | 183 | What do I do? 184 | An easy way to do is combining different vectors. 185 | 186 | ```{r} 187 | my_matrix <- rbind( 188 | c(1, 2, 3, 4), 189 | c(1, 2, 1, 2), 190 | c(2, 3, 2, 4) 191 | ) 192 | 193 | my_matrix 194 | ``` 195 | 196 | The `rbind()` command takes vectors of the same length, and bind them as rows. 197 | The matrix has 3 rows, so I bind 3 vectors together. 198 | (Note: there is a `cbind()` command, which binds vectors as columns.) 199 | 200 | ### Subsetting a matrix 201 | 202 | Now let's talk about selecting members of a matrix. 203 | We will use the `[ ]` syntax again. 204 | Since a matrix has two dimensions, now the `[ ]` takes two inputs, which is [rows, columns] 205 | 206 | Say I want the 1st and 2nd rows of this matrix, what do I do? 207 | 208 | ```{r} 209 | my_matrix[c(1,2), ] 210 | ``` 211 | 212 | In the rows area of the `[ ]`, I put in `c(1,2)`, which selects the 1st and 2nd rows. 213 | I want all the columns, so I left the column area blank. 214 | 215 | Now if I want the 2nd and 4th column of this matrix, what do I do? 216 | 217 | ```{r} 218 | my_matrix[, c(2,4)] 219 | ``` 220 | 221 | Now I left the rows area blank, and specify `c(2, 4)` in the columns area. 222 | 223 | So if I want the 1st and 2nd rows and 2nd and 4th columns, what do I do? 224 | 225 | ```{r} 226 | my_matrix[c(1,2), c(2,4)] 227 | ``` 228 | 229 | Pretty straightforward. 230 | 231 | ## Data frame 232 | 233 | The last type of items I will cover today is data frame. 234 | Data frame is the technical jargon for data tables in R. 235 | We will cover data frames more in depth in the next lesson. 236 | Today we will just have a brief introduction. 237 | 238 | Say I have 3 apples, 4 banana and 2 oranges, 239 | and I want to organize this info into a table, and called this table fruits. 240 | What do I do? 241 | 242 | ```{r} 243 | fruits <- data.frame( 244 | name = c("apple", "banana", "orange"), 245 | count = c(3, 4, 2) 246 | ) 247 | 248 | fruits 249 | ``` 250 | 251 | That will do it. 252 | 253 | To make a new table, you can use the `data.frame()` command. 254 | Inside `data.frame()`, you then type out each columns as vectors. 255 | In this example, the columns are name and count. 256 | 257 | Notice that in a data frame, all members of a column are the same type. 258 | For example, the count column is all numeric. 259 | This is different from a matrix, 260 | because in a matrix, all members are the same type across rows and columns. 261 | Also notice data frames have column names. 262 | In this case, the column names are name and count. 263 | 264 | You can pull out a particular column using its name. 265 | For example, say I want to pull out the count column, what do I do? 266 | 267 | You will use the `$` syntax. 268 | 269 | ```{r} 270 | fruits$count 271 | ``` 272 | 273 | If I want the name column, it would be `fruits$name` instead. 274 | 275 | # Operations 276 | 277 | Now let's go over some basic operations. 278 | Say I want to add 0.5 to each member of my matrix, 279 | then take the log2 of each member then take the square root, what do I do? 280 | 281 | ```{r} 282 | sqrt(log2(my_matrix + 0.5)) 283 | ``` 284 | 285 | See, nothing complicated, just an interactive calculator. 286 | You see how the parentheses of `log2()` and `sqrt()` are nested. 287 | You can imagine if you do this for many steps, there will be too many parentheses to keep track of. 288 | To avoid the too many nested parentheses problem, you can use the pipe syntax. 289 | 290 | The pipe is `%>%` 291 | The pipe belongs to the dplyr package, 292 | so if you want to use the pipe, you have to load the dplyr package first. 293 | We already did that earlier when we did `library(dplyr) `. 294 | 295 | The shortcut is Ctrl+Shift+M on a PC and Cmd+Shift+M on a Mac. 296 | I think Ctrl+Shift+M also works on a Mac. 297 | When you use the pipe, it takes the output of the previous line and feed it as the input for the next line. 298 | In our example, it should look like: 299 | 300 | ```{r} 301 | (my_matrix + 0.5) %>% 302 | log2() %>% 303 | sqrt() 304 | ``` 305 | 306 | See how this is much clearer and easier to read than nested parentheses? 307 | It's a good practice to start a new line after `%>%`, which makes your code easier to read. 308 | `%>%` is a good trick. 309 | Missing parentheses is another common source of bugs, and `%>%` really reduces that. 310 | 311 | ## add comments or skip lines in the code chunk 312 | 313 | Sometimes it's nice to annotate your code in-line as well. 314 | The way to do it is using the `#` symbol in the code chunk. 315 | Any text on the same line after a `#` sign will be ignore in a code chunk by R. 316 | For example: 317 | 318 | ```{r} 319 | (my_matrix + 0.5) %>% #add 0.5 to every value first 320 | log2() %>% #then take log2() 321 | sqrt() #lastly take square root 322 | ``` 323 | 324 | If you add the `#` sign at the very start of the line, it will skip that entire line. 325 | 326 | ```{r} 327 | c(4, 4, 4) %>% 328 | # log2() %>% 329 | # mean() %>% 330 | sqrt() 331 | ``` 332 | 333 | This is helpful when you realize you want to skip a step (or steps), 334 | but you don't want to remove the code. 335 | You can simply add a `#` sign to the start of the lines that you want to skip, and R will ignore them. 336 | The shortcut is Ctrl+Shift+C on a PC and Cmd+Shift+C on a Mac. 337 | 338 | # Excerice 339 | 340 | Today you have learned some basic syntax of R. 341 | Now it's time for you to practice. 342 | 343 | ## Q1: 344 | 345 | 1. Insert a new code chunk 346 | 2. Make this matrix 347 | 348 | | 1 | 1 | 2 | 2 | 349 | | 2 | 2 | 1 | 2 | 350 | | 2 | 3 | 3 | 4 | 351 | | 1 | 2 | 3 | 4 | 352 | 353 | and save it as an item called `my_mat2`. 354 | 355 | 3. Select the 1st and 3rd rows and the 1st, 2nd and 4th columns, and save it as an item. 356 | 357 | 4. Take the square root for each member of my_mat2, then take log2(), and lastly find the maximum value. 358 | Use the pipe syntax. The command for maximum is `max()`. 359 | 360 | ## Q2: 361 | 362 | 1. Use the following info to make a data frame and save it as an item called "grade". 363 | Adel got 85 on the exam, Bren got 83, and Cecil got 93. 364 | Their letter grades are B, B, and A, respectively. 365 | (Hint: How many columns do you have to have?) 366 | 367 | 2. Pull out the column with the scores. 368 | Use the `$` syntax. 369 | -------------------------------------------------------------------------------- /Scripts/04_Intro_mean_separation.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "mean_separation" 3 | author: "Chenxin Li" 4 | date: "2023-01-07" 5 | output: 6 | html_notebook: 7 | number_sections: yes 8 | toc: yes 9 | toc_float: yes 10 | --- 11 | 12 | ```{r setup, include=FALSE} 13 | knitr::opts_chunk$set(echo = TRUE) 14 | ``` 15 | 16 | # Introduction 17 | In this lesson we will explore the best practice for mean separation visualizations. 18 | Mean separation is one of the most common graph types in scientific publications. 19 | In these graphs, you have two or more groups. 20 | These groups can be experimental treatments, genotypes, conditions, and so on. 21 | Within each group, there are multiple observations. 22 | The goal of the visualization is to represent the mean (the central tendency) and spread of the data. 23 | 24 | We will explore: 25 | 26 | 1. Is your bar plot hiding data from you? 27 | 2. What are the optimal graph type given the number of observations? 28 | 3. How to handle mean separation plots in multifactorial experiments? 29 | 30 | # Packages 31 | ```{r} 32 | library(tidyverse) 33 | library(readxl) 34 | library(RColorBrewer) 35 | library(ggbeeswarm) 36 | ``` 37 | 38 | We have a new package `ggbeeswarm`. 39 | `ggbeeswarm` is useful for mean separation graphs with many overlapping dots. 40 | You will see that in a moment. 41 | 42 | # Is your bar plot hiding data from you? 43 | For a first example, let's simulate some data, then make a plain old bar plot. 44 | You can ignore this code chunk. 45 | ```{r} 46 | set.seed(666) 47 | group1 <- rnorm(n = 100, mean = 1, sd = 1) 48 | group2 <- rlnorm(n = 100, 49 | meanlog = log(1^2/sqrt(1^2 + 1^2)), 50 | sdlog = sqrt(log(1+(1^2/1^2)))) 51 | 52 | groups_long <- cbind( 53 | group1, 54 | group2 55 | ) %>% 56 | as.data.frame() %>% 57 | pivot_longer(cols = 1:2, names_to = "group", values_to = "response") 58 | 59 | head(groups_long) 60 | ``` 61 | 62 | I simulated two groups, each has 100 observations. Now let's make a bar plot. 63 | To make a bar plot, we will use `geom_bar()`. 64 | 65 | ```{r} 66 | groups_long %>% 67 | ggplot(aes(x = group, y = response)) + 68 | geom_bar(stat = "summary", fun = mean, width = 0.5, 69 | aes(fill = group)) + 70 | stat_summary(geom = "errorbar", fun.data = mean_se, 71 | width = 0.1) + 72 | scale_fill_manual(values = brewer.pal(8, "Accent")) + 73 | theme_classic() + 74 | theme( 75 | legend.position = "none" 76 | ) 77 | 78 | ggsave("../Results/04_bar.png", height = 2.5, width = 2) 79 | ``` 80 | `stat = "summary", fun = mean` inside `geom_bar()` calculates the mean for each group, 81 | such that the bar lengths are the mean, not the individual observations. 82 | The `stat_summary()` layer adds the means +/- standard errors (SE) as error bars, 83 | which is specified by `fun.data = mean_se`. 84 | 85 | As you can see in this simple bar plot, the two groups are _very_ similar in terms of mean and SE. 86 | But is this the full story? Let's make another plot. 87 | 88 | ```{r} 89 | groups_long %>% 90 | ggplot(aes(x = group, y = response)) + 91 | geom_boxplot(aes(fill = group), width = 0.5) + 92 | scale_fill_manual(values = brewer.pal(8, "Accent")) + 93 | theme_classic() + 94 | theme( 95 | legend.position = "none" 96 | ) 97 | 98 | ggsave("../Results/04_box.png", height = 2.5, width = 2) 99 | ``` 100 | Now I've made a box plot. 101 | In a box plot, the center line is median. 102 | The box spans interquartile range (from 25th percentile to 75th percentile). 103 | The whiskers span 1.5 IQR or (min and max, whichever is less extreme). 104 | 105 | Looking at the box plots, we can notice that group2 is more skewed (more outliers on the upper end). 106 | In this case, our simple bar plot was hiding data from us. 107 | 108 | For mean separation plots, my go-to is plotting all dots with lateral offsets based on density. 109 | What does that even mean? Let's graph it first and I will explain. 110 | ```{r} 111 | groups_long %>% 112 | ggplot(aes(x = group, y = response)) + 113 | geom_quasirandom(aes(color = group), alpha = 0.8) + 114 | scale_color_manual(values = brewer.pal(8, "Accent")) + 115 | theme_classic() + 116 | theme( 117 | legend.position = "none" 118 | ) 119 | 120 | ggsave("../Results/04_dots.png", height = 2.5, width = 2) 121 | ``` 122 | 123 | The offset dots layer is provided by `geom_quasirandom()` of the `ggbeeswarm` package. 124 | The name comes from the spread of dots looks like a swarm of bees. 125 | As you can see, this is not just a dot plot. 126 | The dots are offset to the left or right when there are multiple observations around the same value. 127 | The term "quasirandom" means the dots are offset more if the density is higher. 128 | The result is that the dots lay themselves out into the distribution of the data. 129 | 130 | A final note is that we have 100 observations per group. 131 | Even with `geom_quasirandom()` and offsets, there are still overlapping dots. 132 | This is called "overplotting" in data visualization. 133 | I turned on `alpha = 0.8` inside `geom_quasirandom()` to make the dots a bit transparent. 134 | `alpha = 0.8` means 80% opacity, or 20% transparency. 135 | This way, when dots do overlap, we can at least see through them. 136 | 137 | Comparing the bar plot, the box plot, and the dot plot, 138 | we can see that while the mean and SE are similar between the groups, 139 | their underlying distributions are very different. 140 | The bar plot itself cannot convey this aspect of our data. 141 | 142 | **Take home message**: I _highly_ recommend plotting all data points in mean separation graphs. 143 | 144 | # What are the optimal graph type given the number of observations? 145 | Be very careful with data with low number of observations (n < 30), and very large number of observations (n > 100). 146 | 147 | For data with large number of observations, as we just saw in the dot plot, 148 | even with lateral offset, you might still get overlapping dots. 149 | A good practice is turn on transparency of the dots. 150 | Another solution is to make a violin plot. 151 | 152 | ```{r} 153 | groups_long %>% 154 | ggplot(aes(x = group, y = response)) + 155 | geom_quasirandom(aes(color = group), alpha = 0.8) + 156 | geom_violin(fill = NA) + 157 | geom_boxplot(width = 0.1, fill = NA) + 158 | scale_color_manual(values = brewer.pal(8, "Accent")) + 159 | theme_classic() + 160 | theme( 161 | legend.position = "none" 162 | ) 163 | 164 | ggsave("../Results/04_dots_violin.png", height = 2.5, width = 2) 165 | ``` 166 | In this case, I overlaid both violin plot and box plot onto the dots. 167 | The width of the violin plot shows the distribution, 168 | which as you can see follows the offset of the dots. 169 | A quick note: I turned on `fill = NA` in `geom_violin()` and `geom_boxplot()` so that we can see through them and see the dots under them. 170 | The default has white (non-transparent) fill. 171 | 172 | You can read more about fancy mean separation plots (such as "raincloud plot") [here](https://z3tt.github.io/Rainclouds/). 173 | 174 | 175 | Violin and box plots are great for data with many observations. 176 | However, they can be misleading in small datasets. 177 | Here is an example: 178 | 179 | ![what's wrong here?](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/Beware_of_small_n_box_violin_plot.svg) 180 | 181 | Violin plots (or any sort of smoothed distribution curves) and box plots make no sense for small n. 182 | Distributions and quartiles can vary widely with small n, 183 | even if the underlying observations are similar. 184 | Distribution and quartiles are only meaningful with large n. 185 | I did an experiment before, where I sampled the same normal distribution several times and computed the quartiles for each sample. 186 | The quartiles only stabilize when n gets larger than 50. 187 | 188 | **Take home message**: pay attention to the data set size when designing a mean separation plot! 189 | 190 | # Mean separation in multifactorial experiments. 191 | Multifactorial experiments are very common in science. 192 | To explain this, let's use a real world example. 193 | Data from [Matand et al., 2020, BMC Plant Biology](https://link.springer.com/article/10.1186/s12870-020-2243-7). 194 | 195 | ```{r} 196 | tissue_culture <- read_excel("../Data/LU-Matand - Stem Data.xlsx") 197 | head(tissue_culture) 198 | ``` 199 | 200 | In this experiment, the authors looked at the ability of stem pieces of [daylily](https://en.wikipedia.org/wiki/Daylily) to regrow into a full plant. 201 | 202 | They have 19 cultivars of the plant, 3 explant type, and 3 hormone treatment. 203 | They want to look at the best regeneration condition for each cultivar that they have. 204 | The ability to regenerate is quantified by the number of buds or shoots. 205 | 206 | Multifactorial experiments like this are _very_ common. 207 | However, we need to pay extra attention to the design of the graphs. 208 | 209 | ```{r} 210 | tissue_culture %>% 211 | ggplot(aes(x = Variety, y = Buds_Shoots)) + 212 | geom_bar(stat = "summary", 213 | aes(fill = interaction(Explant, Treatment)), 214 | fun = mean, 215 | position = position_dodge2()) + 216 | labs(x = NULL, 217 | y = "Num buds and shoots", 218 | fill = NULL) + 219 | theme_classic() + 220 | theme(axis.text.x = element_text(angle = 45, hjust = 1)) 221 | 222 | ggsave("../Results/04_multi_bar.png", height = 3, width = 6) 223 | ``` 224 | I have cultivar on x axis, buds/shoots on y axis, 225 | and color bars by the combination of explant and treatment. 226 | This is very horrendous! It is extremely difficult to interpret what is going on. 227 | 228 | To overcome this problem, I am introducing you to your new friend, facets! 229 | Facetting is a method to layout your plot into subplots and better direct the attention of your readers. 230 | Let me show you an example: 231 | 232 | ```{r} 233 | tissue_culture %>% 234 | mutate(Treatment = factor(Treatment, 235 | levels = c("T1", "T5", "T10"))) %>% 236 | ggplot(aes(x = Treatment, y = Buds_Shoots)) + 237 | facet_wrap(~ Variety, scales = "free") + 238 | geom_quasirandom(aes(color = Explant), alpha = 0.8) + 239 | scale_colour_manual(values = brewer.pal(8, "Set2")) + 240 | labs(x = "Hormone Treatment", 241 | y = "Num buds and shoots", 242 | fill = NULL) + 243 | theme_classic() + 244 | theme(legend.position = "bottom") 245 | 246 | ggsave("../Results/04_multi_dots.png", height = 6, width = 9) 247 | ``` 248 | Now this is a lot better. 249 | Since the authors want to find the best conditions for each cultivar, we make each cultivar a subplot. 250 | This is provided by `facet_wrap()`. 251 | The value after `~` is the column name to make subplot from, which is "Variety" in the table. 252 | From this new plot, it is obvious that the split stem explant and T10 treatment gave the best results for most cultivars, with some exceptions in a cultivar-specific manner. 253 | 254 | **Take home message**: facets are very powerful and you should learn to use them! Read more about facets [here](https://ggplot2.tidyverse.org/reference/facet_grid.html). 255 | 256 | # Exercise 257 | ## Q1 258 | Notice the `scales = "free` inside `facet_wrap()`. 259 | Make a new graph but remove the `scales = "free` inside `facet_wrap()`. 260 | 261 | Can you explain what happened? 262 | Can you explain in what scenario is more appropriate to have `scales = "free"` turned on vs off? 263 | 264 | ## Q2 265 | Here is an example data for you to practice: 266 | ```{r} 267 | M1 <- data.frame( 268 | conc = rnorm(n = 8, mean = 0.03, sd = 0.01) 269 | ) %>% 270 | mutate(group = "ctrl") %>% 271 | rbind( 272 | data.frame( 273 | conc = rnorm(n = 6, mean = 0.25, sd = 0.02) 274 | ) %>% 275 | mutate(group = "trt") 276 | ) %>% 277 | mutate(pest = "Pest 1") 278 | 279 | M2 <- data.frame( 280 | conc = rnorm(n = 8, mean = 6, sd = 1) 281 | ) %>% 282 | mutate(group = "ctrl") %>% 283 | rbind( 284 | data.frame( 285 | conc = rnorm(n = 6, mean = 5.5, sd = 1.1) 286 | ) %>% 287 | mutate(group = "trt") 288 | ) %>% 289 | mutate(pest = "Pest 2") 290 | 291 | M3 <- data.frame( 292 | conc = rnorm(n = 8, mean = 20, sd = 0.5) 293 | ) %>% 294 | mutate(group = "ctrl") %>% 295 | rbind( 296 | data.frame( 297 | conc = rnorm(n = 6, mean = 19.5, sd = 1.2) 298 | ) %>% 299 | mutate(group = "trt") 300 | ) %>% 301 | mutate(pest = "Pest 3") 302 | 303 | Spray <- rbind( 304 | M1, M2, M3 305 | ) 306 | 307 | head(Spray) 308 | ``` 309 | 310 | In this hypothetical experiment, I have two treatments: ctrl vs. trt. 311 | And I measured the occurrence of three different pests after spraying either treatments. 312 | Make a mean separation plot for this experiment. 313 | 314 | Hints: 315 | 316 | * Is this a multifactorial experiment? 317 | * How many observations are there in each group? 318 | 319 | Was the treatment effective in controlling any of the three pests? 320 | Save the graph using `ggsave()`. 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | -------------------------------------------------------------------------------- /Lessons/04_Intro_mean_separation.md: -------------------------------------------------------------------------------- 1 | 2 | # Introduction 3 | In this lesson we will explore the best practice for mean separation visualizations. 4 | Mean separation is one of the most common graph types in scientific publications. 5 | In these graphs, you have two or more groups. 6 | These groups can be experimental treatments, genotypes, conditions, and so on. 7 | Within each group, there are multiple observations. 8 | The goal of the visualization is to represent the mean (the central tendency) and spread of the data. 9 | 10 | We will explore: 11 | 12 | 1. Is your bar plot hiding data from you? 13 | 2. What are the optimal graph type given the number of observations? 14 | 3. How to handle mean separation plots in multifactorial experiments? 15 | 16 | # Packages 17 | ```{r} 18 | library(tidyverse) 19 | library(readxl) 20 | library(RColorBrewer) 21 | library(ggbeeswarm) 22 | ``` 23 | 24 | We have a new package `ggbeeswarm`. 25 | `ggbeeswarm` is useful for mean separation graphs with many overlapping dots. 26 | You will see that in a moment. 27 | 28 | # Is your bar plot hiding data from you? 29 | For a first example, let's simulate some data, then make a plain old bar plot. 30 | You can ignore this code chunk. 31 | ```{r} 32 | set.seed(666) 33 | group1 <- rnorm(n = 100, mean = 1, sd = 1) 34 | group2 <- rlnorm(n = 100, 35 | meanlog = log(1^2/sqrt(1^2 + 1^2)), 36 | sdlog = sqrt(log(1+(1^2/1^2)))) 37 | 38 | groups_long <- cbind( 39 | group1, 40 | group2 41 | ) %>% 42 | as.data.frame() %>% 43 | pivot_longer(cols = 1:2, names_to = "group", values_to = "response") 44 | 45 | head(groups_long) 46 | ``` 47 | 48 | I simulated two groups, each has 100 observations. Now let's make a bar plot. 49 | To make a bar plot, we will use `geom_bar()`. 50 | 51 | ```{r} 52 | groups_long %>% 53 | ggplot(aes(x = group, y = response)) + 54 | geom_bar(stat = "summary", fun = mean, width = 0.5, 55 | aes(fill = group)) + 56 | stat_summary(geom = "errorbar", fun.data = mean_se, 57 | width = 0.1) + 58 | scale_fill_manual(values = brewer.pal(8, "Accent")) + 59 | theme_classic() + 60 | theme( 61 | legend.position = "none" 62 | ) 63 | 64 | ggsave("../Results/04_bar.png", height = 2.5, width = 2) 65 | ``` 66 | 67 | ![bar plot](https://github.com/cxli233/Quick_data_vis/blob/main/Results/04_bar.png) 68 | 69 | `stat = "summary", fun = mean` inside `geom_bar()` calculates the mean for each group, 70 | such that the bar lengths are the mean, not the individual observations. 71 | The `stat_summary()` layer adds the means +/- standard errors (SE) as error bars, 72 | which is specified by `fun.data = mean_se`. 73 | 74 | As you can see in this simple bar plot, the two groups are _very_ similar in terms of mean and SE. 75 | But is this the full story? Let's make another plot. 76 | 77 | ```{r} 78 | groups_long %>% 79 | ggplot(aes(x = group, y = response)) + 80 | geom_boxplot(aes(fill = group), width = 0.5) + 81 | scale_fill_manual(values = brewer.pal(8, "Accent")) + 82 | theme_classic() + 83 | theme( 84 | legend.position = "none" 85 | ) 86 | 87 | ggsave("../Results/04_box.png", height = 2.5, width = 2) 88 | ``` 89 | 90 | ![box plot](https://github.com/cxli233/Quick_data_vis/blob/main/Results/04_box.png) 91 | 92 | Now I've made a box plot. 93 | In a box plot, the center line is median. 94 | The box spans interquartile range (from 25th percentile to 75th percentile). 95 | The whiskers span 1.5 IQR or (min and max, whichever is less extreme). 96 | 97 | Looking at the box plots, we can notice that group2 is more skewed (more outliers on the upper end). 98 | In this case, our simple bar plot was hiding data from us. 99 | 100 | For mean separation plots, my go-to is plotting all dots with lateral offsets based on density. 101 | What does that even mean? Let's graph it first and I will explain. 102 | ```{r} 103 | groups_long %>% 104 | ggplot(aes(x = group, y = response)) + 105 | geom_quasirandom(aes(color = group), alpha = 0.8) + 106 | scale_color_manual(values = brewer.pal(8, "Accent")) + 107 | theme_classic() + 108 | theme( 109 | legend.position = "none" 110 | ) 111 | 112 | ggsave("../Results/04_dots.png", height = 2.5, width = 2) 113 | ``` 114 | 115 | ![dot plot](https://github.com/cxli233/Quick_data_vis/blob/main/Results/04_dots.png) 116 | 117 | The offset dots layer is provided by `geom_quasirandom()` of the `ggbeeswarm` package. 118 | The name comes from the spread of dots looks like a swarm of bees. 119 | As you can see, this is not just a dot plot. 120 | The dots are offset to the left or right when there are multiple observations around the same value. 121 | The term "quasirandom" means the dots are offset more if the density is higher. 122 | The result is that the dots lay themselves out into the distribution of the data. 123 | 124 | A final note is that we have 100 observations per group. 125 | Even with `geom_quasirandom()` and offsets, there are still overlapping dots. 126 | This is called "overplotting" in data visualization. 127 | I turned on `alpha = 0.8` inside `geom_quasirandom()` to make the dots a bit transparent. 128 | `alpha = 0.8` means 80% opacity, or 20% transparency. 129 | This way, when dots do overlap, we can at least see through them. 130 | 131 | Comparing the bar plot, the box plot, and the dot plot, 132 | we can see that while the mean and SE are similar between the groups, 133 | their underlying distributions are very different. 134 | The bar plot itself cannot convey this aspect of our data. 135 | 136 | **Take home message**: I _highly_ recommend plotting all data points in mean separation graphs. 137 | 138 | # What are the optimal graph type given the number of observations? 139 | Be very careful with data with low number of observations (n < 30), and very large number of observations (n > 100). 140 | 141 | For data with large number of observations, as we just saw in the dot plot, 142 | even with lateral offset, you might still get overlapping dots. 143 | A good practice is turn on transparency of the dots. 144 | Another solution is to make a violin plot. 145 | 146 | ```{r} 147 | groups_long %>% 148 | ggplot(aes(x = group, y = response)) + 149 | geom_quasirandom(aes(color = group), alpha = 0.8) + 150 | geom_violin(fill = NA) + 151 | geom_boxplot(width = 0.1, fill = NA) + 152 | scale_color_manual(values = brewer.pal(8, "Accent")) + 153 | theme_classic() + 154 | theme( 155 | legend.position = "none" 156 | ) 157 | 158 | ggsave("../Results/04_dots_violin.png", height = 2.5, width = 2) 159 | ``` 160 | 161 | ![dot and violin plot](https://github.com/cxli233/Quick_data_vis/blob/main/Results/04_dots_violin.png) 162 | 163 | In this case, I overlaid both violin plot and box plot onto the dots. 164 | The width of the violin plot shows the distribution, 165 | which as you can see follows the offset of the dots. 166 | A quick note: I turned on `fill = NA` in `geom_violin()` and `geom_boxplot()` so that we can see through them and see the dots under them. 167 | The default has white (non-transparent) fill. 168 | 169 | You can read more about fancy mean separation plots (such as "raincloud plot") [here](https://z3tt.github.io/Rainclouds/). 170 | 171 | 172 | Violin and box plots are great for data with many observations. 173 | However, they can be misleading in small datasets. 174 | Here is an example: 175 | 176 | ![what's wrong here?](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/Beware_of_small_n_box_violin_plot.svg) 177 | 178 | Violin plots (or any sort of smoothed distribution curves) and box plots make no sense for small n. 179 | Distributions and quartiles can vary widely with small n, 180 | even if the underlying observations are similar. 181 | Distribution and quartiles are only meaningful with large n. 182 | I did an experiment before, where I sampled the same normal distribution several times and computed the quartiles for each sample. 183 | The quartiles only stabilize when n gets larger than 50. 184 | 185 | **Take home message**: pay attention to the data set size when designing a mean separation plot! 186 | 187 | # Mean separation in multifactorial experiments. 188 | Multifactorial experiments are very common in science. 189 | To explain this, let's use a real world example. 190 | Data from [Matand et al., 2020, BMC Plant Biology](https://link.springer.com/article/10.1186/s12870-020-2243-7). 191 | 192 | ```{r} 193 | tissue_culture <- read_excel("../Data/LU-Matand - Stem Data.xlsx") 194 | head(tissue_culture) 195 | ``` 196 | 197 | In this experiment, the authors looked at the ability of stem pieces of [daylily](https://en.wikipedia.org/wiki/Daylily) to regrow into a full plant. 198 | 199 | They have 19 cultivars of the plant, 3 explant type, and 3 hormone treatment. 200 | They want to look at the best regeneration condition for each cultivar that they have. 201 | The ability to regenerate is quantified by the number of buds or shoots. 202 | 203 | Multifactorial experiments like this are _very_ common. 204 | However, we need to pay extra attention to the design of the graphs. 205 | 206 | ```{r} 207 | tissue_culture %>% 208 | ggplot(aes(x = Variety, y = Buds_Shoots)) + 209 | geom_bar(stat = "summary", 210 | aes(fill = interaction(Explant, Treatment)), 211 | fun = mean, 212 | position = position_dodge2()) + 213 | labs(x = NULL, 214 | y = "Num buds and shoots", 215 | fill = NULL) + 216 | theme_classic() + 217 | theme(axis.text.x = element_text(angle = 45, hjust = 1)) 218 | 219 | ggsave("../Results/04_multi_bar.png", height = 3, width = 6) 220 | ``` 221 | ![very bad bar plot](https://github.com/cxli233/Quick_data_vis/blob/main/Results/04_multi_bar.png) 222 | 223 | I have cultivar on x axis, buds/shoots on y axis, 224 | and color bars by the combination of explant and treatment. 225 | This is very horrendous! It is extremely difficult to interpret what is going on. 226 | 227 | To overcome this problem, I am introducing you to your new friend, facets! 228 | Facetting is a method to layout your plot into subplots and better direct the attention of your readers. 229 | Let me show you an example: 230 | 231 | ```{r} 232 | tissue_culture %>% 233 | mutate(Treatment = factor(Treatment, 234 | levels = c("T1", "T5", "T10"))) %>% 235 | ggplot(aes(x = Treatment, y = Buds_Shoots)) + 236 | facet_wrap(~ Variety, scales = "free") + 237 | geom_quasirandom(aes(color = Explant), alpha = 0.8) + 238 | scale_colour_manual(values = brewer.pal(8, "Set2")) + 239 | labs(x = "Hormone Treatment", 240 | y = "Num buds and shoots", 241 | fill = NULL) + 242 | theme_classic() + 243 | theme(legend.position = "bottom") 244 | 245 | ggsave("../Results/04_multi_dots.png", height = 6, width = 9) 246 | ``` 247 | 248 | ![much better](https://github.com/cxli233/Quick_data_vis/blob/main/Results/04_multi_dots.png) 249 | 250 | Now this is a lot better. 251 | Since the authors want to find the best conditions for each cultivar, we make each cultivar a subplot. 252 | This is provided by `facet_wrap()`. 253 | The value after `~` is the column name to make subplot from, which is "Variety" in the table. 254 | From this new plot, it is obvious that the split stem explant and T10 treatment gave the best results for most cultivars, with some exceptions in a cultivar-specific manner. 255 | 256 | **Take home message**: facets are very powerful and you should learn to use them! Read more about facets [here](https://ggplot2.tidyverse.org/reference/facet_grid.html). 257 | 258 | # Exercise 259 | ## Q1 260 | Notice the `scales = "free` inside `facet_wrap()`. 261 | Make a new graph but remove the `scales = "free` inside `facet_wrap()`. 262 | 263 | Can you explain what happened? 264 | Can you explain in what scenario is more appropriate to have `scales = "free"` turned on vs off? 265 | 266 | ## Q2 267 | Here is an example data for you to practice: 268 | ```{r} 269 | M1 <- data.frame( 270 | conc = rnorm(n = 8, mean = 0.03, sd = 0.01) 271 | ) %>% 272 | mutate(group = "ctrl") %>% 273 | rbind( 274 | data.frame( 275 | conc = rnorm(n = 6, mean = 0.25, sd = 0.02) 276 | ) %>% 277 | mutate(group = "trt") 278 | ) %>% 279 | mutate(pest = "Pest 1") 280 | 281 | M2 <- data.frame( 282 | conc = rnorm(n = 8, mean = 6, sd = 1) 283 | ) %>% 284 | mutate(group = "ctrl") %>% 285 | rbind( 286 | data.frame( 287 | conc = rnorm(n = 6, mean = 5.5, sd = 1.1) 288 | ) %>% 289 | mutate(group = "trt") 290 | ) %>% 291 | mutate(pest = "Pest 2") 292 | 293 | M3 <- data.frame( 294 | conc = rnorm(n = 8, mean = 20, sd = 0.5) 295 | ) %>% 296 | mutate(group = "ctrl") %>% 297 | rbind( 298 | data.frame( 299 | conc = rnorm(n = 6, mean = 19.5, sd = 1.2) 300 | ) %>% 301 | mutate(group = "trt") 302 | ) %>% 303 | mutate(pest = "Pest 3") 304 | 305 | Spray <- rbind( 306 | M1, M2, M3 307 | ) 308 | 309 | head(Spray) 310 | ``` 311 | 312 | In this hypothetical experiment, I have two treatments: ctrl vs. trt. 313 | And I measured the occurrence of three different pests after spraying either treatments. 314 | Make a mean separation plot for this experiment. 315 | 316 | Hints: 317 | 318 | * Is this a multifactorial experiment? 319 | * How many observations are there in each group? 320 | 321 | Was the treatment effective in controlling any of the three pests? 322 | Save the graph using `ggsave()`. 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | -------------------------------------------------------------------------------- /Scripts/07_Intro_elationship_data.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Intro_to_network.Rmd" 3 | author: "Chenxin Li" 4 | date: "2023-01-15" 5 | output: html_notebook 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | ``` 11 | 12 | # Introduction 13 | Network analyses are useful in many fields. 14 | They are very powerful in modeling relationship data, which have a wide range of applications. 15 | We will explore some of the applications below. 16 | Given its prevalence and usefulness, network (and the underlying relationship data) should have a lesson of its own. 17 | 18 | In this lesson, we will explore: 19 | 20 | 1. What underlies a network? 21 | 2. How to handle the visualization of network? 22 | 23 | # Packages 24 | ```{r} 25 | library(tidyverse) 26 | library(igraph) 27 | library(ggraph) 28 | library(readxl) 29 | library(RColorBrewer) 30 | ``` 31 | 32 | We have two new packages, `igraph` and `ggraph`. 33 | `igraph` is a network analysis package that can do a lot of heavy-lifting when it comes of dealing with networks. 34 | `ggraph` is a `ggplot` extension for network visualization. 35 | 36 | # What underlies a network? 37 | ![insert network here]() 38 | 39 | A network consists of nodes of edges. 40 | Nodes are points, and edges are lines connecting nodes. 41 | Nodes are in fact observations or entities. 42 | Edges are the relationship between nodes. 43 | When nodes and edges are combined to make a network, 44 | the network becomes an object that contains both observational data (from the nodes) and relationship data (from the edges). 45 | (In mathematics, a network is also referred to as a "graph", thus the "graph" in `igraph`. 46 | Nodes are sometimes also referred to as "vertices".) 47 | 48 | Thus, a network is fundamentally two tables: 49 | 50 | 1. an edge table - a data frame that records how nodes are connected to each other. 51 | 2. a node table - data frame that records attributes of nodes. 52 | 53 | An additional requirement for the node table is that 54 | each node in the edge table must appear in the node table once and only once. 55 | 56 | An edge table can look like this: 57 | 58 | | From | To | Additional info | 59 | |:----:|:--:|:---------------:| 60 | | Robin| Li | Started 09/2021 | 61 | 62 | * The first column of the edge table must be `from`: where the edge starts. 63 | * The second column of the edge table must be `to`: where the edge ends. 64 | * Additional information can be added as additional columns. 65 | 66 | In this case, this table has a single edge going from `Robin` to `Li`. 67 | 68 | A node table can look like this: 69 | 70 | | Node | Additional info | 71 | |:----:|:---------------:| 72 | |Robin | PI | 73 | | Li | Postdoc | 74 | 75 | In a node table, the first column must be the names of the nodes. 76 | Each row is a node. In this case, we have a table of two nodes. 77 | 78 | # How to visualize a network? 79 | ## Make a network object from edges and nodes 80 | We will use an example. Here is a hypothetical network that I made up. 81 | ```{r} 82 | example_network_edges <- read_excel("../Data/Example_network_edges.xlsx") 83 | ``` 84 | Without a given node table, 85 | you can actually produce a node table from the edge table. 86 | The node table is just a non-redundant list of all members of `from` and `to` in the edge table. 87 | 88 | ```{r} 89 | example_network_nodes <- data.frame( 90 | nodes = unique(c(example_network_edges$From, example_network_edges$To)) 91 | ) %>% 92 | mutate(nodes = as.character(nodes)) 93 | ``` 94 | 95 | `unique(c(example_network_edges$From, example_network_edges$To)` takes the members of `from` and `to` from `example_network_edges` and take the non-redundant set. 96 | 97 | To make a network, use the `graph_from_data_frame()` function from the `igraph` package. 98 | ```{r} 99 | my_network <- graph_from_data_frame( 100 | d = example_network_edges, 101 | vertices = example_network_nodes, 102 | directed = F 103 | ) 104 | ``` 105 | 106 | The `d` in `graph_from_data_frame()` specifies the edge table. 107 | `vertices` species the node table. 108 | In this example I set `directed` to `FALSE`. 109 | This means from node 1 to node 2 is the same from node 2 to node 1. 110 | 111 | ## Using ggraph 112 | To make a network diagram, we will use `ggraph`. 113 | ```{r} 114 | ggraph(my_network, layout = "kk") + 115 | geom_edge_diagonal(color = "grey70") + 116 | geom_node_point(size = 3, shape = 21, color = "white", fill = "grey20") + 117 | geom_node_text(repel = T, aes(label = example_network_nodes$nodes)) + 118 | theme_void() 119 | 120 | ggsave("../Results/07_my_network1.png", height = 3, width = 3.5, bg = "white") 121 | ``` 122 | There are 3 fundamental elements to a network diagram. 123 | 124 | 1. Layout, which controls how nodes are placed in the 2-D plane. In this diagram I used "kk" or "Kamada-Kawai" layout algorithm. More about layout algorithms is discussed below. 125 | 2. Edges. In this example, I used `geom_edge_diagonal()`, which draws curves between nodes. 126 | 3. Nodes, which is provided by `geom_node_point()`, draws each node as a point. 127 | 128 | ## Try different network layouts 129 | There are multiple layout algorithms for network diagrams. 130 | Layouts can drastically change the appearance of networks, making them easier or harder to interpret. 131 | Here are 3 network diagrams from the same data using different layout algorithms. 132 | They look very different from each other. Data from: [Li et al., 2022, BioRxiv](https://www.biorxiv.org/content/10.1101/2022.07.04.498697v1.abstract) 133 | 134 | My go-to is "kk", which generally gives good results. 135 | But I encourage you to read more about different layouts [here](https://www.data-imaginist.com/2017/ggraph-introduction-layouts/) and [here](https://r-graph-gallery.com/247-network-chart-layouts.html). 136 | 137 | ## Mapping variables to networks 138 | ![insert network here]() 139 | 140 | If you look at the network that we made earlier, all the dots are just black. 141 | What if we want to color nodes based on additional attributes? 142 | This is when additional columns of the node table become useful. 143 | The method is like any other `ggplot` layers, you open an `aes()` argument inside `geom_node_point()`. 144 | 145 | As a simple example, let's use Robin and Li. 146 | ```{r} 147 | small_network_edges <- data.frame( 148 | from = "Robin", 149 | to = "Li", 150 | info = "started 09/2021" 151 | ) 152 | 153 | small_network_nodes <- data.frame( 154 | node = c("Robin", "Li"), 155 | role = c("PI", "Postdoc") 156 | ) 157 | 158 | small_network <- graph_from_data_frame( 159 | d = small_network_edges, 160 | vertices = small_network_nodes, 161 | directed = T 162 | ) 163 | ``` 164 | 165 | Here I wrote the edge and node table, and make a network object from them. 166 | Now let's make a network diagram, and color nodes by `role`. 167 | ```{r} 168 | ggraph(small_network, 169 | layout = rbind(c(1, 1), 170 | c(1, 0))) + 171 | geom_edge_link(arrow = arrow(length = unit(4, "mm")), 172 | end_cap = circle(4, "mm"), 173 | start_cap = circle(4, "mm"), 174 | aes(label = info), 175 | angle_calc = "along", 176 | vjust = -0.5) + 177 | geom_node_point(size = 3, aes(color = role)) + 178 | geom_node_text(repel = T, aes(label = small_network_nodes$node)) + 179 | theme_void() 180 | 181 | ggsave("../Results/07_small_network1.png", height = 2.5, width = 2, bg = "white") 182 | ``` 183 | 184 | Now the network diagram is colored by our roles, PI or postdoc. 185 | `geom_edge_link()` makes straight lines between nodes. 186 | There are multiple additional parameters in `geom_edge_link()` to make the arrow and edge label. 187 | You can look into what these arguments do by removing them and see what happens to the diagram. 188 | 189 | Note that you can provide manual layout for the network if you want to. 190 | In this example, I provided a matrix using `layout = rbind(c(1, 1),c(1, 0))`. 191 | This specifies the node `Robin` to be at coordinate x = 1, y = 1; 192 | and node `Li` to be at x = 1, y = 0. 193 | For a small network you can get away with this. 194 | 195 | # How to deconvolute a large network into smaller ones? 196 | Networks can be very large, to better understand them we might have to deconstruct them into smaller sub-networks. 197 | These sub-networks are often referred to as modules. 198 | There are multiple ways to achieve that. I will introduce the two most common approaches below. 199 | 200 | ## Deconvolute using neigbors of nodes of interest 201 | The first approach is to pull out neighbors of nodes of interest. 202 | Let's look at the example network we made earlier. 203 | ![insert network here]() 204 | 205 | Say we are interested in nodes 1, 14, and 23. 206 | And we want to look at what are their direct network neighbors. 207 | We can do that using the `neighbors()` function in the `igraph` package. 208 | 209 | ```{r} 210 | neighbors_of_interst <- c( 211 | names(neighbors(my_network, v = "1")), # pull neighbors of node 1 212 | names(neighbors(my_network, v = "14")), # pull neighbors of node 14 213 | names(neighbors(my_network, v = "23")), # pull neighbors of node 23 214 | c(1, 14, 23) # don't forget 1, 14, 23 themselves 215 | ) %>% 216 | unique() # remove redundant nodes, if any. 217 | 218 | neighbors_of_interst 219 | ``` 220 | We can make a sub-network by filtering nodes from the node table, 221 | as well as filtering for edges that connect them. 222 | 223 | ```{r} 224 | my_subnetwork_nodes <- example_network_nodes %>% 225 | filter(nodes %in% neighbors_of_interst) 226 | 227 | my_subnetwork_edges <- example_network_edges %>% 228 | filter(From %in% neighbors_of_interst & 229 | To %in% neighbors_of_interst) 230 | 231 | my_subnetwork <- graph_from_data_frame( 232 | d = my_subnetwork_edges, 233 | vertices = my_subnetwork_nodes, 234 | directed = F 235 | ) 236 | ``` 237 | 238 | We made the sub-network object. 239 | Now we can make a new diagram. 240 | ```{r} 241 | ggraph(my_subnetwork, layout = "kk") + 242 | geom_edge_diagonal(color = "grey70") + 243 | geom_node_point(size = 3, shape = 21, color = "white", fill = "grey20") + 244 | geom_node_text(repel = T, aes(label = my_subnetwork_nodes$nodes)) + 245 | theme_void() 246 | 247 | ggsave("../Results/07_my_network2.png", height = 3, width = 3, bg = "white") 248 | ``` 249 | Now we are left with nodes 1, 14, and 23 and their direct neighbors. 250 | 251 | ## Deconvolute using module detection 252 | A very common method of network deconvolution is clustering. 253 | Clustering looks for highly interconnected nodes and assigns modules accordingly. 254 | The `igraph` package comes with clustering algorithms, such as `leiden`. 255 | You can break a network into multiple highly interconnected parts using `cluster_leiden()`. 256 | 257 | ```{r} 258 | my_netowrk_modules <- cluster_leiden(graph = my_network, 259 | objective_function = "modularity") 260 | 261 | node_table_clustered <- data.frame( 262 | nodes = my_netowrk_modules$names, # Pull node names 263 | Module = my_netowrk_modules$membership # Pull membership 264 | ) %>% 265 | arrange(as.numeric(nodes)) %>% 266 | inner_join(example_network_nodes, by = "nodes") # add membership info to original node table 267 | 268 | head(node_table_clustered) 269 | ``` 270 | What I did in this chunk is I ran the leiden clustering algorithm on my network. 271 | Then I made a new data frame that contains the nodes and which module they belong to. 272 | Finally, I joined the membership information to the original node table. 273 | 274 | Now we can make a new network object with the updated node table. 275 | ```{r} 276 | my_network_clustered <- graph_from_data_frame(d = example_network_edges, 277 | vertices = node_table_clustered, 278 | directed = F) 279 | ``` 280 | 281 | ```{r} 282 | ggraph(my_network_clustered, layout = "kk") + 283 | geom_edge_diagonal(color = "grey70") + 284 | geom_node_point(size = 3, shape = 21, color = "white", 285 | aes(fill = as.factor(Module))) + 286 | geom_node_text(repel = T, aes(label = node_table_clustered$nodes)) + 287 | scale_fill_manual(values = brewer.pal(8, "Set2")) + 288 | labs(fill = "Module") + 289 | theme_void() + 290 | theme( 291 | legend.position = c(0.2, 0.25) 292 | ) 293 | 294 | ggsave("../Results/07_my_network3.png", height = 3, width = 3.5, bg = "white") 295 | ``` 296 | The key addition in this code chunk is `aes(fill = as.factor(Module)` inside `geom_node_point()`. 297 | Now the nodes are colored by which module they belong to. 298 | You can see that the clustering actually makes sense: 299 | Nodes that are highly interconnected are assigned in the same cluster. 300 | Incidentally, this network has 3 modules. 301 | While there are multiple connections _within_ each module, 302 | there is only one edge _between_ each module. 303 | 304 | One application of network analysis is in gene expression. 305 | In gene co-expression networks, each node is a gene, 306 | and an edge is drawn between two genes if two genes are co-expressed (having very similar expression patterns). 307 | Gene co-expression modules represent groups of genes that may function in the same biological process. 308 | 309 | # Homework 310 | ## Q1 311 | Metabolic pathways can be modeled by networks, where each node is a metabolite and each edge is a metabolic enzyme. 312 | You can read more about how to represent metabolic pathways [here](https://github.com/cxli233/ggpathway). 313 | 314 | Here is an example: 315 | ```{r} 316 | calvin_cycle_edges <- read_excel("../Data/Calvin_cycle_edges.xlsx") 317 | calvin_cycle_nodes <- read_excel("../Data/Calvin_cycle_nodes.xlsx") 318 | 319 | head(calvin_cycle_edges) 320 | head(calvin_cycle_nodes) 321 | ``` 322 | Make a network diagram for the Calvin cycle. 323 | Color each node by how many carbons they have (the `carbon` column in the node table). 324 | Label each metabolite using `geom_node_text(aes(label = name), hjust = 0.5, repel = T)`. 325 | Save the figure using `ggsave()`. 326 | Use `bg = "white"` inside `ggsave()` to add a white background when exporting the png file. 327 | 328 | 329 | ## Q2 330 | [Phylogenetic trees](https://en.wikipedia.org/wiki/Phylogenetic_tree) and [dendrogram](https://en.wikipedia.org/wiki/Dendrogram) can be modeled by a network. 331 | Say we have a tree like this: 332 | 333 | ``` 334 | I 335 | | 336 | G - H 337 | | 338 | A - B - C 339 | | 340 | D - E 341 | | 342 | F 343 | ``` 344 | 345 | Make a network diagram and label each node. 346 | Save the diagram using `ggave()`. 347 | Use `bg = "white"` inside `ggsave()` to add a white background when exporting the png file. 348 | Hint: You can write the edge table in Excel and import it into R if that's easier. 349 | -------------------------------------------------------------------------------- /Lessons/07_Intro_elationship_data.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | Network analyses are useful in many fields. 3 | They are very powerful in modeling relationship data, which have a wide range of applications. 4 | We will explore some of the applications below. 5 | Given its prevalence and usefulness, network (and the underlying relationship data) should have a lesson of its own. 6 | 7 | In this lesson, we will explore: 8 | 9 | 1. What underlies a network? 10 | 2. How to handle the visualization of network? 11 | 12 | # Packages 13 | ```{r} 14 | library(tidyverse) 15 | library(igraph) 16 | library(ggraph) 17 | library(readxl) 18 | library(RColorBrewer) 19 | ``` 20 | 21 | We have two new packages, `igraph` and `ggraph`. 22 | `igraph` is a network analysis package that can do a lot of heavy-lifting when it comes of dealing with networks. 23 | `ggraph` is a `ggplot` extension for network visualization. 24 | 25 | # What underlies a network? 26 | ![insert network here](https://github.com/cxli233/Quick_data_vis/blob/main/Results/07_my_network1.png) 27 | 28 | A network consists of nodes of edges. 29 | Nodes are points, and edges are lines connecting nodes. 30 | Nodes are in fact observations or entities. 31 | Edges are the relationship between nodes. 32 | When nodes and edges are combined to make a network, 33 | the network becomes an object that contains both observational data (from the nodes) and relationship data (from the edges). 34 | (In mathematics, a network is also referred to as a "graph", thus the "graph" in `igraph`. 35 | Nodes are sometimes also referred to as "vertices".) 36 | 37 | Thus, a network is fundamentally two tables: 38 | 39 | 1. an edge table - a data frame that records how nodes are connected to each other. 40 | 2. a node table - data frame that records attributes of nodes. 41 | 42 | An additional requirement for the node table is that 43 | each node in the edge table must appear in the node table once and only once. 44 | 45 | An edge table can look like this: 46 | 47 | | From | To | Additional info | 48 | |:----:|:--:|:---------------:| 49 | | Robin| Li | Started 09/2021 | 50 | 51 | * The first column of the edge table must be `from`: where the edge starts. 52 | * The second column of the edge table must be `to`: where the edge ends. 53 | * Additional information can be added as additional columns. 54 | 55 | In this case, this table has a single edge going from `Robin` to `Li`. 56 | 57 | A node table can look like this: 58 | 59 | | Node | Additional info | 60 | |:----:|:---------------:| 61 | |Robin | PI | 62 | | Li | Postdoc | 63 | 64 | In a node table, the first column must be the names of the nodes. 65 | Each row is a node. In this case, we have a table of two nodes. 66 | 67 | # How to visualize a network? 68 | ## Make a network object from edges and nodes 69 | We will use an example. Here is a hypothetical network that I made up. 70 | ```{r} 71 | example_network_edges <- read_excel("../Data/Example_network_edges.xlsx") 72 | ``` 73 | Without a given node table, 74 | you can actually produce a node table from the edge table. 75 | The node table is just a non-redundant list of all members of `from` and `to` in the edge table. 76 | 77 | ```{r} 78 | example_network_nodes <- data.frame( 79 | nodes = unique(c(example_network_edges$From, example_network_edges$To)) 80 | ) %>% 81 | mutate(nodes = as.character(nodes)) 82 | ``` 83 | 84 | `unique(c(example_network_edges$From, example_network_edges$To)` takes the members of `from` and `to` from `example_network_edges` and take the non-redundant set. 85 | 86 | To make a network, use the `graph_from_data_frame()` function from the `igraph` package. 87 | ```{r} 88 | my_network <- graph_from_data_frame( 89 | d = example_network_edges, 90 | vertices = example_network_nodes, 91 | directed = F 92 | ) 93 | ``` 94 | 95 | The `d` in `graph_from_data_frame()` specifies the edge table. 96 | `vertices` species the node table. 97 | In this example I set `directed` to `FALSE`. 98 | This means from node 1 to node 2 is the same from node 2 to node 1. 99 | 100 | ## Using ggraph 101 | To make a network diagram, we will use `ggraph`. 102 | ```{r} 103 | ggraph(my_network, layout = "kk") + 104 | geom_edge_diagonal(color = "grey70") + 105 | geom_node_point(size = 3, shape = 21, color = "white", fill = "grey20") + 106 | geom_node_text(repel = T, aes(label = example_network_nodes$nodes)) + 107 | theme_void() 108 | 109 | ggsave("../Results/07_my_network1.png", height = 3, width = 3.5, bg = "white") 110 | ``` 111 | ![network 1](https://github.com/cxli233/Quick_data_vis/blob/main/Results/07_my_network1.png) 112 | 113 | There are 3 fundamental elements to a network diagram. 114 | 115 | 1. Layout, which controls how nodes are placed in the 2-D plane. In this diagram I used "kk" or "Kamada-Kawai" layout algorithm. More about layout algorithms is discussed below. 116 | 2. Edges. In this example, I used `geom_edge_diagonal()`, which draws curves between nodes. 117 | 3. Nodes, which is provided by `geom_node_point()`, draws each node as a point. 118 | 119 | ## Try different network layouts 120 | There are multiple layout algorithms for network diagrams. 121 | Layouts can drastically change the appearance of networks, making them easier or harder to interpret. 122 | Here are 3 network diagrams from the same data using different layout algorithms. 123 | ![Try different layouts](https://github.com/cxli233/FriendsDontLetFriends/blob/main/Results/TryDifferentLayouts.svg) 124 | 125 | They look very different from each other. Data from: [Li et al., 2022, BioRxiv](https://www.biorxiv.org/content/10.1101/2022.07.04.498697v1.abstract) 126 | 127 | My go-to is "kk", which generally gives good results. 128 | But I encourage you to read more about different layouts [here](https://www.data-imaginist.com/2017/ggraph-introduction-layouts/) and [here](https://r-graph-gallery.com/247-network-chart-layouts.html). 129 | 130 | ## Mapping variables to networks 131 | ![insert network here](https://github.com/cxli233/Quick_data_vis/blob/main/Results/07_my_network1.png) 132 | 133 | If you look at the network that we made earlier, all the dots are just black. 134 | What if we want to color nodes based on additional attributes? 135 | This is when additional columns of the node table become useful. 136 | The method is like any other `ggplot` layers, you open an `aes()` argument inside `geom_node_point()`. 137 | 138 | As a simple example, let's use Robin and Li. 139 | ```{r} 140 | small_network_edges <- data.frame( 141 | from = "Robin", 142 | to = "Li", 143 | info = "started 09/2021" 144 | ) 145 | 146 | small_network_nodes <- data.frame( 147 | node = c("Robin", "Li"), 148 | role = c("PI", "Postdoc") 149 | ) 150 | 151 | small_network <- graph_from_data_frame( 152 | d = small_network_edges, 153 | vertices = small_network_nodes, 154 | directed = T 155 | ) 156 | ``` 157 | 158 | Here I wrote the edge and node table, and make a network object from them. 159 | Now let's make a network diagram, and color nodes by `role`. 160 | ```{r} 161 | ggraph(small_network, 162 | layout = rbind(c(1, 1), 163 | c(1, 0))) + 164 | geom_edge_link(arrow = arrow(length = unit(4, "mm")), 165 | end_cap = circle(4, "mm"), 166 | start_cap = circle(4, "mm"), 167 | aes(label = info), 168 | angle_calc = "along", 169 | vjust = -0.5) + 170 | geom_node_point(size = 3, aes(color = role)) + 171 | geom_node_text(repel = T, aes(label = small_network_nodes$node)) + 172 | theme_void() 173 | 174 | ggsave("../Results/07_small_network1.png", height = 2.5, width = 2, bg = "white") 175 | ``` 176 | ![small_network](https://github.com/cxli233/Quick_data_vis/blob/main/Results/07_small_network1.png) 177 | 178 | Now the network diagram is colored by our roles, PI or postdoc. 179 | `geom_edge_link()` makes straight lines between nodes. 180 | There are multiple additional parameters in `geom_edge_link()` to make the arrow and edge label. 181 | You can look into what these arguments do by removing them and see what happens to the diagram. 182 | 183 | Note that you can provide manual layout for the network if you want to. 184 | In this example, I provided a matrix using `layout = rbind(c(1, 1),c(1, 0))`. 185 | This specifies the node `Robin` to be at coordinate x = 1, y = 1; 186 | and node `Li` to be at x = 1, y = 0. 187 | For a small network you can get away with this. 188 | 189 | # How to deconvolute a large network into smaller ones? 190 | Networks can be very large, to better understand them we might have to deconstruct them into smaller sub-networks. 191 | These sub-networks are often referred to as modules. 192 | There are multiple ways to achieve that. I will introduce the two most common approaches below. 193 | 194 | ## Deconvolute using neigbors of nodes of interest 195 | The first approach is to pull out neighbors of nodes of interest. 196 | Let's look at the example network we made earlier. 197 | ![insert network here](https://github.com/cxli233/Quick_data_vis/blob/main/Results/07_my_network1.png) 198 | 199 | Say we are interested in nodes 1, 14, and 23. 200 | And we want to look at what are their direct network neighbors. 201 | We can do that using the `neighbors()` function in the `igraph` package. 202 | 203 | ```{r} 204 | neighbors_of_interst <- c( 205 | names(neighbors(my_network, v = "1")), # pull neighbors of node 1 206 | names(neighbors(my_network, v = "14")), # pull neighbors of node 14 207 | names(neighbors(my_network, v = "23")), # pull neighbors of node 23 208 | c(1, 14, 23) # don't forget 1, 14, 23 themselves 209 | ) %>% 210 | unique() # remove redundant nodes, if any. 211 | 212 | neighbors_of_interst 213 | ``` 214 | We can make a sub-network by filtering nodes from the node table, 215 | as well as filtering for edges that connect them. 216 | 217 | ```{r} 218 | my_subnetwork_nodes <- example_network_nodes %>% 219 | filter(nodes %in% neighbors_of_interst) 220 | 221 | my_subnetwork_edges <- example_network_edges %>% 222 | filter(From %in% neighbors_of_interst & 223 | To %in% neighbors_of_interst) 224 | 225 | my_subnetwork <- graph_from_data_frame( 226 | d = my_subnetwork_edges, 227 | vertices = my_subnetwork_nodes, 228 | directed = F 229 | ) 230 | ``` 231 | 232 | We made the sub-network object. 233 | Now we can make a new diagram. 234 | ```{r} 235 | ggraph(my_subnetwork, layout = "kk") + 236 | geom_edge_diagonal(color = "grey70") + 237 | geom_node_point(size = 3, shape = 21, color = "white", fill = "grey20") + 238 | geom_node_text(repel = T, aes(label = my_subnetwork_nodes$nodes)) + 239 | theme_void() 240 | 241 | ggsave("../Results/07_my_network2.png", height = 3, width = 3, bg = "white") 242 | ``` 243 | ![network2](https://github.com/cxli233/Quick_data_vis/blob/main/Results/07_my_network2.png) 244 | 245 | Now we are left with nodes 1, 14, and 23 and their direct neighbors. 246 | 247 | ## Deconvolute using module detection 248 | A very common method of network deconvolution is clustering. 249 | Clustering looks for highly interconnected nodes and assigns modules accordingly. 250 | The `igraph` package comes with clustering algorithms, such as `leiden`. 251 | You can break a network into multiple highly interconnected parts using `cluster_leiden()`. 252 | 253 | ```{r} 254 | my_netowrk_modules <- cluster_leiden(graph = my_network, 255 | objective_function = "modularity") 256 | 257 | node_table_clustered <- data.frame( 258 | nodes = my_netowrk_modules$names, # Pull node names 259 | Module = my_netowrk_modules$membership # Pull membership 260 | ) %>% 261 | arrange(as.numeric(nodes)) %>% 262 | inner_join(example_network_nodes, by = "nodes") # add membership info to original node table 263 | 264 | head(node_table_clustered) 265 | ``` 266 | What I did in this chunk is I ran the leiden clustering algorithm on my network. 267 | Then I made a new data frame that contains the nodes and which module they belong to. 268 | Finally, I joined the membership information to the original node table. 269 | 270 | Now we can make a new network object with the updated node table. 271 | ```{r} 272 | my_network_clustered <- graph_from_data_frame(d = example_network_edges, 273 | vertices = node_table_clustered, 274 | directed = F) 275 | ``` 276 | 277 | ```{r} 278 | ggraph(my_network_clustered, layout = "kk") + 279 | geom_edge_diagonal(color = "grey70") + 280 | geom_node_point(size = 3, shape = 21, color = "white", 281 | aes(fill = as.factor(Module))) + 282 | geom_node_text(repel = T, aes(label = node_table_clustered$nodes)) + 283 | scale_fill_manual(values = brewer.pal(8, "Set2")) + 284 | labs(fill = "Module") + 285 | theme_void() + 286 | theme( 287 | legend.position = c(0.2, 0.25) 288 | ) 289 | 290 | ggsave("../Results/07_my_network3.png", height = 3, width = 3.5, bg = "white") 291 | ``` 292 | ![network3](https://github.com/cxli233/Quick_data_vis/blob/main/Results/07_my_network3.png) 293 | 294 | The key addition in this code chunk is `aes(fill = as.factor(Module)` inside `geom_node_point()`. 295 | Now the nodes are colored by which module they belong to. 296 | You can see that the clustering actually makes sense: 297 | Nodes that are highly interconnected are assigned in the same cluster. 298 | Incidentally, this network has 3 modules. 299 | While there are multiple connections _within_ each module, 300 | there is only one edge _between_ each module. 301 | 302 | One application of network analysis is in gene expression. 303 | In gene co-expression networks, each node is a gene, 304 | and an edge is drawn between two genes if two genes are co-expressed (having very similar expression patterns). 305 | Gene co-expression modules represent groups of genes that may function in the same biological process. 306 | 307 | # Homework 308 | ## Q1 309 | Metabolic pathways can be modeled by networks, where each node is a metabolite and each edge is a metabolic enzyme. 310 | You can read more about how to represent metabolic pathways [here](https://github.com/cxli233/ggpathway). 311 | 312 | Here is an example: 313 | ```{r} 314 | calvin_cycle_edges <- read_excel("../Data/Calvin_cycle_edges.xlsx") 315 | calvin_cycle_nodes <- read_excel("../Data/Calvin_cycle_nodes.xlsx") 316 | 317 | head(calvin_cycle_edges) 318 | head(calvin_cycle_nodes) 319 | ``` 320 | Make a network diagram for the Calvin cycle. 321 | Color each node by how many carbons they have (the `carbon` column in the node table). 322 | Label each metabolite using `geom_node_text(aes(label = name), hjust = 0.5, repel = T)`. 323 | Save the figure using `ggsave()`. 324 | Use `bg = "white"` inside `ggsave()` to add a white background when exporting the png file. 325 | 326 | 327 | ## Q2 328 | [Phylogenetic trees](https://en.wikipedia.org/wiki/Phylogenetic_tree) and [dendrogram](https://en.wikipedia.org/wiki/Dendrogram) can be modeled by a network. 329 | Say we have a tree like this: 330 | 331 | ``` 332 | I 333 | | 334 | G - H 335 | | 336 | A - B - C 337 | | 338 | D - E 339 | | 340 | F 341 | ``` 342 | 343 | Make a network diagram and label each node. 344 | Save the diagram using `ggave()`. 345 | Use `bg = "white"` inside `ggsave()` to add a white background when exporting the png file. 346 | Hint: You can write the edge table in Excel and import it into R if that's easier. 347 | --------------------------------------------------------------------------------