├── .gitignore
├── Homework 1
├── Homework1.md
├── ikea_data.csv
└── img
│ ├── ikea-logo.png
│ ├── task4.png
│ ├── task5.png
│ ├── task6.png
│ └── task7.png
├── Homework 2
├── Homework2.md
├── Viet Nam-1975.csv
├── Viet Nam-1980.csv
├── Viet Nam-1985.csv
├── Viet Nam-1990.csv
├── Viet Nam-1995.csv
├── Viet Nam-2000.csv
├── Viet Nam-2005.csv
├── Viet Nam-2010.csv
├── Viet Nam-2015.csv
├── Viet Nam-2020.csv
├── Viet Nam-2025.csv
├── college_data_normalized.csv
└── img
│ ├── banner.png
│ ├── task1.png
│ ├── task2.png
│ ├── task3.png
│ ├── task4.png
│ ├── task5.png
│ └── task6.png
├── Homework 3
├── Homework3.md
├── data
│ ├── dog_travel.csv
│ ├── median-housing 1.csv
│ └── scorecard-models.RData
└── img
│ ├── banner.png
│ ├── rcis.png
│ ├── task2.png
│ ├── task2b.png
│ └── task2c.png
├── LICENSE.md
├── README.md
├── Week 1
├── Week1-AE-Notes.md
├── Week1-AE-RMarkdown.Rmd
├── Week1-AE-RMarkdown.html
├── Week1-AE-RMarkdown.pdf
└── img
│ ├── Minard_Update.png
│ ├── Untitled 1.png
│ ├── Untitled 2.png
│ ├── Untitled 3.png
│ ├── Untitled 4.png
│ ├── Untitled 5.png
│ ├── Untitled 6.png
│ ├── Untitled 7.png
│ ├── Untitled 8.png
│ ├── Untitled 9.png
│ ├── Untitled.png
│ └── olympic_feathers_print.jpeg
├── Week 10
├── Week10-AE-Notes.md
├── Week10-AE-RMarkdown-Key.Rmd
├── Week10-AE-RMarkdown.Rmd
├── img
│ ├── banner.png
│ ├── choropleth.png
│ ├── geomsf.png
│ ├── shame.png
│ ├── task1.png
│ ├── task2.png
│ ├── task3.png
│ ├── task4-optional.png
│ ├── task4.png
│ └── task5.png
├── ny-counties.rds
├── ny-inc.rds
└── nyc-311.csv
├── Week 11
├── Week11-AE-Notes.md
├── Week11-AE-RMarkdown.Rmd
├── ad_aqi_tracker_data-2023.csv
├── freedom.csv
└── img
│ ├── syracuse.png
│ ├── task0.png
│ ├── task1.png
│ ├── task2.png
│ ├── task3.png
│ ├── task4.png
│ ├── task5.png
│ ├── task6-1.gif
│ └── task6-2.gif
├── Week 12
├── Week12-AE-RMarkdown.Rmd
└── Week12-Lecture-Examples.Rmd
├── Week 2
├── Week2-AE-Notes.md
├── Week2-AE-RMarkdown.Rmd
├── Week2-AE-RMarkdown.html
├── Week2-AE-RMarkdown.pdf
├── homesales.csv
└── img
│ ├── banner.png
│ ├── dataink.png
│ ├── dataink2.jpg
│ ├── housedata1.png
│ ├── housedata10.png
│ ├── housedata11.png
│ ├── housedata2.png
│ ├── housedata3.png
│ ├── housedata4.png
│ ├── housedata5.png
│ ├── housedata6.png
│ ├── housedata7.png
│ ├── housedata8.png
│ ├── housedata9.png
│ └── piechart.png
├── Week 4
├── Week4-AE-Notes.md
├── img
│ ├── Truncated-Y-Axis-Data-Visualizations-Designed-To-Mislead.jpg
│ ├── banner.png
│ ├── task1.jpg
│ ├── task2.jpg
│ ├── task3.jpg
│ ├── task4.jpg
│ ├── task5.jpg
│ ├── task6.jpg
│ ├── waffle1.jpg
│ ├── waffle2.png
│ ├── waffle3.jpg
│ ├── waffle4.jpg
│ └── waffle5.jpg
└── worldometer_data.csv
├── Week 5
├── Week5-AE-Notes.md
├── college_data_normalized.csv
├── demo.Rmd
└── img
│ ├── ae1.png
│ ├── ae2.png
│ ├── ae3.png
│ ├── ae4.png
│ ├── banner.png
│ └── shame.jpg
├── Week 6
├── Week6-AE-Notes.md
├── Week6-AE-RMarkdown.Rmd
├── fjc-judges.RData
├── img
│ ├── banner.jpg
│ ├── shame.jpg
│ └── staff-employment.png
└── instructional-staff.csv
├── Week 7
├── Week7-AE-Notes.md
├── Week7-AE-RMarkdown.Rmd
└── img
│ ├── banner.gif
│ ├── fame.jpg
│ ├── notes1.jpg
│ ├── task1.jpg
│ ├── task2.jpg
│ ├── task3.png
│ ├── task4.png
│ ├── task5.png
│ ├── task6.png
│ ├── task7.png
│ └── task8.png
└── Week 9
├── Week9-AE-Notes.md
├── Week9-AE-RMarkdown.Rmd
├── img
├── avoid1.jpg
├── avoid2.jpg
├── banner.png
├── colorblind-game.png
├── contrast.png
├── dont.jpg
├── task0.png
├── task1.png
├── task2.png
└── task3.png
└── nurses.csv
/.gitignore:
--------------------------------------------------------------------------------
1 | demo.Rmd
2 | demo.Rmd
3 | *Solution.Rmd
4 | Homework 2/.RData
5 | Homework 2/.Rhistory
6 | *demo.Rmd
7 |
--------------------------------------------------------------------------------
/Homework 1/Homework1.md:
--------------------------------------------------------------------------------
1 | 
2 |
3 | # COMP4010/5120 Data Visualization - Homework 1 Question Set
4 |
5 | This assignment dives into the world of data manipulation and visualization using the `tidyverse` package in R. You'll work with [an IKEA furniture dataset]((https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-11-03/)), exploring various techniques to clean, transform, and visually represent the data.
6 |
7 | Throughout the assignment, you'll encounter ten tasks, each requiring you to write R code to achieve specific data manipulation and visualization goals.
8 |
9 | Feel free to experiment, explore different approaches, and consult online resources if needed. Remember to utilize clear and concise code with meaningful variable names and comments to enhance readability and understanding.
10 |
11 | Dataset: The dataset, [`ikea_data.csv`](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-11-03/), contains information about various IKEA furniture items, including their names, categories, prices, dimensions, and designers.
12 |
13 | ### Data dictionary
14 |
15 | |variable |class |description |
16 | |:-----------------|:---------|:-----------|
17 | |item_id |double | Item ID (Irrelevant to us) |
18 | |name |character | Commercial name of items |
19 | |category |character | The furniture category |
20 | |price |double | Current price in Saudi Riyals |
21 | |old_price |character | Price of item in Saudi Riyals before discount |
22 | |sellable_online |logical | Sellable online (boolean) |
23 | |link |character | Link to the item |
24 | |other_colors |character | Whether other colors are available for the item (boolean) |
25 | |short_description |character | Description of the item |
26 | |designer |character | Designer who designed the item |
27 | |depth |double | Depth of the item in Centimeter |
28 | |height |double | Height of the item in Centimeter |
29 | |width |double | Width of the item in Centimeter|
30 |
31 | # Submission requirements
32 | Similar to the weekly AEs, you should submit your work in 2 formats: `Rmd` and `pdf` to Canvas.
33 | Answers for non-coding questions should be added in the file as plain text.
34 |
35 |
36 | ## Task 1: Converting prices to USD
37 |
38 | Convert the price column from Saudi Riyals (SAR) to USD, considering an exchange rate of 1 SAR = 0.27 USD. Create a new column, `price_usd`, containing the prices in USD.
39 |
40 | # Task 2: Splitting multiple designers into seperate rows
41 |
42 | The designer column might contain multiple designers separated by "`/`". Split these entries into separate rows, creating a new data frame with one designer per row.
43 |
44 | For example, consider the following example dataframe `df`:
45 | | item_id |designer |price |
46 | |:-----------------|:---------|:-----------|
47 | |1| Designer A | 1
48 | |2| Designer B/Designer C/Designer D | 15
49 |
50 | Item 2 has 3 designers, we need to create a new dataframe `df_split` by splitting the designer column:
51 | |item_id |designer |price |
52 | |:-----------------|:---------|:-----------|
53 | |1| Designer A | 1 |
54 | |2| Designer B | 15 |
55 | |2| Designer C | 15 |
56 | |2| Designer D | 15 |
57 |
58 |
59 | Hint: There are multiple ways to achieve this, but you can check out the [documentation for `seperate_rows`](https://tidyr.tidyverse.org/reference/separate_rows.html) if you need help.
60 |
61 | # Task 3: Get top 20 designers by number of items. Exclude NA and IKEA of Sweden.
62 |
63 | Based on `df_split` in Task 2, find the top 20 designers with the most products (excluding "`IKEA of Sweden`" and entries with missing values in the `designer` column) and store it in a new dataframe called `top_designers`. The dataframe should have 2 columns which show the designer name and the corresponding number of item made by that designer:
64 |
65 | | designer | num_items|
66 | |:-----------------|:---------|
67 | | Designer Z | 74 |
68 | | Designer X | 107 |
69 | | Designer Y | 67 |
70 | | ... | ... |
71 |
72 | Hint: You may find the following functions useful:
73 | - [`filter()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/filter)
74 | - [`is.na()`](https://www.statology.org/is-na/)
75 | - [`group_by()`](https://datacarpentry.org/R-genomics/04-dplyr.html#split-apply-combine_data_analysis_and_the_summarize()_function)
76 | - [`summarize()`](https://datacarpentry.org/R-genomics/04-dplyr.html#split-apply-combine_data_analysis_and_the_summarize()_function)
77 | - [`n()`](https://www.statology.org/n-function-in-r/)
78 | - [`top_n()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/top_n)
79 |
80 | In your RMarkdown, answer the following question(s):
81 | - Who are the top 3 designers by number of items?
82 | - What is the number of items designed by IKEA of Sweden or by unknown designers? Why should we exclude them from the analysis?
83 |
84 | # Task 4. Visualizing price distribution per designer with box plot
85 | Create a boxplot to visualize the distribution of item prices (USD) for each designer identified in `top_designers`. Your plot should show x-axis for designers and y-axis for prices.
86 | Add appropriate axis labels, and a plot title.
87 |
88 | An example plot (yours may look different):
89 |
90 | 
91 |
92 | In your RMarkdown, answer the following question(s):
93 | - In 3-4 sentences, briefly describe the key findings from this plot. (Open ended question - write down anything useful you can see from this plot)
94 | - In your opinion, is this an effective visualization? If not, what is your suggestion to improve it?
95 |
96 | # Task 5. Distribution of Items per Category with lollipop chart
97 | In the original dataset, count the number of items in each `category` and visualize the distribution using a lollipop chart (recall Weekly AE 2). Create a bar chart with categories on the x-axis, item count on the y-axis, and informative labels and title.
98 |
99 | **Optional**: Sort the categories in descending order of item count.
100 |
101 | An example plot (yours may look different):
102 |
103 | 
104 |
105 | In your RMarkdown, answer the following question(s):
106 | - In 3-4 sentences, briefly describe the key findings from this plot. (Open ended question - write down anything useful you can see from this plot)
107 | - In your opinion, is this an effective visualization? If not, what is your suggestion to improve it?
108 |
109 | # Task 6. Average price per category with lollipop chart
110 | Calculate the average price (USD) for each category and visualize the distribution using a lollipop chart.
111 |
112 | **Optional**: Sort the categories by average price in descending order.
113 |
114 | An example plot (yours may look different):
115 |
116 | 
117 |
118 | In your RMarkdown, answer the following question(s):
119 | - In 3-4 sentences, briefly describe the key findings from this plot. (Open ended question - write down anything useful you can see from this plot)
120 | - In your opinion, is this an effective visualization? If not, what is your suggestion to improve it?
121 |
122 | # Task 7. Price vs. Volume relationship with scatter plot
123 | A volume of each item is the product of its `height`, `width`, and `depth`. Plot a scattter plot of volume on the y-axis, and the price in USD on the x-axis. Color the plot using `category`.
124 |
125 | An example plot (yours may look different):
126 |
127 | 
128 |
129 | In your RMarkdown, answer the following question(s):
130 | - In 3-4 sentences, briefly describe the key findings from this plot. (Open ended question - write down anything useful you can see from this plot)
131 | - In your opinion, is this an effective visualization? If not, what is your suggestion to improve it?
--------------------------------------------------------------------------------
/Homework 1/img/ikea-logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Homework 1/img/ikea-logo.png
--------------------------------------------------------------------------------
/Homework 1/img/task4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Homework 1/img/task4.png
--------------------------------------------------------------------------------
/Homework 1/img/task5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Homework 1/img/task5.png
--------------------------------------------------------------------------------
/Homework 1/img/task6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Homework 1/img/task6.png
--------------------------------------------------------------------------------
/Homework 1/img/task7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Homework 1/img/task7.png
--------------------------------------------------------------------------------
/Homework 2/Homework2.md:
--------------------------------------------------------------------------------
1 | 
2 |
3 | # COMP4010/5120 Data Visualization - Homework 2 Question Set
4 |
5 | This assignment introduces you to a new type of plot called [ridgeline plot](https://www.data-to-viz.com/graph/ridgeline.html), and also require you to apply the skills you have learned so far in the course. We will be using the same fictional VinUni enrolment data from Week 5 AE, plus some Vietnam population data to construct a [population pyramid plot](https://education.nationalgeographic.org/resource/population-pyramid/).
6 |
7 | # Submission requirements
8 |
9 | Similar to the weekly AEs and prior homeworks, you should submit your work in 2 formats: `Rmd` and `pdf` to Canvas.
10 | Answers for non-coding questions should be added in the file as plain text.
11 |
12 | ## Task 1: Creating a ridgeline plot for VinUni enrolment data
13 |
14 | ### Summary of task requirements
15 |
16 | - Recreate the example plot to the best of your ability (minor differences like font size and figure size are accepted).
17 | - Provide a short comment about what information you are able to intepret from this ridgeline plot.
18 |
19 | ### Task description
20 |
21 | Install the package `ggridges` ([quick beginner's guide](https://cran.r-project.org/web/packages/ggridges/vignettes/introduction.html)). Use `ggridges` to create a ridgeline plot of the distribution of percentage of enrolled students by college in the fictional VinUni dataset [`college_data_normalized.csv`](college_data_normalized.csv).
22 |
23 | Recreate the following plot to the best of your ability (minor differences like font size and figure size are accepted), and provide a short comment about what information you are able to intepret from this ridgeline plot:
24 |
25 | 
26 |
27 | You should do this by creating a function called `theme_college_stats()` to store your theme for the following exercises. In particular, the theme should have the following properties:
28 |
29 | - Centered and bold plot title.
30 | - Only y-axis major grid lines are visible.
31 | - Axis title and ticks labels should be bold.
32 | - Font size should be readable and does not need to be identical to the example plot provided above.
33 |
34 | Additionally, the font family used in the above plot is [Open Sans from Google Font](https://fonts.google.com/specimen/Open+Sans).
35 | The color palette used for color coding the colleges is: `#78d6ff` for CAS, `#ffee54` for CBM, `#992212` for CECS, and `#117024` for CHS.
36 |
37 | Finally, please provide a short comment about what information you are able to intepret from this ridgeline plot.
38 |
39 | ## Task 2: Creating a bar chart for 2118 VinUni data
40 |
41 | ### Summary of task requirements
42 |
43 | - Filter the VinUni enrolment data to extract only data from the year 2118.
44 | - Recreate the example plot to the best of your ability (minor differences like font size and figure size are accepted).
45 |
46 | ### Task description
47 |
48 | Recreate the following plot to the best of your ability (minor differences like font size and figure size are accepted):
49 |
50 | 
51 |
52 | ## Task 3: Creating a pie chart for 2118 VinUni data
53 |
54 | ### Summary of task requirements
55 |
56 | - Recreate the example plot to the best of your ability (minor differences like font size and figure size are accepted).
57 | - Provide a short comment comparing the effectiveness of the pie chart and the bar chart from Task 2 to represent the same data.
58 |
59 | ### Task description
60 |
61 | Recreate the following plot to the best of your ability (minor differences like font size and figure size are accepted):
62 |
63 | 
64 |
65 | Hint: If you're unsure how to create this pie chart, check out [`coord_polar()` documentation](https://ggplot2.tidyverse.org/reference/coord_polar.html).
66 |
67 | ## Task 4: Creating population pyramids for Vietnam data from 1975 to 2025
68 |
69 | ### Summary of task requirements
70 |
71 | - Create a series of 11 individual population pyramid plots for Vietnam from 1975 to 2025.
72 | - Create a theme to style the plots as close as possible to the example plot provided below (you can use other color schemes).
73 | - Provide a comment describing what information does a population pyramid of a year tell you?
74 | - Provide a comment, based on the 11 plots, describing your idea of what is going on with the population in Vietnam? Optional: Provide your hypothesis/explanation to why that is happening.
75 |
76 | ### Task description
77 |
78 | Create a series of 11 individual [population pyramid]((https://education.nationalgeographic.org/resource/population-pyramid/)) plots for Vietnam from 1975 to 2025, based on the data files [`Viet Nam-####.csv`](./). In particular, the x-axis should represent the percentage of total population while the y-axis should be the age group. Below is an example plot of the data for 2020:
79 |
80 | 
81 |
82 | Please try to recreate the plot to the best of your ability. You can pick any color schemes you would like for your plot. In the example plot above, the colors `#109466` and `#112e80` are used for female and male respectively.
83 |
84 | To avoid copy and pasting a lot of code, you can create a function called `create_population_pyramid(file_name)` to generate a population pyramid with the defined style and theme for a given data file.
85 |
86 | Please describe what information does a population pyramid of a year tell you? Based on the 11 plots you've just generated, describe your idea of what is going on with the population in Vietnam? **Optional**: Provide your hypothesis/explanation to why that is happening.
87 |
88 | ## Task 5: Creating line graph for Vietnamese total population count from 1975 to 2025
89 |
90 | ### Summary of task requirements
91 |
92 | - Create a line graph for Vietnamese total population count from 1975 to 2025.
93 |
94 | ### Task description
95 |
96 | Using all provided data files for Vietnamese population from 1975 to 2025, create a line graph for Vietnamese total population count from 1975 to 2025. As the data for 2025 is predicted data, plot it with a dashed line from 2020 to highlight that it is a prediction.
97 |
98 | Recreate the following plot to the best of your ability (minor differences like font size and figure size are accepted):
99 |
100 | 
101 |
102 | You can pick any color schemes you would like for your plot. In the example plot above, the color `#112e80` is used for the lines and points and `#999999` is used for the dashed line.
103 |
104 | ## Task 6: Your own Hall of Fame/Shame
105 |
106 | Find an interesting data visualization online and provide a critique of it. It can be a great visualization or a terrible one, up to you!
107 |
108 | In 3-4 sentences:
109 |
110 | - Provide and introduction to the visualization and an image of it along with proper credit/citation. To insert an image in R Markdown, use:
111 | ``````
112 | - Describe the purpose of the visualization and the question it is attempting to answer.
113 |
114 | In bullet points: identify the strengths and weaknesses of the visualization.
115 |
116 | In two to three paragraphs (80-150 words): give a critique/analysis/constructive comments about the visualization. Apply the you knowledge of effective visualization and what you've have learned in the course to your analysis. Can you suggest some improvements that can be made to the visualization?
117 |
118 | Side note: inserting an image in R Markdown should look something like this:
119 |
120 | 
--------------------------------------------------------------------------------
/Homework 2/Viet Nam-1975.csv:
--------------------------------------------------------------------------------
1 | Age,M,F
2 | 0-4,3694642,3533048
3 | 5-9,3121466,3028507
4 | 10-14,2989087,2922867
5 | 15-19,2714457,2693838
6 | 20-24,2076148,2135972
7 | 25-29,1403191,1533105
8 | 30-34,1027172,1154893
9 | 35-39,984171,1109235
10 | 40-44,1010992,1158747
11 | 45-49,848021,985453
12 | 50-54,772016,901463
13 | 55-59,646767,746972
14 | 60-64,570331,678210
15 | 65-69,428088,533747
16 | 70-74,332872,443930
17 | 75-79,183815,270358
18 | 80-84,85977,143035
19 | 85-89,28564,55873
20 | 90-94,5496,14371
21 | 95-99,522,2033
22 | 100+,21,131
23 |
--------------------------------------------------------------------------------
/Homework 2/Viet Nam-1980.csv:
--------------------------------------------------------------------------------
1 | Age,M,F
2 | 0-4,4086624,3896241
3 | 5-9,3601373,3485693
4 | 10-14,3020584,2990706
5 | 15-19,2838616,2875253
6 | 20-24,2608580,2663069
7 | 25-29,2021410,2110407
8 | 30-34,1359024,1506636
9 | 35-39,989298,1134796
10 | 40-44,951458,1094749
11 | 45-49,977491,1143917
12 | 50-54,811901,964290
13 | 55-59,727906,874578
14 | 60-64,590842,707845
15 | 65-69,495372,620876
16 | 70-74,345272,462280
17 | 75-79,236716,347983
18 | 80-84,107137,179331
19 | 85-89,36725,72510
20 | 90-94,7448,19518
21 | 95-99,732,2846
22 | 100+,33,196
23 |
--------------------------------------------------------------------------------
/Homework 2/Viet Nam-1985.csv:
--------------------------------------------------------------------------------
1 | Age,M,F
2 | 0-4,4526915,4311312
3 | 5-9,4006081,3859743
4 | 10-14,3537375,3461674
5 | 15-19,2907232,2951899
6 | 20-24,2729170,2841305
7 | 25-29,2542777,2635707
8 | 30-34,1971260,2082154
9 | 35-39,1317979,1484041
10 | 40-44,957636,1118555
11 | 45-49,921444,1080490
12 | 50-54,938433,1121204
13 | 55-59,767940,936689
14 | 60-64,670707,835130
15 | 65-69,517712,652171
16 | 70-74,404676,543594
17 | 75-79,249893,368246
18 | 80-84,140976,235917
19 | 85-89,47322,94088
20 | 90-94,10036,26384
21 | 95-99,1036,4061
22 | 100+,48,290
23 |
--------------------------------------------------------------------------------
/Homework 2/Viet Nam-1990.csv:
--------------------------------------------------------------------------------
1 | Age,M,F
2 | 0-4,4670777,4448662
3 | 5-9,4448635,4277100
4 | 10-14,3959308,3842936
5 | 15-19,3451933,3431582
6 | 20-24,2813886,2923031
7 | 25-29,2666815,2817462
8 | 30-34,2489071,2609294
9 | 35-39,1923492,2059038
10 | 40-44,1281928,1465924
11 | 45-49,929012,1104419
12 | 50-54,885849,1060693
13 | 55-59,888640,1090833
14 | 60-64,710108,898816
15 | 65-69,590784,775701
16 | 70-74,425791,577455
17 | 75-79,295573,440848
18 | 80-84,150868,256699
19 | 85-89,62777,128777
20 | 90-94,13669,36413
21 | 95-99,1525,5949
22 | 100+,69,459
23 |
--------------------------------------------------------------------------------
/Homework 2/Viet Nam-1995.csv:
--------------------------------------------------------------------------------
1 | Age,M,F
2 | 0-4,4480472,4260623
3 | 5-9,4595982,4415086
4 | 10-14,4414726,4268137
5 | 15-19,3928244,3833046
6 | 20-24,3405738,3418352
7 | 25-29,2769092,2908102
8 | 30-34,2621960,2801018
9 | 35-39,2443493,2592424
10 | 40-44,1882183,2041223
11 | 45-49,1247959,1448627
12 | 50-54,894373,1085734
13 | 55-59,839338,1034514
14 | 60-64,822764,1051791
15 | 65-69,628519,844696
16 | 70-74,488346,697378
17 | 75-79,313255,480197
18 | 80-84,180199,319823
19 | 85-89,68052,148908
20 | 90-94,18329,54131
21 | 95-99,2117,9266
22 | 100+,105,780
23 |
--------------------------------------------------------------------------------
/Homework 2/Viet Nam-2000.csv:
--------------------------------------------------------------------------------
1 | Age,M,F
2 | 0-4,3678636,3482725
3 | 5-9,4430164,4238865
4 | 10-14,4554958,4400237
5 | 15-19,4357328,4241607
6 | 20-24,3880755,3813519
7 | 25-29,3382369,3406481
8 | 30-34,2750254,2899375
9 | 35-39,2595111,2791177
10 | 40-44,2408404,2577420
11 | 45-49,1846777,2023776
12 | 50-54,1212804,1429674
13 | 55-59,854513,1062360
14 | 60-64,782876,1000858
15 | 65-69,732617,994023
16 | 70-74,522859,766157
17 | 75-79,362475,588576
18 | 80-84,193320,357017
19 | 85-89,82564,192744
20 | 90-94,20254,66003
21 | 95-99,2875,15115
22 | 100+,148,1362
23 |
--------------------------------------------------------------------------------
/Homework 2/Viet Nam-2005.csv:
--------------------------------------------------------------------------------
1 | Age,M,F
2 | 0-4,3673690,3421596
3 | 5-9,3645906,3460996
4 | 10-14,4357498,4198635
5 | 15-19,4280693,4236040
6 | 20-24,4061281,4043894
7 | 25-29,3806480,3688444
8 | 30-34,3393297,3340381
9 | 35-39,2745868,2870666
10 | 40-44,2584660,2775668
11 | 45-49,2399863,2566785
12 | 50-54,1830224,2011950
13 | 55-59,1181752,1403330
14 | 60-64,812941,1024571
15 | 65-69,706730,946525
16 | 70-74,611051,905409
17 | 75-79,389821,651462
18 | 80-84,224921,442767
19 | 85-89,89264,219561
20 | 90-94,24755,87404
21 | 95-99,3209,19441
22 | 100+,202,2451
23 |
--------------------------------------------------------------------------------
/Homework 2/Viet Nam-2010.csv:
--------------------------------------------------------------------------------
1 | Age,M,F
2 | 0-4,3741588,3387806
3 | 5-9,3643329,3404214
4 | 10-14,3610494,3443027
5 | 15-19,4202547,4109705
6 | 20-24,4073782,4111665
7 | 25-29,3981072,3959360
8 | 30-34,3788085,3639043
9 | 35-39,3363334,3312905
10 | 40-44,2716365,2851761
11 | 45-49,2551428,2756603
12 | 50-54,2347015,2541429
13 | 55-59,1759579,1973885
14 | 60-64,1112588,1359315
15 | 65-69,732028,971634
16 | 70-74,590187,864554
17 | 75-79,456549,773649
18 | 80-84,242899,494433
19 | 85-89,104331,276123
20 | 90-94,26942,101798
21 | 95-99,3935,26470
22 | 100+,229,3319
23 |
--------------------------------------------------------------------------------
/Homework 2/Viet Nam-2015.csv:
--------------------------------------------------------------------------------
1 | Age,M,F
2 | 0-4,3947110,3538149
3 | 5-9,3712754,3375324
4 | 10-14,3620222,3396949
5 | 15-19,3580980,3431778
6 | 20-24,4149889,4092957
7 | 25-29,4024289,4095666
8 | 30-34,3931499,3941614
9 | 35-39,3731891,3618417
10 | 40-44,3302109,3287568
11 | 45-49,2653235,2822345
12 | 50-54,2466112,2716899
13 | 55-59,2234321,2488514
14 | 60-64,1640149,1915359
15 | 65-69,995877,1292890
16 | 70-74,613443,889638
17 | 75-79,442039,740313
18 | 80-84,285623,589422
19 | 85-89,113576,310718
20 | 90-94,31763,129407
21 | 95-99,4340,31362
22 | 100+,284,4595
23 |
--------------------------------------------------------------------------------
/Homework 2/Viet Nam-2020.csv:
--------------------------------------------------------------------------------
1 | Age,M,F
2 | 0-4,3914415,3515182
3 | 5-9,3918151,3525629
4 | 10-14,3690690,3368547
5 | 15-19,3592890,3386916
6 | 20-24,3537280,3418117
7 | 25-29,4099736,4077268
8 | 30-34,3974712,4077902
9 | 35-39,3874635,3920484
10 | 40-44,3665987,3592153
11 | 45-49,3228325,3255484
12 | 50-54,2567440,2783503
13 | 55-59,2351065,2662918
14 | 60-64,2086301,2417969
15 | 65-69,1469944,1824485
16 | 70-74,839870,1189523
17 | 75-79,462156,766667
18 | 80-84,277636,568363
19 | 85-89,134475,375856
20 | 90-94,34972,148902
21 | 95-99,5157,40949
22 | 100+,316,5705
23 |
--------------------------------------------------------------------------------
/Homework 2/Viet Nam-2025.csv:
--------------------------------------------------------------------------------
1 | Age,M,F
2 | 0-4,3693730,3341723
3 | 5-9,3889316,3501707
4 | 10-14,3859798,3501012
5 | 15-19,3596688,3313053
6 | 20-24,3523399,3343511
7 | 25-29,3565336,3420389
8 | 30-34,4122372,4066148
9 | 35-39,3940209,4038239
10 | 40-44,3805911,3871350
11 | 45-49,3573473,3537612
12 | 50-54,3106736,3192372
13 | 55-59,2421136,2706316
14 | 60-64,2177102,2575229
15 | 65-69,1868542,2302988
16 | 70-74,1240881,1680082
17 | 75-79,638319,1030491
18 | 80-84,292072,590966
19 | 85-89,130530,361834
20 | 90-94,41247,180620
21 | 95-99,5720,47634
22 | 100+,399,7778
23 |
--------------------------------------------------------------------------------
/Homework 2/college_data_normalized.csv:
--------------------------------------------------------------------------------
1 | year,college,pct
2 | 2105,CBM,0.598425
3 | 2106,CBM,0.6615047779779688
4 | 2107,CBM,0.3971112533138313
5 | 2108,CBM,0.41129442916793973
6 | 2109,CBM,0.4251509105144849
7 | 2110,CBM,0.43469596320899334
8 | 2111,CBM,0.42034424444598883
9 | 2112,CBM,0.4281175536139794
10 | 2113,CBM,0.434933723243606
11 | 2114,CBM,0.4274814637646496
12 | 2115,CBM,0.4368950555476089
13 | 2116,CBM,0.45135
14 | 2117,CBM,0.4224771553436631
15 | 2118,CBM,0.4020581168683984
16 | 2119,CBM,0.3827852384477107
17 | 2120,CBM,0.3396724986988029
18 | 2121,CBM,0.3363424403658756
19 | 2122,CBM,0.3318401649092207
20 | 2123,CBM,0.3014243451343038
21 | 2124,CBM,0.2848104801538449
22 | 2125,CBM,0.2708468062389401
23 | 2105,CAS,0.0000000000000000
24 | 2106,CAS,0.0000000000000000
25 | 2107,CAS,0.2802358533686809
26 | 2108,CAS,0.2880130359507078
27 | 2109,CAS,0.2777006036420579
28 | 2110,CAS,0.2843893714869698
29 | 2111,CAS,0.2966386359950888
30 | 2112,CAS,0.3150025697332152
31 | 2113,CAS,0.3112730410604281
32 | 2114,CAS,0.2922028222913179
33 | 2115,CAS,0.2437896342917084
34 | 2116,CAS,0.2183250000000002
35 | 2117,CAS,0.2292163289630512
36 | 2118,CAS,0.2471780225758193
37 | 2119,CAS,0.2621973141588241
38 | 2120,CAS,0.2780758297633823
39 | 2121,CAS,0.2759211653813196
40 | 2122,CAS,0.2591968603821454
41 | 2123,CAS,0.2455672890421426
42 | 2124,CAS,0.2567813349718876
43 | 2125,CAS,0.2634276664873313
44 | 2105,CECS,0.221725000000000
45 | 2106,CECS,0.172349305539717
46 | 2107,CECS,0.106522534052472
47 | 2108,CECS,0.082951420714940
48 | 2109,CECS,0.081150708458565
49 | 2110,CECS,0.056642820643842
50 | 2111,CECS,0.063937730210577
51 | 2112,CECS,0.062374433490632
52 | 2113,CECS,0.066156906924902
53 | 2114,CECS,0.072518536235350
54 | 2115,CECS,0.081986363419634
55 | 2116,CECS,0.104075000000000
56 | 2117,CECS,0.139327572506952
57 | 2118,CECS,0.173330613355093
58 | 2119,CECS,0.189468118853759
59 | 2120,CECS,0.224646674940945
60 | 2121,CECS,0.247962376198162
61 | 2122,CECS,0.264568302544993
62 | 2123,CECS,0.300523108873258
63 | 2124,CECS,0.304196364150193
64 | 2125,CECS,0.315810182777259
65 | 2105,CHS,0.1776000000000000
66 | 2106,CHS,0.1620179259698497
67 | 2107,CHS,0.2161303592650151
68 | 2108,CHS,0.2177411141664120
69 | 2109,CHS,0.2159977773848913
70 | 2110,CHS,0.2242718446601941
71 | 2111,CHS,0.2190793893483448
72 | 2112,CHS,0.1945054431621735
73 | 2113,CHS,0.1876363287710635
74 | 2114,CHS,0.2077971777086821
75 | 2115,CHS,0.2373289467410482
76 | 2116,CHS,0.2262500000000000
77 | 2117,CHS,0.2089789431863329
78 | 2118,CHS,0.1774332472006890
79 | 2119,CHS,0.1655493285397060
80 | 2120,CHS,0.1576049965968691
81 | 2121,CHS,0.1397740180546422
82 | 2122,CHS,0.1443946721636406
83 | 2123,CHS,0.1524852569502948
84 | 2124,CHS,0.1542118207240741
85 | 2125,CHS,0.1499153444964689
86 |
--------------------------------------------------------------------------------
/Homework 2/img/banner.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Homework 2/img/banner.png
--------------------------------------------------------------------------------
/Homework 2/img/task1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Homework 2/img/task1.png
--------------------------------------------------------------------------------
/Homework 2/img/task2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Homework 2/img/task2.png
--------------------------------------------------------------------------------
/Homework 2/img/task3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Homework 2/img/task3.png
--------------------------------------------------------------------------------
/Homework 2/img/task4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Homework 2/img/task4.png
--------------------------------------------------------------------------------
/Homework 2/img/task5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Homework 2/img/task5.png
--------------------------------------------------------------------------------
/Homework 2/img/task6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Homework 2/img/task6.png
--------------------------------------------------------------------------------
/Homework 3/Homework3.md:
--------------------------------------------------------------------------------
1 | 
2 |
3 | # COMP4010/5120 Data Visualization - Homework 3 Question Set
4 |
5 | This assignment requires you to demonstrate your knowledge of visualizing geospatial data (choropleth map), creating and publishing an interactive Shiny dashboard. Additionally, you will also be introduced to `DALEX`, a library for ML model intepretation. This was supposed to be covered during an AE sessio, however due to scheduling, a brief introductory exercise will be included here instead.
6 |
7 | The assignment is seperated into 3 main sections, each using a different dataset.
8 |
9 | The maximum points for this assignment is **20 points**.
10 |
11 | # Submission requirements
12 |
13 | Similar to the weekly AEs and prior homeworks, you should submit your work in 2 formats: `Rmd` and `pdf` to Canvas.
14 | The entire assignment should be contained in 1 `Rmd` file.
15 |
16 | Answers for non-coding questions should be added in the file as plain text in the provided `ANSWER` field.
17 |
18 | # Section 1: Studying popularity of baby names
19 |
20 | For this set of exercises, we are going to use the [`babynames` dataset](https://hadley.github.io/babynames/).
21 |
22 | Install the dataset by running:
23 |
24 | ```R
25 | # Install the released version from CRAN
26 | install.packages("babynames")
27 |
28 | library(babynames)
29 |
30 | applicants_df <- applicants
31 | babynames_df <- babynames
32 | lifetables_df <- lifetables
33 | births_df <- births
34 | ```
35 |
36 | The four included datasets in this package are:
37 |
38 | - `babynames`: For each year from 1880 to 2017, the number of children of each sex given each name. All names with more than 5 uses are given.
39 |
40 | - `applicants`: The number of applicants for social security numbers (SSN) for each year for each sex.
41 |
42 | - `lifetables`: Cohort life tables data.
43 |
44 | It also includes the following data set from the US Census:
45 |
46 | - `births`: Number of live births by year, up to 2017.
47 | You may need to use multiple files, but the most important one is `babynames`. Feel free to bring in additional data sources as you wish.
48 |
49 | ## Task 1: Shiny Dashboard for Baby Naming Trends (10 pts)
50 |
51 | ### Summary of task requirements
52 |
53 | Design and implement a Shiny dashboard that reports on baby naming trends. The dashboard should appropriately use server-side interactivity.
54 |
55 | 1. **Originality**:
56 | - Avoid direct code copying from examples.
57 | - Create an original implementation.
58 |
59 | 2. **Inputs and Outputs**:
60 | - Use at least **two user inputs**.
61 | - Provide at least **three reactive outputs** (e.g., plots, tables, text).
62 |
63 | 3. **Components**:
64 | - Include at least:
65 | - One **plot**
66 | - One **table**
67 | - One **value box**
68 | - Utilize **plotly** for interactivity in at least one plot.
69 |
70 | 4. **Styling and Customization**:
71 | - Apply an **appropriate theme** for styling.
72 | - Optional: Customize further with **CSS/SCSS** if desired.
73 |
74 | 5. **Creativity and Beyond Basics**:
75 | - Extend beyond basic functionality.
76 | - Consider user experience, customization, interactivity, data sources, and insights.
77 |
78 | 6. **Publishing**:
79 | - Deploy the dashboard on **Shinyapps.io**.
80 | - Clearly print the **working app URL** in the rendered PDF for evaluation.
81 |
82 | Remember to showcase creativity and thoughtful design in your implementation! Good luck!
83 |
84 | ### Task description
85 |
86 | **Create a Shiny dashboard to report on baby naming trends.** Design and implement a Shiny dashboard that appropriately uses server-side interactivity.
87 |
88 | There are tons of examples online of Shiny apps made using the `babynames` package. You may use them for design inspiration but your implementation must be original. Don’t directly copy code from examples, and if you draw on specific features and/or code be sure to cite it.
89 |
90 | - You must use at least two user inputs and three reactive outputs (e.g. plots, tables, text).
91 |
92 | - The dashboard should include at least one plot, one table, and one value box.
93 | - You should use `plotly` for at least one plot and customize it to maximize its interactivity.
94 | - Use an appropriate theme for styling. Feel free to further customize with CSS/SCSS if you have background knowledge, but it’s not required.
95 |
96 | **If you create a basic dashboard, expect a basic grade.** We’re looking for extending beyond the basics and some originality/creativity. Things you might think to incorporate into the dashboard:
97 |
98 | - User experience: e.g. layout (pages, tabs, arrangement of cards, etc.), design, etc.
99 | - Customization: e.g. color palettes, themes, fonts, customized theme() to blend ggplot2 plots with the dashboard theme, etc.
100 | - Interactivity: e.g. hover effects, click events, etc.
101 | - Data: e.g. more than just babynames.csv, additional data sources, data wrangling, etc.
102 | - Insights: e.g. text boxes, value boxes, etc.
103 |
104 | The dashboard should be published using Shinyapps.io and the URL for the working app clearly printed in your rendered PDF so we can easily access it during the evaluation. You will need to create an account on Shinyapps.io but the free tier should be sufficient for this assignment.
105 |
106 | # Section 2: Adopt, don’t shop
107 |
108 | The data for this exercise comes from [The Pudding](https://github.com/the-pudding/data/blob/master/dog-shelters/README.md) via [TidyTuesday](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-12-17).
109 |
110 | You will also need to have the 2 packages: `ggmap` and `maps`. For your convenience, a dataframe with the US state names and their corresponding abbreviations is provided.
111 |
112 | ```R
113 | #install.packages("ggmaps")
114 | library(ggmap)
115 |
116 | #install.packages("maps")
117 | library(maps)
118 |
119 | state_abbreviations <- data.frame(
120 | state_name = c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado",
121 | "Connecticut", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho",
122 | "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana",
123 | "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota",
124 | "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada",
125 | "New Hampshire", "New Jersey", "New Mexico", "New York",
126 | "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon",
127 | "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota",
128 | "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington",
129 | "West Virginia", "Wisconsin", "Wyoming"),
130 | state_abbr = c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID",
131 | "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN",
132 | "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND",
133 | "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT",
134 | "VA", "WA", "WV", "WI", "WY")
135 | )
136 | ```
137 |
138 | ## Task 2: Exploring dog adoption dataset in the US (8 pts)
139 |
140 | ### Summary of task requirements
141 |
142 | - Create a dataframe containing the number of dogs available to adopt per `contact_state`
143 | - Create a histogram of the number of dogs available to adopt and describe the distribution of this variable.
144 | - Create a choropleth map where each state is filled in with a color based on the number of dogs available to adopt in that state.
145 |
146 | ### Task description
147 |
148 | - Load the `data/dog_travel.csv` dataset included with `read_csv()`.
149 |
150 | - Calculate the number of dogs available to adopt per `contact_state`. Save the result as a new data frame with 2 columns `contact_state` and `n`. Your new dataframe should look similar to the following example:
151 |
152 | 
153 |
154 | - Make a histogram of the number of dogs available to adopt and describe the distribution of this variable. Use `binwidth = 50`. Your plot may look like the following example:
155 |
156 | 
157 |
158 | - Use this dataset to make a map of the US states, where each state is filled in with a color based on the number of dogs available to adopt in that state.
159 |
160 | 
161 |
162 | *Hints*:
163 |
164 | - Use the `state_abbreviations` dataframe which you can find in the data folder of your repo as a lookup table to match state names to abbreviations.
165 | - Use a gradient color scale and `log10` transformation.
166 |
167 | # Section 3. DALEX basics with explainer
168 |
169 | To start, make sure you have the following packages installed.
170 |
171 | ```R
172 | # packages for wrangling data and the original models
173 | library(tidyverse)
174 | #install.packages('tidymodels')
175 | library(tidymodels)
176 | #install.packages('rcis')
177 | library(rcis)
178 |
179 | # packages for model interpretation/explanation
180 | #install.packages('DALEX')
181 | library(DALEX)
182 | #install.packages('DALEXtra')
183 | library(DALEXtra)
184 | #install.packages('rattle')
185 | library(rattle) # fancy tree plots
186 |
187 | # set random number generator seed value for reproducibility
188 | set.seed(4010)
189 | ```
190 |
191 | In this task we will utilize the `scorecard` dataset. With this dataset, we will intepret a set of machine learning models predicting the median student debt load for students graduating in 2020-21 at four-year colleges and universities as a function of university-specific factors (e.g. public vs. private school, admissions rate, cost of attendance).
192 |
193 | To read about the dataset, run `help(scorecard)` after importing `rcis`.
194 |
195 | 
196 |
197 | We have estimated three distinct machine learning models to predict the median student debt load for students graduating in 2020-21 at four-year colleges and universities. Each model uses the same set of predictors, but the algorithms differ. Specifically, we have estimated
198 |
199 | - Random forest
200 | - Penalized regression
201 | - 10-nearest neighbors
202 | All models were estimated using tidymodels. We will load the training set, test set, and ML workflows from `data/scorecard-models.Rdata`.
203 |
204 | ```R
205 | # load Rdata file with all the data frames and pre-trained models
206 | load("data/scorecard-models.RData")
207 | ```
208 |
209 | In order to generate our interpretations, we will use the [DALEX package](https://dalex.drwhy.ai/). The first step in any DALEX operation is to create an **explainer object**. This object contains all the information needed to interpret the model’s predictions. We will create explainer objects for each of the three models.
210 |
211 | For example, we can create an explainer object for the Penalized regression model (GLMNet) with:
212 |
213 | ```R
214 | # use explain_*() to create explainer object
215 | # first step of an DALEX operation
216 | explainer_glmnet <- explain_tidymodels(
217 | model = glmnet_wf,
218 | # data should exclude the outcome feature
219 | data = scorecard_train |> select(-debt),
220 | # y should be a vector containing the outcome of interest for the training set
221 | y = scorecard_train$debt,
222 | # assign a label to clearly identify model in later plots
223 | label = "penalized regression"
224 | )
225 | ```
226 |
227 | ## Task 3: Create explainer objects (1 pt)
228 |
229 | ### Summary of task requirements
230 |
231 | Fill in the `TODO` fields in the given snippets.
232 |
233 | ### Task description
234 |
235 | Review the syntax below for creating explainer objects using the `explain_tidymodels()` function. Then, create explainer objects for the random forest and
236 | k-nearest neighbors models. Fill in the `TODO` fields in the given snippets provided below.
237 |
238 | ```R
239 | # explainer for random forest model
240 | explainer_rf <- explain_tidymodels(
241 | model = TODO,
242 | data = scorecard_train |> select(-debt),
243 | y = scorecard_train$debt,
244 | label = "TODO"
245 | )
246 |
247 | # explainer for nearest neighbors model
248 | explainer_kknn <- explain_tidymodels(
249 | model = TODO,
250 | data = scorecard_train |> select(-debt),
251 | y = scorecard_train$debt,
252 | label = "TODO"
253 | )
254 | ```
255 |
256 | ## Task 4: Feature importance (1 pt)
257 |
258 | ### Summary of task requirements
259 |
260 | Run the given snippets, and provide a **short answer (1 sentence)** in the `ANSWER` field included in the comment.
261 |
262 | ### Task description
263 |
264 | The DALEX package provides a variety of methods for interpreting machine learning models. One common method is to calculate feature importance. Feature importance measures the contribution of each predictor variable to the model’s predictions. We will use the `model_parts()` function to calculate feature importance for the random forest model. It includes a built-in `plot()` method using `ggplot2` to visualize the results.
265 |
266 | Here is an example, run this snippet and see what happens:
267 |
268 | ```R
269 | # generate feature importance measures
270 | vip_rf <- model_parts(explainer_rf)
271 | vip_rf
272 |
273 | # visualize feature importance
274 | plot(vip_rf)
275 | ```
276 |
277 | Next, we want to examine the difference when using the ratio of the raw change in the loss function instead. Run the following snippet, examine the output and compare with the plot above.
278 |
279 | ```R
280 | # Question: Calculate feature importance for the random forest model using the ratio of the raw change in the loss function. How does this differ from the raw change?
281 | # ANSWER: YOUR ANSWER HERE
282 |
283 | # calculate ratio rather than raw change
284 | model_parts(explainer_rf, type = "ratio") |>
285 | plot()
286 | ```
287 |
--------------------------------------------------------------------------------
/Homework 3/data/median-housing 1.csv:
--------------------------------------------------------------------------------
1 | DATE,MSPUS
2 | 1/1/1963,17800
3 | 4/1/1963,18000
4 | 7/1/1963,17900
5 | 10/1/1963,18500
6 | 1/1/1964,18500
7 | 4/1/1964,18900
8 | 7/1/1964,18900
9 | 10/1/1964,19400
10 | 1/1/1965,20200
11 | 4/1/1965,19800
12 | 7/1/1965,20200
13 | 10/1/1965,20300
14 | 1/1/1966,21000
15 | 4/1/1966,22100
16 | 7/1/1966,21500
17 | 10/1/1966,21400
18 | 1/1/1967,22300
19 | 4/1/1967,23300
20 | 7/1/1967,22500
21 | 10/1/1967,22900
22 | 1/1/1968,23900
23 | 4/1/1968,24900
24 | 7/1/1968,24800
25 | 10/1/1968,25600
26 | 1/1/1969,25700
27 | 4/1/1969,25900
28 | 7/1/1969,25900
29 | 10/1/1969,24900
30 | 1/1/1970,23900
31 | 4/1/1970,24400
32 | 7/1/1970,23000
33 | 10/1/1970,22600
34 | 1/1/1971,24300
35 | 4/1/1971,25800
36 | 7/1/1971,25300
37 | 10/1/1971,25500
38 | 1/1/1972,26200
39 | 4/1/1972,26800
40 | 7/1/1972,27900
41 | 10/1/1972,29200
42 | 1/1/1973,30200
43 | 4/1/1973,32700
44 | 7/1/1973,33500
45 | 10/1/1973,34000
46 | 1/1/1974,35200
47 | 4/1/1974,35600
48 | 7/1/1974,36200
49 | 10/1/1974,37200
50 | 1/1/1975,38100
51 | 4/1/1975,39000
52 | 7/1/1975,38800
53 | 10/1/1975,41200
54 | 1/1/1976,42800
55 | 4/1/1976,44200
56 | 7/1/1976,44400
57 | 10/1/1976,45500
58 | 1/1/1977,46300
59 | 4/1/1977,48900
60 | 7/1/1977,48800
61 | 10/1/1977,51600
62 | 1/1/1978,53000
63 | 4/1/1978,55300
64 | 7/1/1978,56100
65 | 10/1/1978,59000
66 | 1/1/1979,60600
67 | 4/1/1979,63100
68 | 7/1/1979,64700
69 | 10/1/1979,62600
70 | 1/1/1980,63700
71 | 4/1/1980,64000
72 | 7/1/1980,64900
73 | 10/1/1980,66400
74 | 1/1/1981,66800
75 | 4/1/1981,69400
76 | 7/1/1981,69200
77 | 10/1/1981,70400
78 | 1/1/1982,66400
79 | 4/1/1982,69600
80 | 7/1/1982,69300
81 | 10/1/1982,71600
82 | 1/1/1983,73300
83 | 4/1/1983,74900
84 | 7/1/1983,77400
85 | 10/1/1983,75900
86 | 1/1/1984,78200
87 | 4/1/1984,80700
88 | 7/1/1984,81000
89 | 10/1/1984,79900
90 | 1/1/1985,82800
91 | 4/1/1985,84300
92 | 7/1/1985,83200
93 | 10/1/1985,86800
94 | 1/1/1986,88000
95 | 4/1/1986,92100
96 | 7/1/1986,93000
97 | 10/1/1986,95000
98 | 1/1/1987,97900
99 | 4/1/1987,103400
100 | 7/1/1987,106000
101 | 10/1/1987,111500
102 | 1/1/1988,110000
103 | 4/1/1988,110000
104 | 7/1/1988,115000
105 | 10/1/1988,113900
106 | 1/1/1989,118000
107 | 4/1/1989,118900
108 | 7/1/1989,120000
109 | 10/1/1989,124800
110 | 1/1/1990,123900
111 | 4/1/1990,126800
112 | 7/1/1990,117000
113 | 10/1/1990,121500
114 | 1/1/1991,120000
115 | 4/1/1991,119900
116 | 7/1/1991,120000
117 | 10/1/1991,120000
118 | 1/1/1992,119500
119 | 4/1/1992,120000
120 | 7/1/1992,120000
121 | 10/1/1992,126000
122 | 1/1/1993,125000
123 | 4/1/1993,127000
124 | 7/1/1993,127000
125 | 10/1/1993,127000
126 | 1/1/1994,130000
127 | 4/1/1994,130000
128 | 7/1/1994,129700
129 | 10/1/1994,132000
130 | 1/1/1995,130000
131 | 4/1/1995,133900
132 | 7/1/1995,132000
133 | 10/1/1995,138000
134 | 1/1/1996,137000
135 | 4/1/1996,139900
136 | 7/1/1996,140000
137 | 10/1/1996,144100
138 | 1/1/1997,145000
139 | 4/1/1997,145800
140 | 7/1/1997,145000
141 | 10/1/1997,144200
142 | 1/1/1998,152200
143 | 4/1/1998,149500
144 | 7/1/1998,153000
145 | 10/1/1998,153000
146 | 1/1/1999,157400
147 | 4/1/1999,158700
148 | 7/1/1999,159100
149 | 10/1/1999,165300
150 | 1/1/2000,165300
151 | 4/1/2000,163200
152 | 7/1/2000,168800
153 | 10/1/2000,172900
154 | 1/1/2001,169800
155 | 4/1/2001,179000
156 | 7/1/2001,172500
157 | 10/1/2001,171100
158 | 1/1/2002,188700
159 | 4/1/2002,187200
160 | 7/1/2002,178100
161 | 10/1/2002,190100
162 | 1/1/2003,186000
163 | 4/1/2003,191800
164 | 7/1/2003,191900
165 | 10/1/2003,198800
166 | 1/1/2004,212700
167 | 4/1/2004,217600
168 | 7/1/2004,213500
169 | 10/1/2004,228800
170 | 1/1/2005,232500
171 | 4/1/2005,233700
172 | 7/1/2005,236400
173 | 10/1/2005,243600
174 | 1/1/2006,247700
175 | 4/1/2006,246300
176 | 7/1/2006,235600
177 | 10/1/2006,245400
178 | 1/1/2007,257400
179 | 4/1/2007,242200
180 | 7/1/2007,241800
181 | 10/1/2007,238400
182 | 1/1/2008,233900
183 | 4/1/2008,235300
184 | 7/1/2008,226500
185 | 10/1/2008,222500
186 | 1/1/2009,208400
187 | 4/1/2009,220900
188 | 7/1/2009,214300
189 | 10/1/2009,219000
190 | 1/1/2010,222900
191 | 4/1/2010,219500
192 | 7/1/2010,224100
193 | 10/1/2010,224300
194 | 1/1/2011,226900
195 | 4/1/2011,228100
196 | 7/1/2011,223500
197 | 10/1/2011,221100
198 | 1/1/2012,238400
199 | 4/1/2012,238700
200 | 7/1/2012,248800
201 | 10/1/2012,251700
202 | 1/1/2013,258400
203 | 4/1/2013,268100
204 | 7/1/2013,264800
205 | 10/1/2013,273600
206 | 1/1/2014,275200
207 | 4/1/2014,288000
208 | 7/1/2014,281000
209 | 10/1/2014,298900
210 | 1/1/2015,289200
211 | 4/1/2015,289100
212 | 7/1/2015,295800
213 | 10/1/2015,302500
214 | 1/1/2016,299800
215 | 4/1/2016,306000
216 | 7/1/2016,303800
217 | 10/1/2016,310900
218 | 1/1/2017,313100
219 | 4/1/2017,318200
220 | 7/1/2017,320500
221 | 10/1/2017,337900
222 | 1/1/2018,331800
223 | 4/1/2018,315600
224 | 7/1/2018,330900
225 | 10/1/2018,322800
226 | 1/1/2019,313000
227 | 4/1/2019,322500
228 | 7/1/2019,318400
229 | 10/1/2019,327100
230 | 1/1/2020,329000
231 | 4/1/2020,322600
232 | 7/1/2020,337500
233 | 10/1/2020,358700
234 | 1/1/2021,369800
235 | 4/1/2021,382600
236 | 7/1/2021,411200
237 | 10/1/2021,423600
238 | 1/1/2022,433100
239 | 4/1/2022,449300
240 | 7/1/2022,468000
241 | 10/1/2022,479500
242 | 1/1/2023,429000
243 | 4/1/2023,418500
244 | 7/1/2023,435400
245 | 10/1/2023,417700
--------------------------------------------------------------------------------
/Homework 3/data/scorecard-models.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Homework 3/data/scorecard-models.RData
--------------------------------------------------------------------------------
/Homework 3/img/banner.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Homework 3/img/banner.png
--------------------------------------------------------------------------------
/Homework 3/img/rcis.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Homework 3/img/rcis.png
--------------------------------------------------------------------------------
/Homework 3/img/task2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Homework 3/img/task2.png
--------------------------------------------------------------------------------
/Homework 3/img/task2b.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Homework 3/img/task2b.png
--------------------------------------------------------------------------------
/Homework 3/img/task2c.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Homework 3/img/task2c.png
--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2024 VinUniversity
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | 
2 | # COMP4010/5120 - Data Visualization - Spring 2024
3 | This course teaches techniques and algorithms for creating effective visualizations of large datasets and their analytics, based on principles from graphic design, visual art, perceptual psychology and cognitive science. In addition to participating in class discussions, students will have to complete several short data analysis and visualization design assignments as well as a final project. Data visualization tools such as Tableau or R are considered as lab exercises. Future developments and methods for continual learning are also presented.
4 |
5 | ## Prerequisites
6 |
7 | - Basic knowledge of programming (experience in Python is beneficial)
8 | - R and RStudio installed on your computer
9 |
10 | ## Installation
11 |
12 | To get started, you will need to install R and RStudio:
13 |
14 | - Download R from the [CRAN repository](https://cran.r-project.org/)
15 | - Download RStudio from the [RStudio Download page](https://www.rstudio.com/products/rstudio/download/)
16 |
17 | ## Canvas
18 |
19 | - [COMP4010 Canvas](https://vinuni.instructure.com/courses/1977)
20 | - [COMP5120 Canvas](https://vinuni.instructure.com/courses/1995)
21 |
22 | ## Homework Problem Sets
23 | | Weeks | Problem Set | Dataset | Deadline |
24 | | --- | --- | --- | --- |
25 | | Week 1-2 | [Problem Set](Homework%201/Homework1.md) | [Dataset](Homework%201/ikea_data.csv) | 11.59pm Monday Week 4 (March 11) |
26 | | Week 4-5 | [Problem Set](Homework%202/Homework2.md) | [Dataset](Homework%202/) | 11.59pm Monday Week 7 (April 1) |
27 | | The rest! | [Problem Set](Homework%203/Homework3.md) | [Dataset](Homework%203/data/) | 11.59pm Monday Week 15 (June 24) |
28 |
29 | ## Weekly Application Exercises (AE)
30 |
31 | For each week, read through the notes before attempting the weekly AE.
32 | The corresponding example R Markdown notebook is also provided with different formats for readability.
33 |
34 | | Week | Topic | Notes | R Markdown |
35 | | --- | --- | --- | --- |
36 | | 01 | Introduction to R and `ggplot2` | [Notes](Week%201/Week1-AE-Notes.md) | [Rmd](Week%201/Week1-AE-RMarkdown.Rmd) \ [HTML](Week%201/Week1-AE-RMarkdown.html) \ [PDF](Week%201/Week1-AE-RMarkdown.pdf) |
37 | | 02 | Practicing with Tabular Data and More Geoms | [Notes](Week%202/Week2-AE-Notes.md) | [Rmd](Week%202/Week2-AE-RMarkdown.Rmd) \ [HTML](Week%202/Week2-AE-RMarkdown.html) \ [PDF](Week%202/Week2-AE-RMarkdown.pdf) |
38 | | 03 | Guest lecture! | n/a | n/a |
39 | | 04 | Log scale & Waffle charts | [Notes](Week%204/Week4-AE-Notes.md) | n/a |
40 | | 05 | Customizing theme | [Notes](Week%205/Week5-AE-Notes.md) | n/a |
41 | | 06 | Data wrangling! | [Notes](Week%206/Week6-AE-Notes.md) | [Rmd](Week%206/Week6-AE-RMarkdown.Rmd) |
42 | | 07 | Annotations | [Notes](Week%207/Week7-AE-Notes.md) | [Rmd](Week%207/Week7-AE-RMarkdown.Rmd) |
43 | | 08 | Presentation week! | n/a | n/a |
44 | | 09 | Accessibility | [Notes](Week%209/Week9-AE-Notes.md) | [Rmd](Week%209/Week9-AE-RMarkdown.Rmd) |
45 | | 10 | Spatial (geo) data visualization | [Notes](Week%2010/Week10-AE-Notes.md) | [Rmd](Week%2010/Week10-AE-RMarkdown.Rmd) |
46 | | 11 | Time series and Animation | [Notes](Week%2011/Week11-AE-Notes.md) | [Rmd](Week%2011/Week11-AE-RMarkdown.Rmd) |
47 | | 12 | Interactive visualization with RShiny | No notes, follow the lecture! \ [MIT 6.859 Interaction Zoo](https://vis.csail.mit.edu/classes/6.859/lectures/09-InteractionZoo.pdf) \ [MIT 6.859 Narrative Visualization](https://vis.csail.mit.edu/classes/6.859/lectures/12-Narrative.pdf) \ [[VIS 20 Talk] Tilt Map](https://www.youtube.com/watch?v=wa51nQzv2Ac) \ [Immersive Storytelling](https://www.youtube.com/watch?v=8VQ1twU3RRI) | [Lecture Rmd](Week%2012/Week12-Lecture-Examples.Rmd) \ [AE Rmd](Week%2012/Week12-AE-RMarkdown.Rmd) |
48 | | 13 | ... | | |
49 | | 14 | ... | | |
50 | | 15 | ... | | |
51 |
52 | ## Helpful resources
53 |
54 | - [Hans Rosling's 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC Four](https://youtu.be/jbkSRLYSojo?si=yENI1BZSAPYKcjd7): One of the most popular talks about data visualization
55 | - [from Data to Viz](https://www.data-to-viz.com/)
56 |
57 | ## Hall of Fame - Learn from good examples
58 |
59 | - [Beautiful tutorial on plotting Australian heatwave data with R](https://github.com/njtierney/ozviridis)
60 | - [Ab interactive visualization showing how the names of makeup products reveal bias in beauty](https://pudding.cool/2021/03/foundation-names/)
61 | - [adumb - I Made a Graph of Wikipedia... This Is What I Found](https://www.youtube.com/watch?v=JheGL6uSF-4&ab_channel=adumb)
62 |
63 | ## Hall of Shame - Learn from bad examples
64 |
65 | - [Don't be counter intuitive](https://www.data-to-viz.com/caveat/counter_intuitive.html)
66 | - [5 examples of bad data visualization](https://www.jotform.com/blog/bad-data-visualization/)
67 | - [10 Good and Bad Examples of Data Visualization in 2024](https://www.polymersearch.com/blog/10-good-and-bad-examples-of-data-visualization)
68 | - [12 Bad Data Visualization Examples Explained](https://www.codeconquest.com/blog/12-bad-data-visualization-examples-explained/)
69 |
70 | ## Contributing
71 |
72 | We encourage students to contribute to this repository by providing feedback, suggesting improvements, or adding new resources.
73 |
74 | ## License
75 |
76 | This course material is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details.
77 |
78 | ## Acknowledgments
79 |
80 | - Banner image from the amazing [Visual Cinnamon](https://www.visualcinnamon.com/resources/learning-data-visualization/)
81 |
82 | ---
83 |
84 | Happy Learning!
85 |
--------------------------------------------------------------------------------
/Week 1/Week1-AE-RMarkdown.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "COMP4010 - Week 1"
3 | output:
4 | html_document: default
5 | pdf_document: default
6 | date: "2024-02-20"
7 | ---
8 |
9 | # Week 1
10 |
11 | # Application Exercises
12 |
13 | ## Task 1.
14 |
15 | - Data:
16 | - Mapping:
17 | - Statistical transformation:
18 | - Geometric object:
19 | - Position adjustment:
20 | - Scale:
21 | - Coordinate system:
22 | - Faceting:
23 |
24 | ## Task 2.
25 |
26 | ```{r}
27 | # Your code here
28 |
29 | ```
30 |
31 | ## Task 3.
32 |
33 | ```{r}
34 | # Your code here
35 |
36 | ```
37 |
38 | # Reading Material
39 |
40 | ## Hello World! but in R
41 |
42 | Create a chunk below and create a vector of numbers and calculate its mean. (This is akin to printing 'Hello World' in other languages, statisticians are not as fun :D)
43 |
44 | ```{r}
45 | myVector <- c(1,2,3,4)
46 | mean(myVector)
47 | ```
48 |
49 | ## Using CRAN to install ggplot2
50 |
51 | Alternatively, you can just run this in the console below.
52 |
53 | ```{r}
54 | #install.packages("ggplot2") # uncomment to install
55 | library(ggplot2) # Import
56 | ```
57 |
58 | ## Hello to ggplot2
59 |
60 | This uses the built-in `mtcars` dataset. To preview this dataset, you can use `summary(mtcars)` or `View(mtcars)`.
61 |
62 | ```{r}
63 | View(mtcars)
64 | ```
65 |
66 | We can see that there are the `mpg` (miles per gallons) and `wt` (weight) columns in the dataset. Let's plot the 2 dimensions on the scatterplot using `ggplot2`.
67 |
68 | ```{r}
69 | ggplot(data = mtcars, aes(x = wt, y = mpg)) + geom_point()
70 | ```
71 |
72 | ## Bar chart with iris dataset
73 |
74 | We can also try making a bar chart with the built-in `iris` dataset.
75 |
76 | ```{r}
77 | ggplot(data = iris, aes(x = Species, fill = Species)) + geom_bar()
78 | ```
79 |
80 | ## Customizing plots
81 |
82 | Adding titles and labels:
83 |
84 | ```{r}
85 | ggplot(data = mtcars, aes(x = wt, y = mpg)) +
86 | geom_point() +
87 | ggtitle("Scatter Plot of mpg vs wt") +
88 | xlab("Weight") +
89 | ylab("Miles Per Gallon")
90 | ```
91 |
92 | Changing colors:
93 |
94 | ```{r}
95 | ggplot(data = mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
96 | geom_point()
97 | ```
98 |
99 | ### Experimenting with the Iris Dataset
100 |
101 | - **Dataset**: Iris (available in R by default)
102 | - **Task**: Create a scatter plot showing the relationship between petal length and petal width, colored by species.
103 | - **Customization**: Add a smooth regression line for each species.
104 |
105 | ```{r}
106 | ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
107 | geom_point() +
108 | geom_smooth(method = "lm") +
109 | ggtitle("Petal Length vs Width by Species") +
110 | theme_minimal()
111 | ```
112 |
113 | ### Visualizing the mtcars Dataset
114 |
115 | - **Dataset**: mtcars (available in R by default)
116 | - **Task**: Create a bar plot showing the average miles per gallon (mpg) for cars with different numbers of cylinders.
117 | - **Customization**: Use a different fill color for each cylinder type and add labels for the average mpg.
118 |
119 | ```{r}
120 | ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
121 | geom_bar(stat = "summary", fun = mean) +
122 | geom_text(stat = 'summary', aes(label = round(..y.., 1)), vjust = -0.5) +
123 | labs(x = "Number of Cylinders", y = "Average Miles per Gallon", title = "Average MPG by Cylinder Count") +
124 | theme_bw()
125 | ```
126 |
127 | ### Exploring the gapminder Dataset
128 |
129 | - **Dataset**: gapminder (install using **`install.packages("gapminder")`** and then **`library(gapminder)`**)
130 | - **Task**: Create a line plot showing GDP per capita over time for select countries.
131 | - **Customization**: Use different line types and colors for each country.
132 |
133 | ```{r}
134 | # install.packages("gapminder")
135 | library(gapminder)
136 | ggplot(subset(gapminder, country %in% c("Japan", "United Kingdom", "United States")),
137 | aes(x = year, y = gdpPercap, color = country, linetype = country)) +
138 | geom_line() +
139 | scale_y_log10() +
140 | ggtitle("GDP Per Capita Over Time") +
141 | theme_light()
142 | ```
143 |
144 | ### Working with the diamonds Dataset
145 |
146 | - **Dataset**: diamonds (part of ggplot2 package)
147 | - **Task**: Create a histogram of diamond prices, faceted by cut quality.
148 | - **Customization**: Adjust the bin width and use a theme that enhances readability.
149 |
150 | ```{r}
151 | ggplot(diamonds, aes(x = price)) +
152 | geom_histogram(binwidth = 500) +
153 | facet_wrap(~cut) +
154 | labs(title = "Diamond Prices by Cut Quality", x = "Price", y = "Count") +
155 | theme_classic()
156 | ```
157 |
--------------------------------------------------------------------------------
/Week 1/Week1-AE-RMarkdown.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 1/Week1-AE-RMarkdown.pdf
--------------------------------------------------------------------------------
/Week 1/img/Minard_Update.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 1/img/Minard_Update.png
--------------------------------------------------------------------------------
/Week 1/img/Untitled 1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 1/img/Untitled 1.png
--------------------------------------------------------------------------------
/Week 1/img/Untitled 2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 1/img/Untitled 2.png
--------------------------------------------------------------------------------
/Week 1/img/Untitled 3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 1/img/Untitled 3.png
--------------------------------------------------------------------------------
/Week 1/img/Untitled 4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 1/img/Untitled 4.png
--------------------------------------------------------------------------------
/Week 1/img/Untitled 5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 1/img/Untitled 5.png
--------------------------------------------------------------------------------
/Week 1/img/Untitled 6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 1/img/Untitled 6.png
--------------------------------------------------------------------------------
/Week 1/img/Untitled 7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 1/img/Untitled 7.png
--------------------------------------------------------------------------------
/Week 1/img/Untitled 8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 1/img/Untitled 8.png
--------------------------------------------------------------------------------
/Week 1/img/Untitled 9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 1/img/Untitled 9.png
--------------------------------------------------------------------------------
/Week 1/img/Untitled.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 1/img/Untitled.png
--------------------------------------------------------------------------------
/Week 1/img/olympic_feathers_print.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 1/img/olympic_feathers_print.jpeg
--------------------------------------------------------------------------------
/Week 10/Week10-AE-Notes.md:
--------------------------------------------------------------------------------
1 | 
2 |
3 | # COMP4010/5120 - Week 10 Application Exercises
4 | ---
5 |
6 | # A. Application Exercises
7 |
8 | **Data:** [`nyc-311.csv`](./nyc-311.csv), [`ny-inc.rds`](./ny-inc.rds), [`ny-counties.rds`](./ny-counties.rds)
9 |
10 | ```R
11 | library(tidyverse)
12 | #install.packages('ggmap')
13 | library(ggmap)
14 | ```
15 |
16 | Let’s first load a subset of 311 service requests in New York City. This subset includes 311 service requests related to Food Poisoning in commercial establishments (e.g. restaurants, cafeterias, food carts).
17 |
18 | ```R
19 | nyc_311 <- read_csv(file = "nyc-311.csv", show_col_types = FALSE)
20 | nyc_311
21 | ```
22 |
23 | Store your Stadia Maps API key using the function:
24 |
25 | ```R
26 | register_stadiamaps(key = "YOUR KEY HERE", write = TRUE)
27 | ```
28 |
29 | ## Task 1. Obtain map tiles for New York City
30 |
31 | Use [bboxfinder.com] to find bounding box coordinates for New York City. Then, use `get_stamenmap()` to obtain map tiles for New York City and visualize the map.
32 |
33 | > Try using a `zoom` level of 11.
34 |
35 | 
36 |
37 | ## Task 2. Food poisoning rates
38 |
39 | The COVID-19 pandemic caused massive disruption in the restaurant industry. Due to social distancing measures and lockdowns, restaurant traffic decreased significantly.
40 |
41 | While this had significant financial ramifications, one potentially overlooked consequence is the impact on food poisoning rates. With fewer people eating out, the number of food poisoning complaints may have decreased.
42 |
43 | Visualize the geospatial distribution of complaints related to food poisoning in NYC in March, April, and May over a four-year time period (2018-21). Construct the chart in such a way that you can make valid comparisons over time and geographically. What impact did COVID-19 have on food poisoning cases in NYC? Did it vary geographically?
44 |
45 | 
46 |
47 | # Task 3. Visualize food poisoning complains on Roosevelt Island
48 |
49 | Now focus on food poisoning complaints on or around Roosevelt Island.2 Use get_stamenmap() to obtain map tiles for the Roosevelt Island region and overlay with the food poisoning complaints. What type of chart is more effective for this task?
50 |
51 | > Consider adjusting your `zoom` for this geographic region.
52 |
53 | 
54 |
55 | ---
56 |
57 | # For the next tasks: New York 2022 median household income
58 |
59 | We will use two data files for this analysis. The first contains median household incomes for each census tract in New York from 2022. The second contains the boundaries of each county in New York.
60 |
61 | ```R
62 | # useful on MacOS to speed up rendering of geom_sf() objects
63 | if (!identical(getOption("bitmapType"), "cairo") && isTRUE(capabilities()[["cairo"]])) {
64 | options(bitmapType = "cairo")
65 | }
66 |
67 | # create reusable labels for each plot
68 | map_labels <- labs(
69 | title = "Median household income in New York in 2022",
70 | subtitle = "By census tract",
71 | color = NULL,
72 | fill = NULL,
73 | caption = "Source: American Community Survey"
74 | )
75 |
76 | # load data
77 | ny_inc <- read_rds(file = "ny-inc.rds")
78 | ny_counties <- read_rds(file = "ny-counties.rds")
79 | ```
80 |
81 | # Task 4. Draw a continuous choropleth of median household income
82 |
83 | Create a choropleth map of median household income in New York. Use a continuous color gradient to identify each tract’s median household income. Use a continuous color gradient to identify each tract’s median household income.
84 |
85 | > Use the stored `map_labels` to set the title, subtitle, and caption for this and the remaining plots.
86 |
87 | 
88 |
89 | **OPTIONAL:** Use `viridis` color palette.
90 |
91 | 
92 |
93 | # Task 5. Overlay county borders
94 |
95 | To provide better context, overlay the NY county borders on the choropleth map using the data from `ny_counties`.
96 |
97 | 
98 |
99 | # B. Reading Material
100 |
101 | ## 1. Hall of Shame (is back!)
102 |
103 | 
104 |
105 | "[Jon Schwabish](https://policyviz.com/about/) is an economist who trains people to be better at data visualization." This week we don't have just one example for the Hall of Shame but a rather interesting video going through numerous bad examples of data visualization used in real life, along with the comments of Professor Jon Schwabish.
106 |
107 | Check out the video here: [What Not to do in Data Visualization: A Walk through the Bad DataViz Hall of Shame](https://www.youtube.com/watch?v=KluzR75S6U0)
108 |
109 | > Talk abstract: Prepare to be amused and enlightened as we embark on a comical journey through the quirky world of bad data visualizations. In this light-hearted talk, I’ll showcase some of the most outrageous and baffling data visual blunders that have left audiences scratching their heads. From pie charts that vie you everything to bar charts that distort and mislead, you’ll see it all. I mix the comical with the serious to unveil visual missteps in the data world. Amidst the 3D exploding charts, you'll also glean valuable lessons on what not to do when crafting data visualizations. Join me for a rollicking exploration of data gone wrong and leave with a smile and a newfound appreciation for the importance of clarity and accuracy in our data-driven endeavors.
110 |
111 |
112 | ## 2. Getting a Stadia Maps API Key
113 |
114 | [Stadia Maps](https://stadiamaps.com/) is a readily available collection of APIs related to maps and geolocations. To get started with this week exercises, you should sign up for a Stadia Maps API key following [this guide](https://search.r-project.org/CRAN/refmans/ggmap/html/register_stadiamaps.html).
115 |
116 | ## 3. Introduction to `ggmap` in R
117 |
118 | The `ggmap` package is a valuable tool for R users who work with spatial visualization. It serves as an extension to the popular `ggplot2` package, enabling the integration of map data with the customizable plotting capabilities of `ggplot2`. Whether you are a data scientist, a geospatial analyst, or just someone interested in mapping data, `ggmap` offers a flexible and powerful approach to creating detailed and aesthetically pleasing maps.
119 |
120 | The primary feature of `ggmap` is its ability to download and render map tiles from popular mapping services like Google Maps, OpenStreetMap, and Stamen Maps. This allows users to overlay their data onto these maps, creating rich, interactive visualizations that can convey complex geographical data in an intuitive way.
121 |
122 | `ggmap` simplifies the process of geocoding addresses into latitude and longitude coordinates and vice versa, making it easier to plot location-based data. Additionally, it supports the customization of base maps, enabling users to adjust the aesthetic elements such as color schemes, annotations, and layers to suit specific needs.
123 |
124 | In the following sections, we will delve into how to install `ggmap`, retrieve maps, geocode data, and customize your maps to tell stories with your data effectively. Whether you are plotting simple location data or conducting complex spatial analysis, `ggmap` provides the tools necessary to produce insightful and visually engaging maps.
125 |
126 | A great introduction to `ggmap` can be found here: [ggmap: Spatial Visualization with ggplot2](https://journal.r-project.org/archive/2013-1/kahle-wickham.pdf)
127 |
128 | ## 4. Choropleth map
129 |
130 | A choropleth map is a type of thematic map where areas are shaded or patterned in proportion to the measurement of a statistical variable being displayed on the map, such as population density or per-capita income. The purpose of a choropleth map is to provide an easy visual way to compare data across geographic regions without needing to understand precise values.
131 |
132 | Key characteristics of choropleth maps include:
133 |
134 | - **Data Classification**: Data is typically classified into different categories or ranges, such as income levels or temperature ranges. These categories are then assigned different colors or patterns.
135 | - **Color Schemes**: Choropleth maps use color gradients to represent different data ranges. Warmer colors (e.g., red, orange) might represent higher values, while cooler colors (e.g., blue, green) might represent lower values. The choice of colors can significantly affect the readability and interpretability of the data.
136 | Geographic Units: The map is divided into spatial units such as countries, states, counties, or districts. Each unit is filled with a color corresponding to the data value for that area.
137 | - **Simplicity and Accessibility**: These maps make complex data more accessible and understandable to a broad audience, facilitating easier comparison and analysis of regional data.
138 |
139 | Choropleth maps are widely used in various fields such as meteorology, public health, economics, and political science to visualize and analyze spatial data distributions. However, they also have limitations, such as potentially misleading interpretations if the areas of the geographic units vary significantly in size, or if the data is not normalized (e.g., total values vs. per capita).
140 |
141 | > However, its downside is that regions with bigger sizes tend to have a bigger weight in the map interpretation, which includes a bias. - [from Data to Viz](https://www.data-to-viz.com/graph/choropleth.html)
142 |
143 | 
144 |
145 | ## 5. Geographic Information System (GIS) Data and the `sf` library
146 |
147 | A geographic information system (GIS) is a computer system for capturing, storing, checking, and displaying data related to positions on Earth’s surface. GIS can show many different kinds of data on one map, such as streets, buildings, and vegetation. This enables people to more easily see, analyze, and understand patterns and relationships. ([National Geographic](https://education.nationalgeographic.org/resource/geographic-information-system-gis/))
148 |
149 | The `sf` package in R stands for "simple features" and is a modern approach to handling spatial and geographic data within the R programming environment. The package provides classes and functions to manipulate, process, and visualize spatial data, integrating well with the tidyverse suite of packages for data analysis.
150 |
151 | We can combine GIS data and the `sf` library to draw vector shapes based on geographical data such as country or state borders. For example, to draw a vector border of a state using GIS data in R with the `geom_sf()` function from the `sf` package, you'll need to follow a few key steps. This involves obtaining spatial data in a vector format, preparing the data, and then plotting it.
152 |
153 | You can get the spatial data from various sources such as the `rnaturalearth` and `rnaturalearthdata` libraries and use it directly with the `sf` library:
154 |
155 | ```R
156 | #install.packages("rnaturalearth")
157 | #install.packages("rnaturalearthdata")
158 | library(rnaturalearth)
159 | library(rnaturalearthdata)
160 |
161 | world <- ne_countries(scale = "medium", returnclass = "sf")
162 | ggplot(data = world) +
163 | geom_sf()
164 | ```
165 |
166 | 
--------------------------------------------------------------------------------
/Week 10/Week10-AE-RMarkdown-Key.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Week10-AE-RMarkdown"
3 | output: html_document
4 | date: "2024-04-25"
5 | ---
6 |
7 | ```{r}
8 | library(tidyverse)
9 | #install.packages('ggmap')
10 | library(ggmap)
11 | #install.packages('sf')
12 | library(sf)
13 | library(colorspace)
14 | library(scales)
15 | ```
16 |
17 | Let’s first load a subset of 311 service requests in New York City. This subset includes 311 service requests related to Food Poisoning in commercial establishments (e.g. restaurants, cafeterias, food carts).
18 |
19 | ```{r}
20 | nyc_311 <- read_csv(file = "nyc-311.csv", show_col_types = FALSE)
21 | nyc_311
22 | ```
23 |
24 | # Store your Stadia Maps API key using the function
25 |
26 | ```{r}
27 | register_stadiamaps(key = "9ae33801-3948-44ed-af85-6b5081ae22bf", write = TRUE)
28 | ```
29 |
30 | # Task 1. Obtain map tiles for New York City
31 |
32 | Use [bboxfinder.com] to find bounding box coordinates for New York City. Then, use `get_stamenmap()` to obtain map tiles for New York City and visualize the map.
33 |
34 | > Try using a `zoom` level of 11.
35 |
36 | ```{r}
37 | # store bounding box coordinates
38 | nyc_bb <- c(
39 | left = -74.263045,
40 | bottom = 40.487652,
41 | right = -73.675963,
42 | top = 40.934743
43 | )
44 |
45 | nyc <- get_stadiamap(
46 | bbox = nyc_bb,
47 | zoom = 11
48 | )
49 |
50 | # plot the raster map
51 | ggmap(nyc)
52 | ```
53 |
54 | # Task 2. Food poisoning rates
55 |
56 | The COVID-19 pandemic caused massive disruption in the restaurant industry. Due to social distancing measures and lockdowns, restaurant traffic decreased significantly.
57 |
58 | While this had significant financial ramifications, one potentially overlooked consequence is the impact on food poisoning rates. With fewer people eating out, the number of food poisoning complaints may have decreased.
59 |
60 | Visualize the geospatial distribution of complaints related to food poisoning in NYC in March, April, and May over a four-year time period (2018-21). Construct the chart in such a way that you can make valid comparisons over time and geographically. What impact did COVID-19 have on food poisoning cases in NYC? Did it vary geographically?
61 |
62 | ```{r}
63 | nyc_covid_food_poison <- nyc_311 |>
64 | # generate a year variable
65 | mutate(year = year(created_date)) |>
66 | # only keep reports in March, April, and May from 2018-21
67 | filter(month(created_date) %in% 3:5, year %in% 2018:2021)
68 |
69 | ggmap(nyc) +
70 | # add the heatmap
71 | stat_density_2d(
72 | data = nyc_covid_food_poison,
73 | mapping = aes(
74 | x = longitude,
75 | y = latitude,
76 | fill = after_stat(level)
77 | ),
78 | alpha = .1,
79 | bins = 50,
80 | geom = "polygon"
81 | ) +
82 | scale_fill_viridis_c() +
83 | facet_wrap(facets = vars(year))
84 | ```
85 |
86 | # Task 3. Visualize food poisoning complains on Roosevelt Island
87 |
88 | Now focus on food poisoning complaints on or around Roosevelt Island.2 Use get_stamenmap() to obtain map tiles for the Roosevelt Island region and overlay with the food poisoning complaints. What type of chart is more effective for this task?
89 |
90 | > Consider adjusting your `zoom` for this geographic region.
91 |
92 | ```{r}
93 | # Obtain map tiles for Roosevelt Island
94 | roosevelt_bb <- c(
95 | left = -73.967121,
96 | bottom = 40.748700,
97 | right = -73.937080,
98 | top = 40.774704
99 | )
100 | roosevelt <- get_stadiamap(
101 | bbox = roosevelt_bb,
102 | zoom = 14
103 | )
104 |
105 | # Generate a scatterplot of food poisoning complaints
106 | ggmap(roosevelt) +
107 | # add a scatterplot layer
108 | geom_point(
109 | data = filter(nyc_311, complaint_type == "Food Poisoning"),
110 | mapping = aes(
111 | x = longitude,
112 | y = latitude
113 | ),
114 | alpha = 0.2
115 | )
116 | ```
117 |
118 | # New York 2022 median household income
119 |
120 | We will use two data files for this analysis. The first contains median household incomes for each census tract in New York from 2022. The second contains the boundaries of each county in New York.
121 |
122 | ```{r}
123 | # useful on MacOS to speed up rendering of geom_sf() objects
124 | if (!identical(getOption("bitmapType"), "cairo") && isTRUE(capabilities()[["cairo"]])) {
125 | options(bitmapType = "cairo")
126 | }
127 |
128 | # create reusable labels for each plot
129 | map_labels <- labs(
130 | title = "Median household income in New York in 2022",
131 | subtitle = "By census tract",
132 | color = NULL,
133 | fill = NULL,
134 | caption = "Source: American Community Survey"
135 | )
136 |
137 | # load data
138 | ny_inc <- read_rds(file = "ny-inc.rds")
139 | ny_counties <- read_rds(file = "ny-counties.rds")
140 |
141 | ny_inc
142 | ```
143 |
144 | ```{r}
145 | ny_counties
146 | ```
147 |
148 | # Task 4. Draw a continuous choropleth of median household income
149 |
150 | Create a choropleth map of median household income in New York. Use a continuous color gradient to identify each tract’s median household income. Use a continuous color gradient to identify each tract’s median household income.
151 |
152 | > Use the stored `map_labels` to set the title, subtitle, and caption for this and the remaining plots.
153 |
154 | ```{r}
155 | ggplot(data = ny_inc) +
156 | # use fill and color to avoid gray boundary lines
157 | geom_sf(aes(fill = medincomeE, color = medincomeE)) +
158 | # increase interpretability of graph
159 | scale_color_continuous(labels = label_dollar()) +
160 | scale_fill_continuous(labels = label_dollar()) +
161 | map_labels
162 | ```
163 |
164 | **OPTIONAL**: Use `viridis` color palette.
165 | ```{r}
166 | ggplot(data = ny_inc) +
167 | # use fill and color to avoid gray boundary lines
168 | geom_sf(mapping = aes(fill = medincomeE, color = medincomeE)) +
169 | # increase interpretability of graph
170 | scale_fill_continuous_sequential(
171 | palette = "viridis",
172 | rev = FALSE,
173 | aesthetics = c("fill", "color"),
174 | labels = label_dollar(),
175 | name = NULL
176 | ) +
177 | map_labels
178 | ```
179 |
180 | # Task 5. Overlay county borders
181 |
182 | To provide better context, overlay the NY county borders on the choropleth map.
183 |
184 | ```{r}
185 | ggplot(data = ny_inc) +
186 | # use fill and color to avoid gray boundary lines
187 | geom_sf(mapping = aes(fill = medincomeE, color = medincomeE)) +
188 | # add county borders
189 | geom_sf(data = ny_counties, color = "white", fill = NA) +
190 | # increase interpretability of graph
191 | scale_fill_continuous_sequential(
192 | palette = "viridis",
193 | rev = FALSE,
194 | aesthetics = c("fill", "color"),
195 | labels = label_dollar(),
196 | name = NULL
197 | ) +
198 | map_labels
199 | ```
--------------------------------------------------------------------------------
/Week 10/Week10-AE-RMarkdown.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Week10-AE-RMarkdown"
3 | output: html_document
4 | date: "2024-04-25"
5 | ---
6 |
7 | ```{r}
8 | library(tidyverse)
9 | #install.packages('ggmap')
10 | library(ggmap)
11 | ```
12 |
13 | Let’s first load a subset of 311 service requests in New York City. This subset includes 311 service requests related to Food Poisoning in commercial establishments (e.g. restaurants, cafeterias, food carts).
14 |
15 | ```{r}
16 | nyc_311 <- read_csv(file = "nyc-311.csv", show_col_types = FALSE)
17 | nyc_311
18 | ```
19 |
20 | # Store your Stadia Maps API key using the function
21 |
22 | ```{r}
23 | register_stadiamaps(key = "YOUR KEY HERE", write = TRUE)
24 | ```
25 |
26 | # Task 1. Obtain map tiles for New York City
27 |
28 | Use [bboxfinder.com] to find bounding box coordinates for New York City. Then, use `get_stamenmap()` to obtain map tiles for New York City and visualize the map.
29 |
30 | > Try using a `zoom` level of 11.
31 |
32 | ```{r}
33 | # store bounding box coordinates
34 | nyc_bb <- c(
35 | left = YOUR_CODE_HERE,
36 | bottom = YOUR_CODE_HERE,
37 | right = YOUR_CODE_HERE,
38 | top = YOUR_CODE_HERE
39 | )
40 |
41 | nyc <- get_stadiamap(
42 | bbox = nyc_bb,
43 | zoom = YOUR_CODE_HERE
44 | )
45 |
46 | # plot the raster map
47 | ggmap(nyc)
48 | ```
49 |
50 | # Task 2. Food poisoning rates
51 |
52 | The COVID-19 pandemic caused massive disruption in the restaurant industry. Due to social distancing measures and lockdowns, restaurant traffic decreased significantly.
53 |
54 | While this had significant financial ramifications, one potentially overlooked consequence is the impact on food poisoning rates. With fewer people eating out, the number of food poisoning complaints may have decreased.
55 |
56 | Visualize the geospatial distribution of complaints related to food poisoning in NYC in March, April, and May over a four-year time period (2018-21). Construct the chart in such a way that you can make valid comparisons over time and geographically. What impact did COVID-19 have on food poisoning cases in NYC? Did it vary geographically?
57 |
58 | ```{r}
59 | nyc_covid_food_poison <- nyc_311 |>
60 | # generate a year variable
61 | mutate(year = year(created_date)) |>
62 | # only keep reports in March, April, and May from 2018-21
63 | filter(month(created_date) %in% 3:5, year %in% 2018:2021)
64 |
65 | # YOUR CODE HERE
66 | ```
67 |
68 | # Task 3. Visualize food poisoning complains on Roosevelt Island
69 |
70 | Now focus on food poisoning complaints on or around Roosevelt Island.2 Use get_stamenmap() to obtain map tiles for the Roosevelt Island region and overlay with the food poisoning complaints. What type of chart is more effective for this task?
71 |
72 | > Consider adjusting your `zoom` for this geographic region.
73 |
74 | ```{r}
75 | # Obtain map tiles for Roosevelt Island
76 | roosevelt_bb <- c(
77 | left = YOUR_CODE_HERE,
78 | bottom = YOUR_CODE_HERE,
79 | right = YOUR_CODE_HERE,
80 | top = YOUR_CODE_HERE
81 | )
82 | roosevelt <- get_stadiamap(
83 | bbox = roosevelt_bb,
84 | zoom = YOUR_CODE_HERE
85 | )
86 |
87 | # YOUR CODE HERE
88 |
89 | ```
90 |
91 | # New York 2022 median household income
92 |
93 | We will use two data files for this analysis. The first contains median household incomes for each census tract in New York from 2022. The second contains the boundaries of each county in New York.
94 |
95 | ```{r}
96 | # useful on MacOS to speed up rendering of geom_sf() objects
97 | if (!identical(getOption("bitmapType"), "cairo") && isTRUE(capabilities()[["cairo"]])) {
98 | options(bitmapType = "cairo")
99 | }
100 |
101 | # create reusable labels for each plot
102 | map_labels <- labs(
103 | title = "Median household income in New York in 2022",
104 | subtitle = "By census tract",
105 | color = NULL,
106 | fill = NULL,
107 | caption = "Source: American Community Survey"
108 | )
109 |
110 | # load data
111 | ny_inc <- read_rds(file = "ny-inc.rds")
112 | ny_counties <- read_rds(file = "ny-counties.rds")
113 |
114 | ny_inc
115 | ```
116 |
117 | ```{r}
118 | ny_counties
119 | ```
120 |
121 | # Task 4. Draw a continuous choropleth of median household income
122 |
123 | Create a choropleth map of median household income in New York. Use a continuous color gradient to identify each tract’s median household income. Use a continuous color gradient to identify each tract’s median household income.
124 |
125 | > Use the stored `map_labels` to set the title, subtitle, and caption for this and the remaining plots.
126 |
127 | ```{r}
128 | # YOUR CODE HERE
129 |
130 |
131 | ```
132 |
133 | **OPTIONAL**: Use `viridis` color palette.
134 | ```{r}
135 | # YOUR CODE HERE
136 |
137 |
138 | ```
139 |
140 | # Task 5. Overlay county borders
141 | To provide better context, overlay the NY county borders on the choropleth map.
142 |
143 | ```{r}
144 | # YOUR CODE HERE
145 |
146 |
147 | ```
--------------------------------------------------------------------------------
/Week 10/img/banner.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 10/img/banner.png
--------------------------------------------------------------------------------
/Week 10/img/choropleth.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 10/img/choropleth.png
--------------------------------------------------------------------------------
/Week 10/img/geomsf.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 10/img/geomsf.png
--------------------------------------------------------------------------------
/Week 10/img/shame.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 10/img/shame.png
--------------------------------------------------------------------------------
/Week 10/img/task1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 10/img/task1.png
--------------------------------------------------------------------------------
/Week 10/img/task2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 10/img/task2.png
--------------------------------------------------------------------------------
/Week 10/img/task3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 10/img/task3.png
--------------------------------------------------------------------------------
/Week 10/img/task4-optional.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 10/img/task4-optional.png
--------------------------------------------------------------------------------
/Week 10/img/task4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 10/img/task4.png
--------------------------------------------------------------------------------
/Week 10/img/task5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 10/img/task5.png
--------------------------------------------------------------------------------
/Week 10/ny-counties.rds:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 10/ny-counties.rds
--------------------------------------------------------------------------------
/Week 10/ny-inc.rds:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 10/ny-inc.rds
--------------------------------------------------------------------------------
/Week 11/Week11-AE-Notes.md:
--------------------------------------------------------------------------------
1 | # COMP4010/5120 - Week 11 Application Exercises
2 |
3 | ---
4 |
5 | # A. Application Exercises
6 |
7 | **Data:** [`ad_aqi_tracker_data-2023.csv`](./ad_aqi_tracker_data-2023.csv), [`freedom.csv`](./freedom.csv)
8 |
9 | ```R
10 | library(tidyverse)
11 | library(scales)
12 | library(janitor)
13 | library(colorspace)
14 |
15 | #install.packages('gganimate')
16 | library(gganimate)
17 |
18 | aqi_levels <- tribble(
19 | ~aqi_min, ~aqi_max, ~color, ~level,
20 | 0, 50, "#D8EEDA", "Good",
21 | 51, 100, "#F1E7D4", "Moderate",
22 | 101, 150, "#F8E4D8", "Unhealthy for sensitive groups",
23 | 151, 200, "#FEE2E1", "Unhealthy",
24 | 201, 300, "#F4E3F7", "Very unhealthy",
25 | 301, 400, "#F9D0D4", "Hazardous"
26 | )
27 |
28 | freedom <- read_csv(
29 | file = "freedom.csv",
30 | na = "-"
31 | )
32 |
33 | syr_2023 <- read_csv(file = "ad_aqi_tracker_data-2023.csv")
34 |
35 | syr_2023 <- syr_2023 |>
36 | clean_names() |>
37 | mutate(date = mdy(date))
38 | ```
39 | ---
40 |
41 | First, we would like to recreate this plot:
42 |
43 | 
44 |
45 | Initially, our plot looks like this:
46 |
47 | ```R
48 | syr_2023 |>
49 | ggplot(aes(x = date, y = aqi_value, group = 1)) +
50 | # plot the AQI in Syracuse
51 | geom_line(linewidth = 1, alpha = 0.5)
52 | ```
53 |
54 | ## Task 1. Add color shading to the plot based on the AQI guide. The color palette does not need to match the specific colors in the table yet.
55 |
56 | 
57 |
58 | ## Task 2. Use the hexidecimal colors from the dataset for the color palette.
59 |
60 | 
61 |
62 | ## Task 3. Label each AQI category on the chart
63 |
64 | Incorporate text labels for each AQI value directly into the graph. To accomplish this, you need to:
65 |
66 | - Calculate the midpoint AQI value for each category
67 | - Add a geom_text() layer to the plot with the AQI values positioned at the midpoint on the y-axis
68 |
69 | Extend the range of the x-axis to provide more horizontal space for the AQI category labels without interfering with the trend line.
70 |
71 | 
72 |
73 | # Task 4. Clean up the plot.
74 |
75 | Add a meaningful title, axis labels, caption, etc.
76 |
77 | 
78 |
79 | ---
80 |
81 | From this task we will be using the `freedom.csv` data to practice animating a bar chart with `gganimate`.
82 |
83 | Here is the data we will be using:
84 |
85 | ```{r}
86 | freedom_to_plot <- freedom |>
87 | # calculate rowwise standard deviations (one row per country)
88 | rowwise() |>
89 | mutate(sd = sd(c_across(contains("cl_")), na.rm = TRUE)) |>
90 | ungroup() |>
91 | # find the 15 countries with the highest standard deviations
92 | relocate(country, sd) |>
93 | slice_max(order_by = sd, n = 15) |>
94 | # only keep countries with complete observations - necessary for future plotting
95 | drop_na()
96 | freedom_to_plot
97 | ```
98 |
99 | ```{r}
100 | # calculate position rankings rather than raw scores
101 | freedom_ranked <- freedom_to_plot |>
102 | # only keep columns with civil liberties scores
103 | select(country, contains("cl_")) |>
104 | # wrangle the data to a long format
105 | pivot_longer(
106 | cols = -country,
107 | names_to = "year",
108 | values_to = "civil_liberty",
109 | names_prefix = "cl_",
110 | names_transform = list(year = as.numeric)
111 | ) |>
112 | # calculate rank within year - larger is worse, so reverse in the ranking
113 | group_by(year) |>
114 | mutate(rank_in_year = rank(-civil_liberty, ties.method = "first")) |>
115 | ungroup() |>
116 | # highlight Venezuela
117 | mutate(is_venezuela = if_else(country == "Venezuela", TRUE, FALSE))
118 | freedom_ranked
119 | ```
120 | ## Task 5. Fill in the code given to create a faceted bar plot by year.
121 |
122 | 
123 |
124 | ## Task 6. Fill in the code given to turn the facet plot into an animation showing the data by year.
125 |
126 | 
127 |
128 | ```{r}
129 | # smoother transition - might take a while to render
130 | animate(freedom_bar_race, nframes = 300, fps = 100, start_pause = 10, end_pause = 10)
131 | ```
132 |
133 | 
--------------------------------------------------------------------------------
/Week 11/Week11-AE-RMarkdown.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Week11-AE-RMarkdown"
3 | output:
4 | pdf_document: default
5 | html_document: default
6 | date: "2024-04-25"
7 | ---
8 |
9 | ```{r}
10 | library(tidyverse)
11 | library(scales)
12 | library(janitor)
13 | library(colorspace)
14 |
15 | #install.packages('gganimate')
16 | library(gganimate)
17 |
18 | aqi_levels <- tribble(
19 | ~aqi_min, ~aqi_max, ~color, ~level,
20 | 0, 50, "#D8EEDA", "Good",
21 | 51, 100, "#F1E7D4", "Moderate",
22 | 101, 150, "#F8E4D8", "Unhealthy for sensitive groups",
23 | 151, 200, "#FEE2E1", "Unhealthy",
24 | 201, 300, "#F4E3F7", "Very unhealthy",
25 | 301, 400, "#F9D0D4", "Hazardous"
26 | )
27 |
28 |
29 | # Data is in a wide format
30 | # pr_* - political rights, scores 1-7 (1 highest degree of freedom, 7 the lowest)
31 | # cl_* - civil liberties, scores 1-7 (1 highest degree of freedom, 7 the lowest)
32 | # status_* - freedom status (Free, Partly Free, and Not Free)
33 |
34 | freedom <- read_csv(
35 | file = "freedom.csv",
36 | na = "-"
37 | )
38 |
39 | syr_2023 <- read_csv(file = "ad_aqi_tracker_data-2023.csv")
40 |
41 | syr_2023 <- syr_2023 |>
42 | clean_names() |>
43 | mutate(date = mdy(date))
44 | ```
45 |
46 | ```{r}
47 | syr_2023 |>
48 | ggplot(aes(x = date, y = aqi_value, group = 1)) +
49 | # plot the AQI in Syracuse
50 | geom_line(linewidth = 1, alpha = 0.5)
51 | ```
52 |
53 | # Task 1. Add color shading to the plot based on the AQI guide. The color palette does not need to match the specific colors in the table yet.
54 | ```{r}
55 | syr_plot <- syr_2023 |>
56 | ggplot(aes(x = date, y = aqi_value, group = 1)) +
57 | # shade in background with colors based on AQI guide
58 | geom_rect(
59 | data = ...,
60 | aes(
61 | ymin = ..., ymax = ...,
62 | xmin = ..., xmax = ...,
63 | x = ..., y = ..., fill = color
64 | )
65 | ) +
66 | # plot the AQI in Syracuse
67 | geom_line(linewidth = 1, alpha = 0.5)
68 |
69 | syr_plot
70 | ```
71 |
72 | # Task 2. Use the hexidecimal colors from the dataset for the color palette.
73 |
74 | ```{r}
75 | syr_plot_colored <- syr_plot +
76 | ...()
77 |
78 | syr_plot_colored
79 | ```
80 |
81 | # Task 3. Label each AQI category on the chart
82 | Incorporate text labels for each AQI value directly into the graph. To accomplish this, you need to:
83 |
84 | - Calculate the midpoint AQI value for each category
85 | - Add a geom_text() layer to the plot with the AQI values positioned at the midpoint on the y-axis
86 |
87 | Extend the range of the x-axis to provide more horizontal space for the AQI category labels without interfering with the trend line.
88 |
89 | ```{r}
90 | aqi_levels <- aqi_levels |>
91 | mutate(aqi_mid = ((aqi_min + aqi_max) / 2))
92 |
93 | syr_plot_labelled <- syr_plot_colored +
94 | scale_x_date(
95 | name = NULL, date_labels = "%b %Y",
96 | limits = c(ymd("2023-01-01"), ymd("2024-03-01"))
97 | ) +
98 | # add text labels for each AQI category
99 | geom_text(
100 | data = ...,
101 | aes(x = ..., y = aqi_mid, label = level),
102 | hjust = 1, size = 6, fontface = "bold", color = "white",
103 | family = "Atkinson Hyperlegible"
104 | )
105 |
106 | syr_plot_labelled
107 | ```
108 |
109 | # Task 4. Clean up the plot.
110 |
111 | Add a meaningful title, axis labels, caption, etc.
112 | ```{r}
113 |
114 | syr_plot_labelled <- syr_plot_labelled +
115 | theme_minimal() +
116 | ...()
117 |
118 | syr_plot_labelled
119 | ```
120 |
121 | ---
122 |
123 | From this task we will be using the `freedom.csv` data.
124 |
125 | ```{r}
126 | freedom_to_plot <- freedom |>
127 | # calculate rowwise standard deviations (one row per country)
128 | rowwise() |>
129 | mutate(sd = sd(c_across(contains("cl_")), na.rm = TRUE)) |>
130 | ungroup() |>
131 | # find the 15 countries with the highest standard deviations
132 | relocate(country, sd) |>
133 | slice_max(order_by = sd, n = 15) |>
134 | # only keep countries with complete observations - necessary for future plotting
135 | drop_na()
136 | freedom_to_plot
137 | ```
138 |
139 | ```{r}
140 | # calculate position rankings rather than raw scores
141 | freedom_ranked <- freedom_to_plot |>
142 | # only keep columns with civil liberties scores
143 | select(country, contains("cl_")) |>
144 | # wrangle the data to a long format
145 | pivot_longer(
146 | cols = -country,
147 | names_to = "year",
148 | values_to = "civil_liberty",
149 | names_prefix = "cl_",
150 | names_transform = list(year = as.numeric)
151 | ) |>
152 | # calculate rank within year - larger is worse, so reverse in the ranking
153 | group_by(year) |>
154 | mutate(rank_in_year = rank(-civil_liberty, ties.method = "first")) |>
155 | ungroup() |>
156 | # highlight Venezuela
157 | mutate(is_venezuela = if_else(country == "Venezuela", TRUE, FALSE))
158 | freedom_ranked
159 | ```
160 |
161 | # Task 5. Fill in the code below to create a faceted bar plot by year.
162 |
163 | ```{r}
164 | freedom_faceted_plot <- freedom_ranked |>
165 | # civil liberty vs freedom rank
166 | ggplot(aes(x = ..., y = ..., fill = ...)) +
167 | geom_col(show.legend = FALSE) +
168 | # change the color palette for emphasis of Venezuela
169 | scale_fill_manual(values = c("gray", "red")) +
170 | # facet by year
171 | facet_wrap(vars(year)) +
172 | # create explicit labels for civil liberties score,
173 | # leaving room for country text labels
174 | scale_x_continuous(
175 | limits = ...,
176 | breaks = ...
177 | ) +
178 | geom_text(
179 | hjust = "...",
180 | aes(label = ...),
181 | x = ...
182 | ) +
183 | # remove extraneous theme/label components
184 | theme(
185 | panel.grid.major.y = element_blank(),
186 | panel.grid.minor = element_blank(),
187 | axis.text.y = element_blank()
188 | ) +
189 | labs(x = NULL, y = NULL)
190 |
191 | freedom_faceted_plot
192 | ```
193 | # Task 6. Fill in the code below to turn the facet plot into an animation showing the data by year.
194 |
195 | ```{r}
196 | # If your code generates a bunch of png's instead of showing an animated object:
197 | # Install this:
198 | #install.packages("magick")
199 | # And restart your R session
200 | ```
201 |
202 | ```{r}
203 | freedom_bar_race <- freedom_faceted_plot +
204 | # remove faceting
205 | facet_null() +
206 | # label the current year in the top corner of the plot
207 | geom_text(
208 | x = ..., y = ...,
209 | hjust = "...",
210 | aes(label = ...),
211 | size = ...
212 | ) +
213 | # define group structure for transitions
214 | aes(group = ...) +
215 | # temporal transition - ensure integer value for labeling
216 | transition_time(...) +
217 | labs(
218 | title = "Civil liberties rating, {frame_time}",
219 | subtitle = "1: Highest degree of freedom - 7: Lowest degree of freedom"
220 | )
221 |
222 | # basic transition
223 | animate(freedom_bar_race, nframes = 30, fps = 2)
224 | ```
225 |
226 | ```{r}
227 | # smoother transition - might take a while to render
228 | animate(freedom_bar_race, nframes = 300, fps = 100, start_pause = 10, end_pause = 10)
229 | ```
230 |
231 |
--------------------------------------------------------------------------------
/Week 11/img/syracuse.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 11/img/syracuse.png
--------------------------------------------------------------------------------
/Week 11/img/task0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 11/img/task0.png
--------------------------------------------------------------------------------
/Week 11/img/task1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 11/img/task1.png
--------------------------------------------------------------------------------
/Week 11/img/task2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 11/img/task2.png
--------------------------------------------------------------------------------
/Week 11/img/task3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 11/img/task3.png
--------------------------------------------------------------------------------
/Week 11/img/task4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 11/img/task4.png
--------------------------------------------------------------------------------
/Week 11/img/task5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 11/img/task5.png
--------------------------------------------------------------------------------
/Week 11/img/task6-1.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 11/img/task6-1.gif
--------------------------------------------------------------------------------
/Week 11/img/task6-2.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 11/img/task6-2.gif
--------------------------------------------------------------------------------
/Week 12/Week12-AE-RMarkdown.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Iris & Penguins Dataset Dashboard"
3 | author: "Your Name"
4 | output:
5 | html_document:
6 | runtime: shiny
7 | ---
8 |
9 | ```{r setup, include=FALSE}
10 | library(shiny)
11 | library(dplyr)
12 | library(ggplot2)
13 | library(plotly)
14 | library(palmerpenguins) #install.packages('palmerpenguins')
15 | library(bslib) #install.packages("bslib")
16 | library(bsicons)
17 | data(iris)
18 | penguins <- penguins_raw
19 | ```
20 |
21 | # Starting a Shiny app
22 |
23 | ```{r}
24 | # Define UI for the Shiny app
25 | ui <- fluidPage(
26 | titlePanel("Iris and Penguin Dataset Dashboard")
27 | )
28 |
29 | # Define server logic for the Shiny app
30 | server <- function(input, output) {}
31 |
32 | # Run the application
33 | shinyApp(ui = ui, server = server)
34 | ```
35 |
36 | # Adding tabs
37 |
38 | ```{r}
39 | # Define UI for the Shiny app
40 | ui <- fluidPage(
41 | titlePanel("Iris and Penguin Dataset Dashboard"),
42 | tabsetPanel(
43 | tabPanel("Iris Dataset"),
44 | tabPanel("Penguin Dataset")
45 | )
46 | )
47 |
48 | # Define server logic for the Shiny app
49 | server <- function(input, output) {}
50 |
51 | # Run the application
52 | shinyApp(ui = ui, server = server)
53 | ```
54 |
55 |
56 | # Adding layout and controls for Iris
57 |
58 | ```{r}
59 | # Define UI for the Shiny app
60 | ui <- fluidPage(
61 | titlePanel("Iris and Penguin Dataset Dashboard"),
62 | tabsetPanel(
63 | tabPanel("Iris Dataset",
64 | sidebarLayout(
65 | sidebarPanel(
66 | selectInput("iris_xvar", "X-axis variable", choices = names(iris)[1:4]),
67 | selectInput("iris_yvar", "Y-axis variable", choices = names(iris)[1:4], selected = names(iris)[[2]]),
68 | selectInput("iris_species", "Species", choices = unique(iris$Species), multiple = TRUE, selected = unique(iris$Species))
69 | ),
70 | mainPanel(
71 | plotlyOutput("iris_scatterPlot"),
72 | tableOutput("iris_dataTable")
73 | )
74 | )
75 | ),
76 | tabPanel("Penguin Dataset")
77 | )
78 | )
79 |
80 | # Define server logic for the Shiny app
81 | server <- function(input, output) {}
82 |
83 | # Run the application
84 | shinyApp(ui = ui, server = server)
85 | ```
86 |
87 | # Adding layout and controls for Penguins
88 |
89 | ```{r}
90 | # Define UI for the Shiny app
91 | ui <- fluidPage(
92 | titlePanel("Iris and Penguin Dataset Dashboard"),
93 | tabsetPanel(
94 | tabPanel("Iris Dataset",
95 | sidebarLayout(
96 | sidebarPanel(
97 | selectInput("iris_xvar", "X-axis variable", choices = names(iris)[1:4]),
98 | selectInput("iris_yvar", "Y-axis variable", choices = names(iris)[1:4], selected = names(iris)[[2]]),
99 | selectInput("iris_species", "Species", choices = unique(iris$Species), multiple = TRUE, selected = unique(iris$Species))
100 | ),
101 | mainPanel(
102 | plotlyOutput("iris_scatterPlot"),
103 | tableOutput("iris_dataTable")
104 | )
105 | )
106 | ),
107 | tabPanel("Penguin Dataset",
108 | sidebarLayout(
109 | sidebarPanel(
110 | selectInput("penguin_xvar", "X-axis variable", choices = names(penguins)[3:6]),
111 | selectInput("penguin_species", "Species", choices = unique(penguins$Species), multiple = TRUE, selected = unique(penguins$Species))
112 | ),
113 | mainPanel(
114 | plotlyOutput("penguin_barPlot"),
115 | tableOutput("penguin_dataTable")
116 | )
117 | )
118 | )
119 | )
120 | )
121 |
122 | # Define server logic for the Shiny app
123 | server <- function(input, output) {}
124 |
125 | # Run the application
126 | shinyApp(ui = ui, server = server)
127 |
128 | ```
129 |
130 | # Adding reactive filtering for Iris
131 |
132 | ```{r}
133 | # Define UI for the Shiny app
134 | ui <- fluidPage(
135 | titlePanel("Iris and Penguin Dataset Dashboard"),
136 | tabsetPanel(
137 | tabPanel("Iris Dataset",
138 | sidebarLayout(
139 | sidebarPanel(
140 | selectInput("iris_xvar", "X-axis variable", choices = names(iris)[1:4]),
141 | selectInput("iris_yvar", "Y-axis variable", choices = names(iris)[1:4], selected = names(iris)[[2]]),
142 | selectInput("iris_species", "Species", choices = unique(iris$Species), multiple = TRUE, selected = unique(iris$Species))
143 | ),
144 | mainPanel(
145 | plotlyOutput("iris_scatterPlot"),
146 | tableOutput("iris_dataTable")
147 | )
148 | )
149 | ),
150 | tabPanel("Penguin Dataset",
151 | sidebarLayout(
152 | sidebarPanel(
153 | selectInput("penguin_xvar", "X-axis variable", choices = names(penguins)[3:6]),
154 | selectInput("penguin_species", "Species", choices = unique(penguins$Species), multiple = TRUE, selected = unique(penguins$Species))
155 | ),
156 | mainPanel(
157 | plotlyOutput("penguin_barPlot"),
158 | tableOutput("penguin_dataTable")
159 | )
160 | )
161 | )
162 | )
163 | )
164 |
165 | # Define server logic for the Shiny app
166 | server <- function(input, output) {
167 | # Iris dataset
168 | filteredIrisData <- reactive({
169 | iris %>%
170 | filter(Species %in% input$iris_species)
171 | })
172 |
173 | output$iris_scatterPlot <- renderPlotly({
174 | p <- ggplot(filteredIrisData(), aes_string(x = input$iris_xvar, y = input$iris_yvar, color = "Species", text = "Species")) +
175 | geom_point() +
176 | theme_minimal() +
177 | labs(title = "Scatter Plot of Iris Dataset", x = input$iris_xvar, y = input$iris_yvar)
178 |
179 | ggplotly(p, tooltip = c("x", "y", "Species"))
180 | })
181 |
182 | output$iris_dataTable <- renderTable({
183 | filteredIrisData()
184 | })
185 | }
186 |
187 | # Run the application
188 | shinyApp(ui = ui, server = server)
189 |
190 | ```
191 |
192 | # Adding reactive filtering for Penguins
193 |
194 | ```{r}
195 | # Define UI for the Shiny app
196 | ui <- fluidPage(
197 | titlePanel("Iris and Penguin Dataset Dashboard"),
198 | tabsetPanel(
199 | tabPanel("Iris Dataset",
200 | sidebarLayout(
201 | sidebarPanel(
202 | selectInput("iris_xvar", "X-axis variable", choices = names(iris)[1:4]),
203 | selectInput("iris_yvar", "Y-axis variable", choices = names(iris)[1:4], selected = names(iris)[[2]]),
204 | selectInput("iris_species", "Species", choices = unique(iris$Species), multiple = TRUE, selected = unique(iris$Species))
205 | ),
206 | mainPanel(
207 | plotlyOutput("iris_scatterPlot"),
208 | tableOutput("iris_dataTable")
209 | )
210 | )
211 | ),
212 | tabPanel("Penguin Dataset",
213 | sidebarLayout(
214 | sidebarPanel(
215 | selectInput("penguin_xvar", "X-axis variable", choices = names(penguins)[3:6]),
216 | selectInput("penguin_species", "Species", choices = unique(penguins$Species), multiple = TRUE, selected = unique(penguins$Species))
217 | ),
218 | mainPanel(
219 | plotlyOutput("penguin_barPlot"),
220 | tableOutput("penguin_dataTable")
221 | )
222 | )
223 | )
224 | )
225 | )
226 |
227 | # Define server logic for the Shiny app
228 | server <- function(input, output) {
229 | # Iris dataset
230 | filteredIrisData <- reactive({
231 | iris %>%
232 | filter(Species %in% input$iris_species)
233 | })
234 |
235 | output$iris_scatterPlot <- renderPlotly({
236 | p <- ggplot(filteredIrisData(), aes_string(x = input$iris_xvar, y = input$iris_yvar, color = "Species", text = "Species")) +
237 | geom_point() +
238 | theme_minimal() +
239 | labs(title = "Scatter Plot of Iris Dataset", x = input$iris_xvar, y = input$iris_yvar)
240 |
241 | ggplotly(p, tooltip = c("x", "y", "Species"))
242 | })
243 |
244 | output$iris_dataTable <- renderTable({
245 | filteredIrisData()
246 | })
247 |
248 | # Penguin dataset
249 | filteredPenguinData <- reactive({
250 | penguins %>%
251 | filter(Species %in% input$penguin_species)
252 | })
253 |
254 | output$penguin_barPlot <- renderPlotly({
255 | p <- ggplot(filteredPenguinData(), aes_string(x = input$penguin_xvar, fill = "Species", text = "Species")) +
256 | geom_bar(position = "dodge") +
257 | theme_minimal() +
258 | labs(title = "Bar Plot of Penguin Dataset", x = input$penguin_xvar)
259 |
260 | ggplotly(p, tooltip = c("x", "y", "Species"))
261 | })
262 |
263 | output$penguin_dataTable <- renderTable({
264 | filteredPenguinData()
265 | })
266 | }
267 |
268 | # Run the application
269 | shinyApp(ui = ui, server = server)
270 |
271 | ```
272 |
273 | # Final version using bslib and bsicon for decorations
274 |
275 | ```{r}
276 | # Define UI for the Shiny app
277 | ui <- fluidPage(
278 | theme = bs_theme(
279 | version = 5,
280 | bootswatch = "flatly", # You can choose different themes like "cosmo", "cerulean", etc.
281 | primary = "#2c3e50", # Customize colors as needed
282 | secondary = "#18bc9c"
283 | ),
284 | titlePanel("Iris and Penguin Dataset Dashboard"),
285 | tabsetPanel(
286 | tabPanel("Iris Dataset",
287 | sidebarLayout(
288 | sidebarPanel(
289 | selectInput("iris_xvar", "X-axis variable", choices = names(iris)[1:4]),
290 | selectInput("iris_yvar", "Y-axis variable", choices = names(iris)[1:4], selected = names(iris)[[2]]),
291 | selectInput("iris_species", "Species", choices = unique(iris$Species), multiple = TRUE, selected = unique(iris$Species)),
292 | value_box(
293 | title = "Iris Dataset",
294 | value = "Data about some flowers",
295 | showcase = bs_icon("database-fill-check"),
296 | p("Flowers", bs_icon("flower3")),
297 | p("and stuff", bs_icon("emoji-smile"))
298 | )
299 | ),
300 | mainPanel(
301 | plotlyOutput("iris_scatterPlot"),
302 | tableOutput("iris_dataTable")
303 |
304 | )
305 | )
306 | ),
307 | tabPanel("Penguin Dataset",
308 | sidebarLayout(
309 | sidebarPanel(
310 | selectInput("penguin_xvar", "X-axis variable", choices = names(penguins)[3:6]),
311 | selectInput("penguin_species", "Species", choices = unique(penguins$Species), multiple = TRUE, selected = unique(penguins$Species))
312 | ),
313 | mainPanel(
314 | plotlyOutput("penguin_barPlot"),
315 | tableOutput("penguin_dataTable")
316 | )
317 | )
318 | )
319 | )
320 | )
321 |
322 | # Define server logic for the Shiny app
323 | server <- function(input, output) {
324 | # Iris dataset
325 | filteredIrisData <- reactive({
326 | iris %>%
327 | filter(Species %in% input$iris_species)
328 | })
329 |
330 | output$iris_scatterPlot <- renderPlotly({
331 | p <- ggplot(filteredIrisData(), aes_string(x = input$iris_xvar, y = input$iris_yvar, color = "Species", text = "Species")) +
332 | geom_point() +
333 | theme_minimal() +
334 | labs(title = "Scatter Plot of Iris Dataset", x = input$iris_xvar, y = input$iris_yvar)
335 |
336 | ggplotly(p, tooltip = c("x", "y", "Species"))
337 | })
338 |
339 | output$iris_dataTable <- renderTable({
340 | filteredIrisData()
341 | })
342 |
343 | # Penguin dataset
344 | filteredPenguinData <- reactive({
345 | penguins %>%
346 | filter(Species %in% input$penguin_species)
347 | })
348 |
349 | output$penguin_barPlot <- renderPlotly({
350 | p <- ggplot(filteredPenguinData(), aes_string(x = input$penguin_xvar, fill = "Species", text = "Species")) +
351 | geom_bar(position = "dodge") +
352 | theme_minimal() +
353 | labs(title = "Bar Plot of Penguin Dataset", x = input$penguin_xvar)
354 |
355 | ggplotly(p, tooltip = c("x", "y", "species"))
356 | })
357 |
358 | output$penguin_dataTable <- renderTable({
359 | filteredPenguinData()
360 | })
361 | }
362 |
363 | # Run the application
364 | shinyApp(ui = ui, server = server)
365 |
366 |
367 | ```
368 |
--------------------------------------------------------------------------------
/Week 12/Week12-Lecture-Examples.Rmd:
--------------------------------------------------------------------------------
1 | # Plotly
2 |
3 | ```{r}
4 | library(gapminder)
5 |
6 | #install.packages('plotly')
7 | library(plotly)
8 |
9 | gapminder_2007 <- filter(
10 | gapminder,
11 | year == 2007
12 | )
13 |
14 | my_plot <- ggplot(
15 | data = gapminder_2007,
16 | mapping = aes(
17 | x = gdpPercap, y = lifeExp,
18 | color = continent
19 | )
20 | ) +
21 | geom_point() +
22 | scale_x_log10() +
23 | theme_minimal()
24 |
25 | ggplotly(my_plot)
26 | ```
27 |
28 | # Plotly tooltips
29 | ```{r}
30 | my_plot <- ggplot(
31 | data = gapminder_2007,
32 | mapping = aes(
33 | x = gdpPercap, y = lifeExp,
34 | color = continent
35 | )
36 | ) +
37 | geom_point(aes(text = country)) +
38 | scale_x_log10() +
39 | theme_minimal()
40 |
41 | ggplotly(
42 | my_plot,
43 | tooltip = "text"
44 | )
45 | ```
46 |
47 |
48 | # Plotly with bars
49 | ```{r}
50 | car_hist <- ggplot(
51 | data = mpg,
52 | mapping = aes(x = hwy)
53 | ) +
54 | geom_histogram(
55 | binwidth = 2,
56 | boundary = 0,
57 | color = "white"
58 | )
59 |
60 | ggplotly(car_hist)
61 | ```
62 |
63 | # Creating interactive World Bank plot
64 | ```{r}
65 | library(tidyverse)
66 | library(WDI)
67 | library(scales)
68 | library(plotly)
69 | library(colorspace)
70 | #install.packages('ggbeeswarm')
71 | library(ggbeeswarm)
72 |
73 | # get World Bank indicators
74 | indicators <- c(
75 | population = "SP.POP.TOTL",
76 | prop_women_parl = "SG.GEN.PARL.ZS",
77 | gdp_per_cap = "NY.GDP.PCAP.KD"
78 | )
79 |
80 | wdi_parl_raw <- WDI(
81 | country = "all", indicators, extra = TRUE,
82 | start = 2022, end = 2022
83 | )
84 |
85 | # keep actual "economies" (this may take a while to run)
86 | wdi_clean <- wdi_parl_raw |>
87 | filter(region != "Aggregates")
88 | glimpse(wdi_clean)
89 | ```
90 |
91 | ```{r}
92 | wdi_2022 <- wdi_clean |>
93 | filter(year == 2022) |>
94 | drop_na(prop_women_parl) |>
95 | # Scale this down from 0-100 to 0-1 so that scales::label_percent() can format
96 | # it as an actual percent
97 | mutate(prop_women_parl = prop_women_parl / 100)
98 |
99 | static_plot <- ggplot(
100 | data = wdi_2022,
101 | mapping = aes(y = fct_rev(region), x = prop_women_parl, color = region)
102 | ) +
103 | geom_quasirandom() +
104 | scale_x_continuous(labels = label_percent()) +
105 | scale_color_discrete_qualitative(guide = "none") +
106 | labs(x = "% women in parliament", y = NULL, caption = "Source: The World Bank") +
107 | theme_bw(base_size = 14)
108 | static_plot
109 | ```
110 | Make it interactive
111 |
112 | ```{r}
113 | ggplotly(static_plot)
114 | ```
115 |
116 | Modifying the tooltip
117 |
118 | ```{r}
119 | static_plot_tooltip <- ggplot(
120 | data = wdi_2022,
121 | mapping = aes(y = fct_rev(region), x = prop_women_parl, color = region)
122 | ) +
123 | geom_quasirandom(
124 | mapping = aes(text = country)
125 | ) +
126 | scale_x_continuous(labels = label_percent()) +
127 | scale_color_discrete_qualitative() +
128 | labs(x = "% women in parliament", y = NULL, caption = "Source: The World Bank") +
129 | theme_bw(base_size = 14) +
130 | theme(legend.position = "none")
131 |
132 | ggplotly(static_plot_tooltip,
133 | tooltip = "text"
134 | )
135 | ```
136 |
137 |
138 | Creating custom tooltips with the format
139 |
140 | ```
141 | Name of country
142 | X% women in parliament
143 | ```
144 |
145 | Generate a new column using `mutate()` with the required character string
146 |
147 | - `str_glue()` to combine character strings with data values
148 | - The `
` is HTML code for a line break
149 | - Use the `label_percent()` function to format numbers as percents
150 |
151 | ```{r}
152 | wdi_2022 <- wdi_clean |>
153 | filter(year == 2022) |>
154 | drop_na(prop_women_parl) |>
155 | mutate(
156 | prop_women_parl = prop_women_parl / 100,
157 | fancy_label = str_glue("{country}
{label_percent(accuracy = 0.1)(prop_women_parl)} women in parliament")
158 | )
159 | wdi_2022 |>
160 | select(country, prop_women_parl, fancy_label) |>
161 | head()
162 | ```
163 |
164 |
165 | ```{r}
166 | static_plot_tooltip_fancy <- ggplot(
167 | data = wdi_2022,
168 | mapping = aes(y = fct_rev(region), x = prop_women_parl, color = region)
169 | ) +
170 | geom_quasirandom(
171 | mapping = aes(text = fancy_label)
172 | ) +
173 | scale_x_continuous(labels = label_percent()) +
174 | scale_color_discrete_qualitative() +
175 | labs(x = "% women in parliament", y = NULL, caption = "Source: The World Bank") +
176 | theme_bw(base_size = 14) +
177 | theme(legend.position = "none")
178 |
179 | ggplotly(static_plot_tooltip_fancy,
180 | tooltip = "text"
181 | )
182 | ```
--------------------------------------------------------------------------------
/Week 2/Week2-AE-Notes.md:
--------------------------------------------------------------------------------
1 | 
2 |
3 | # COMP4010/5120 - Week 2 Application Exercises
4 |
5 | ---
6 |
7 | # Submission requirements:
8 |
9 | - For each week, you will need to submit **TWO** items to Canvas: Your R Markdown file in `.Rmd` and `.pdf` formats. You can export `.Rmd` to `.pdf` by using the Knit functionality. If you’re not sure how to do that please check out this video: [Tutorial on how to Knit R Markdown](https://youtu.be/8eBBPVMwTLo?si=93Vo8OOApf0vAYYH).
10 | - In the R Markdown file please provide your answers to the provided exercises (both theory and programming questions).
11 | - Answers to theory questions should be included in markdown as plain text. Answers to programming questions should be included in executable R chunks (more information in **Reading Material Section A.2.4**).
12 |
13 | ---
14 |
15 | # A. Application Exercises
16 |
17 | ## Task 1. Visualizing the house sales data as a lollipop chart
18 |
19 | Following the content from Section B.3, plot the mean area of houses by decade built using a [lollipop chart](https://www.data-to-viz.com/graph/lollipop.html).
20 |
21 | **Hint:** Think of what **geometric primitives** you need to create a lollipop chart? Refer to [this handy cheatsheet](https://rstudio.github.io/cheatsheets/html/data-visualization.html) and find the geoms you need.
22 |
23 | Your plot should look something (may not need to be exactly) like this:
24 |
25 | 
26 |
27 | ## Task 2. Visualizing the distribution of the number of bedrooms
28 |
29 | Your task is to visualize the distribution of the number of bedrooms. To simplify the task, let’s collapse the variable beds into a smaller number of categories and drop rows with missing values for this variable.
30 | ```R
31 | df_bed <- df |>
32 | mutate(beds = factor(beds) |>
33 | fct_collapse(
34 | "5+" = c("5", "6", "7", "9")
35 | )) |>
36 | drop_na(beds)
37 | ```
38 | Since the number of bedrooms is effectively a categorical variable, we should select a geom appropriate for a single categorical variable.
39 | Create a bar chart visualizing the distribution of the number of bedrooms in the properties sold. Your plot should look something (may not need to be exactly) like this:
40 |
41 | 
42 |
43 | ## Task 3. Visualizing the distribution of the number of bedrooms by the decade in which the property was built
44 | Now let’s visualize the distribution of the number of bedrooms by the decade in which the property was built. We will still use a bar chart but also color-code the bar segments for each decade. Now we have a few variations to consider.
45 |
46 | - [**Stacked bar chart**](https://datavizproject.com/data-type/stacked-bar-chart/) - each bar segment represents the frequency count and are stacked vertically on top of each other.
47 | - [**Dodged/Grouped bar chart**](https://chartio.com/learn/charts/grouped-bar-chart-complete-guide/) - each bar segment represents the frequency count and are placed side by side for each decade. This leaves each segment with a common origin, or baseline value of 0.
48 | - [**Relative frequency bar chart**](https://www.researchgate.net/figure/Relative-frequency-bar-chart-of-settlement-classification-per-cluster-according-to-the_fig4_330763944) - each bar segment represents the relative frequency (proportion) of each category within each decade.
49 |
50 | Generate each form of the bar chart and compare the differences. Which one do you think is the most informative?
51 | *Hint:* Read the documentation for [geom_bar()](https://ggplot2.tidyverse.org/reference/geom_bar.html) to identify an appropriate argument for specifying each type of bar chart.
52 |
53 | Your plot should look something (may not need to be exactly) like this:
54 |
55 | **Stacked bar chart**
56 |
57 | 
58 |
59 | **Dodged bar chart**
60 |
61 | 
62 |
63 | **Relative frequency bar chart**
64 |
65 | 
66 |
67 | ## Task 4. Visualizing the distribution of property size by decades
68 |
69 | Now let’s evaluate the typical property size (`area`) by the decade in which the property was built. We will start by summarizing the data and then visualize the results using a bar chart and a boxplot.
70 |
71 | ```R
72 | mean_area_decade <- df |>
73 | group_by(decade_built_cat) |>
74 | summarize(mean_area = mean(area))
75 | ```
76 |
77 | Visualize the property size by the decade in which the property was built. Construct a bar chart reporting the average property size, as well as a [boxplot](https://r-graph-gallery.com/boxplot), [violin plot](https://r-graph-gallery.com/violin), and [strip chart](https://diametrical.co.uk/products/quickchart/advanced-charts/strip-charts/) (e.g. jittered scatterplot). What does each graph tell you about the distribution of property size by decade built? Which ones do you find to be more or less effective?
78 |
79 | *Bonus*: [XKCD on Violin Plots](https://xkcd.com/1967/)
80 |
81 | Your plot should look something (may not need to be exactly) like this:
82 |
83 | **Bar chart (mean area by decade built)**
84 | 
85 |
86 | **Box plot (area by decade built)**
87 | 
88 |
89 | **Violin plot (area by decade built)**
90 | 
91 |
92 | **Strip chart (area by decade built)**
93 | 
94 |
95 |
96 | # B. Reading Material
97 |
98 | ## 1. Hall of ~~Fame~~ Shame
99 |
100 | Learning from other's mistakes is immensely beneficial as it provides you with tangible examples of what not to do in your own visualizations.
101 | The "Hall of Shame" is a curated collection of poorly executed visualizations that serve as cautionary examples of what to avoid in your own work. By examining these visualizations, we'll gain valuable insights into common pitfalls, ineffective techniques, and misleading representations. The "Hall of Shame" is not intended to ridicule or criticize, but rather to foster critical thinking, inspire improvement, and deepen our understanding of effective data communication.
102 |
103 | For our first one, let's look at the most common sin of data viz: pie charts!
104 |
105 | 
106 |
107 | Here are some reasons why using a pie chart in this situation is not ideal:
108 |
109 | - Difficulty comparing slices: As mentioned earlier, the human eye is not good at accurately judging the relative sizes of pie chart slices, especially for more than a few slices. This makes it difficult to compare the percentage of each country or region in this chart. For instance, it’s hard to tell at a glance which is larger, the slice for China or the one for “Other”.
110 |
111 | - Limited data points: Pie charts are most effective when there are only a few data points to compare. This chart has six slices, which can make it visually complex and difficult to interpret.
112 |
113 | A better alternative for this data visualization would be a bar chart. A bar chart would display the percentages for each country or region along a horizontal or vertical axis, making it easier to compare the magnitudes and identify the highest or lowest percentages.
114 |
115 | ## 2. Data-Ink Ratio: Less is More in Data Visualization
116 |
117 | The data-ink ratio is a key principle in data visualization, emphasizing the importance of maximizing the proportion of ink used to represent the actual data compared to the total ink used in the entire visualization. This concept, championed by Edward Tufte, encourages creators to focus on essential elements that convey the data's message effectively.
118 |
119 | 
120 |
121 | Why is a high data-ink ratio crucial in modern data viz designs?
122 |
123 | - Clarity and Focus: By minimizing unnecessary visual elements like excessive decorations or overly complex chart structures, viewers can easily grasp the data's key points without being distracted by extraneous information. This leads to a clearer understanding of the message being conveyed.
124 | - Efficiency and Impact: A high data-ink ratio promotes efficient use of visual space, allowing for the presentation of more data within the same area. This maximizes the impact of the visualization, enabling viewers to quickly extract insights from the data.
125 | - Professionalism and Aesthetics: A focus on essential elements often translates to a cleaner and more professional visual design. This aesthetic simplicity fosters a sense of trust and reliability in the data presented.
126 |
127 | 
128 |
129 | While not a strict quantitative measure, the data-ink ratio serves as a valuable guiding principle for data visualization professionals. By striving for a high data-ink ratio, creators can ensure their visualizations are clear, concise, and impactful, allowing viewers to effectively understand and utilize the presented information.
130 |
131 | ## 3. House Sales Data
132 | For the following exercises we will work with data on houses that were sold in the state of New York (USA) in 2022 and 2023.
133 |
134 | The variables include:
135 |
136 | - `property_type` - type of property (e.g. single family residential, townhouse, condo)
137 | - `address` - street address of property
138 | - `city` - city of property
139 | - `state` - state of property (all are New York)
140 | - `zip_code` - ZIP code of property
141 | - `price` - sale price (in dollars)
142 | - `beds` - number of bedrooms
143 | - `baths` - number of bathrooms. Full bathrooms with shower/toilet count as 1, bathrooms with just a toilet count as 0.5.
144 | - `area` - living area of the home (in square feet)
145 | - `lot_size` - size of property’s lot (in acres)
146 | - `year_built` - year home was built
147 | - `hoa_month` - monthly HOA dues. If the property is not part of an HOA, then the value is `NA`
148 |
149 | The dataset can be found in the repo for this week. It is called `homesales.csv`. We will import the data and create a new variable, `decade_built_cat`, which identifies the decade in which the home was built. It will include catch-all categories for any homes pre-1940 and post-1990.
150 |
151 | First, you can read the data `csv` in R with:
152 |
153 | ```R
154 | df <- read_csv("homesales.csv")
155 | ```
156 |
157 | ### 3.1. Average home size by decade
158 |
159 | Let’s examine the average size of homes recently sold by their age. To simplify this task, we will split the homes by decade of construction. It will include catch-all categories for any homes pre-1940 and post-1990. Then we will calculate the average size of homes sold by decade.
160 |
161 | ```R
162 | # create decade variable
163 | df <- df |>
164 | mutate(
165 | decade_built = (year_built %/% 10) * 10,
166 | decade_built_cat = case_when(
167 | decade_built <= 1940 ~ "1940 or before",
168 | decade_built >= 1990 ~ "1990 or after",
169 | .default = as.character(decade_built)
170 | )
171 | )
172 |
173 | # calculate mean area by decade
174 | mean_area_decade <- df |>
175 | group_by(decade_built_cat) |>
176 | summarize(mean_area = mean(area))
177 | mean_area_decade
178 | ```
179 |
180 | This code snippet is using the pipe operator (`|>`) in R along with the `mutate()` function from the `dplyr` package to create two new variables (`decade_built` and `decade_built_cat`) based on an existing variable `year_built` in the dataframe `df`.
181 |
182 | - The pipe operator (`|>`) passes the dataframe `df` into the next function, which is `mutate()`, making the dataframe the first argument of the `mutate()` function.
183 | - `mutate()`: This function from the `dplyr` package is used to create new variables or modify existing ones in the dataframe.
184 | - `decade_built = (year_built %/% 10) * 10`: This line calculates the decade in which each observation's `year_built` falls. It does this by dividing `year_built` by 10 (`year_built %/% 10`), which effectively truncates the year to the nearest decade. Then, it multiplies the result by 10 to obtain the decade. For example, if `year_built` is 1995, `(1995 %/% 10) * 10` would result in 1990.
185 | - `decade_built_cat = case_when()`: This line creates a categorical variable `decade_built_cat` based on the decade information calculated in the previous step. It uses the `case_when()` function, which allows for conditional assignment based on multiple conditions.
186 | - `decade_built <= 1940 ~ "1940 or before"`: This condition checks if the decade_built is less than or equal to 1940. If true, it assigns the value `"1940 or before"` to the `decade_built_cat`.
187 | - `decade_built >= 1990 ~ "1990 or after"`: This condition checks if the decade_built is greater than or equal to 1990. If true, it assigns the value `"1990 or after"` to the `decade_built_cat`.
188 | - `.default = as.character(decade_built)`: This is the default condition. If the `decade_built` does not fall into either of the specified ranges, it converts the decade into a character and assigns it to `decade_built_cat`.
189 |
190 | ### 3.2. Visualizing the data as a bar chart
191 |
192 | A conventional approach to visualizing this data is a **bar chart**. Since we already calculated the average area, we can use `geom_col()` to create the bar chart. We also graph it horizontally to avoid overlapping labels for the decades.
193 |
194 | ```R
195 | ggplot(
196 | data = mean_area_decade,
197 | mapping = aes(x = mean_area, y = decade_built_cat)
198 | ) +
199 | geom_col() +
200 | labs(
201 | x = "Mean area (square feet)", y = "Decade built",
202 | title = "Mean area of houses, by decade built"
203 | )
204 | ```
205 |
206 | 
207 |
208 | ### 3.3. Visualizing the data as a dot plot
209 |
210 | The bar chart violates the data-ink ratio principle. The bars are not necessary to convey the information. We can use a **dot plot** instead. The dot plot is a variation of the bar chart, where the bars are replaced by dots. The dot plot is a (potentially) better choice because it uses less ink to convey the same information.
211 |
212 | ```R
213 | ggplot(
214 | data = mean_area_decade,
215 | mapping = aes(x = mean_area, y = decade_built_cat)
216 | ) +
217 | geom_point(size = 4) +
218 | labs(
219 | x = "Mean area (square feet)", y = "Decade built",
220 | title = "Mean area of houses, by decade built"
221 | )
222 | ```
223 |
224 | 
--------------------------------------------------------------------------------
/Week 2/Week2-AE-RMarkdown.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | ---
3 | ---
4 |
5 | # Week 2
6 |
7 | # Application Exercises
8 |
9 | Include `tidyverse`:
10 |
11 | ```{r}
12 | #install.packages("tidyverse")
13 | library(tidyverse)
14 | theme_set(theme_minimal())
15 | ```
16 |
17 | Read the data:
18 |
19 | ```{r}
20 | df <- read_csv("homesales.csv")
21 | ```
22 |
23 | Average home size by decade:
24 |
25 | ```{r}
26 | # create decade variable
27 | df <- df |>
28 | mutate(
29 | decade_built = (year_built %/% 10) * 10,
30 | decade_built_cat = case_when(
31 | decade_built <= 1940 ~ "1940 or before",
32 | decade_built >= 1990 ~ "1990 or after",
33 | .default = as.character(decade_built)
34 | )
35 | )
36 |
37 | # calculate mean area by decade
38 | mean_area_decade <- df |>
39 | group_by(decade_built_cat) |>
40 | summarize(mean_area = mean(area))
41 | mean_area_decade
42 | ```
43 |
44 | Visualizing the data as a bar chart:
45 |
46 | ```{r}
47 | ggplot(
48 | data = mean_area_decade,
49 | mapping = aes(x = mean_area, y = decade_built_cat)
50 | ) +
51 | geom_col() +
52 | labs(
53 | x = "Mean area (square feet)", y = "Decade built",
54 | title = "Mean area of houses, by decade built"
55 | )
56 | ```
57 |
58 | Visualizing the data as a dot plot:
59 |
60 | ```{r}
61 | ggplot(
62 | data = mean_area_decade,
63 | mapping = aes(x = mean_area, y = decade_built_cat)
64 | ) +
65 | geom_point(size = 4) +
66 | labs(
67 | x = "Mean area (square feet)", y = "Decade built",
68 | title = "Mean area of houses, by decade built"
69 | )
70 | ```
71 |
72 | ## TASK 1. Visualizing the data as a lollipop chart
73 |
74 | ```{r}
75 | # YOUR CODE HERE
76 |
77 |
78 |
79 |
80 | ```
81 |
82 | ## TASK 2. Visualizing the distribution of the number of bedrooms
83 |
84 | Collapse the variable `beds` into a smaller number of categories and drop rows with missing values for this variable:
85 |
86 | ```{r}
87 | df_bed <- df |>
88 | mutate(beds = factor(beds) |>
89 | fct_collapse(
90 | "5+" = c("5", "6", "7", "9")
91 | )) |>
92 | drop_na(beds)
93 | ```
94 |
95 | ```{r}
96 | # YOUR CODE HERE
97 |
98 |
99 |
100 |
101 | ```
102 |
103 | ## TASK 3. Visualizing the distribution of the number of bedrooms by the decade in which the property was built
104 |
105 | Stacked bar chart (number of bedrooms by the decade built):
106 |
107 | ```{r}
108 | # YOUR CODE HERE
109 |
110 |
111 |
112 |
113 | ```
114 |
115 | Dodged bar chart (number of bedrooms by the decade built):
116 |
117 | ```{r}
118 | # YOUR CODE HERE
119 |
120 |
121 |
122 |
123 | ```
124 |
125 | Relative frequency bar chart (number of bedrooms by the decade built):
126 |
127 | ```{r}
128 | # YOUR CODE HERE
129 |
130 |
131 |
132 |
133 | ```
134 |
135 | ## Task 4. Visualizing the distribution of property size by decades
136 |
137 | Getting mean of area of each decade category:
138 |
139 | ```{r}
140 | mean_area_decade <- df |>
141 | group_by(decade_built_cat) |>
142 | summarize(mean_area = mean(area))
143 | ```
144 |
145 | Bar chart (mean area by decade built):
146 |
147 | ```{r}
148 | # YOUR CODE HERE
149 |
150 |
151 |
152 |
153 | ```
154 |
155 | Box plot (area by decade built):
156 |
157 | ```{r}
158 | # YOUR CODE HERE
159 |
160 |
161 |
162 |
163 | ```
164 |
165 | Violin plot (area by decade built):
166 |
167 | ```{r}
168 | # YOUR CODE HERE
169 |
170 |
171 |
172 |
173 | ```
174 |
175 | Strip chart (area by decade built):
176 |
177 | ```{r}
178 | set.seed(4010)
179 |
180 | # YOUR CODE HERE
181 |
182 |
183 |
184 |
185 | ```
186 |
--------------------------------------------------------------------------------
/Week 2/Week2-AE-RMarkdown.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 2/Week2-AE-RMarkdown.pdf
--------------------------------------------------------------------------------
/Week 2/img/banner.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 2/img/banner.png
--------------------------------------------------------------------------------
/Week 2/img/dataink.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 2/img/dataink.png
--------------------------------------------------------------------------------
/Week 2/img/dataink2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 2/img/dataink2.jpg
--------------------------------------------------------------------------------
/Week 2/img/housedata1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 2/img/housedata1.png
--------------------------------------------------------------------------------
/Week 2/img/housedata10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 2/img/housedata10.png
--------------------------------------------------------------------------------
/Week 2/img/housedata11.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 2/img/housedata11.png
--------------------------------------------------------------------------------
/Week 2/img/housedata2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 2/img/housedata2.png
--------------------------------------------------------------------------------
/Week 2/img/housedata3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 2/img/housedata3.png
--------------------------------------------------------------------------------
/Week 2/img/housedata4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 2/img/housedata4.png
--------------------------------------------------------------------------------
/Week 2/img/housedata5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 2/img/housedata5.png
--------------------------------------------------------------------------------
/Week 2/img/housedata6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 2/img/housedata6.png
--------------------------------------------------------------------------------
/Week 2/img/housedata7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 2/img/housedata7.png
--------------------------------------------------------------------------------
/Week 2/img/housedata8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 2/img/housedata8.png
--------------------------------------------------------------------------------
/Week 2/img/housedata9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 2/img/housedata9.png
--------------------------------------------------------------------------------
/Week 2/img/piechart.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 2/img/piechart.png
--------------------------------------------------------------------------------
/Week 4/Week4-AE-Notes.md:
--------------------------------------------------------------------------------
1 | 
2 | # COMP4010/5120 - Week 4 Application Exercises
3 | ---
4 |
5 | # A. Application Exercises
6 | Data source: https://github.com/imdevskp/covid_19_jhu_data_web_scrap_and_cleaning
7 |
8 | ## Task 1. COVID-19 `TotalCases` vs. `TotalDeaths`
9 | From `worldometer_data.csv`, create a plot of `TotalCases` vs. `TotalDeaths`. Describe the relationship between the two variables.
10 |
11 | 
12 |
13 | ## Task 2. Log scale
14 | Apply log scales to the plot for better '*readability*'. For these two variables, is the log scale beneficial for visualization? Briefly describe your opinion of the plot before and after applying the log scale.
15 |
16 | 
17 |
18 | ## Task 3. Using `ggwaffle`
19 | Refer to Section B.3.2. for information on how to install and use `ggwaffle` to create waffle charts. Use `ggwaffle` to create a waffle chart to demonstrate the distribution of continents in `worldometer_data.csv` from the `Continent` column. What information does this plot tell you about the continents?
20 |
21 | 
22 |
23 | ## Task 4. Tidying the waffle plot
24 | Adjust the waffle chart to use a fixed aspect ratio so the symbols are squares.
25 |
26 | 
27 |
28 | ## Task 5. Coloring the plot
29 | Use [`scale_fill_viridis_d(option='scale_name')`](https://sjspielman.git-hub.io/introverse/articles/color_fill_scales.html) to color your plot. Be sure to have `tidyverse` loaded.
30 |
31 | EDIT: [New link](https://sjspielman.github.io/introverse/articles/color_fill_scales.html)
32 |
33 | 
34 |
35 | ## Task 6. Using icons
36 | Install the `emojifont` package with `install.packages("emojifont")` and import it. Use [`fontawesome`](https://cran.r-project.org/web/packages/emojifont/vignettes/emojifont.html#font-awesome) to get an icon/emoji and apply it to the waffle chart similar to the last example from [`ggwaffle doc`](https://liamgilbey.github.io/ggwaffle/).
37 |
38 |
39 | 
40 |
41 | ---
42 |
43 | # B. Reading Material
44 |
45 | ## 1. Hall of Shame
46 |
47 | 
48 |
49 | Infamous for its overuse in politics, the truncated y-axis is a classic way to visually mislead. Take a look at the graph above, comparing people with jobs to people on welfare. At first glance, the visual dynamics of the graph suggest people on welfare to number four times as many as people with jobs. Numbers don’t lie, however, and when analyzed, they point out much less sensational facts than the data visualization would suggest.
50 |
51 | This type of misinformation occurs when the graph’s producers ignore convention and manipulate the y-axis. The conventional way of organizing the y-axis is to start at 0 and then go up to the highest data point in your set. By not setting the origin of the y-axis at zero, small differences become hyperbolic and therefore play more on people’s prejudices rather than their rationality.
52 | Focus on creating your data visualizations using data with a zero-baseline y-axis and watch out for truncated axes. Sometimes these distortions are done on purpose to mislead readers, but other times they’re just the consequence of not knowing how an unintentional use of a non-zero-baseline can skew data.
53 |
54 | ## 2. Understanding and Applying Logarithmic Scales in Data Visualization
55 |
56 | The 2 following articles provide a very nice and intuitive explanation for log scales and its benefits and use cases:
57 | - [data-viz-workshop-2021 - Plotting using logarithmic scales](https://badriadhikari.github.io/data-viz-workshop-2021/log/)
58 | - [The Power of Logarithmic Scale](https://dataclaritycorp.com/the-power-of-logarithmic-scale/)
59 |
60 | Data visualization is a crucial aspect of interpreting complex datasets, and understanding different types of scales, including logarithmic scales, is vital. This tutorial aims to provide a clear understanding of logarithmic scales and their application in data visualization, helping you to make informed choices when displaying data.
61 |
62 | ### 2.1. Grasping the Concept of Log Scales
63 |
64 | Consider the 'turn-off display after' setting in electronic devices, where options increase from minutes to hours unevenly. This intuitive understanding of varying increments is at the core of log scales.
65 |
66 | **Visualizing distances**: A linear scale might not effectively compare distances from a central point to various cities if one distance is significantly greater than the others. Using a log scale in this scenario makes the comparison more balanced and informative.
67 |
68 | **Transforming Data for Better Interpretation**: With exponential data like population growth or pandemic spread, linear scales can be misleading. Log scales provide a more nuanced view, revealing trends and patterns that might be obscured on a linear scale.
69 | Exponential variables, such as wildfire spread or population growth, demonstrate rapid increases. These can be more accurately represented on a log scale, making it easier to understand their progression.
70 |
71 | ### 2.2. When to Use Logarithmic Scales
72 |
73 | **Exponential Data**: In cases of exponential growth or decay, where data increases or decreases rapidly, log scales can transform skewed distributions into more interpretable forms.
74 | Understanding the exponential equation (f(x) = ab^x) helps in recognizing scenarios where log scales are appropriate. Remember, log scales are less suitable for data that includes negative or zero values.
75 |
76 | **Comparing Rates of Change**: In scenarios like tracking stock prices or disease spread, log scales can highlight rates of change more effectively than linear scales. They provide a clearer view of how variables are evolving over time, especially when the changes are multiplicative rather than additive.
77 |
78 | **Resolving Skewness in Data Visualization**: When certain data points dominate because of their large magnitude, log scales can prevent this skewness, allowing for a more balanced view of all data points.
79 |
80 | ### 2.3. How to Read and Interpret Logarithmic Scales
81 |
82 | **Identifying Logarithmic Scales**: Determine whether the graph is a semi-log (one axis is logarithmic) or log-log (both axes are logarithmic). Unevenly spaced grid lines are a hallmark of logarithmic scales.
83 | Understand the representation of values on a log scale: consistent multiplicative increases rather than additive ones. For instance, each tick mark could represent a tenfold increase.
84 |
85 | **Understanding Exponential Trends on Log Scales**: On a log scale, exponential trends appear as straight lines. This property can be used to identify exponential relationships within data and to compare different exponential trends.
86 |
87 | ### 2.4. Application in Scientific Research
88 |
89 | **Innate Understanding of Logarithmic Representation**: Studies suggest that humans may intuitively understand logarithmic scales, as they often perceive quantities in relative terms rather than absolute differences.
90 | This innate perception applies to various sensory experiences, such as brightness or sound, indicating the natural fit of log scales in representing certain types of data.
91 |
92 | **Interpreting Scientific Data**: In fields where data spans several orders of magnitude, such as in astronomy or microbiology, logarithmic scales are indispensable for meaningful visualization and comparison.
93 |
94 | ### To sum it up:
95 |
96 | Use logarithmic scales when dealing with multiplicative changes, exponential growth or decay, and data spanning several orders of magnitude. Avoid logarithmic scales for data that includes negative values, zeroes, or when linear relationships are more appropriate. Remember, the choice of scale can significantly affect the interpretation of data. By understanding and applying logarithmic scales appropriately, you can uncover patterns and insights that might otherwise remain hidden in your data visualizations.
97 |
98 | ## 3. Waffle charts (yum!)
99 |
100 | Waffle charts are an innovative way to visualize data, offering a unique and engaging alternative to traditional pie or bar charts. They are particularly effective for displaying part-to-whole relationships and for making comparisons between categories. This tutorial will guide you through the fundamentals of waffle charts, their construction, and their effective use in data visualization.
101 |
102 | 
103 |
104 | ### 3.1. Understanding Waffle Charts
105 |
106 | **What is a Waffle Chart?** A waffle chart is a grid-based visual representation where each cell in the grid represents a small, equal portion of the whole. It’s akin to a square pie chart but offers a more granular view.
107 | Ideal for representing percentages or proportions, waffle charts can effectively communicate data in a more relatable and visually appealing way.
108 |
109 | **Comparative Strengths and Weaknesses**:
110 |
111 | - Strengths: Waffle charts are excellent for illustrating small differences, which might be overlooked in pie charts. They are visually engaging and easy to interpret.
112 | - Weaknesses: Not suitable for large datasets or for displaying changes over time. They can become cluttered and less effective if too many categories are included.
113 |
114 | ### 3.2. Creating a Waffle Chart in R using `ggwaffle`:
115 |
116 | Waffle charts can be effectively created in R using the `ggwaffle` package, which is part of the `ggplot2` ecosystem. This package simplifies the process of creating waffle charts while providing flexibility to customize the charts.
117 |
118 | Install and import the `ggwaffle` package:
119 |
120 | ```R
121 | devtools::install_github("liamgilbey/ggwaffle")
122 | library(ggwaffle)
123 | ```
124 |
125 | From [https://liamgilbey.github.io/ggwaffle/](https://liamgilbey.github.io/ggwaffle/):
126 |
127 | > `ggwaffle` heavily relies on the usage of `ggplot2`. Much like standard `ggplot` graphs, waffle charts are created by adding layers to a base graphic. Because of the inner mechanisms of `ggplot2`, some of the necessary data transformations have to be completed outside of a standard plot creation. The function `waffle_iron` has been added to help with issue.
128 | > `ggwaffle` also introduces a column mapping function, `aes_d`. At this stage I have no idea of how useful this is outside the context of the package, but it seemed a nice way to specify dynamic column renaming. `aes_d` is obviously coined from ggplot’s aes function and has a very similar idea. Here we are mapping column names to feed into a function so they can be renamed for used appropriately.
129 |
130 | ```R
131 | waffle_data <- waffle_iron(mpg, aes_d(group = class))
132 |
133 | ggplot(waffle_data, aes(x, y, fill = group)) +
134 | geom_waffle()
135 | ```
136 |
137 | 
138 |
139 | In short, you first need to prepare the data with `waffle_iron()` and then plotting it with `ggplot2` and `geom_waffle()`. See the [documentation for [`waffle_iron()`](https://rdrr.io/github/liamgilbey/ggwaffle/man/waffle_iron.html) for more details.
140 |
141 | Let's practice with the `mtcars` dataset. Let's create a waffle chart grouped by the number of cylinders in a car. Note that `x` and `y` here are columns that were automatically created by `waffle_iron()` based on the grouping that we have requested.
142 |
143 | ```R
144 | cyl_data <- waffle_iron(mtcars, aes_d(group = cyl))
145 |
146 | # Creating the waffle chart
147 | ggplot(cyl_data, mapping = aes(x, y, fill = group)) +
148 | geom_waffle() +
149 | ggtitle("Distribution of Cars by Cylinder Count in mtcars Dataset")
150 | ```
151 |
152 | 
153 |
154 | Right now it's treating our `cyl` column as a continuous nuemrical variable, but we want to treat the number of cylinders as discrete categories. Let's convert it to discrete factors and see what will happen.
155 |
156 | ```R
157 | cyl_data <- waffle_iron(mtcars, aes_d(group = cyl))
158 |
159 | # Creating the waffle chart
160 | ggplot(cyl_data, mapping = aes(x, y, fill = as.factor(group))) +
161 | geom_waffle() +
162 | ggtitle("Distribution of Cars by Cylinder Count in mtcars Dataset")
163 | ```
164 |
165 | 
166 |
167 | Much better! Let's change the theme to the default waffle theme so it looks more like a proper waffle chart.
168 |
169 | ```R
170 | cyl_data <- waffle_iron(mtcars, aes_d(group = cyl))
171 |
172 | # Creating the waffle chart
173 | ggplot(cyl_data, mapping = aes(x, y, fill = as.factor(group))) +
174 | geom_waffle() +
175 | theme_waffle() +
176 | ggtitle("Distribution of Cars by Cylinder Count in mtcars Dataset")
177 | ```
178 |
179 | 
180 |
181 | For more customization, please refer to this handy [quick start documentation for `ggwaffle`](https://liamgilbey.github.io/ggwaffle/).
182 |
183 | ### 3.3. Best Practices for Clarity and Accuracy:
184 |
185 | - **Keep it simple**: Limit the number of categories to ensure the chart is easy to read and interpret.
186 | - **Consistent scale**: Ensure that each cell represents the same value across all categories for accurate comparison.
187 | - **Use color effectively**: Choose distinct, contrasting colors for different categories to enhance readability.
188 | - **Labels**: Include clear labels and a legend to convey the meaning of each color and category. This enhances the user's understanding of the chart. If necessary, include annotations or tooltips for more detailed explanations.
189 | - **Interactivity**: In digital formats, make your waffle charts interactive. Hover-over effects can display exact values or additional information, adding depth to the data presented.
190 |
191 | ### 3.4. Use cases
192 |
193 | Specific examples:
194 | - Displaying election results, survey data, or market share.
195 | - Comparing demographic data like age or income distribution.
196 | - Visualizing progress towards a goal, like fundraising or project milestones.
197 |
198 | ### To sum it up:
199 |
200 | Waffle charts are ideal when you want to emphasize the composition of a dataset and when dealing with a limited number of categories.
201 | Consider the nature of your data, the audience, and the context of your presentation before deciding if a waffle chart is the most effective tool for your visualization needs.
202 | Waffle charts, with their straightforward design and ease of interpretation, can be a powerful tool in your data visualization arsenal. By applying the principles outlined in this tutorial, you can enhance your presentations and make complex data more accessible and engaging for your audience.
203 |
--------------------------------------------------------------------------------
/Week 4/img/Truncated-Y-Axis-Data-Visualizations-Designed-To-Mislead.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 4/img/Truncated-Y-Axis-Data-Visualizations-Designed-To-Mislead.jpg
--------------------------------------------------------------------------------
/Week 4/img/banner.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 4/img/banner.png
--------------------------------------------------------------------------------
/Week 4/img/task1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 4/img/task1.jpg
--------------------------------------------------------------------------------
/Week 4/img/task2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 4/img/task2.jpg
--------------------------------------------------------------------------------
/Week 4/img/task3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 4/img/task3.jpg
--------------------------------------------------------------------------------
/Week 4/img/task4.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 4/img/task4.jpg
--------------------------------------------------------------------------------
/Week 4/img/task5.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 4/img/task5.jpg
--------------------------------------------------------------------------------
/Week 4/img/task6.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 4/img/task6.jpg
--------------------------------------------------------------------------------
/Week 4/img/waffle1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 4/img/waffle1.jpg
--------------------------------------------------------------------------------
/Week 4/img/waffle2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 4/img/waffle2.png
--------------------------------------------------------------------------------
/Week 4/img/waffle3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 4/img/waffle3.jpg
--------------------------------------------------------------------------------
/Week 4/img/waffle4.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 4/img/waffle4.jpg
--------------------------------------------------------------------------------
/Week 4/img/waffle5.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 4/img/waffle5.jpg
--------------------------------------------------------------------------------
/Week 5/Week5-AE-Notes.md:
--------------------------------------------------------------------------------
1 | 
2 | # COMP4010/5120 - Week 5 Application Exercises
3 | ---
4 |
5 | # A. Application Exercises
6 | Suppose we have a fictional dataset of enrolment at VinUni from the year 2105 to 2125. We are tasked to create a series of statistical charts to be presented on the organization's official media streams. Due to [**organizational branding**](https://www.paradigmmarketinganddesign.com/what-is-organizational-branding/) requirements, we have to design our charts to follow the required visual language of the organization.
7 |
8 | > Organizational branding is one of the most often misunderstood concepts in marketing and design. Many make the mistake of relegating it to simply the logo or signature colors of a business or organization. But a brand is so much more than that. It is one of, if not the most valuable asset you possess. While it is not always easily defined, it is almost always easily recognized. It is how your audience perceives you and is created by a combination of your reputation, identity, and voice. In other words, organizational branding is the mechanism used to shape, cultivate, and evolve your brand. It involves many elements, including your name, logo, tagline, website, colors, collateral, messaging, positioning, graphic elements, social media, and other outreach platforms." - [What is Organizational Branding?](https://www.paradigmmarketinganddesign.com/what-is-organizational-branding/)
9 |
10 | VinUni, for instance, has comprehensive[ brand identity guidelines](https://policy.vinuni.edu.vn/all-policies/brand-identity-manual/). Their design center offers clear instructions on using the VinUni logo, color scheme, and fonts, along with downloadable resources like a PowerPoint template.
11 |
12 | Let's say you want to develop a set of statistical charts for a VinUni presentation in a specific course, and you want to ensure they are both consistent with the university's branding and can be easily replicated. You can use the `ggplot2` package to create a custom theme that aligns with VinUni's brand identity.
13 |
14 | ## Task 1. Simple dodged bar chart for enrolment by college
15 |
16 | From `college_data_normalized.csv`, create a dodged bar chart of `pct` for each `college` with fill by `year`. Only show data from 2120 onwards.
17 |
18 | 
19 |
20 | ## Task 2. Customize with brand identity
21 |
22 | With the [`showtext`](https://cran.rstudio.com/web/packages/showtext/vignettes/introduction.html) library (`install.packages('showtext')`) and the [`ggplot2` theme documentation](https://ggplot2.tidyverse.org/reference/theme.html), customize the chart with corresponding fonts and color schemes.
23 |
24 | You can select any organization branding scheme that you may like, the following examples will be following VinUni's branding.
25 |
26 | 
27 |
28 | ## Task 3. Make customization reusable by creating a function
29 |
30 | Create an R function for your customizations which can be reused for future charts. For example, with the base chart:
31 | `college_plot <- ggplot(...)`, and a function called `theme_vinuni`, we can simply apply the theme to the chart by doing:
32 |
33 | ```R
34 | college_plot +
35 | theme_vinuni()
36 | ```
37 |
38 | Create the same plot as Task 2, but using a function to apply your theme. **Note**: You don't need to include the color palette in your function. That can be done outside of the function using `scale_fill_manual` or `scale_color_manual`.
39 |
40 | ## Task 4. Create a themed multiple line graph
41 |
42 | Create a multiple line graph that shows the percentage of enrolled students for each college from 2105 to 2125, apply the theme you have previously created. Experiment with the color palette, pick any color that you want which may represent the brand you have chosen for effective visualization.
43 |
44 | 
45 |
46 |
47 | ---
48 |
49 | # B. Reading Material
50 |
51 | ## 1. Hall of Shame
52 |
53 | 
54 |
55 | >The pie chart in the above image shows the percentage of Americans who have tried marijuana in three different years. Now, a pie chart is used to show percentages of a whole and represents percentages at a set point in time. Due to this, the audience may mistake the visualization showing the following information.
56 | > - All the people participating in the survey tried marijuana.
57 | > - 51 percent of the population tried marijuana today.
58 | > - 43 percent of them tried it last year.
59 | > - 34 percent of them tried marijuana in 1997.
60 | > However, the reality is entirely different. The above pie chart shows data from three different surveys. The graph is trying to show that
61 | > - Today, 51 percent of the total population has tried marijuana. 49 percent of them haven’t.
62 | > - Last year, 43 percent of the total population tried marijuana. 57 percent of them didn’t.
63 | > - In 1997, only 34 percent of the total population tried marijuana. 67 percent of them didn’t.
64 | > Thus, the above data visualization is using graphic forms in inappropriate ways to distort the data.
65 |
66 | Source: [Codeconquest](https://www.codeconquest.com/blog/12-bad-data-visualization-examples-explained/)
67 |
68 | ## 2. Customizing `ggplot2`
69 |
70 | This week, we will delve into customizing themes in `ggplot2`, an R package widely used for creating elegant data visualizations. Specifically, we will focus on how to recreate a specific plot with customized themes. The customization involves font adjustments using the showtext package and applying unique color palettes based on different branding guidelines.
71 |
72 | ### Step 1: Setting Up the Font
73 |
74 | Select a font according to your chosen branding scheme. For VinUni, it's the Montserrat type face. We will use the Montserrat font for our plot, a versatile and modern sans-serif typeface. To do this, we first need to load the `showtext library` and add the Montserrat font.
75 |
76 | ```R
77 | library(showtext)
78 | font_add_google("Montserrat", "Montserrat")
79 | showtext_auto()
80 | ```
81 |
82 | ### Step 2: Defining Color Palettes
83 |
84 | Our customization involves two color palettes based on branding guidelines from VinUni. Let's define these palettes in R:
85 |
86 | ```R
87 | # VinUni color palette
88 | vinuni_palette_main <- c("#35426e", "#d2ae6d", "#c83538", "#2e548a")
89 | vinuni_palette_accents <- c("#5cc6d0", "#a7c4d2", "#d2d3d5", "#4890bd", "#0087c3", "#d2ae6d")
90 | ```
91 |
92 | ### Step 3: Creating the Plot with Custom Themes
93 |
94 | Assuming you have a plot object named `my_plot`, we can customize its theme as follows:
95 |
96 | ```R
97 | main_font = "Montserrat"
98 |
99 | college_plot +
100 | scale_fill_manual(values = vinuni_palette_accents) +
101 | theme_minimal(
102 | base_family = main_font,
103 | base_size = 11
104 | ) +
105 | theme(
106 | plot.title.position = "plot",
107 | plot.title = element_text(hjust = 0.5, face="bold", colour = vinuni_palette_main[3]),
108 | plot.subtitle = element_text(hjust = 0.5, face="bold"),
109 | legend.position = "bottom",
110 | panel.grid.major.x = element_blank(),
111 | panel.grid.minor.x = element_blank(),
112 | panel.grid.major.y = element_line(color = "grey", linewidth = 0.2),
113 | panel.grid.minor.y = element_blank(),
114 | axis.text = element_text(size = rel(1.0)),
115 | axis.text.x = element_text(face="bold", colour = vinuni_palette_main[1]),
116 | axis.text.y = element_text(face="bold", colour = vinuni_palette_main[1]),
117 | legend.text = element_text(size = rel(0.9))
118 | )
119 | ```
120 |
121 | ### Step 4: Finalizing the Plot
122 |
123 | Finally, to ensure the `showtext` settings are applied correctly, we need to call `showtext_auto()` again at the end of the script.
--------------------------------------------------------------------------------
/Week 5/college_data_normalized.csv:
--------------------------------------------------------------------------------
1 | year,college,pct
2 | 2105,CBM,0.598425
3 | 2106,CBM,0.6615047779779688
4 | 2107,CBM,0.3971112533138313
5 | 2108,CBM,0.41129442916793973
6 | 2109,CBM,0.4251509105144849
7 | 2110,CBM,0.43469596320899334
8 | 2111,CBM,0.42034424444598883
9 | 2112,CBM,0.4281175536139794
10 | 2113,CBM,0.434933723243606
11 | 2114,CBM,0.4274814637646496
12 | 2115,CBM,0.4368950555476089
13 | 2116,CBM,0.45135
14 | 2117,CBM,0.4224771553436631
15 | 2118,CBM,0.4020581168683984
16 | 2119,CBM,0.3827852384477107
17 | 2120,CBM,0.3396724986988029
18 | 2121,CBM,0.3363424403658756
19 | 2122,CBM,0.3318401649092207
20 | 2123,CBM,0.3014243451343038
21 | 2124,CBM,0.2848104801538449
22 | 2125,CBM,0.2708468062389401
23 | 2105,CAS,0.0000000000000000
24 | 2106,CAS,0.0000000000000000
25 | 2107,CAS,0.2802358533686809
26 | 2108,CAS,0.2880130359507078
27 | 2109,CAS,0.2777006036420579
28 | 2110,CAS,0.2843893714869698
29 | 2111,CAS,0.2966386359950888
30 | 2112,CAS,0.3150025697332152
31 | 2113,CAS,0.3112730410604281
32 | 2114,CAS,0.2922028222913179
33 | 2115,CAS,0.2437896342917084
34 | 2116,CAS,0.2183250000000002
35 | 2117,CAS,0.2292163289630512
36 | 2118,CAS,0.2471780225758193
37 | 2119,CAS,0.2621973141588241
38 | 2120,CAS,0.2780758297633823
39 | 2121,CAS,0.2759211653813196
40 | 2122,CAS,0.2591968603821454
41 | 2123,CAS,0.2455672890421426
42 | 2124,CAS,0.2567813349718876
43 | 2125,CAS,0.2634276664873313
44 | 2105,CECS,0.221725000000000
45 | 2106,CECS,0.172349305539717
46 | 2107,CECS,0.106522534052472
47 | 2108,CECS,0.082951420714940
48 | 2109,CECS,0.081150708458565
49 | 2110,CECS,0.056642820643842
50 | 2111,CECS,0.063937730210577
51 | 2112,CECS,0.062374433490632
52 | 2113,CECS,0.066156906924902
53 | 2114,CECS,0.072518536235350
54 | 2115,CECS,0.081986363419634
55 | 2116,CECS,0.104075000000000
56 | 2117,CECS,0.139327572506952
57 | 2118,CECS,0.173330613355093
58 | 2119,CECS,0.189468118853759
59 | 2120,CECS,0.224646674940945
60 | 2121,CECS,0.247962376198162
61 | 2122,CECS,0.264568302544993
62 | 2123,CECS,0.300523108873258
63 | 2124,CECS,0.304196364150193
64 | 2125,CECS,0.315810182777259
65 | 2105,CHS,0.1776000000000000
66 | 2106,CHS,0.1620179259698497
67 | 2107,CHS,0.2161303592650151
68 | 2108,CHS,0.2177411141664120
69 | 2109,CHS,0.2159977773848913
70 | 2110,CHS,0.2242718446601941
71 | 2111,CHS,0.2190793893483448
72 | 2112,CHS,0.1945054431621735
73 | 2113,CHS,0.1876363287710635
74 | 2114,CHS,0.2077971777086821
75 | 2115,CHS,0.2373289467410482
76 | 2116,CHS,0.2262500000000000
77 | 2117,CHS,0.2089789431863329
78 | 2118,CHS,0.1774332472006890
79 | 2119,CHS,0.1655493285397060
80 | 2120,CHS,0.1576049965968691
81 | 2121,CHS,0.1397740180546422
82 | 2122,CHS,0.1443946721636406
83 | 2123,CHS,0.1524852569502948
84 | 2124,CHS,0.1542118207240741
85 | 2125,CHS,0.1499153444964689
86 |
--------------------------------------------------------------------------------
/Week 5/demo.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Untitled"
3 | output: html_document
4 | date: "2024-03-19"
5 | ---
6 |
7 | ```{r}
8 | library(tidyverse)
9 | library(scales)
10 |
11 | # Load the data from the CSV file
12 | data <- read_csv("college_data_normalized.csv")
13 |
14 | # Filter the data for years 2100 onwards
15 | filtered_data <- data %>%
16 | filter(year >= 2116)
17 |
18 | # Create a dodged bar chart
19 | college_plot <- ggplot(filtered_data, aes(x = college, y = pct, fill = as.factor(year))) +
20 | geom_col(position = position_dodge2(padding = 0.2)) +
21 | scale_y_continuous(labels = label_percent()) +
22 | labs(
23 | title = "Percentage of Enrolled Students by College",
24 | subtitle = "From 2120 onwards",
25 | x = NULL,
26 | y = NULL,
27 | fill = NULL
28 | )
29 |
30 | college_plot
31 | ```
32 |
33 |
34 | ```{r}
35 |
36 | #install.packages('showtext')
37 | library(showtext)
38 | font_add_google("Montserrat", "Montserrat")
39 |
40 | showtext_auto()
41 |
42 | main_font = "Montserrat"
43 |
44 | # vinuni color palette - accent colors
45 | # based on branding guideline (page 13) - https://policy.vinuni.edu.vn/all-policies/brand-identity-manual/
46 | vinuni_palette_main <- c("#35426e", "#d2ae6d", "#c83538", "#2e548a")
47 | vinuni_palette_accents <- c( "#5cc6d0", "#a7c4d2", "#d2d3d5", "#4890bd", "#0087c3", "#d2ae6d")
48 |
49 | #maybe try google palette: https://partnermarketinghub.withgoogle.com/brands/google-assistant/overview/color-palette-and-typography/
50 | google_palette_main <- c("#EA4335", "#4285F4", "#FBBC04", "#34A853")
51 |
52 | #or MIT color palette: https://brand.mit.edu/color
53 | mit_palette_main <- c("#750014", "#8b959e", "#ff1423", "#000000")
54 |
55 | college_plot +
56 | scale_fill_manual(values = vinuni_palette_accents) +
57 | theme_minimal(
58 | base_family = main_font,
59 | base_size = 11
60 | ) +
61 | theme(
62 | plot.title.position = "plot",
63 | plot.title = element_text(hjust = 0.5, face="bold", colour = vinuni_palette_main[3]),
64 | plot.subtitle = element_text(hjust = 0.5, face="bold"),
65 | legend.position = "bottom",
66 | panel.grid.major.x = element_blank(),
67 | panel.grid.minor.x = element_blank(),
68 | panel.grid.major.y = element_line(color = "grey", linewidth = 0.2),
69 | panel.grid.minor.y = element_blank(),
70 | axis.text = element_text(size = rel(1.0)),
71 | axis.text.x = element_text(face="bold", colour = vinuni_palette_main[1]),
72 | axis.text.y = element_text(face="bold", colour = vinuni_palette_main[1]),
73 | legend.text = element_text(size = rel(0.9))
74 | )
75 |
76 | showtext_auto()
77 | ```
78 |
79 |
80 | ```{r}
81 |
82 | # Saving our theme as a function
83 | theme_vinuni <- function(base_size = 11, base_family = main_font,
84 | base_line_size = base_size / 22,
85 | base_rect_size = base_size / 22) {
86 | # Base our theme on minimal theme
87 | theme_minimal(
88 | base_family = base_family,
89 | base_size = base_size,
90 | base_line_size = base_line_size,
91 | base_rect_size = base_rect_size
92 | ) +
93 | theme(
94 | plot.title.position = "plot",
95 | plot.title = element_text(hjust = 0.5, face="bold", colour = vinuni_palette_main[3]),
96 | plot.subtitle = element_text(hjust = 0.5, face="bold"),
97 | legend.position = "bottom",
98 | panel.grid.major.x = element_blank(),
99 | panel.grid.minor.x = element_blank(),
100 | panel.grid.major.y = element_line(color = "#bbbbbb", linewidth = 0.2),
101 | panel.grid.minor.y = element_blank(),
102 | axis.text = element_text(size = rel(1.0)),
103 | axis.text.x = element_text(face="bold", colour = vinuni_palette_main[1]),
104 | axis.text.y = element_text(face="bold", colour = vinuni_palette_main[1]),
105 | legend.text = element_text(size = rel(0.9))
106 | )
107 | }
108 | ```
109 |
110 | ```{r}
111 | college_plot +
112 | scale_fill_manual(values = vinuni_palette_accents) +
113 | theme_vinuni()
114 | ```
115 |
116 | ```{r}
117 | data |>
118 | mutate(
119 | college = fct_reorder2(.f = college, .x = year, .y = pct)
120 | ) |>
121 | ggplot(aes(x = year, y = pct, color = college)) +
122 | geom_point() +
123 | geom_line() +
124 | scale_x_continuous(limits = c(2100, 2120), breaks = seq(2100, 2120, 5)) +
125 | scale_y_continuous(labels = label_percent()) +
126 | labs(
127 | x = "Year",
128 | y = "Percent of students admitted",
129 | color = "College",
130 | title = "Percentage of Enrolled Students by College",
131 | ) +
132 | theme_vinuni() +
133 | theme(legend.position = "right") +
134 | scale_color_manual(values = vinuni_palette_main)
135 | ```
--------------------------------------------------------------------------------
/Week 5/img/ae1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 5/img/ae1.png
--------------------------------------------------------------------------------
/Week 5/img/ae2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 5/img/ae2.png
--------------------------------------------------------------------------------
/Week 5/img/ae3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 5/img/ae3.png
--------------------------------------------------------------------------------
/Week 5/img/ae4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 5/img/ae4.png
--------------------------------------------------------------------------------
/Week 5/img/banner.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 5/img/banner.png
--------------------------------------------------------------------------------
/Week 5/img/shame.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 5/img/shame.jpg
--------------------------------------------------------------------------------
/Week 6/Week6-AE-Notes.md:
--------------------------------------------------------------------------------
1 | 
2 |
3 | # COMP4010/5120 - Week 6 Application Exercises
4 | ---
5 |
6 | # A. Application Exercises
7 |
8 | **Data:** [`instructional-staff.csv`](instructional-staff.csv)
9 |
10 | The American Association of University Professors (AAUP) is a nonprofit membership association of faculty and other academic professionals. [This report](https://www.aaup.org/sites/default/files/files/AAUP_Report_InstrStaff-75-11_apr2013.pdf) by the AAUP shows trends in instructional staff employees between 1975 and 2011, and contains an image very similar to the one given below.
11 |
12 | 
13 |
14 | Each row in this dataset represents a faculty type, and the columns are the years for which we have data. The values are percentage of hires of that type of faculty for each year.
15 |
16 | In order to recreate this visualization we need to first reshape the data to have one variable for faculty type and one variable for year. In other words, we will convert the data from the long format to wide format.
17 |
18 | ## Task 1. Recreate the visualization
19 |
20 | Reshape the data so we have one row per faculty type and year, and the percentage of hires as a single column.
21 |
22 | ## Task 2. Attempt to recreate the original bar chart as best as you can.
23 |
24 | Don’t worry about theming or color palettes right now. The most important aspects to incorporate:
25 |
26 | - Faculty type on the y-axis with bar segments color-coded based on the year of the survey
27 | - Percentage of instructional staff employees on the x-axis
28 | - Begin the x-axis at 5%
29 | - Label the x-axis at 5% increments
30 | - Match the order of the legend
31 |
32 | > [forcats](https://forcats.tidyverse.org/) contains many functions for defining and adjusting the order of levels for factor variables. Factors are often used to enforce specific ordering of categorical variables in charts.
33 |
34 | ## Task 3. Let’s make it better
35 |
36 | The original plot is not very informative. It’s hard to compare the trends for across each faculty type.
37 |
38 | Improve the chart by using a relative frequency bar chart with year on the y-axis and faculty type encoded using color.
39 |
40 | What are this chart’s advantages and disadvantages?
41 |
42 | ## Task 4: Let’s instead use a line chart.
43 |
44 | Graph the data with year on the x-axis and percentage of employees on the y-axis. Distinguish each faculty type using an appropriate aesthetic mapping.
45 |
46 | ## Task 5: Cleaning it up
47 |
48 | - Add a proper title and labelling to the chart
49 | - Use an optimized color palette
50 | - Order the legend values by the final value of the percentage variable
51 |
52 | ## Task 6: More improvements!
53 |
54 | Colleges and universities have come to rely more heavily on non-tenure track faculty members over time, in particular part-time faculty (e.g. contingent faculty, adjuncts). We want to show how academia is increasingly relying on part-time faculty.
55 |
56 | With your peers, sketch/design a chart that highlights the trend for part-time faculty. What type of geom would you use? What elements would you include? What would you remove?
57 |
58 | Create the chart you designed above using `ggplot2`.
59 |
60 | # B. Reading Material
61 |
62 | ## 1. Hall of Shame
63 |
64 | 
65 |
66 | >At first glance, the chart might seem okay. However, there are plenty of problems with it.
67 | > The Y-axis of the graph has a break in it. The lower ticks at the Y-axis are separated at \$100M. After \$700M, It suddenly jumps from \$700M to \$1,700M. Due to this, the revenue of \$490M looks bigger than \$1213M of government funding. Hence, it’s extremely misleading to present the scale in a way where 1.2 billion looks smaller than or almost equal to 490 million.
68 | > At first glance, a viewer will figure out that television revenue is the same as government funding. This is due to the fact that the blue part of the bar chart is almost equal to the length of the pink part of the bar chart due to the distortion of the Y-axis labels.
69 | > Another major problem in the above chart is that the Revenue and advertising revenue charts should not be separate from the main bar showing total income. They aren’t two separate bars but they are just subdividing the revenue section of the bar showing the total income. The second bar is just showing how the blue part of the first bar is split and the third one is showing how the purple part of the second bar chart is split.
70 | > Thus, we can say that the above chart is an example of bad data visualization as it is intentionally misleading the viewer by distorting the elements of the chart.
71 | [[Source]](https://www.codeconquest.com/blog/12-bad-data-visualization-examples-explained/)
72 |
73 | ## 2. Wide data vs. long data formats
74 |
75 | In data analysis, the terms "*wide*" and "*long*" refer to two different ways of structuring a dataset, especially in the context of handling repeated measures or time series data. Understanding these formats is crucial for data manipulation, visualization, and analysis. Let's define each:
76 |
77 | **Wide Format**:
78 |
79 | - **Characteristic**: In a wide format, each subject or entity is usually in a single row, with multiple columns representing different variables, conditions, time points, or measurements.
80 | - **Example**: Consider a dataset of students with their scores in different subjects. Each student would have a single row, with separate columns for each subject (like Math, Science, English).
81 |
82 | | StudentID | MathScore | ScienceScore | EnglishScore |
83 | |-----------|-----------|--------------|--------------|
84 | | 1 | 85 | 90 | 75 |
85 | | 2 | 88 | 92 | 80 |
86 | | 3 | 90 | 94 | 85 |
87 |
88 | - **Use Case**: The wide format is more intuitive for human reading and is often the format in which data is initially collected. It's well-suited for scenarios where each column represents a distinct category.
89 |
90 | **Long Format**:
91 |
92 | - **Characteristic**: In the long format, each row represents a single observation, often a single time point or condition per subject. There are typically columns indicating the subject/entity, the variable (or time/condition), and the value.
93 | - **Example**: Using the same student dataset, a long format would list each subject's score in a separate row.
94 |
95 | | StudentID | Subject | Score |
96 | |-----------|----------|-------|
97 | | 1 | Math | 85 |
98 | | 1 | Science | 90 |
99 | | 1 | English | 75 |
100 | | 2 | Math | 88 |
101 | | 2 | Science | 92 |
102 | | 2 | English | 80 |
103 | | 3 | Math | 90 |
104 | | 3 | Science | 94 |
105 | | 3 | English | 85 |
106 |
107 | - **Use Case**: The long format is particularly useful for statistical analysis and visualization in many data analysis tools, as it allows for more consistent treatment of variables. It's also the preferred format for many types of time series or repeated measures data analysis.
108 |
109 | Essentially, the wide format is more about having separate columns for different variables, while the long format focuses on having one row per observation, making it easier to apply certain types of data manipulations and analyses. The choice between wide and long formats depends on the specific requirements of your analysis and the tools you are using.
110 |
111 | ## 3. Turning Wide Data into Long Format and Vice Versa in `R`
112 |
113 | In the `tidyverse` package in R, you can use different functions to transform data between wide and long formats. The `tidyr` package, part of the `tidyverse`, provides convenient functions for these transformations: `pivot_longer()` for turning wide data into long format, and `pivot_wider()` for the reverse.
114 |
115 | Let's assume you have a wide data frame named `wide_data`:
116 |
117 | ```R
118 | wide_data <- tibble(
119 | id = 1:3,
120 | score_math = c(90, 85, 88),
121 | score_science = c(92, 89, 87),
122 | score_english = c(93, 90, 85)
123 | )
124 | ```
125 |
126 | To convert this into a long format, you'd use `pivot_longer()`:
127 |
128 | ```R
129 | library(tidyverse)
130 |
131 | long_data <- wide_data %>%
132 | pivot_longer(
133 | cols = starts_with("score"),
134 | names_to = "subject",
135 | names_prefix = "score_",
136 | values_to = "score"
137 | )
138 | ```
139 |
140 | In this example, `cols = starts_with("score")` selects all columns that start with `"score"` to be transformed. `names_to = "subject"` creates a new column named `"subject"` for the former column names. `names_prefix = "score_"` removes the prefix `"score_"` from these names. Finally, `values_to = "score"` specifies that the actual values go into a column named `"score"`.
141 |
142 | Let's look at the reverse process. Assuming you have a long format data frame `long_data`:
143 |
144 | ```R
145 | long_data <- tibble(
146 | id = rep(1:3, each = 3),
147 | subject = rep(c("math", "science", "english"), times = 3),
148 | score = c(90, 92, 93, 85, 89, 90, 88, 87, 85)
149 | )
150 | ```
151 |
152 | To convert this into a wide format, use `pivot_wider()`:
153 |
154 | ```R
155 | wide_data <- long_data %>%
156 | pivot_wider(
157 | names_from = subject,
158 | values_from = score,
159 | names_prefix = "score_"
160 | )
161 | ```
162 |
163 | Here, `names_from = subject` tells the function to create new columns from the unique values in the `"subject"` column. `values_from = score` indicates that the values in these new columns should come from the `"score"` column. `names_prefix = "score_"` adds a prefix to the new column names for clarity.
164 |
165 | Both `pivot_longer()` and `pivot_wider()` are highly versatile and can be customized further to accommodate more complex data reshaping tasks.
166 |
167 |
168 | ## 4. Data manipulation with `mutate()`, `fct_reorder()`, and `if_else()`
169 |
170 | ### 4.1. `mutate()`
171 |
172 | `mutate` is used in `dplyr` to create new columns or modify existing ones in a data frame or tibble. It's a key function for data transformation.
173 |
174 | With `mutate`, you can apply a function to one or more existing columns to create a new column, or you can overwrite an existing column with new values.
175 |
176 | For example:
177 |
178 | ```R
179 | library(dplyr)
180 |
181 | data <- tibble(
182 | x = 1:5,
183 | y = c("A", "B", "A", "B", "C")
184 | )
185 |
186 | data <- data %>%
187 | mutate(z = x * 2) # Creates a new column 'z' which is twice the value of 'x'
188 | ```
189 |
190 | ### 4.2. `fct_reorder()`
191 |
192 | `fct_reorder()` is from the `forcats` package, which is designed for handling factors (categorical data) in `R`. `fct_reorder` reorders factor levels based on the values of another variable, which is particularly useful for plotting.
193 |
194 | This function is typically used when you want the levels of a factor to be ordered based on some quantitative measure. This is often done before creating plots to ensure that factor levels are displayed in a meaningful order.
195 |
196 | Example:
197 |
198 | ```R
199 | library(forcats)
200 |
201 | data <- tibble(
202 | category = factor(c("A", "B", "A", "B", "C")),
203 | value = c(3, 5, 2, 4, 1)
204 | )
205 |
206 | data <- data %>%
207 | mutate(category = fct_reorder(category, value, .fun = mean))
208 | # Reorders 'category' based on the mean of 'value' for each level
209 | ```
210 |
211 | In this example, `fct_reorder` is used inside `mutate` to reorder the levels of the category factor based on the mean of the corresponding values.
212 |
213 | ### 4.3. `if_else()`
214 |
215 | `if_else` is another function from the `dplyr` package in `R`, which is used for conditional logic. It is similar to the base `R` `ifelse` function but is more strict and type-safe. `if_else` tests a condition and returns one value if the condition is true and another if it is false.
216 |
217 | Let's combine `mutate()`, `fct_reorder()`, and `if_else()` in examples to illustrate their usage:
218 |
219 |
220 | **Example 1: Using `mutate()` with `if_else()`**
221 | Imagine a dataset of employees with their sales numbers, and we want to classify them as "High Performer" or "Low Performer" based on whether their sales exceed a certain threshold:
222 |
223 | ```R
224 | library(dplyr)
225 |
226 | # Example dataset
227 | employee_data <- tibble(
228 | employee_id = 1:5,
229 | sales = c(200, 150, 300, 250, 100)
230 | )
231 |
232 | # Classify employees based on their sales
233 | employee_data <- employee_data %>%
234 | mutate(performance = if_else(sales > 200, "High Performer", "Low Performer"))
235 | ```
236 |
237 | Here, `if_else()` is used to create a new column `performance` which classifies employees based on whether their `sales` are greater than 200.
238 |
239 | **Example 2: Using `mutate()` with `fct_reorder()` and `if_else()`**
240 | Now, let's combine both in a more complex scenario:
241 |
242 | ```R
243 | # Combined dataset
244 | data <- tibble(
245 | category = factor(c("A", "B", "A", "B", "C")),
246 | value = c(3, 15, 2, 4, 1),
247 | status = c("New", "Old", "Old", "New", "New")
248 | )
249 |
250 | # Reorder 'category' and create a new 'status_label' column
251 | data <- data %>%
252 | mutate(
253 | category = fct_reorder(category, value, .fun = mean),
254 | status_label = if_else(value > 5, "High", "Low")
255 | )
256 | ```
257 |
258 | In this combined example, we first reorder `category` based on the mean of `value`. Then we use `if_else` to create a new column `status_label` that assigns `"High"` or `"Low"` based on the `value` column. This showcases how `mutate()` can be used to simultaneously perform multiple transformations on a dataset.
259 |
--------------------------------------------------------------------------------
/Week 6/Week6-AE-RMarkdown.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Week6-AE-RMarkdown"
3 | output: html_document
4 | date: "2024-03-27"
5 | ---
6 |
7 | ```{r}
8 | library(tidyverse)
9 | library(scales)
10 | ```
11 |
12 | # Take a sad plot, and make it better
13 |
14 | The American Association of University Professors (AAUP) is a nonprofit membership association of faculty and other academic professionals. [This](https://www.aaup.org/sites/default/files/files/AAUP_Report_InstrStaff-75-11_apr2013.pdf) report by the AAUP shows trends in instructional staff employees between 1975 and 2011, and contains an image very similar to the one given below.
15 |
16 | Each row in this dataset represents a faculty type, and the columns are the years for which we have data. The values are percentage of hires of that type of faculty for each year.
17 |
18 | ```{r}
19 | staff <- read_csv("instructional-staff.csv")
20 | staff
21 | ```
22 |
23 | # Recreate the visualization
24 |
25 | In order to recreate this visualization we need to first reshape the data to have one variable for faculty type and one variable for year. In other words, we will convert the data from the long format to wide format.
26 |
27 | # Task 1: Reshape the data so we have one row per faculty type and year, and the percentage of hires as a single column.
28 |
29 | ```{r}
30 | # YOUR CODE HERE
31 |
32 | ```
33 |
34 | # Task 2: Attempt to recreate the original bar chart as best as you can.
35 |
36 | Don’t worry about theming or color palettes right now. The most important aspects to incorporate:
37 |
38 | - Faculty type on the y-axis with bar segments color-coded based on the year of the survey
39 | - Percentage of instructional staff employees on the x-axis
40 | - Begin the x-axis at 5%
41 | - Label the x-axis at 5% increments
42 | - Match the order of the legend
43 |
44 | > [forcats](https://forcats.tidyverse.org/) contains many functions for defining and adjusting the order of levels for factor variables. Factors are often used to enforce specific ordering of categorical variables in charts.
45 |
46 | ```{r}
47 | # YOUR CODE HERE
48 |
49 | ```
50 |
51 | # Let’s make it better
52 |
53 | The original plot is not very informative. It’s hard to compare the trends for across each faculty type.
54 |
55 | # Task 3: Improve the chart by using a relative frequency bar chart with year on the y-axis and faculty type encoded using color.
56 |
57 | ```{r}
58 | # YOUR CODE HERE
59 |
60 | ```
61 |
62 | What are this chart’s advantages and disadvantages? - ADD YOUR RESPONSE HERE
63 |
64 | # Task 4: Let’s instead use a line chart.
65 |
66 | Graph the data with year on the x-axis and percentage of employees on the y-axis. Distinguish each faculty type using an appropriate aesthetic mapping.
67 |
68 | ```{r}
69 | # YOUR CODE HERE
70 |
71 | ```
72 |
73 | # Task 5: Cleaning it up.
74 |
75 | - Add a proper title and labelling to the chart
76 | - Use an optimized color palette
77 | - Order the legend values by the final value of the percentage variable
78 |
79 | ```{r}
80 | # YOUR CODE HERE
81 |
82 | ```
83 |
84 | # Task 6: More improvements!
85 | Colleges and universities have come to rely more heavily on non-tenure track faculty members over time, in particular part-time faculty (e.g. contingent faculty, adjuncts). We want to show how academia is increasingly relying on part-time faculty.
86 |
87 | With your peers, sketch/design a chart that highlights the trend for part-time faculty. What type of geom would you use? What elements would you include? What would you remove?
88 |
89 | - ADD YOUR RESPONSE HERE
90 |
91 | Create the chart you designed above using `ggplot2`.
92 |
93 | ```{r}
94 | # YOUR CODE HERE
95 |
96 | ```
97 |
--------------------------------------------------------------------------------
/Week 6/fjc-judges.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 6/fjc-judges.RData
--------------------------------------------------------------------------------
/Week 6/img/banner.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 6/img/banner.jpg
--------------------------------------------------------------------------------
/Week 6/img/shame.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 6/img/shame.jpg
--------------------------------------------------------------------------------
/Week 6/img/staff-employment.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 6/img/staff-employment.png
--------------------------------------------------------------------------------
/Week 6/instructional-staff.csv:
--------------------------------------------------------------------------------
1 | faculty_type,1975,1989,1993,1995,1999,2001,2003,2005,2007,2009,2011
2 | Full-Time Tenured Faculty,29,27.6,25,24.8,21.8,20.3,19.3,17.8,17.2,16.8,16.7
3 | Full-Time Tenure-Track Faculty,16.1,11.4,10.2,9.6,8.9,9.2,8.8,8.2,8,7.6,7.4
4 | Full-Time Non-Tenure-Track Faculty,10.3,14.1,13.6,13.6,15.2,15.5,15,14.8,14.9,15.1,15.4
5 | Part-Time Faculty,24,30.4,33.1,33.2,35.5,36,37,39.3,40.5,41.1,41.3
6 | Graduate Student Employees,20.5,16.5,18.1,18.8,18.7,19,20,19.9,19.5,19.4,19.3
--------------------------------------------------------------------------------
/Week 7/Week7-AE-Notes.md:
--------------------------------------------------------------------------------
1 | 
2 |
3 | # COMP4010/5120 - Week 7 Application Exercises
4 | ---
5 |
6 | # A. Application Exercises
7 |
8 | **Important documentation:** [`ggplot2 annotate()`](https://ggplot2.tidyverse.org/reference/annotate.html)
9 |
10 | **Data:** `wdi_co2_raw <- WDI(country = "all", c("SP.POP.TOTL","EN.ATM.CO2E.PC","NY.GDP.PCAP.KD"), extra = TRUE, start = 1995, end = 2023)` from the `WDI` library by World Bank.
11 |
12 | Although this dataset contains up to 2023, the CO2 emissions data is mostly `NA` from 2021 onwards, so we will be using only rows up until 2020. Also, to reduce clutter, let's only consider countries with a population greater than 200,000.
13 |
14 | 
15 |
16 | For convenience you should rename some columns to more *usable* names:
17 |
18 | - `SP.POP.TOTL` to `population`,
19 | - `EN.ATM.CO2E.PC` to `co2_emissions`,
20 | - `NY.GDP.PCAP.KD` to `gdp_per_cap`,
21 |
22 | ```R
23 | wdi_clean <- wdi_co2_raw |>
24 | filter(region != "Aggregates") |>
25 | filter(population > 200000) |>
26 | select(iso2c, iso3c, country, year,
27 | population = SP.POP.TOTL,
28 | co2_emissions = EN.ATM.CO2E.PC,
29 | gdp_per_cap = NY.GDP.PCAP.KD,
30 | region, income
31 | )
32 | ```
33 |
34 | Create rankings for countries by CO2 emissions by year:
35 |
36 | ```R
37 | co2_rankings <- wdi_clean |>
38 | # Get rid of all the rows that have missing values in co2_emissions
39 | drop_na(co2_emissions) |>
40 | # Look at each year individually and rank countries based on their emissions that year
41 | mutate(
42 | ranking = rank(co2_emissions),
43 | .by = year
44 | )
45 | ```
46 |
47 | ## Task 1: Prepare data in wide format
48 |
49 | Convert the data frame into wide format to better see the ranking by year.
50 |
51 | 
52 |
53 | ## Task 2. Data wrangling
54 |
55 | In this task you will need to prepare the data for further visualization tasks.
56 |
57 | - Calculate the difference in rankings between any 2 years (e.g. 2020 and 1995) and store in a column called `rank_diff`.
58 | - Create a column called `significant_diff` to indicate whether the rank changed by more than 30 positions, and whether it was a significant increase or significant decrease in ranking.
59 |
60 | 
61 |
62 | ## Task 3. Scatter plot for changes in CO2 emission rankings between 1995 and 2020 (or the years you've chosen)
63 |
64 | Create a basic plot that visualizes the changes in CO2 emission rankings between 1995 and 2020 (or the years you've chosen).
65 |
66 | 
67 |
68 | # Task 4. Lazy way to show change in rank
69 |
70 | Can you come up with the simplest visual element we can use to demonstrate whether a country increased, decreased, or maintained their ranking? You shouldn't need to change the data in any way.
71 | Add it to your current plot. (You may need to read the [`ggplot2 annotate()`](https://rfortherestofus.com/2023/10/annotate-vs-geoms) examples).
72 |
73 | # Task 5. Highlight significant countries
74 |
75 | Add colors to better separate the three groups of countries based on the change in ranking. Use [`geom_label_repel()`](https://r-graph-gallery.com/package/ggrepel.html) from the `ggrepel` library to add the names of the countries which had a significant change in their rank.
76 |
77 | 
78 |
79 | # Task 6. Additional text annotations
80 |
81 | Since our plot is showing the relationship between the ranking of two different years, it essentially divides the plot into two halves. Add text annotations to provide labels for the two halves of the plot, `"Countries improving"` and `"Countries worsening"`.
82 |
83 | 
84 |
85 | # Task 7. Using colors to redirect attention
86 |
87 | Before we added colors to highlight the different classes of countries based on their change in ranking. Now use `scale_color_manual()` to change the color of the insignificant countries to be more *uninteresting*.
88 |
89 | 
90 |
91 | # Task 8. More geometric annotations
92 |
93 | With `annotate()`, use `segment` (with `arrow` parameter) and `rect` to create boxes, arrows, and text labels to highlight region containing the top 25 and bottom 25 ranking countries labelled `"Lowest emitters"` and `"Highest emitters"`.
94 |
95 | 
96 |
97 | # B. Reading Material
98 |
99 | ## Hall of Fame!
100 |
101 | A YouTuber have created a graph of all Wikipedia articles, which showcases how much information you can extract from an unfathomable amount of data.
102 |
103 | [I Made a Graph of Wikipedia... This Is What I Found](https://www.youtube.com/watch?v=JheGL6uSF-4&ab_channel=adumb)
104 | [](https://www.youtube.com/watch?v=JheGL6uSF-4&ab_channel=adumb)
105 |
106 | Within the same graph, using different annotation elements (mostly highlighting), the author presented multiple interesting findings like trends, relationships, communinities, special communities, isolated articles (deadends, orphans), etc.
107 |
108 | The video is a great example showing the power of extracting *information* from *data*, with visualization.
109 |
--------------------------------------------------------------------------------
/Week 7/Week7-AE-RMarkdown.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Week7-AE-RMarkdown"
3 | output: html_document
4 | date: "2024-03-27"
5 | ---
6 |
7 | ```{r}
8 | library(tidyverse)
9 | library(scales)
10 |
11 |
12 | #install.packages('WDI')
13 | library(WDI)
14 | #install.packages('ggrepel')
15 | library(ggrepel)
16 | #install.packages('ggtext')
17 | library(ggtext)
18 | ```
19 | ```{r}
20 | indicators <- c("SP.POP.TOTL", # Population
21 | "EN.ATM.CO2E.PC", # CO2 emissions
22 | "NY.GDP.PCAP.KD") # GDP per capita
23 |
24 | # CO2 emissions data is mostly NULL from 2021 onwards...
25 | wdi_co2_raw <- WDI(country = "all", indicators, extra = TRUE,
26 | start = 1995, end = 2023)
27 | ```
28 |
29 | ```{r}
30 | wdi_clean <- wdi_co2_raw |>
31 | filter(region != "Aggregates") |>
32 | select(iso2c, iso3c, country, year,
33 | population = SP.POP.TOTL,
34 | co2_emissions = EN.ATM.CO2E.PC,
35 | gdp_per_cap = NY.GDP.PCAP.KD,
36 | region, income
37 | ) |>
38 | filter(population > 200000)
39 | ```
40 |
41 | ```{r}
42 | co2_rankings <- wdi_clean |>
43 | # Get rid of all the rows that have missing values in co2_emissions
44 | drop_na(co2_emissions) |>
45 | # Look at each year individually and rank countries based on their emissions that year
46 | mutate(
47 | ranking = rank(co2_emissions),
48 | .by = year
49 | )
50 | ```
51 |
52 |
53 | # Task 1: Prepare data in wide format
54 | ```{r}
55 | # YOUR CODE HERE
56 |
57 |
58 |
59 | ```
60 |
61 |
62 | # Task 2: Data wrangling
63 | ```{r}
64 | # YOUR CODE HERE
65 |
66 |
67 |
68 | ```
69 |
70 |
71 | # Task 3: Scatter plot for changes in CO2 emission rankings between 1995 and 2020
72 | ```{r}
73 | # YOUR CODE HERE
74 |
75 |
76 |
77 | ```
78 |
79 |
80 | # Task 4: Lazy way to show change in rank
81 | ```{r}
82 | # YOUR CODE HERE
83 |
84 |
85 |
86 | ```
87 |
88 |
89 | # Task 5: Highlight significant countries
90 | ```{r}
91 | # YOUR CODE HERE
92 |
93 |
94 |
95 | ```
96 |
97 |
98 | # Task 6: Additional text annotations
99 | ```{r}
100 | # YOUR CODE HERE
101 |
102 |
103 |
104 | ```
105 |
106 |
107 | # Task 7: Using colors to redirect attention
108 | ```{R}
109 | # YOUR CODE HERE
110 |
111 |
112 |
113 | ```
114 |
115 |
116 | # Task 8: More geometric annotations
117 | ```{r}
118 | # YOUR CODE HERE
119 |
120 |
121 |
122 | ```
123 |
--------------------------------------------------------------------------------
/Week 7/img/banner.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 7/img/banner.gif
--------------------------------------------------------------------------------
/Week 7/img/fame.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 7/img/fame.jpg
--------------------------------------------------------------------------------
/Week 7/img/notes1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 7/img/notes1.jpg
--------------------------------------------------------------------------------
/Week 7/img/task1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 7/img/task1.jpg
--------------------------------------------------------------------------------
/Week 7/img/task2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 7/img/task2.jpg
--------------------------------------------------------------------------------
/Week 7/img/task3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 7/img/task3.png
--------------------------------------------------------------------------------
/Week 7/img/task4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 7/img/task4.png
--------------------------------------------------------------------------------
/Week 7/img/task5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 7/img/task5.png
--------------------------------------------------------------------------------
/Week 7/img/task6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 7/img/task6.png
--------------------------------------------------------------------------------
/Week 7/img/task7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 7/img/task7.png
--------------------------------------------------------------------------------
/Week 7/img/task8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 7/img/task8.png
--------------------------------------------------------------------------------
/Week 9/Week9-AE-Notes.md:
--------------------------------------------------------------------------------
1 | 
2 |
3 | # COMP4010/5120 - Week 9 Application Exercises
4 | ---
5 |
6 | # A. Application Exercises
7 |
8 | **Data:** [`nurses.csv`](./nurses.csv)
9 |
10 | ```R
11 | nurses <- read_csv("nurses.csv") |> clean_names()
12 |
13 | # subset to three states
14 | nurses_subset <- nurses |>
15 | filter(state %in% c("California", "New York", "North Carolina"))
16 | ```
17 |
18 | The following code chunk demonstrates how to add alternative text to a bar chart. The alternative text is added to the chunk header using the `fig-alt` chunk option. The text is written in Markdown and can be as long as needed. Note that fig-cap is not the same as `fig-alt`.
19 |
20 | ```R
21 | #| label: nurses-bar
22 | #| fig-cap: "Total employed Registered Nurses"
23 | #| fig-alt: "The figure is a bar chart titled 'Total employed Registered
24 | #| Nurses' that displays the numbers of registered nurses in three states
25 | #| (California, New York, and North Carolina) over a 20 year period, with data
26 | #| recorded in three time points (2000, 2010, and 2020). In each state, the
27 | #| numbers of registered nurses increase over time. The following numbers are
28 | #| all approximate. California started off with 200K registered nurses in 2000,
29 | #| 240K in 2010, and 300K in 2020. New York had 150K in 2000, 160K in 2010, and
30 | #| 170K in 2020. Finally North Carolina had 60K in 2000, 90K in 2010, and 100K
31 | #| in 2020."
32 |
33 | nurses_subset |>
34 | filter(year %in% c(2000, 2010, 2020)) |>
35 | ggplot(aes(x = state, y = total_employed_rn, fill = factor(year))) +
36 | geom_col(position = "dodge") +
37 | scale_fill_viridis_d(option = "E") +
38 | scale_y_continuous(labels = label_number(scale = 1/1000, suffix = "K")) +
39 | labs(
40 | x = "State", y = "Number of Registered Nurses", fill = "Year",
41 | title = "Total employed Registered Nurses"
42 | ) +
43 | theme(
44 | legend.background = element_rect(fill = "white", color = "white"),
45 | legend.position = c(0.85, 0.75)
46 | )
47 | ```
48 |
49 |
50 | 
51 |
52 | ## Task 1. Add alt text to line chart
53 |
54 | Add appropriate alt text (label, caption, and alt text) to make the following chart accessible for screen readers.
55 |
56 | ```R
57 | # Your label here
58 | # Your caption here
59 | # Your alt text here
60 |
61 | nurses_subset |>
62 | ggplot(aes(x = year, y = annual_salary_median, color = state)) +
63 | geom_line() +
64 | scale_y_continuous(labels = label_dollar(scale = 1/1000, suffix = "K")) +
65 | labs(
66 | x = "Year", y = "Annual median salary", color = "State",
67 | title = "Annual median salary of Registered Nurses"
68 | ) +
69 | coord_cartesian(clip = "off") +
70 | theme(
71 | plot.margin = margin(0.1, 0.9, 0.1, 0.1, "in")
72 | )
73 | ```
74 |
75 | 
76 |
77 | ## Task 2. Direct labelling instead of legends
78 |
79 | Direct labels is when you add corresponding text annotations to a plot to describe the visual elements without relying on legends.
80 |
81 | Create a version of the same line chart but using direct labels instead of legends.
82 |
83 | You can achieve this with just `geom_text()` but you can also check out https://r-graph-gallery.com/web-line-chart-with-labels-at-end-of-line.html for a fancier way of achieving this.
84 |
85 | 
86 |
87 | ## Task 3. Colorblind-friendly plots
88 |
89 | Use `colorblindr` for colorblind-friendly palettes.
90 | ```{r}
91 | #remotes::install_github("wilkelab/cowplot")
92 | #install.packages("colorspace", repos = "http://R-Forge.R-project.org")
93 | #remotes::install_github("clauswilke/colorblindr")
94 | library(colorblindr)
95 | ```
96 |
97 | Try out colorblind simulations at http://hclwizard.org/cvdemulator/ or |> your plot to `cvd_grid()` to see the plot in various color-vision-deficiency simulations.
98 |
99 | With the line chart from Task 1, create 3 different plots: one with the default color scale, one with the `viridis` color scale, and one with the `OkabeIto` color scale from `colorblindr`. Show the `cvd_grid()` of each plot and describe the simulated effectiveness of the color scales for colorblind viewers.
100 |
101 | For example, the grid for the default palette should look like this.
102 |
103 | 
104 |
105 | # B. Reading Material
106 |
107 | ## 1. Accessible data visualization
108 |
109 | Accessibility in data visualization is essential because it ensures that all individuals, regardless of disabilities or limitations, have the opportunity to understand and engage with data.
110 |
111 | Data visualizations are powerful tools for conveying complex information quickly and effectively. Ensuring these tools are accessible means that everyone, including people with disabilities such as visual impairments, cognitive differences, or mobility restrictions, can benefit from the data being presented. This democratization of information supports equality and inclusivity.
112 |
113 | Many regions have legal requirements, such as the Americans with Disabilities Act (ADA) in the U.S., which mandate that digital content, including data visualizations, be accessible to all users. Beyond legal compliance, there is an ethical imperative to ensure that no group is excluded from accessing information that could impact their personal, professional, or educational lives.
114 |
115 | Designing for accessibility often results in clearer and more comprehensible visualizations for all users, not just those with disabilities. This concept, known as the "curb-cut effect," originates from the idea that curb cuts, while designed for wheelchairs, benefit everyone including cyclists, parents with strollers, and more. Similarly, accessible data visualizations can provide benefits such as better readability and easier comprehension for a wider audience.
116 |
117 | By making data visualizations accessible, creators can engage a more diverse audience. This diversity can lead to a broader range of feedback and insights, potentially improving the data analysis and communication strategies based on a wider array of perspectives.
118 |
119 | ## 2. Accessibility factors
120 |
121 | **Visual Impairments**: This includes a range of conditions from complete blindness to various forms of low vision and color vision deficiencies (colorblindness). Visualizations should be designed so that they are still comprehensible without reliance solely on color or fine visual details. Techniques such as high contrast, alternative text descriptions, and screen reader compatibility can enhance accessibility.
122 |
123 | **Hearing Impairments**: While data visualizations are primarily visual, any accompanying audio explanations or alerts must be accessible. Providing captions or transcripts for audio content can ensure that users who are deaf or hard of hearing can still access all the information.
124 |
125 | **Cognitive Disabilities**: Individuals with cognitive disabilities, including dyslexia, autism, and intellectual disabilities, might find complex visualizations challenging. Simplifying information, avoiding sensory overload, and allowing users to control interactive elements at their own pace can help make data visualizations more accessible.
126 |
127 | **Mobility and Motor Impairments**: Some users may have difficulty with fine motor control, impacting their ability to interact with highly interactive visualizations. Designing with keyboard navigation in mind and ensuring that interactive elements are large enough to be easily clicked can help.
128 |
129 | **Technological Limitations**: Not all users have access to the latest technology. Some might be using older software or devices with lower processing power or without support for advanced graphical presentations. Ensuring that visualizations are accessible on a variety of platforms and devices is essential.
130 |
131 | **Environmental Factors**: The environment in which a visualization is accessed can impact visibility and interaction. For example, viewing a visualization in a brightly lit area or on a small smartphone screen in direct sunlight can affect how easily the information is consumed. Designing with good contrast and clarity can mitigate some of these issues.
132 |
133 | **Educational Background and Expertise**: The level of a user's expertise and familiarity with the data or the tools used to display the data can influence how effectively they can interpret a visualization. Providing explanatory notes, glossaries, or adjustable complexity levels can help bridge different levels of expertise.
134 |
135 | **Cultural and Linguistic Differences**: Symbols, color meanings, and layout preferences can vary widely between cultures. Additionally, ensuring that visualizations are usable by speakers of different languages by supporting localization and internationalization can expand accessibility.
136 |
137 | ## 3. Colorblindness
138 |
139 | 
140 |
141 | Colorblindness, also known as color vision deficiency, is a condition where people see colors differently than those with typical color vision. There are several types of colorblindness, each affecting how certain colors are perceived. The types are generally categorized based on which particular color sensitivities are impaired. The main types of colorblindess are:
142 |
143 | - **Red-Green Colorblindness**: This is the most common form of colorblindness. It occurs due to a deficiency or absence of red cones (protan) or green cones (deutan) in the eyes.
144 | - Protanomaly: This is a reduced sensitivity to red light. Individuals with protanomaly see red, orange, and yellow as greener and less bright than typical.
145 | - Protanopia: This involves a total absence of red cones. Red appears as black, and certain shades of orange, yellow, and green can appear as yellow.
146 | - Deuteranomaly: This is a reduced sensitivity to green light. It affects the ability to distinguish between some shades of red, orange, yellow, and green, which look more similar.
147 | - Deuteranopia: This is characterized by the absence of green cones. Green appears as beige, and red looks brownish-yellow, making it hard to differentiate hues in this spectrum.
148 | - **Blue-Yellow Colorblindness**: Less common than red-green colorblindness, this type involves the blue cones in the retina.
149 | - Tritanomaly: Reduced sensitivity to blue light. Blue appears greener, and it can be difficult to distinguish yellow and red from pink.
150 | - Tritanopia: This involves a lack of blue photoreceptors. Blue appears as green and yellow looks light grey or violet.
151 | - **Complete Colorblindness (Achromatopsia)**: This rare condition involves a total absence of color vision. People with achromatopsia see the world in shades of grey. This condition is often associated with sensitivity to light, blurred vision, and involuntary eye movements (nystagmus).
152 | - **Partial Colorblindness (Achromatomaly)**: Also rare, this involves limited color perception, where colors are generally perceived but are washed out and not as vibrant as they appear to those with normal color vision.
153 |
154 | Visit http://hclwizard.org/cvdemulator/ and upload an image to get an idea of how each type of colorblindness affect color perception.
155 |
156 | So how does this affect our data visualizations? Many data visualizations rely on color differences to convey distinct categories, relationships, trends, or priorities. For individuals with colorblindness, colors that might appear distinct to others can be indistinguishable. For example, red and green, often used to signify opposing conditions such as stop and go or decrease and increase, can look nearly identical to someone with red-green colorblindness.
157 |
158 | 
159 | 
160 |
161 |
162 | Furthermore, if a visualization uses color as the sole method for distinguishing data points, colorblind users may miss out on critical information. This can lead to misunderstandings or incomplete interpretations of the data, which can be particularly problematic in fields where accurate data interpretation is crucial, such as in finance, healthcare, and safety-related industries.
163 |
164 | When color is not perceived as intended, colorblind individuals may need to spend additional time and effort to understand the visualization. They might have to rely on contextual clues, legends, or labels more heavily than other viewers, which can make the process of interpreting data more cumbersome and less intuitive.
165 |
166 | Visualizations that are not designed with colorblindness in mind can also be less aesthetically pleasing to colorblind viewers, potentially decreasing engagement. Engagement is critical in many contexts, such as education, marketing, and public information dissemination. When people cannot fully perceive visual data presentations, they might feel excluded from discussions or decisions that are based on such visualizations. This exclusion can impact personal opportunities and broader societal participation.
167 |
168 | 
169 |
170 | If your audience may include colorblind viewers, consider the following measures:
171 |
172 | - **Use Colorblind-Friendly Palettes**: Opt for palettes that are distinguishable to those with various types of color vision deficiencies. Tools and resources, such as colorblind simulation software, can help designers choose appropriate colors.
173 | - **Add Text Labels and Symbols**: Incorporating text labels, symbols, or textures in addition to color can help convey information clearly to all viewers, regardless of how they perceive color.
174 | - **Employ Contrasting Shades**: Utilize contrasts not just in color but also in brightness and saturation. Different shades can help delineate data more clearly for those with color vision deficiencies.
175 | - **Utilize Patterns and Shapes**: Patterns and shapes can help differentiate elements in a chart where colors might fail. For instance, using different line styles (solid, dashed) or markers (circles, squares) can indicate different data series effectively.
176 |
177 | Other than colors, contrast can also affect how different people perceive your visualization. To check the contrast level between two colors, you can use https://coolors.co/contrast-checker or the package [`coloratio`](https://matt-dray.github.io/coloratio/). Usually, a contrast value of 4.5 or higher is desired for foreground-background colors.
178 |
179 | 
--------------------------------------------------------------------------------
/Week 9/Week9-AE-RMarkdown.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Week9-AE-RMarkdown"
3 | output: html_document
4 | date: "2024-03-27"
5 | ---
6 |
7 | ```{r}
8 | library(tidyverse)
9 | library(readxl)
10 | library(scales)
11 | #install.packages('janitor')
12 | library(janitor)
13 | ```
14 |
15 | ```{r}
16 | nurses <- read_csv("nurses.csv") |> clean_names()
17 |
18 | # subset to three states
19 | nurses_subset <- nurses |>
20 | filter(state %in% c("California", "New York", "North Carolina"))
21 | ```
22 |
23 | The following code chunk demonstrates how to add alternative text to a bar chart. The alternative text is added to the chunk header using the `fig-alt` chunk option. The text is written in Markdown and can be as long as needed. Note that fig-cap is not the same as `fig-alt`.
24 |
25 | ```{r}
26 | #| label: nurses-bar
27 | #| fig-cap: "Total employed Registered Nurses"
28 | #| fig-alt: "The figure is a bar chart titled 'Total employed Registered
29 | #| Nurses' that displays the numbers of registered nurses in three states
30 | #| (California, New York, and North Carolina) over a 20 year period, with data
31 | #| recorded in three time points (2000, 2010, and 2020). In each state, the
32 | #| numbers of registered nurses increase over time. The following numbers are
33 | #| all approximate. California started off with 200K registered nurses in 2000,
34 | #| 240K in 2010, and 300K in 2020. New York had 150K in 2000, 160K in 2010, and
35 | #| 170K in 2020. Finally North Carolina had 60K in 2000, 90K in 2010, and 100K
36 | #| in 2020."
37 |
38 | nurses_subset |>
39 | filter(year %in% c(2000, 2010, 2020)) |>
40 | ggplot(aes(x = state, y = total_employed_rn, fill = factor(year))) +
41 | geom_col(position = "dodge") +
42 | scale_fill_viridis_d(option = "E") +
43 | scale_y_continuous(labels = label_number(scale = 1/1000, suffix = "K")) +
44 | labs(
45 | x = "State", y = "Number of Registered Nurses", fill = "Year",
46 | title = "Total employed Registered Nurses"
47 | ) +
48 | theme(
49 | legend.background = element_rect(fill = "white", color = "white"),
50 | legend.position = c(0.85, 0.75)
51 | )
52 | ```
53 |
54 | # Task 1. Add alt text to line chart
55 | ```{r}
56 | # Your label here
57 | # Your caption here
58 | # Your alt text here
59 |
60 | nurses_subset |>
61 | ggplot(aes(x = year, y = annual_salary_median, color = state)) +
62 | geom_line() +
63 | scale_y_continuous(labels = label_dollar(scale = 1/1000, suffix = "K")) +
64 | labs(
65 | x = "Year", y = "Annual median salary", color = "State",
66 | title = "Annual median salary of Registered Nurses"
67 | ) +
68 | coord_cartesian(clip = "off") +
69 | theme(
70 | plot.margin = margin(0.1, 0.9, 0.1, 0.1, "in")
71 | )
72 | ```
73 |
74 | # Task 2. Direct labelling instead of legends
75 |
76 | Create a version of the same line chart but using direct labels instead of legends.
77 |
78 | You can achieve this with just `geom_text()` but you can also check out https://r-graph-gallery.com/web-line-chart-with-labels-at-end-of-line.html for a fancier way of achieving this.
79 |
80 | ```{r}
81 | # YOUR CODE HERE
82 |
83 | ```
84 |
85 | # Task 3. Colorblind-friendly plots
86 | Use `colorblindr` for colorblind-friendly palettes.
87 | ```{r}
88 | #remotes::install_github("wilkelab/cowplot")
89 | #install.packages("colorspace", repos = "http://R-Forge.R-project.org")
90 | #remotes::install_github("clauswilke/colorblindr")
91 | library(colorblindr)
92 | ```
93 |
94 | Try out colorblind simulations at http://hclwizard.org/cvdemulator/ or |> your plot to `cvd_grid()` to see the plot in various color-vision-deficiency simulations.
95 |
96 | With the line chart from Task 1, create 3 different plots: one with the default color scale, one with the `viridis` color scale, and one with the `OkabeIto` color scale from `colorblindr`. Show the `cvd_grid()` of each plot and describe the simulated effectiveness of the color scales for colorblind viewers.
97 |
98 | ```{r}
99 | # YOUR CODE HERE for default color scale
100 |
101 | ```
102 |
103 | What do you think of the default color scale effectiveness for colorblind viewers?
104 |
105 | YOUR ANSWER HERE
106 |
107 | ---
108 |
109 | ```{r}
110 | # YOUR CODE HERE for viridis color scale
111 |
112 | ```
113 |
114 | What do you think of the viridis color scale effectiveness for colorblind viewers?
115 |
116 | YOUR ANSWER HERE
117 |
118 | ---
119 |
120 | ```{r}
121 | # YOUR CODE HERE for OkabeIto color scale
122 |
123 | ```
124 |
125 | What do you think of the OkabeIto color scale effectiveness for colorblind viewers?
126 |
127 | YOUR ANSWER HERE
128 |
129 |
130 |
--------------------------------------------------------------------------------
/Week 9/img/avoid1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 9/img/avoid1.jpg
--------------------------------------------------------------------------------
/Week 9/img/avoid2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 9/img/avoid2.jpg
--------------------------------------------------------------------------------
/Week 9/img/banner.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 9/img/banner.png
--------------------------------------------------------------------------------
/Week 9/img/colorblind-game.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 9/img/colorblind-game.png
--------------------------------------------------------------------------------
/Week 9/img/contrast.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 9/img/contrast.png
--------------------------------------------------------------------------------
/Week 9/img/dont.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 9/img/dont.jpg
--------------------------------------------------------------------------------
/Week 9/img/task0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 9/img/task0.png
--------------------------------------------------------------------------------
/Week 9/img/task1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 9/img/task1.png
--------------------------------------------------------------------------------
/Week 9/img/task2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 9/img/task2.png
--------------------------------------------------------------------------------
/Week 9/img/task3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lennemo09/COMP4010-Spring24/1510166559d09c744687528820554b682c2883c1/Week 9/img/task3.png
--------------------------------------------------------------------------------