├── R_code
├── .Rprofile
├── R_code.Rproj
├── Line_graph.Rmd
├── Line_graph.ipynb
├── Leaflet_map.Rmd
├── Leaflet_map.ipynb
└── Guide.Rmd
├── Data
├── GI_map.xlsx
├── GI_det_EW.csv
├── GI_age.csv
├── sexuality_country_gender.csv
├── my_plotly_graph.html
└── cleaned_sexuality_df.csv
├── Images
├── ukds.png
├── GH_pages.png
└── GH_pages_resized_30.png
├── .gitignore
├── Python_code
├── Folium_map.ipynb
├── Data_cleaning_sexuality.ipynb
├── Line_graph.ipynb
└── HTML_files
│ ├── gi_per.html
│ ├── gi_age.html
│ ├── gi_age2.html
│ ├── line.html
│ └── scatter.html
├── .gitattributes
├── README.md
└── Dockerfile
/R_code/.Rprofile:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Data/GI_map.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UKDataServiceOpen/Interactive_visualisations/main/Data/GI_map.xlsx
--------------------------------------------------------------------------------
/Images/ukds.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UKDataServiceOpen/Interactive_visualisations/main/Images/ukds.png
--------------------------------------------------------------------------------
/Images/GH_pages.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UKDataServiceOpen/Interactive_visualisations/main/Images/GH_pages.png
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Jupyter Notebook checkpoints
2 | .ipynb_checkpoints/
3 |
4 | # R-related files
5 | .RData
6 | .Rhistory
7 | .Rproj.user/
8 |
--------------------------------------------------------------------------------
/Images/GH_pages_resized_30.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UKDataServiceOpen/Interactive_visualisations/main/Images/GH_pages_resized_30.png
--------------------------------------------------------------------------------
/Python_code/Folium_map.ipynb:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:a210c5218e1600323d6b7bf5afbb3c43512398eef67c91a42bf547e4c5301e09
3 | size 29415
4 |
--------------------------------------------------------------------------------
/R_code/R_code.Rproj:
--------------------------------------------------------------------------------
1 | Version: 1.0
2 |
3 | RestoreWorkspace: Default
4 | SaveWorkspace: Default
5 | AlwaysSaveHistory: Default
6 |
7 | EnableCodeIndexing: Yes
8 | UseSpacesForTab: Yes
9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 |
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 |
--------------------------------------------------------------------------------
/.gitattributes:
--------------------------------------------------------------------------------
1 | Shapefiles/LADs/LAD_MAY_2022_UK_BFE_V3.shp filter=lfs diff=lfs merge=lfs -text
2 | Folium_map.ipynb filter=lfs diff=lfs merge=lfs -text
3 | *.shp filter=lfs diff=lfs merge=lfs -text
4 | Data/map.html filter=lfs diff=lfs merge=lfs -text
5 | Data/map2.html filter=lfs diff=lfs merge=lfs -text
6 |
--------------------------------------------------------------------------------
/Data/GI_det_EW.csv:
--------------------------------------------------------------------------------
1 | England and Wales Code,England and Wales,Gender identity (8 categories) Code,Gender identity (8 categories),Observation
2 | K04000001,England and Wales,-8,Does not apply,0
3 | K04000001,England and Wales,1,Gender identity the same as sex registered at birth,45389635
4 | K04000001,England and Wales,2,Gender identity different from sex registered at birth but no specific identity given,117775
5 | K04000001,England and Wales,3,Trans woman,47572
6 | K04000001,England and Wales,4,Trans man,48435
7 | K04000001,England and Wales,5,Non-binary,30257
8 | K04000001,England and Wales,6,All other gender identities,18074
9 | K04000001,England and Wales,7,Not answered,2914625
10 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Interactive visualisations
2 |
3 |
4 | This repository holds materials for the Interactive Visualisation workshop.
5 | It contains both **Python code** (written in Jupyter notebook), and **R code** (written in RStudio and converted to .ipynb for purposes of running it in the interactive Binder environment). But if you'd rather clone the repo and execute the code in your own computational environment, please do so, as we also have R code notebooks available as R markdown files.
6 |
7 | **Datasets used** in this workshop are from the **2021 UK census**, and involve the new voluntary question which focuses on **gender identity**. In particular, we explore the relationship between age and gender identity, as well as ethnicity and gender identity.
8 |
9 | Each respective code folder contains a **general guide** to creating interactive visualisations (from simple bar charts to scatter plots), and an additional notebook just focused on **interactive mapping**.
10 |
11 | To access and run the code files interactively, click the button below:
12 |
13 | [](https://mybinder.org/v2/gh/UKDataServiceOpen/Interactive_visualisations/HEAD)
14 |
15 |
16 | Preview the visualisations we'll be coding by visiting our GitHub page:
17 |
18 | [](https://ukdataserviceopen.github.io/blog/2024/05/10/interactive-visualisations-workshop.html)
19 |
20 |
--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
1 | # Use the official R-base image as the base image
2 | FROM r-base:4.4.0
3 |
4 | # Add the R Project repository key and repository
5 | RUN apt-get update && \
6 | apt-get install -y gnupg2 software-properties-common && \
7 | gpg --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys B8F25A8A73EACF41 && \
8 | gpg --export --armor B8F25A8A73EACF41 | tee /etc/apt/trusted.gpg.d/cran_debian_key.asc && \
9 | add-apt-repository 'deb http://cloud.r-project.org/bin/linux/debian buster-cran40/' && \
10 | apt-get update
11 |
12 | # Install necessary packages without libglib2.0-0
13 | RUN apt-get install -y \
14 | libglib2.0-bin \
15 | gir1.2-girepository-2.0
16 |
17 | # Install Python 3 and necessary libraries
18 | RUN apt-get install -y \
19 | python3 python3-pip python3-venv python3-dev \
20 | libudunits2-dev libgdal-dev libgeos-dev libproj-dev \
21 | libsqlite3-dev build-essential librsvg2-dev libcairo2-dev sudo
22 |
23 | # Create jovyan user and home directory
24 | RUN useradd -m -s /bin/bash jovyan
25 |
26 | # Create and activate a virtual environment, then install additional Python packages
27 | RUN python3 -m venv /opt/venv && \
28 | . /opt/venv/bin/activate && \
29 | /opt/venv/bin/pip install --upgrade pip && \
30 | /opt/venv/bin/pip install jupyter ipykernel pyarrow pandas geopandas folium plotly statsmodels && \
31 | /opt/venv/bin/python -m ipykernel install --name=venv --user
32 |
33 | # Ensure the virtual environment is used for subsequent commands
34 | ENV PATH="/opt/venv/bin:$PATH"
35 |
36 | # Install R packages including IRkernel which allows R to run on Jupyter Notebook
37 | RUN R -e "install.packages(c('leaflet', 'readr', 'dplyr', 'ggplot2', 'plotly', 'sf', 'IRkernel', 'Cairo', 'rsvg'), dependencies=TRUE, repos='https://cloud.r-project.org/')" && \
38 | R -e "IRkernel::installspec(user = FALSE)"
39 |
40 | # Clean up package lists to reduce image size
41 | RUN apt-get clean && \
42 | rm -rf /var/lib/apt/lists/*
43 |
44 | # Set the working directory to /home/jovyan (default for Binder)
45 | WORKDIR /home/jovyan
46 |
47 | # Copy all contents of the repository into the working directory
48 | COPY . /home/jovyan
49 |
50 | # Debug step: List contents of /home/jovyan
51 | RUN ls -la /home/jovyan
52 |
53 | # Change ownership and permissions of the /home/jovyan directory
54 | RUN chown -R jovyan:jovyan /home/jovyan && chmod -R 775 /home/jovyan
55 |
56 | # Expose the port Jupyter will run on
57 | EXPOSE 8888
58 |
59 | # Switch to jovyan user
60 | USER jovyan
61 |
62 | # Set a default command to run JupyterLab with the virtual environment activated
63 | CMD ["/bin/bash", "-c", ". /opt/venv/bin/activate && exec jupyter lab --ip=0.0.0.0 --port=8888 --notebook-dir=/home/jovyan --no-browser --allow-root"]
64 |
65 |
66 |
67 |
68 |
--------------------------------------------------------------------------------
/Data/GI_age.csv:
--------------------------------------------------------------------------------
1 | England and Wales Code,England and Wales,Gender identity (7 categories) Code,Gender identity (7 categories),Age (6 categories) Code,Age (6 categories),Observation
2 | K04000001,England and Wales,-8,Does not apply,1,Aged 15 years and under,0
3 | K04000001,England and Wales,-8,Does not apply,2,Aged 16 to 24 years,0
4 | K04000001,England and Wales,-8,Does not apply,3,Aged 25 to 34 years,0
5 | K04000001,England and Wales,-8,Does not apply,4,Aged 35 to 49 years,0
6 | K04000001,England and Wales,-8,Does not apply,5,Aged 50 to 64 years,0
7 | K04000001,England and Wales,-8,Does not apply,6,Aged 65 years and over,0
8 | K04000001,England and Wales,1,Gender identity the same as sex registered at birth,1,Aged 15 years and under,0
9 | K04000001,England and Wales,1,Gender identity the same as sex registered at birth,2,Aged 16 to 24 years,5809658
10 | K04000001,England and Wales,1,Gender identity the same as sex registered at birth,3,Aged 25 to 34 years,7518377
11 | K04000001,England and Wales,1,Gender identity the same as sex registered at birth,4,Aged 35 to 49 years,10829667
12 | K04000001,England and Wales,1,Gender identity the same as sex registered at birth,5,Aged 50 to 64 years,10966023
13 | K04000001,England and Wales,1,Gender identity the same as sex registered at birth,6,Aged 65 years and over,10265910
14 | K04000001,England and Wales,2,Gender identity different from sex registered at birth but no specific identity given,1,Aged 15 years and under,0
15 | K04000001,England and Wales,2,Gender identity different from sex registered at birth but no specific identity given,2,Aged 16 to 24 years,16590
16 | K04000001,England and Wales,2,Gender identity different from sex registered at birth but no specific identity given,3,Aged 25 to 34 years,28375
17 | K04000001,England and Wales,2,Gender identity different from sex registered at birth but no specific identity given,4,Aged 35 to 49 years,38280
18 | K04000001,England and Wales,2,Gender identity different from sex registered at birth but no specific identity given,5,Aged 50 to 64 years,21678
19 | K04000001,England and Wales,2,Gender identity different from sex registered at birth but no specific identity given,6,Aged 65 years and over,12852
20 | K04000001,England and Wales,3,Trans woman,1,Aged 15 years and under,0
21 | K04000001,England and Wales,3,Trans woman,2,Aged 16 to 24 years,9186
22 | K04000001,England and Wales,3,Trans woman,3,Aged 25 to 34 years,9835
23 | K04000001,England and Wales,3,Trans woman,4,Aged 35 to 49 years,12607
24 | K04000001,England and Wales,3,Trans woman,5,Aged 50 to 64 years,9449
25 | K04000001,England and Wales,3,Trans woman,6,Aged 65 years and over,6495
26 | K04000001,England and Wales,4,Trans man,1,Aged 15 years and under,0
27 | K04000001,England and Wales,4,Trans man,2,Aged 16 to 24 years,13819
28 | K04000001,England and Wales,4,Trans man,3,Aged 25 to 34 years,8910
29 | K04000001,England and Wales,4,Trans man,4,Aged 35 to 49 years,11700
30 | K04000001,England and Wales,4,Trans man,5,Aged 50 to 64 years,8264
31 | K04000001,England and Wales,4,Trans man,6,Aged 65 years and over,5742
32 | K04000001,England and Wales,5,All other gender identities,1,Aged 15 years and under,0
33 | K04000001,England and Wales,5,All other gender identities,2,Aged 16 to 24 years,23597
34 | K04000001,England and Wales,5,All other gender identities,3,Aged 25 to 34 years,14550
35 | K04000001,England and Wales,5,All other gender identities,4,Aged 35 to 49 years,6628
36 | K04000001,England and Wales,5,All other gender identities,5,Aged 50 to 64 years,2881
37 | K04000001,England and Wales,5,All other gender identities,6,Aged 65 years and over,675
38 | K04000001,England and Wales,6,Not answered,1,Aged 15 years and under,0
39 | K04000001,England and Wales,6,Not answered,2,Aged 16 to 24 years,445463
40 | K04000001,England and Wales,6,Not answered,3,Aged 25 to 34 years,470493
41 | K04000001,England and Wales,6,Not answered,4,Aged 35 to 49 years,627215
42 | K04000001,England and Wales,6,Not answered,5,Aged 50 to 64 years,599783
43 | K04000001,England and Wales,6,Not answered,6,Aged 65 years and over,771669
44 |
--------------------------------------------------------------------------------
/Python_code/Data_cleaning_sexuality.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "id": "69011e1e",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "import pandas as pd"
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": null,
16 | "id": "6546572f",
17 | "metadata": {},
18 | "outputs": [],
19 | "source": [
20 | "import plotly.io as pio\n",
21 | "pio.renderers.default = \"notebook_connected\""
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": null,
27 | "id": "ff67ce73",
28 | "metadata": {},
29 | "outputs": [],
30 | "source": [
31 | "# Read the CSV file into a pandas DataFrame\n",
32 | "\n",
33 | "df = pd.read_csv('../Data/sexuality_country_gender.csv')"
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": null,
39 | "id": "05cc6624",
40 | "metadata": {},
41 | "outputs": [],
42 | "source": [
43 | "df.head()"
44 | ]
45 | },
46 | {
47 | "cell_type": "code",
48 | "execution_count": null,
49 | "id": "c09c11d4",
50 | "metadata": {},
51 | "outputs": [],
52 | "source": [
53 | "# Fill down 'Country' and 'Sex' values\n",
54 | "df['Country'].fillna(method='ffill', inplace=True)\n",
55 | "df['Gender'].fillna(method='ffill', inplace=True)\n",
56 | "\n",
57 | "# Filter out rows related to \"Weighted base (000s)\" and \"Unweighted sample\" for separate handling\n",
58 | "main_df = df[~df['Gender'].str.contains(\"Weighted base|Unweighted sample\")]\n",
59 | "\n",
60 | "# Drop unnecessary NaN columns\n",
61 | "main_df = main_df.dropna(axis=1, how='all')\n",
62 | "main_df = main_df.dropna(axis=0, how='any')\n",
63 | "\n",
64 | "# Display the cleaned main data to ensure it's structured correctly\n",
65 | "main_df.head()"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": null,
71 | "id": "85f1f7b9",
72 | "metadata": {},
73 | "outputs": [],
74 | "source": [
75 | "year_columns = ['2010', '2011', '2012', '2013', '2014'] # Update this list based on your dataset\n",
76 | "long_format_df = main_df.melt(id_vars=['Country', 'Gender', 'Sexuality'], value_vars=year_columns, var_name='Year', value_name='Percentage')\n",
77 | "\n",
78 | "# Convert 'Percentage' to numeric, as it may be read as string due to the initial NaN values\n",
79 | "long_format_df['Percentage'] = pd.to_numeric(long_format_df['Percentage'], errors='coerce')\n",
80 | "\n",
81 | "# Display the transformed dataset ready for plotting\n",
82 | "long_format_df.head(20)"
83 | ]
84 | },
85 | {
86 | "cell_type": "code",
87 | "execution_count": null,
88 | "id": "c0c098c6",
89 | "metadata": {},
90 | "outputs": [],
91 | "source": [
92 | "# Sort datafrmae into right order \n",
93 | "\n",
94 | "# Sorting the DataFrame by 'Country', 'Sex', and then 'Year'\n",
95 | "sorted_df = long_format_df.sort_values(by=['Country', 'Gender', 'Year']).reset_index(drop = True)\n",
96 | "\n",
97 | "# Display the sorted DataFrame to check if it flows as expected\n",
98 | "sorted_df.head(20)"
99 | ]
100 | },
101 | {
102 | "cell_type": "code",
103 | "execution_count": null,
104 | "id": "d5165ea5",
105 | "metadata": {},
106 | "outputs": [],
107 | "source": [
108 | "# Round values in Percentage column to 2 decimal places\n",
109 | "\n",
110 | "sorted_df['Percentage'] = sorted_df['Percentage'].round(2)"
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": null,
116 | "id": "c172705d",
117 | "metadata": {},
118 | "outputs": [],
119 | "source": [
120 | "# Save df\n",
121 | "\n",
122 | "sorted_df.to_csv('../Data/cleaned_sexuality_df.csv', index = False)"
123 | ]
124 | }
125 | ],
126 | "metadata": {
127 | "kernelspec": {
128 | "display_name": "venv",
129 | "language": "python",
130 | "name": "venv"
131 | },
132 | "language_info": {
133 | "codemirror_mode": {
134 | "name": "ipython",
135 | "version": 3
136 | },
137 | "file_extension": ".py",
138 | "mimetype": "text/x-python",
139 | "name": "python",
140 | "nbconvert_exporter": "python",
141 | "pygments_lexer": "ipython3",
142 | "version": "3.11.9"
143 | },
144 | "toc": {
145 | "base_numbering": 1,
146 | "nav_menu": {},
147 | "number_sections": true,
148 | "sideBar": true,
149 | "skip_h1_title": false,
150 | "title_cell": "Table of Contents",
151 | "title_sidebar": "Contents",
152 | "toc_cell": false,
153 | "toc_position": {},
154 | "toc_section_display": true,
155 | "toc_window_display": true
156 | },
157 | "varInspector": {
158 | "cols": {
159 | "lenName": 16,
160 | "lenType": 16,
161 | "lenVar": 40
162 | },
163 | "kernels_config": {
164 | "python": {
165 | "delete_cmd_postfix": "",
166 | "delete_cmd_prefix": "del ",
167 | "library": "var_list.py",
168 | "varRefreshCmd": "print(var_dic_list())"
169 | },
170 | "r": {
171 | "delete_cmd_postfix": ") ",
172 | "delete_cmd_prefix": "rm(",
173 | "library": "var_list.r",
174 | "varRefreshCmd": "cat(var_dic_list()) "
175 | }
176 | },
177 | "types_to_exclude": [
178 | "module",
179 | "function",
180 | "builtin_function_or_method",
181 | "instance",
182 | "_Feature"
183 | ],
184 | "window_display": false
185 | }
186 | },
187 | "nbformat": 4,
188 | "nbformat_minor": 5
189 | }
190 |
--------------------------------------------------------------------------------
/R_code/Line_graph.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Line_graph"
3 | output: html_document
4 | ---
5 |
6 | ```{r setup, include=FALSE}
7 | knitr::opts_chunk$set(echo = FALSE)
8 | ```
9 |
10 | ```{r}
11 | # Allows us to read-in csv files
12 | library(readr)
13 | # For data manipulation
14 | library(dplyr)
15 | # For regular expression operations
16 | library(stringr)
17 | # Used tp create interactive visualisations
18 | library(plotly)
19 | ```
20 |
21 | # Dataset 3
22 |
23 | This dataset includes sexual identity estimates by gender from 2010 to 2014. This is presented at a UK level, and broken down by England, Wales, Scotland and Northern Ireland. I wanted this guide to include a demo of how to make interactive line graphs with gender identity data, but unfortunately given this is only the first year that the ONS has collected this data that was not possible. So I found a dataset from 2015 which involves experimental statistics that have been used in the Integrated Household Survey. For more info, you can check out this [ONS link](https://www.ons.gov.uk/peoplepopulationandcommunity/culturalidentity/sexuality/datasets/sexualidentitybyagegroupbycountry).
24 |
25 | ```{r}
26 | # Load in dataset
27 |
28 | df3 <- read_csv('../Data/cleaned_sexuality_df.csv')
29 | ```
30 |
31 | ```{r}
32 | # Brief glimpse at underlying data structure
33 |
34 | head(df3, 10)
35 | ```
36 |
37 | ## Data cleaning
38 |
39 | When I first found this dataset it was very messy and formatted terribly, so I performed some cleaning on it in a separate jupyter notebook, to save cluttering this one and distracting from the main tutorial. If you'd like to see how I cleaned it up, please see the ['Data_cleaning_sexuality.ipynb'](Data_cleaning_sexuality.ipynb) notebook.
40 |
41 | ## Data pre-processing
42 |
43 | The only pre-processing we're going to do is subset our data by country, and also create 2 separate datasets for Gender = Men and Gender = Women. I'll explain why this step is needed soon.
44 |
45 | ```{r}
46 | # Filter dataset to focus on England
47 | england_df <- df3 %>%
48 | filter(Country == 'England')
49 | ```
50 |
51 | ```{r}
52 | # Let's check it worked..
53 |
54 | unique(england_df$Country)
55 | ```
56 |
57 | ```{r}
58 | # Further filter data for each gender
59 |
60 | men <- england_df %>% filter(Gender == "Men")
61 | women <- england_df %>% filter(Gender == "Women")
62 |
63 | # Let's check it worked
64 |
65 | unique(men$Gender)
66 | unique(women$Gender)
67 | ```
68 |
69 |
70 | ## Interactive linegraph
71 |
72 | Creating a simple line graph in plotly is pretty easy, but where plotly struggles (in R) is in handling facet plots. A facet plot is a type of visualisation that divides data into subplots based on categorical variables. What I'd like to do is create a facet plot of sexuality percentages in England (2010-2014) with individual subplots for our two genders. This is achieved easily in Python due to the plotly.express module, which provides a simple way to create facet plots. Unfortunately, we'll have to go through a bit more of a longwinded route, where we'll manually create our individual plots for each gender, then combine them using the subplot function. Also, plotly.express automatically manages legends to ensure they're unified across facets, but R's plotly requires that we manually sync up these legends. Womp womp. Let's get to it.
73 |
74 |
75 |
76 | ```{r}
77 | # Create individual plot for each gender
78 |
79 | # Create plots for each gender
80 | men_plot <- plot_ly(men,
81 | x = ~Year,
82 | y = ~Percentage,
83 | color = ~Sexuality,
84 | type = 'scatter',
85 | # mode used to make sure our data points are connected by lines across the years
86 | mode = 'lines+markers',
87 | hoverinfo = 'text',
88 | text = ~paste("Year:", Year, "
Percentage:", Percentage, "
Sexuality:", Sexuality),
89 | # legendgroup parameter ensures that data points relating to the same category are synced across plots
90 | legendgroup = ~Sexuality,
91 | # showlegend parameter set to TRUE only for this plot to avoid duplicate legends
92 | showlegend = TRUE) %>%
93 | layout(xaxis = list(title = 'Year', tickvals = 2010:2014, ticktext = 2010:2014),
94 | yaxis = list(title = 'Percentage'),
95 | # Here we add an annotation to the graph to label the first subplot "Men"
96 | # Setting xref and yref to 'paper' simply means the annotation won't move if we zoom in or out
97 | annotations = list(
98 | list(x = 0.5, y = 1.05, text = "Men", showarrow = FALSE, xref='paper', yref='paper')))
99 |
100 |
101 | women_plot <- plot_ly(women,
102 | x = ~Year,
103 | y = ~Percentage,
104 | color = ~Sexuality,
105 | type = 'scatter',
106 | mode = 'lines+markers',
107 | hoverinfo = 'text',
108 | text = ~paste("Year:", Year, "
Percentage:", Percentage, "
Sexuality:", Sexuality),
109 | legendgroup = ~Sexuality,
110 | showlegend = FALSE) %>%
111 | layout(xaxis = list(title = 'Year', tickvals = 2010:2014, ticktext = 2010:2014),
112 | yaxis = list(title = 'Percentage'),
113 | annotations = list(
114 | list(x = 0.5, y = 1.05, text = "Women", showarrow = FALSE, xref='paper', yref='paper')))
115 |
116 | # Let's take a look at one of these graphs
117 |
118 | women_plot
119 | ```
120 |
121 | ```{r}
122 | # Combine individual plots using subplot
123 | # Within subplot, define number of rows, make sure share same x axes and both axes titles
124 | fig5 <- subplot(men_plot, women_plot, nrows = 2, shareX = TRUE, titleX = TRUE, titleY = TRUE) %>%
125 | layout(
126 | title = list(
127 | text = 'Sexuality Percentages by Gender in England (2010-2014)',
128 | y = 0.98, # Move the title higher up
129 | x = 0.5, # Center the title
130 | xanchor = "center",
131 | yanchor = "top"
132 | ),
133 | margin = list(t = 100), # Add space at the top for the title
134 | height = 800,
135 | width = 1000
136 | )
137 |
138 | fig5
139 | ```
140 |
--------------------------------------------------------------------------------
/Data/sexuality_country_gender.csv:
--------------------------------------------------------------------------------
1 | Country,Gender,Sexuality,,2010,2011,2012,2013,2014
2 | UK,Men,,,,,,,
3 | ,,Heterosexual / Straight,,93.62811882,93.83312,93.23683139,92.26242206,92.46354755
4 | ,,Gay / Lesbian,,1.372509211,1.391262926,1.459347924,1.551245295,1.490815218
5 | ,,Bisexual,,0.36131733,0.386498562,0.325135912,0.360631366,0.321500701
6 | ,,Other,,0.418582671,0.367061623,0.330933952,0.260649684,0.316080858
7 | ,,Don't know/refuse,,3.519557136,3.419206992,3.476622124,3.944665469,3.814776265
8 | ,,Non-response,,0.699914828,0.602849894,1.171128699,1.620386122,1.593279404
9 | ,Women,,,,,,,
10 | ,,Heterosexual / Straight,,94.35353472,94.35996194,93.73789331,93.12693735,93.11510822
11 | ,,Gay / Lesbian,,0.636032812,0.668994001,0.716200431,0.81811695,0.65110956
12 | ,,Bisexual,,0.557008164,0.586374269,0.538799017,0.553086384,0.680992584
13 | ,,Other,,0.352114571,0.29377308,0.25004256,0.266139096,0.331951343
14 | ,,Don't know/refuse,,3.500682946,3.607818251,3.764299753,3.856405125,3.965208136
15 | ,,Non-response,,0.600626788,0.483078464,0.99276493,1.3793151,1.255630157
16 | ,,,,,,,,
17 | ,Weighted base (000s) ,,,,,,,
18 | ,,Males,,24308717.74,24505659.12,24706280.39,24907902.78,25181910.9
19 | ,,Females,,25508484.31,25661932.66,25826203.38,25992983.04,26453346.81
20 | ,Unweighted sample,,,,,,,
21 | ,,Males,,103294,87019,78313,78711,74333
22 | ,,Females,,128111,108426,99884,100109,93888
23 | England,Men,,,,,,,
24 | ,,Heterosexual / Straight,,93.46095169,93.6102045,93.00390992,91.96060373,92.24357479
25 | ,,Gay / Lesbian,,1.409938893,1.428043932,1.51711417,1.648751959,1.540857592
26 | ,,Bisexual,,0.383175378,0.388705396,0.315574803,0.364382383,0.323187693
27 | ,,Other,,0.439337792,0.38275392,0.319925832,0.26104683,0.311220581
28 | ,,Don't know/refuse,,3.675059364,3.614741574,3.662760907,4.151810228,3.947951862
29 | ,,Non-response,,0.631536879,0.575550678,1.180714364,1.613404874,1.633207478
30 | ,Women,,,,,,,
31 | ,,Heterosexual / Straight,,94.30702106,94.21343431,93.51278045,92.93290161,92.82094392
32 | ,,Gay / Lesbian,,0.639942374,0.660549286,0.738128415,0.822970429,0.679285111
33 | ,,Bisexual,,0.609534911,0.61474986,0.547201704,0.553738898,0.689049289
34 | ,,Other,,0.341734012,0.290688591,0.247195243,0.263122197,0.340662553
35 | ,,Don't know/refuse,,3.60653303,3.779135708,3.981289916,4.057598482,4.19149524
36 | ,,Non-response,,0.495234614,0.441442247,0.973404276,1.369668381,1.27856389
37 | ,Weighted base (000s) ,,,,,,,
38 | ,,Males,,20423676.62,20595901.39,20770032.75,20945743.72,21174113.8
39 | ,,Females,,21328711.08,21470378.22,21615047.84,21762276.48,22150559.09
40 | ,Unweighted sample,,,,,,,
41 | ,,Males,,79763,65577,57514,57775,54628
42 | ,,Females,,97868,80892,72938,73024,68306
43 | Wales,Men,,,,,,,
44 | ,,,,,,,,
45 | ,,Heterosexual / Straight,,93.99947253,95.27644151,93.82542704,93.13178673,93.60356113
46 | ,,Gay / Lesbian,,1.209079668,1.018596429,1.153972993,1.2238716,1.476273829
47 | ,,Bisexual,,0.328014664,0.309472418,0.496198245,0.377706214,0.239305016
48 | ,,Other,,0.30393243,0.238524827,0.459964613,0.315820887,0.490810953
49 | ,,Don't know/refuse,,2.955478239,2.326213467,2.782720125,3.046023168,2.948026226
50 | ,,Non-response,,1.204022467,0.830751345,1.28171698,1.9047914,1.24202285
51 | ,Women,,,,,,,
52 | ,,Heterosexual / Straight,,94.80461267,95.0339601,94.66231674,93.92877999,94.24760043
53 | ,,Gay / Lesbian,,0.543387787,0.770042321,0.558070399,0.669843723,0.623326069
54 | ,,Bisexual,,0.196116198,0.327624929,0.356052208,0.527131569,0.703674453
55 | ,,Other,,0.386962695,0.320760393,0.338465432,0.386150572,0.401681864
56 | ,,Don't know/refuse,,2.730869369,2.623739352,2.806198312,2.7790508,3.03566191
57 | ,,Non-response,,1.338051276,0.923872901,1.278896905,1.709043347,0.988055272
58 | ,,,,,,,,
59 | ,Weighted base (000s) ,,,,,,,
60 | ,,Males,,1177133.35,1185029.68,1194028.81,1203112.32,1224169.91
61 | ,,Females,,1250355.67,1255875.13,1262312.07,1269093.03,1282153.98
62 | ,Unweighted sample,,,,,,,
63 | ,,Males,,9582,8979,8914,9093,8645
64 | ,,Females,,11933,11210,11389,11560,11037
65 | Scotland,Men,,,,,,,
66 | ,,Heterosexual / Straight,,94.88507567,94.9442994,94.77618445,94.43224829,94.36574403
67 | ,,Gay / Lesbian,,1.142923503,1.365051204,1.292271099,0.974427117,1.094706743
68 | ,,Bisexual,,0.281250249,0.460850059,0.260115294,0.290933166,0.28803851
69 | ,,Other,,0.323164461,0.313607741,0.283961872,0.141615719,0.247092986
70 | ,,Don't know/refuse,,2.163401415,1.99231078,2.258355705,2.453858406,2.547685028
71 | ,,Non-response,,1.204184705,0.923880814,1.129111585,1.706917301,1.4567327
72 | ,Women,,,,,,,
73 | ,,Heterosexual / Straight,,94.81142823,94.87226614,95.02178483,94.2214845,94.90160028
74 | ,,Gay / Lesbian,,0.684099771,0.847356557,0.762843059,1.06255585,0.563237521
75 | ,,Bisexual,,0.356577513,0.48196109,0.466165064,0.346488158,0.363738828
76 | ,,Other,,0.36605498,0.290627862,0.269216405,0.273659686,0.262959014
77 | ,,Don't know/refuse,,2.631876647,2.713222538,2.426957992,2.59807418,2.625206158
78 | ,,Non-response,,1.149962862,0.794565816,1.053032654,1.497737624,1.283258203
79 | ,,,,,,,,
80 | ,Weighted base (000s) ,,,,,,,
81 | ,,Males,,2030910.2,2041091.2,2052286.09,2062913.65,2084745.54
82 | ,,Females,,2212405.36,2213191.11,2221502.81,2229100.71,2281488.63
83 | ,Unweighted sample,,,,,,,
84 | ,,Males,,12720,11227,10579,10531,9873
85 | ,,Females,,16457,14548,13783,13822,12919
86 | Nireland,Men,,,,,,,
87 | ,,Heterosexual / Straight,,94.25480951,94.72942689,94.65116266,93.41119957,91.45702853
88 | ,,Gay / Lesbian,,1.216222977,1.007409124,0.74581328,0.892526169,1.181726548
89 | ,,Bisexual,,-,0.231545447,0.510330906,0.424800953,0.514181478
90 | ,,Other,,0.278033494,0.276702463,0.578744821,0.506092879,0.363062902
91 | ,,Don't know/refuse,,3.877460003,3.683154295,2.69779486,3.68289489,5.077846871
92 | ,,Non-response1,,*,*,0.81615347,1.082485534,1.406153674
93 | ,Women,,,,,,,
94 | ,,Heterosexual / Straight,,93.53768039,95.97344012,94.90206006,94.17154774,94.45180528
95 | ,,Gay / Lesbian,,0.532979774,0.197922956,0.196524143,*,*
96 | ,,Bisexual,,0.242298527,0.512751627,0.8280934,1.207364535,1.379459847
97 | ,,Other,,0.557117438,0.348159319,0.12263717,0.12496027,0.162893589
98 | ,,Don't know/refuse,,4.375210073,2.967725978,3.063223497,3.574886239,2.932440424
99 | ,,Non-response,,0.754713797,-,0.887461729,0.734279299,0.94722537
100 | ,,,,,,,,
101 | ,Weighted base (000s) ,,,,,,,
102 | ,,Males,,676997.57,683636.85,689932.74,696133.09,698881.65
103 | ,,Females,,717012.2,722488.2,727340.66,732512.82,739145.11
104 | ,Unweighted sample,,,,,,,
105 | ,,Males,,1229,1236,1306,1312,1187
106 | ,,Females,,1853,1776,1774,1703,1626
--------------------------------------------------------------------------------
/Python_code/Line_graph.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "7835cbc9",
6 | "metadata": {},
7 | "source": [
8 | "# Import packages"
9 | ]
10 | },
11 | {
12 | "cell_type": "code",
13 | "execution_count": null,
14 | "id": "177942ff",
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "# Allows us to read-in csv files, and used for data manipulation\n",
19 | "import pandas as pd\n",
20 | "\n",
21 | "# Used to create regular expressions to match strings\n",
22 | "import re\n",
23 | "\n",
24 | "# Modules used to create interactive visualisations \n",
25 | "import plotly.express as px\n",
26 | "import plotly.graph_objects as go"
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "id": "fdb2e3f8",
32 | "metadata": {},
33 | "source": [
34 | "# Dataset 4\n",
35 | "\n",
36 | "This dataset includes sexual identity estimates by gender from 2010 to 2014. This is presented at a UK level, and broken down by England, Wales, Scotland and Northern Ireland. I wanted this guide to include a demo of how to make interactive line graphs with gender identity data, but unfortunately given this is only the first year that the ONS has collected this data that was not possible. So I found a dataset from 2015 which involves experimental statistics that have been used in the Integrated Household Survey. For more info, you can check out this [ONS link](https://www.ons.gov.uk/peoplepopulationandcommunity/culturalidentity/sexuality/datasets/sexualidentitybyagegroupbycountry). "
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": null,
42 | "id": "0741066d",
43 | "metadata": {},
44 | "outputs": [],
45 | "source": [
46 | "df4 = pd.read_csv('../Data/cleaned_sexuality_df.csv')"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": null,
52 | "id": "ad837265",
53 | "metadata": {},
54 | "outputs": [],
55 | "source": [
56 | "# Brief glimpse at underlying data structure\n",
57 | "\n",
58 | "df4.head(50)"
59 | ]
60 | },
61 | {
62 | "cell_type": "markdown",
63 | "id": "d0a19aef",
64 | "metadata": {},
65 | "source": [
66 | "## Data cleaning\n",
67 | "\n",
68 | "When I first found this dataset it was very messy and formatted terribly, so I performed some cleaning on it in a separate jupyter notebook, to save cluttering this one and distracting from the main tutorial. If you'd like to see how I cleaned it up, please see the ['Data_cleaning_sexuality.ipynb'](Data_cleaning_sexuality.ipynb) notebook."
69 | ]
70 | },
71 | {
72 | "cell_type": "markdown",
73 | "id": "85a5341a",
74 | "metadata": {},
75 | "source": [
76 | "## Data pre-processing\n",
77 | "\n",
78 | "The only pre-processing we're going to do is subset our data so that we have it ready to analyse in the following step."
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": null,
84 | "id": "e5d4e369",
85 | "metadata": {},
86 | "outputs": [],
87 | "source": [
88 | "# Filtering the dataset for England only\n",
89 | "\n",
90 | "england_df = df4[df4['Country'] == 'England']"
91 | ]
92 | },
93 | {
94 | "cell_type": "markdown",
95 | "id": "6ffda0fe",
96 | "metadata": {},
97 | "source": [
98 | "## Interactive linegraph\n",
99 | "\n",
100 | "By now you probably know the drill. Just like we had our px.bar and px.scatter methods, we have a corresponding one for linegraphs, appropriately named px.line. The parameters used are the same, with the only difference being that we're using:\n",
101 | "\n",
102 | "* facet_row - when we specify a categorical variable here (Gender), this instructs Plotly to create a separate subplot (a row) for each unique value. \n",
103 | "\n",
104 | "* facet_column - when we specify a categorical variable here (Country), this instructs Plotly to create a separate subplot (a column) for each unique value.\n",
105 | "\n",
106 | "Thus, we get our 2x1 grid of linegraphs. If we added on another country e.g. Scotland, and used these same parameters we'd get a 2x3 grid, and so on. "
107 | ]
108 | },
109 | {
110 | "cell_type": "markdown",
111 | "id": "3a66cbaa",
112 | "metadata": {},
113 | "source": [
114 | "## Interactive legends\n",
115 | "\n",
116 | "Again, the cool thing about Plotly's legends is that they are interactive by default. Thus, this allows us to omit values which dominate the graph and obscure our ability to get to the nitty gritty of the data.\n"
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "execution_count": null,
122 | "id": "6c0a5aea",
123 | "metadata": {},
124 | "outputs": [],
125 | "source": [
126 | "# Specify hover_data\n",
127 | "\n",
128 | "hover_data = {'Sexuality': True,\n",
129 | " 'Percentage': ':.2f%',\n",
130 | " 'Country': False,\n",
131 | " 'Year': False,\n",
132 | " 'Gender': True}"
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": null,
138 | "id": "a4bceeb7",
139 | "metadata": {},
140 | "outputs": [],
141 | "source": [
142 | "\n",
143 | "fig6 = px.line(england_df,\n",
144 | " x='Year',\n",
145 | " y='Percentage',\n",
146 | " color='Sexuality',\n",
147 | " facet_row='Gender',\n",
148 | " facet_col='Country',\n",
149 | " hover_data = hover_data,\n",
150 | " title='Sexuality Percentages by Gender in England (2010-2014)',\n",
151 | " markers=True,\n",
152 | " height = 800,\n",
153 | " width = 1000)\n",
154 | "\n",
155 | "# Enhance the layout for readability\n",
156 | "fig6.update_layout(title_x = 0.15,\n",
157 | " legend_title_text='Sexuality')\n",
158 | "\n",
159 | "fig6.show()"
160 | ]
161 | },
162 | {
163 | "cell_type": "code",
164 | "execution_count": null,
165 | "id": "0a510c8f",
166 | "metadata": {},
167 | "outputs": [],
168 | "source": [
169 | "# Finally, let's update our x-axis so that it only shows whole years\n",
170 | "\n",
171 | "# dtick \"M12\" - tells plotly to place a tick every 12 months \n",
172 | "fig6.update_xaxes(dtick=\"M12\", tickformat=\"%Y\")"
173 | ]
174 | },
175 | {
176 | "cell_type": "code",
177 | "execution_count": null,
178 | "id": "f3ee60a7",
179 | "metadata": {},
180 | "outputs": [],
181 | "source": []
182 | }
183 | ],
184 | "metadata": {
185 | "kernelspec": {
186 | "display_name": "Python 3 (ipykernel)",
187 | "language": "python",
188 | "name": "python3"
189 | },
190 | "language_info": {
191 | "codemirror_mode": {
192 | "name": "ipython",
193 | "version": 3
194 | },
195 | "file_extension": ".py",
196 | "mimetype": "text/x-python",
197 | "name": "python",
198 | "nbconvert_exporter": "python",
199 | "pygments_lexer": "ipython3",
200 | "version": "3.10.13"
201 | },
202 | "toc": {
203 | "base_numbering": 1,
204 | "nav_menu": {},
205 | "number_sections": true,
206 | "sideBar": true,
207 | "skip_h1_title": false,
208 | "title_cell": "Table of Contents",
209 | "title_sidebar": "Contents",
210 | "toc_cell": false,
211 | "toc_position": {},
212 | "toc_section_display": true,
213 | "toc_window_display": true
214 | },
215 | "varInspector": {
216 | "cols": {
217 | "lenName": 16,
218 | "lenType": 16,
219 | "lenVar": 40
220 | },
221 | "kernels_config": {
222 | "python": {
223 | "delete_cmd_postfix": "",
224 | "delete_cmd_prefix": "del ",
225 | "library": "var_list.py",
226 | "varRefreshCmd": "print(var_dic_list())"
227 | },
228 | "r": {
229 | "delete_cmd_postfix": ") ",
230 | "delete_cmd_prefix": "rm(",
231 | "library": "var_list.r",
232 | "varRefreshCmd": "cat(var_dic_list()) "
233 | }
234 | },
235 | "types_to_exclude": [
236 | "module",
237 | "function",
238 | "builtin_function_or_method",
239 | "instance",
240 | "_Feature"
241 | ],
242 | "window_display": false
243 | }
244 | },
245 | "nbformat": 4,
246 | "nbformat_minor": 5
247 | }
248 |
--------------------------------------------------------------------------------
/Python_code/HTML_files/gi_per.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
6 |
7 |
--------------------------------------------------------------------------------
/Data/my_plotly_graph.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
6 |
7 |
--------------------------------------------------------------------------------
/Python_code/HTML_files/gi_age.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
6 |
7 |
--------------------------------------------------------------------------------
/Python_code/HTML_files/gi_age2.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
6 |
7 |
--------------------------------------------------------------------------------
/R_code/Line_graph.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": "\n"
7 | },
8 | {
9 | "cell_type": "code",
10 | "execution_count": null,
11 | "metadata": {},
12 | "outputs": [],
13 | "source": [
14 | "knitr::opts_chunk$set(echo = FALSE)\n",
15 | "\n"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": "\n"
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": null,
26 | "metadata": {},
27 | "outputs": [],
28 | "source": [
29 | "# Allows us to read-in csv files\n",
30 | "library(readr) \n",
31 | "# For data manipulation\n",
32 | "library(dplyr) \n",
33 | "# For regular expression operations \n",
34 | "library(stringr) \n",
35 | "# Used tp create interactive visualisations\n",
36 | "library(plotly)\n"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "# Dataset 3\n",
44 | "\n",
45 | "This dataset includes sexual identity estimates by gender from 2010 to 2014. This is presented at a UK level, and broken down by England, Wales, Scotland and Northern Ireland. I wanted this guide to include a demo of how to make interactive line graphs with gender identity data, but unfortunately given this is only the first year that the ONS has collected this data that was not possible. So I found a dataset from 2015 which involves experimental statistics that have been used in the Integrated Household Survey. For more info, you can check out this [ONS link](https://www.ons.gov.uk/peoplepopulationandcommunity/culturalidentity/sexuality/datasets/sexualidentitybyagegroupbycountry). \n"
46 | ]
47 | },
48 | {
49 | "cell_type": "code",
50 | "execution_count": null,
51 | "metadata": {},
52 | "outputs": [],
53 | "source": [
54 | "# Load in dataset\n",
55 | "\n",
56 | "df3 <- read_csv('../Data/cleaned_sexuality_df.csv')\n"
57 | ]
58 | },
59 | {
60 | "cell_type": "markdown",
61 | "metadata": {},
62 | "source": "\n"
63 | },
64 | {
65 | "cell_type": "code",
66 | "execution_count": null,
67 | "metadata": {},
68 | "outputs": [],
69 | "source": [
70 | "# Brief glimpse at underlying data structure\n",
71 | "\n",
72 | "head(df3, 10)\n"
73 | ]
74 | },
75 | {
76 | "cell_type": "markdown",
77 | "metadata": {},
78 | "source": [
79 | "## Data cleaning\n",
80 | "\n",
81 | "When I first found this dataset it was very messy and formatted terribly, so I performed some cleaning on it in a separate jupyter notebook, to save cluttering this one and distracting from the main tutorial. If you'd like to see how I cleaned it up, please see the ['Data_cleaning_sexuality.ipynb'](Data_cleaning_sexuality.ipynb) notebook. \n",
82 | "\n",
83 | "## Data pre-processing\n",
84 | "\n",
85 | "The only pre-processing we're going to do is subset our data by country, and also create 2 separate datasets for Gender = Men and Gender = Women. I'll explain why this step is needed soon. \n"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": null,
91 | "metadata": {},
92 | "outputs": [],
93 | "source": [
94 | "# Filter dataset to focus on England\n",
95 | "england_df <- df3 %>%\n",
96 | " filter(Country == 'England')\n"
97 | ]
98 | },
99 | {
100 | "cell_type": "markdown",
101 | "metadata": {},
102 | "source": "\n"
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": null,
107 | "metadata": {},
108 | "outputs": [],
109 | "source": [
110 | "# Let's check it worked.. \n",
111 | "\n",
112 | "unique(england_df$Country)\n"
113 | ]
114 | },
115 | {
116 | "cell_type": "markdown",
117 | "metadata": {},
118 | "source": "\n"
119 | },
120 | {
121 | "cell_type": "code",
122 | "execution_count": null,
123 | "metadata": {},
124 | "outputs": [],
125 | "source": [
126 | "# Further filter data for each gender\n",
127 | "\n",
128 | "men <- england_df %>% filter(Gender == \"Men\")\n",
129 | "women <- england_df %>% filter(Gender == \"Women\")\n",
130 | "\n",
131 | "# Let's check it worked\n",
132 | "\n",
133 | "unique(men$Gender)\n",
134 | "unique(women$Gender)\n"
135 | ]
136 | },
137 | {
138 | "cell_type": "markdown",
139 | "metadata": {},
140 | "source": [
141 | "## Interactive linegraph\n",
142 | "\n",
143 | "Creating a simple line graph in plotly is pretty easy, but where plotly struggles (in R) is in handling facet plots. A facet plot is a type of visualisation that divides data into subplots based on categorical variables. What I'd like to do is create a facet plot of sexuality percentages in England (2010-2014) with individual subplots for our two genders. This is achieved easily in Python due to the plotly.express module, which provides a simple way to create facet plots. Unfortunately, we'll have to go through a bit more of a longwinded route, where we'll manually create our individual plots for each gender, then combine them using the subplot function. Also, plotly.express automatically manages legends to ensure they're unified across facets, but R's plotly requires that we manually sync up these legends. Womp womp. Let's get to it. \n"
144 | ]
145 | },
146 | {
147 | "cell_type": "code",
148 | "execution_count": null,
149 | "metadata": {},
150 | "outputs": [],
151 | "source": [
152 | "# Create individual plot for each gender\n",
153 | "\n",
154 | "# Create plots for each gender\n",
155 | "men_plot <- plot_ly(men, \n",
156 | " x = ~Year, \n",
157 | " y = ~Percentage, \n",
158 | " color = ~Sexuality, \n",
159 | " type = 'scatter', \n",
160 | " # mode used to make sure our data points are connected by lines across the years\n",
161 | " mode = 'lines+markers', \n",
162 | " hoverinfo = 'text',\n",
163 | " text = ~paste(\"Year:\", Year, \"
Percentage:\", Percentage, \"
Sexuality:\", Sexuality),\n",
164 | " # legendgroup parameter ensures that data points relating to the same category are synced across plots\n",
165 | " legendgroup = ~Sexuality,\n",
166 | " # showlegend parameter set to TRUE only for this plot to avoid duplicate legends\n",
167 | " showlegend = TRUE) %>%\n",
168 | " layout(xaxis = list(title = 'Year', tickvals = 2010:2014, ticktext = 2010:2014),\n",
169 | " yaxis = list(title = 'Percentage'),\n",
170 | " # Here we add an annotation to the graph to label the first subplot \"Men\"\n",
171 | " # Setting xref and yref to 'paper' simply means the annotation won't move if we zoom in or out\n",
172 | " annotations = list(\n",
173 | " list(x = 0.5, y = 1.05, text = \"Men\", showarrow = FALSE, xref='paper', yref='paper')))\n",
174 | "\n",
175 | "\n",
176 | "women_plot <- plot_ly(women, \n",
177 | " x = ~Year, \n",
178 | " y = ~Percentage, \n",
179 | " color = ~Sexuality, \n",
180 | " type = 'scatter', \n",
181 | " mode = 'lines+markers', \n",
182 | " hoverinfo = 'text',\n",
183 | " text = ~paste(\"Year:\", Year, \"
Percentage:\", Percentage, \"
Sexuality:\", Sexuality),\n",
184 | " legendgroup = ~Sexuality,\n",
185 | " showlegend = FALSE) %>%\n",
186 | " layout(xaxis = list(title = 'Year', tickvals = 2010:2014, ticktext = 2010:2014),\n",
187 | " yaxis = list(title = 'Percentage'),\n",
188 | " annotations = list(\n",
189 | " list(x = 0.5, y = 1.05, text = \"Women\", showarrow = FALSE, xref='paper', yref='paper')))\n",
190 | "\n",
191 | "# Let's take a look at one of these graphs\n",
192 | "\n",
193 | "women_plot\n"
194 | ]
195 | },
196 | {
197 | "cell_type": "markdown",
198 | "metadata": {},
199 | "source": "\n"
200 | },
201 | {
202 | "cell_type": "code",
203 | "execution_count": null,
204 | "metadata": {},
205 | "outputs": [],
206 | "source": [
207 | "# Combine individual plots using subplot\n",
208 | "# Within subplot, define number of rows, make sure share same x axes and both axes titles\n",
209 | "fig5 <- subplot(men_plot, women_plot, nrows = 2, shareX = TRUE, titleX = TRUE, titleY = TRUE) %>%\n",
210 | " layout(\n",
211 | " title = list(\n",
212 | " text = 'Sexuality Percentages by Gender in England (2010-2014)', \n",
213 | " y = 0.98, # Move the title higher up\n",
214 | " x = 0.5, # Center the title\n",
215 | " xanchor = \"center\",\n",
216 | " yanchor = \"top\"\n",
217 | " ),\n",
218 | " margin = list(t = 100), # Add space at the top for the title\n",
219 | " height = 800,\n",
220 | " width = 1000\n",
221 | " )\n",
222 | "\n",
223 | "fig5\n"
224 | ]
225 | }
226 | ],
227 | "metadata": {
228 | "anaconda-cloud": "",
229 | "kernelspec": {
230 | "display_name": "R",
231 | "langauge": "R",
232 | "name": "ir"
233 | },
234 | "language_info": {
235 | "codemirror_mode": "r",
236 | "file_extension": ".r",
237 | "mimetype": "text/x-r-source",
238 | "name": "R",
239 | "pygments_lexer": "r",
240 | "version": "3.4.1"
241 | }
242 | },
243 | "nbformat": 4,
244 | "nbformat_minor": 1
245 | }
246 |
--------------------------------------------------------------------------------
/Data/cleaned_sexuality_df.csv:
--------------------------------------------------------------------------------
1 | Country,Gender,Sexuality,Year,Percentage
2 | England,Men,Heterosexual / Straight,2010,93.46
3 | England,Men,Gay / Lesbian,2010,1.41
4 | England,Men,Bisexual,2010,0.38
5 | England,Men,Other,2010,0.44
6 | England,Men,Don't know/refuse,2010,3.68
7 | England,Men,Non-response,2010,0.63
8 | England,Men,Heterosexual / Straight,2011,93.61
9 | England,Men,Gay / Lesbian,2011,1.43
10 | England,Men,Bisexual,2011,0.39
11 | England,Men,Other,2011,0.38
12 | England,Men,Don't know/refuse,2011,3.61
13 | England,Men,Non-response,2011,0.58
14 | England,Men,Heterosexual / Straight,2012,93.0
15 | England,Men,Gay / Lesbian,2012,1.52
16 | England,Men,Bisexual,2012,0.32
17 | England,Men,Other,2012,0.32
18 | England,Men,Don't know/refuse,2012,3.66
19 | England,Men,Non-response,2012,1.18
20 | England,Men,Heterosexual / Straight,2013,91.96
21 | England,Men,Gay / Lesbian,2013,1.65
22 | England,Men,Bisexual,2013,0.36
23 | England,Men,Other,2013,0.26
24 | England,Men,Don't know/refuse,2013,4.15
25 | England,Men,Non-response,2013,1.61
26 | England,Men,Heterosexual / Straight,2014,92.24
27 | England,Men,Gay / Lesbian,2014,1.54
28 | England,Men,Bisexual,2014,0.32
29 | England,Men,Other,2014,0.31
30 | England,Men,Don't know/refuse,2014,3.95
31 | England,Men,Non-response,2014,1.63
32 | England,Women,Heterosexual / Straight,2010,94.31
33 | England,Women,Gay / Lesbian,2010,0.64
34 | England,Women,Bisexual,2010,0.61
35 | England,Women,Other,2010,0.34
36 | England,Women,Don't know/refuse,2010,3.61
37 | England,Women,Non-response,2010,0.5
38 | England,Women,Heterosexual / Straight,2011,94.21
39 | England,Women,Gay / Lesbian,2011,0.66
40 | England,Women,Bisexual,2011,0.61
41 | England,Women,Other,2011,0.29
42 | England,Women,Don't know/refuse,2011,3.78
43 | England,Women,Non-response,2011,0.44
44 | England,Women,Heterosexual / Straight,2012,93.51
45 | England,Women,Gay / Lesbian,2012,0.74
46 | England,Women,Bisexual,2012,0.55
47 | England,Women,Other,2012,0.25
48 | England,Women,Don't know/refuse,2012,3.98
49 | England,Women,Non-response,2012,0.97
50 | England,Women,Heterosexual / Straight,2013,92.93
51 | England,Women,Gay / Lesbian,2013,0.82
52 | England,Women,Bisexual,2013,0.55
53 | England,Women,Other,2013,0.26
54 | England,Women,Don't know/refuse,2013,4.06
55 | England,Women,Non-response,2013,1.37
56 | England,Women,Heterosexual / Straight,2014,92.82
57 | England,Women,Gay / Lesbian,2014,0.68
58 | England,Women,Bisexual,2014,0.69
59 | England,Women,Other,2014,0.34
60 | England,Women,Don't know/refuse,2014,4.19
61 | England,Women,Non-response,2014,1.28
62 | Nireland,Men,Heterosexual / Straight,2010,94.25
63 | Nireland,Men,Gay / Lesbian,2010,1.22
64 | Nireland,Men,Bisexual,2010,
65 | Nireland,Men,Other,2010,0.28
66 | Nireland,Men,Don't know/refuse,2010,3.88
67 | Nireland,Men,Non-response1,2010,
68 | Nireland,Men,Heterosexual / Straight,2011,94.73
69 | Nireland,Men,Gay / Lesbian,2011,1.01
70 | Nireland,Men,Bisexual,2011,0.23
71 | Nireland,Men,Other,2011,0.28
72 | Nireland,Men,Don't know/refuse,2011,3.68
73 | Nireland,Men,Non-response1,2011,
74 | Nireland,Men,Heterosexual / Straight,2012,94.65
75 | Nireland,Men,Gay / Lesbian,2012,0.75
76 | Nireland,Men,Bisexual,2012,0.51
77 | Nireland,Men,Other,2012,0.58
78 | Nireland,Men,Don't know/refuse,2012,2.7
79 | Nireland,Men,Non-response1,2012,0.82
80 | Nireland,Men,Heterosexual / Straight,2013,93.41
81 | Nireland,Men,Gay / Lesbian,2013,0.89
82 | Nireland,Men,Bisexual,2013,0.42
83 | Nireland,Men,Other,2013,0.51
84 | Nireland,Men,Don't know/refuse,2013,3.68
85 | Nireland,Men,Non-response1,2013,1.08
86 | Nireland,Men,Heterosexual / Straight,2014,91.46
87 | Nireland,Men,Gay / Lesbian,2014,1.18
88 | Nireland,Men,Bisexual,2014,0.51
89 | Nireland,Men,Other,2014,0.36
90 | Nireland,Men,Don't know/refuse,2014,5.08
91 | Nireland,Men,Non-response1,2014,1.41
92 | Nireland,Women,Heterosexual / Straight,2010,93.54
93 | Nireland,Women,Gay / Lesbian,2010,0.53
94 | Nireland,Women,Bisexual,2010,0.24
95 | Nireland,Women,Other,2010,0.56
96 | Nireland,Women,Don't know/refuse,2010,4.38
97 | Nireland,Women,Non-response,2010,0.75
98 | Nireland,Women,Heterosexual / Straight,2011,95.97
99 | Nireland,Women,Gay / Lesbian,2011,0.2
100 | Nireland,Women,Bisexual,2011,0.51
101 | Nireland,Women,Other,2011,0.35
102 | Nireland,Women,Don't know/refuse,2011,2.97
103 | Nireland,Women,Non-response,2011,
104 | Nireland,Women,Heterosexual / Straight,2012,94.9
105 | Nireland,Women,Gay / Lesbian,2012,0.2
106 | Nireland,Women,Bisexual,2012,0.83
107 | Nireland,Women,Other,2012,0.12
108 | Nireland,Women,Don't know/refuse,2012,3.06
109 | Nireland,Women,Non-response,2012,0.89
110 | Nireland,Women,Heterosexual / Straight,2013,94.17
111 | Nireland,Women,Gay / Lesbian,2013,
112 | Nireland,Women,Bisexual,2013,1.21
113 | Nireland,Women,Other,2013,0.12
114 | Nireland,Women,Don't know/refuse,2013,3.57
115 | Nireland,Women,Non-response,2013,0.73
116 | Nireland,Women,Heterosexual / Straight,2014,94.45
117 | Nireland,Women,Gay / Lesbian,2014,
118 | Nireland,Women,Bisexual,2014,1.38
119 | Nireland,Women,Other,2014,0.16
120 | Nireland,Women,Don't know/refuse,2014,2.93
121 | Nireland,Women,Non-response,2014,0.95
122 | Scotland,Men,Heterosexual / Straight,2010,94.89
123 | Scotland,Men,Gay / Lesbian,2010,1.14
124 | Scotland,Men,Bisexual,2010,0.28
125 | Scotland,Men,Other,2010,0.32
126 | Scotland,Men,Don't know/refuse,2010,2.16
127 | Scotland,Men,Non-response,2010,1.2
128 | Scotland,Men,Heterosexual / Straight,2011,94.94
129 | Scotland,Men,Gay / Lesbian,2011,1.37
130 | Scotland,Men,Bisexual,2011,0.46
131 | Scotland,Men,Other,2011,0.31
132 | Scotland,Men,Don't know/refuse,2011,1.99
133 | Scotland,Men,Non-response,2011,0.92
134 | Scotland,Men,Heterosexual / Straight,2012,94.78
135 | Scotland,Men,Gay / Lesbian,2012,1.29
136 | Scotland,Men,Bisexual,2012,0.26
137 | Scotland,Men,Other,2012,0.28
138 | Scotland,Men,Don't know/refuse,2012,2.26
139 | Scotland,Men,Non-response,2012,1.13
140 | Scotland,Men,Heterosexual / Straight,2013,94.43
141 | Scotland,Men,Gay / Lesbian,2013,0.97
142 | Scotland,Men,Bisexual,2013,0.29
143 | Scotland,Men,Other,2013,0.14
144 | Scotland,Men,Don't know/refuse,2013,2.45
145 | Scotland,Men,Non-response,2013,1.71
146 | Scotland,Men,Heterosexual / Straight,2014,94.37
147 | Scotland,Men,Gay / Lesbian,2014,1.09
148 | Scotland,Men,Bisexual,2014,0.29
149 | Scotland,Men,Other,2014,0.25
150 | Scotland,Men,Don't know/refuse,2014,2.55
151 | Scotland,Men,Non-response,2014,1.46
152 | Scotland,Women,Heterosexual / Straight,2010,94.81
153 | Scotland,Women,Gay / Lesbian,2010,0.68
154 | Scotland,Women,Bisexual,2010,0.36
155 | Scotland,Women,Other,2010,0.37
156 | Scotland,Women,Don't know/refuse,2010,2.63
157 | Scotland,Women,Non-response,2010,1.15
158 | Scotland,Women,Heterosexual / Straight,2011,94.87
159 | Scotland,Women,Gay / Lesbian,2011,0.85
160 | Scotland,Women,Bisexual,2011,0.48
161 | Scotland,Women,Other,2011,0.29
162 | Scotland,Women,Don't know/refuse,2011,2.71
163 | Scotland,Women,Non-response,2011,0.79
164 | Scotland,Women,Heterosexual / Straight,2012,95.02
165 | Scotland,Women,Gay / Lesbian,2012,0.76
166 | Scotland,Women,Bisexual,2012,0.47
167 | Scotland,Women,Other,2012,0.27
168 | Scotland,Women,Don't know/refuse,2012,2.43
169 | Scotland,Women,Non-response,2012,1.05
170 | Scotland,Women,Heterosexual / Straight,2013,94.22
171 | Scotland,Women,Gay / Lesbian,2013,1.06
172 | Scotland,Women,Bisexual,2013,0.35
173 | Scotland,Women,Other,2013,0.27
174 | Scotland,Women,Don't know/refuse,2013,2.6
175 | Scotland,Women,Non-response,2013,1.5
176 | Scotland,Women,Heterosexual / Straight,2014,94.9
177 | Scotland,Women,Gay / Lesbian,2014,0.56
178 | Scotland,Women,Bisexual,2014,0.36
179 | Scotland,Women,Other,2014,0.26
180 | Scotland,Women,Don't know/refuse,2014,2.63
181 | Scotland,Women,Non-response,2014,1.28
182 | UK,Men,Heterosexual / Straight,2010,93.63
183 | UK,Men,Gay / Lesbian,2010,1.37
184 | UK,Men,Bisexual,2010,0.36
185 | UK,Men,Other,2010,0.42
186 | UK,Men,Don't know/refuse,2010,3.52
187 | UK,Men,Non-response,2010,0.7
188 | UK,Men,Heterosexual / Straight,2011,93.83
189 | UK,Men,Gay / Lesbian,2011,1.39
190 | UK,Men,Bisexual,2011,0.39
191 | UK,Men,Other,2011,0.37
192 | UK,Men,Don't know/refuse,2011,3.42
193 | UK,Men,Non-response,2011,0.6
194 | UK,Men,Heterosexual / Straight,2012,93.24
195 | UK,Men,Gay / Lesbian,2012,1.46
196 | UK,Men,Bisexual,2012,0.33
197 | UK,Men,Other,2012,0.33
198 | UK,Men,Don't know/refuse,2012,3.48
199 | UK,Men,Non-response,2012,1.17
200 | UK,Men,Heterosexual / Straight,2013,92.26
201 | UK,Men,Gay / Lesbian,2013,1.55
202 | UK,Men,Bisexual,2013,0.36
203 | UK,Men,Other,2013,0.26
204 | UK,Men,Don't know/refuse,2013,3.94
205 | UK,Men,Non-response,2013,1.62
206 | UK,Men,Heterosexual / Straight,2014,92.46
207 | UK,Men,Gay / Lesbian,2014,1.49
208 | UK,Men,Bisexual,2014,0.32
209 | UK,Men,Other,2014,0.32
210 | UK,Men,Don't know/refuse,2014,3.81
211 | UK,Men,Non-response,2014,1.59
212 | UK,Women,Heterosexual / Straight,2010,94.35
213 | UK,Women,Gay / Lesbian,2010,0.64
214 | UK,Women,Bisexual,2010,0.56
215 | UK,Women,Other,2010,0.35
216 | UK,Women,Don't know/refuse,2010,3.5
217 | UK,Women,Non-response,2010,0.6
218 | UK,Women,Heterosexual / Straight,2011,94.36
219 | UK,Women,Gay / Lesbian,2011,0.67
220 | UK,Women,Bisexual,2011,0.59
221 | UK,Women,Other,2011,0.29
222 | UK,Women,Don't know/refuse,2011,3.61
223 | UK,Women,Non-response,2011,0.48
224 | UK,Women,Heterosexual / Straight,2012,93.74
225 | UK,Women,Gay / Lesbian,2012,0.72
226 | UK,Women,Bisexual,2012,0.54
227 | UK,Women,Other,2012,0.25
228 | UK,Women,Don't know/refuse,2012,3.76
229 | UK,Women,Non-response,2012,0.99
230 | UK,Women,Heterosexual / Straight,2013,93.13
231 | UK,Women,Gay / Lesbian,2013,0.82
232 | UK,Women,Bisexual,2013,0.55
233 | UK,Women,Other,2013,0.27
234 | UK,Women,Don't know/refuse,2013,3.86
235 | UK,Women,Non-response,2013,1.38
236 | UK,Women,Heterosexual / Straight,2014,93.12
237 | UK,Women,Gay / Lesbian,2014,0.65
238 | UK,Women,Bisexual,2014,0.68
239 | UK,Women,Other,2014,0.33
240 | UK,Women,Don't know/refuse,2014,3.97
241 | UK,Women,Non-response,2014,1.26
242 | Wales,Men,Heterosexual / Straight,2010,94.0
243 | Wales,Men,Gay / Lesbian,2010,1.21
244 | Wales,Men,Bisexual,2010,0.33
245 | Wales,Men,Other,2010,0.3
246 | Wales,Men,Don't know/refuse,2010,2.96
247 | Wales,Men,Non-response,2010,1.2
248 | Wales,Men,Heterosexual / Straight,2011,95.28
249 | Wales,Men,Gay / Lesbian,2011,1.02
250 | Wales,Men,Bisexual,2011,0.31
251 | Wales,Men,Other,2011,0.24
252 | Wales,Men,Don't know/refuse,2011,2.33
253 | Wales,Men,Non-response,2011,0.83
254 | Wales,Men,Heterosexual / Straight,2012,93.83
255 | Wales,Men,Gay / Lesbian,2012,1.15
256 | Wales,Men,Bisexual,2012,0.5
257 | Wales,Men,Other,2012,0.46
258 | Wales,Men,Don't know/refuse,2012,2.78
259 | Wales,Men,Non-response,2012,1.28
260 | Wales,Men,Heterosexual / Straight,2013,93.13
261 | Wales,Men,Gay / Lesbian,2013,1.22
262 | Wales,Men,Bisexual,2013,0.38
263 | Wales,Men,Other,2013,0.32
264 | Wales,Men,Don't know/refuse,2013,3.05
265 | Wales,Men,Non-response,2013,1.9
266 | Wales,Men,Heterosexual / Straight,2014,93.6
267 | Wales,Men,Gay / Lesbian,2014,1.48
268 | Wales,Men,Bisexual,2014,0.24
269 | Wales,Men,Other,2014,0.49
270 | Wales,Men,Don't know/refuse,2014,2.95
271 | Wales,Men,Non-response,2014,1.24
272 | Wales,Women,Heterosexual / Straight,2010,94.8
273 | Wales,Women,Gay / Lesbian,2010,0.54
274 | Wales,Women,Bisexual,2010,0.2
275 | Wales,Women,Other,2010,0.39
276 | Wales,Women,Don't know/refuse,2010,2.73
277 | Wales,Women,Non-response,2010,1.34
278 | Wales,Women,Heterosexual / Straight,2011,95.03
279 | Wales,Women,Gay / Lesbian,2011,0.77
280 | Wales,Women,Bisexual,2011,0.33
281 | Wales,Women,Other,2011,0.32
282 | Wales,Women,Don't know/refuse,2011,2.62
283 | Wales,Women,Non-response,2011,0.92
284 | Wales,Women,Heterosexual / Straight,2012,94.66
285 | Wales,Women,Gay / Lesbian,2012,0.56
286 | Wales,Women,Bisexual,2012,0.36
287 | Wales,Women,Other,2012,0.34
288 | Wales,Women,Don't know/refuse,2012,2.81
289 | Wales,Women,Non-response,2012,1.28
290 | Wales,Women,Heterosexual / Straight,2013,93.93
291 | Wales,Women,Gay / Lesbian,2013,0.67
292 | Wales,Women,Bisexual,2013,0.53
293 | Wales,Women,Other,2013,0.39
294 | Wales,Women,Don't know/refuse,2013,2.78
295 | Wales,Women,Non-response,2013,1.71
296 | Wales,Women,Heterosexual / Straight,2014,94.25
297 | Wales,Women,Gay / Lesbian,2014,0.62
298 | Wales,Women,Bisexual,2014,0.7
299 | Wales,Women,Other,2014,0.4
300 | Wales,Women,Don't know/refuse,2014,3.04
301 | Wales,Women,Non-response,2014,0.99
302 |
--------------------------------------------------------------------------------
/Python_code/HTML_files/line.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
6 |
7 |
--------------------------------------------------------------------------------
/R_code/Leaflet_map.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Interactive Mapping"
3 | output: html_document
4 | editor_options:
5 | markdown:
6 | wrap: sentence
7 | ---
8 |
9 | ```{r setup, include=FALSE}
10 | knitr::opts_chunk$set(echo=TRUE)
11 | ```
12 |
13 | ## Guide
14 |
15 | In this notebook we will create an interactive map of the UK which displays the % of Trans men in each local authority.
16 | There are 331 local authorities in the UK, and we are using data collected in the 2021 UK Census which included 2 new questions on sexuality and gender identity.
17 | The following data used are:
18 |
19 | - [Gender identity (detailed)](https://www.ons.gov.uk/datasets/TS070/editions/2021/versions/3) - this dataset classifies usual residents aged 16 years and over in England and Wales by gender identity.
20 | - [Local Authority District Boundaries](https://geoportal.statistics.gov.uk/datasets/bb53f91cce9e4fd6b661dc0a6c734a3f_0/about) - this file contains the digital vector boundaries for Local Authority Districts in the UK as of May 2022.
21 |
22 | ## Install packages
23 |
24 | If you're running this code on your own PC (and not through the Binder link) then you're going to want to uncomment the lines below so you can install the requisite packages. Another thing to remember is to set your working directory to the correct folder. Otherwise reading in data will be difficult.
25 |
26 | ```{r}
27 | # install.packages("leaflet")
28 | # install.packages("sf")
29 | # install.packages("dplyr")
30 | # install.packages("readr")
31 | ```
32 |
33 | ## Import libraries
34 |
35 | ```{r}
36 |
37 | # used to read-in datasets
38 | library(readr)
39 | # used to manipulate datasets
40 | library(dplyr)
41 | # used to read-in spatial data, shapefiles
42 | library(sf)
43 | # used to create interactive maps
44 | library(leaflet)
45 | # used to scrape data from websites
46 | library(httr)
47 |
48 | ```
49 |
50 | ## Read-in dataset
51 |
52 | ```{r}
53 | # First, let's read in our gender identity dataset
54 |
55 | df <- read_csv('../Data/GI_det.csv')
56 | ```
57 |
58 |
59 | ```{r}
60 | # Use head function to check out the first few rows - but can also access df via environment pane
61 |
62 | head(df, 10)
63 | ```
64 |
65 | ## Data Cleaning
66 |
67 | Before we can calculate the %'s of trans men in each local authority, it's good to do some housekeeping and get our dataframe in order.
68 |
69 | There's a few things that need sorting including:
70 |
71 | 1. renaming columns so they are easier to reference
72 | 2. removing 'Does not apply' from gender identity category
73 |
74 |
75 | ### Pipe operator - %\>%
76 |
77 | The pipe operator is used to pass the result of one function directly into the next one.
78 | E.g. let's say we had some code:
79 |
80 | ```{}
81 | sorted_data <- my_data %\>% filter(condition) %\>% arrange(sorting_variable)
82 | ```
83 |
84 | What we're doing is using the pipe operator to pass my_data to the filter() function, and the result of this is then passed to the arrange() function.
85 |
86 | Basically, pipes allow us to chain together a sequence of functions in a way that's easy to read and understand.
87 |
88 | In the code below we use the pipe operator to pass our dataframe to the rename function.
89 |
90 | This basically supplies the rename function with its first argument, which is the dataframe to filter on.
91 |
92 | ### 1
93 |
94 | ```{r}
95 | # Rename columns using the rename function from dplyr
96 | # Specify what you want to rename the column to, and supply the original column string
97 |
98 | df <- df %>%
99 | rename(LA_code = `Lower tier local authorities Code`,
100 | # backticks ` necessary when names are syntactically invalid, e.g. spaces, special characters etc.
101 | LA_name = `Lower tier local authorities`,
102 | GI_code = `Gender identity (8 categories) Code`,
103 | GI_cat = `Gender identity (8 categories)`)
104 | ```
105 |
106 |
107 | ```{r}
108 | # Let's use the colnames function to see if it worked
109 |
110 | colnames(df)
111 | ```
112 |
113 | ### 2
114 |
115 | ### Logical operators - ==, !=, \<, \>, \<=, \>=, &, \|, !
116 |
117 | Logical operators are used to perform comparisons between values or expressions, which result in a logical (Boolean) value of 'TRUE' or 'FALSE'.
118 |
119 | In the code below we use the '!=' 'Does not equal' operator which tests if the GI_cat value in each row of the df does not equal the string 'Does not apply'.
120 |
121 | For each row where GI_cat is not equal to 'Does not apply', the expression valuates to TRUE.
122 |
123 | We filter so we only keep rows where this expression evaluates to TRUE.
124 |
125 | ```{r}
126 |
127 | # Use dplyr's filter function to get rid of 'Does not apply'
128 | # Use '!=' to keep everything except 'Does not apply' category
129 |
130 | df <- df %>% filter(GI_cat != 'Does not apply')
131 |
132 | ```
133 |
134 | ### Dollar sign operator - $
135 |
136 | This operator is used to access elements, such as columns of a dataframe, by name.Below, we use it to access the gender identity category column, where we want to view the unique values.
137 |
138 | ```{r}
139 | # Unique function can be applied to a column in a df to see which values are in that column
140 | # Let's see if 'Does not apply' has been successfully dropped
141 |
142 | unique(df$GI_cat)
143 |
144 | ```
145 |
146 |
147 | ## Data Pre-processing
148 |
149 | Now onto the more interesting stuff.
150 | The data pre-processing stage involves preparing and transforming data into a suitable format for further analysis.It can involve selecting features, transforming variables, and creating new variables.For our purposes, we need to create a new column 'Percentages' which contains the % of Trans men in each local authority.
151 |
152 | So, we'll need to first calculate the % of each gender identity category for each local authority. Then, we'll want to filter our dataset so that we only keep the responses related to Trans men.
153 |
154 | ```{r}
155 | # Use group_by to group the dataframe by the LA_name column
156 | # Use mutate to perform calculation within each LA_name group, convert result to a % by multiplying by 100
157 | # round() is used to round %'s to 2 decimal places
158 |
159 | df <- df %>%
160 | group_by(LA_name) %>%
161 | mutate(Percentage = round(Observation / sum(Observation) * 100, 2))
162 | ```
163 |
164 |
165 | ```{r}
166 | # Let's check out the results
167 |
168 | head(df, 10)
169 | ```
170 |
171 | ```{r}
172 | # Use filter() to only keep rows where GI_cat equals 'Trans man'
173 | df <- df %>%
174 | filter(GI_cat == 'Trans man') %>%
175 | # Use select() with '-' to remove 'Observation' column
176 | select(-Observation) %>%
177 | # Use distinct() to remove duplicate rows, as a precaution
178 | distinct() %>%
179 | # Use ungroup() to remove grouping - resetting the dataframes state after performing group operations is good practice
180 | ungroup()
181 | ```
182 |
183 |
184 | ```{r}
185 | # Let's take a look at the results
186 | head(df)
187 | ```
188 |
189 | ## Read-in shapefile
190 |
191 | Now that we have our gender identity dataset sorted, we can start on the mapping process. And that starts with reading in our shapefile, which we should have downloaded from the geoportal. If (like me) you don't work with spatial data much, you might assume that you only need the shapefile, and you might delete the others that come with the folder. However, a shapefile is not just a single .shp file, but a collection of files that work together, and each of these files plays a crucial role in defining the shapefile's data and behaviour. When you try and read a shapefile into R, the software expects all components to be present, and missing them can lead to errors or incorrect spatial references. E.g. without the .dbf file, you'd lose all attribute data associated with the geographic features, and without the .shx file you might not be able to read the .shp file altogether.
192 |
193 | **TLDR: Make sure when you download the shapefile folder you keep all the files!**
194 |
195 | Anyway, let's get started.
196 |
197 | ```{r}
198 | # Download shapefiles from geoportal
199 |
200 | # URL for the direct download of the shapefile
201 | url <- "https://services1.arcgis.com/ESMARspQHYMw9BZ9/arcgis/rest/services/Local_Authority_Districts_May_2022_UK_BFE_V3_2022/FeatureServer/replicafilescache/Local_Authority_Districts_May_2022_UK_BFE_V3_2022_3331011932393166417.zip"
202 |
203 | # Create a temporary directory
204 | tmp_dir <- tempdir()
205 | print(paste("Created temporary directory:", tmp_dir))
206 |
207 | # Set destination file path
208 | dest_file <- file.path(tmp_dir, "shapefile.zip")
209 |
210 | # Download the shapefile
211 | response <- GET(url, write_disk(dest_file, overwrite = TRUE))
212 |
213 | # Check if the download was successful
214 | if (response$status_code == 200)
215 | print("Download successful")
216 |
217 | # Unzip the file within the temporary directory
218 | unzip(dest_file, exdir = tmp_dir)
219 | print(paste("Files extracted to:", tmp_dir))
220 |
221 | # List all files in the temporary directory to verify extraction
222 | extracted_files <- list.files(tmp_dir)
223 | print("Extracted files:")
224 | print(extracted_files)
225 |
226 | # Define the path to the actual shapefile (.shp)
227 | shapefile_path <- file.path(tmp_dir, 'LAD_MAY_2022_UK_BFE_V3.shp')
228 |
229 | # Read in shapefile to a simple features object
230 | # st_read() reads in spatial data to a 'simple features' object
231 | sf <- st_read(shapefile_path)
232 | print("Shapefile loaded successfully.")
233 |
234 |
235 | ```
236 |
237 |
238 |
239 | ```{r}
240 | # Let's check it out
241 | head(sf)
242 | # Better to just view via environment pane
243 | ```
244 |
245 |
246 | ```{r}
247 | # Inspect dimensions
248 | dim(sf)
249 | ```
250 |
251 | ```{r}
252 | # length() with the unique() function gives us the number of unique values in a column
253 |
254 | length(unique(sf$LAD22NM))
255 | ```
256 |
257 | ## Cleaning shapefile
258 |
259 | Hmm.We have 331 local authorities in our dataset that we want to plot, but there are 374 listed here.
260 | We'll need to remove the local authorities that don't match the ones in our df.
261 |
262 | 1. rename columns to match 'df'
263 | 2. get rid of redundant Local Authorities
264 |
265 | ### 1
266 |
267 | ```{r}
268 | # Use rename function so sf columns match those in original df
269 |
270 | sf <- sf %>%
271 | rename(LA_code = LAD22CD,
272 | LA_name = LAD22NM)
273 |
274 | # Let's see if it worked
275 | colnames(sf)
276 | ```
277 |
278 | ```{r}
279 | # Replace specific values in the LA_name column using recode()
280 |
281 | sf$LA_name <- sf$LA_name %>%
282 | recode(`Bristol, City of` = "Bristol",
283 | `Kingston upon Hull, City of` = "Kingston upon Hull",
284 | `Herefordshire, County of` = "Herefordshire")
285 | ```
286 |
287 | ### 2
288 |
289 | ### %in% operator
290 |
291 | This is used to check if elements of one list are in another list.
292 | Much like the logical operators, it returns a boolean value TRUE or FALSE.
293 | And we only keep rows in the LA_code for the 'sf' dataset, if they are present in the LA_code column in 'df'.
294 |
295 | ```{r}
296 | # Use filter() with %in% and unique() to only keep LA's that match
297 |
298 | sf <- sf %>%
299 | filter(LA_code %in% unique(df$LA_code))
300 | ```
301 |
302 |
303 | ```{r}
304 | # Let's see how it looks..
305 | # We should have 331 unique LA_codes
306 | length(unique(sf$LA_code))
307 | ```
308 |
309 | ## Pre-processing shapefile
310 |
311 | When it comes to mapping our data, it is important that we know which Coordinate Reference System (CRS) we are working with. Simply put, the CRS is a way to describe how the spatial data in the 'sf' object maps to locations on earth. The CRS is just a way of translating 3D reality into 2D maps. And when it comes to using mapping libraries like 'leaflet', knowing the CRS is important because leaflet expects coordinates in a specific format (usually latitude and longitude), which is EPSG:4326. If our CRS isn't in this format then we might need to transform it so that it matches what leaflet expects. Let's go ahead and see what our CRS is saying.
312 |
313 | ```{r}
314 | # st_crs() shows our CRS info
315 | st_crs(sf)
316 | ```
317 |
318 |
319 | ```{r}
320 | # To transform our crs to EPSG: 4326, simply use st_transform() and specify the crs
321 | # Note: you don't have to use the %>% pipe operator all the time
322 | sf <- st_transform(sf, crs = 4326)
323 | ```
324 |
325 | ### Merge datasets
326 |
327 | What we want to do now is merge our 'df' dataframe with our 'sf' spatial object, so that we can directly access the data and map it!
328 |
329 | When you use the merge function in R, the order in which you place the data matters in terms of the result's class type and spatial attributes.
330 | So, in terms of class type, we have a dataframe and a spatial object. By placing 'sf' first, the result will be a spatial object, which is important because this retains the spatial characteristics and geometry columns of the 'sf' object. We merge the columns on the LA_code and LA_name columns which are present in both datasets.
331 |
332 | ### 'c' function
333 |
334 | Don't overthink it. It's just a way to group items together in R, whether for defining a set of values to work with, specifying parameters for a function, or any number of other uses where a list of items is needed.
335 |
336 | ```{r}
337 | # Merge the dataframes
338 | merged <- merge(sf, df, by = c('LA_code', 'LA_name'))
339 | ```
340 |
341 |
342 | ```{r}
343 | # Let's check it out
344 | head(merged)
345 | ```
346 |
347 | ## Data Analysis
348 |
349 | ## Building our interactive map
350 |
351 | Finally, we can now build out interactive map using leaflet. You can see from the 'geometry' column that we're working with 'MULTIPOLYGON's' and 'POLYGON's'. Multipolygons are a collection of polygons grouped together as a single geometric entity. Basically, multipolygons are good at representing complex shapes. We also have some standard polygons too. In total we have 331 shapes to plot, each representing a local authority. You can take a look at these separate shapes by using the plot function and indexing the row and column (see below).
352 |
353 | ```{r}
354 | plot(sf[1, 'geometry'])
355 | ```
356 |
357 | The code below has helpful code comments that should help you grasp what each bit of the code is doing. But, to provide the overall picture, what we have below is some code for our colour palette which will create a colour scale for the range of values in our 'Percentage' column. Then, we create our interactive map which we've named 'uk_map'. We center our map, add some default map tiles, add our polygons, colour them, then add in the interactive elements such as highlight options (how background changes when cursor hovers over a shape) and label (which specifies tooltips). Then, we add a legend. Finally, we can display this interactive map.
358 |
359 |
360 | ```{r}
361 | # Define the color palette for filling in our multipolygon shapes
362 | # domain sets the range of data values that the colour scale should cover
363 | color_palette <- colorNumeric(palette = "YlGnBu", domain = merged$Percentage)
364 | ```
365 |
366 |
367 | ```{r}
368 | # Use leaflet function with 'merged' dataset
369 | uk_map <- leaflet(merged) %>%
370 | # Centers the map on long and lat for UK
371 | setView(lng = -3.0, lat = 53, zoom = 6) %>%
372 | # Adds default map tiles (the visual image of the map)
373 | addTiles() %>%
374 | # Adds multipolygons to the map, and colours them based on the 'Percentage' column
375 | # We use the palette we created above
376 | addPolygons(
377 | fillColor = ~color_palette(Percentage),
378 | weight = 1, # Set the border weight to 1 for thinner borders
379 | color = "#000000",
380 | fillOpacity = 0.7,
381 | highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE),
382 | label = ~paste(LA_name, ":", Percentage, "%"), # This will create tooltips showing the info
383 | labelOptions = labelOptions(
384 | style = list("font-weight" = "normal", padding = "3px 8px"),
385 | textsize = "12px", direction = "auto") # Adjust text size as needed
386 | ) %>%
387 | addLegend(pal = color_palette, values = ~Percentage, opacity = 0.7, title = "Percentage", position = "topright")
388 |
389 | # Render the map
390 | uk_map
391 | ```
392 |
393 |
--------------------------------------------------------------------------------
/R_code/Leaflet_map.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "\n"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {
14 | "vscode": {
15 | "languageId": "r"
16 | }
17 | },
18 | "outputs": [],
19 | "source": [
20 | "knitr::opts_chunk$set(echo=TRUE)\n"
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "# Guide\n",
28 | "\n",
29 | "In this notebook we will create an interactive map of the UK which displays the % of Trans men in each local authority.\n",
30 | "There are 331 local authorities in the UK, and we are using data collected in the 2021 UK Census which included 2 new questions on sexuality and gender identity.\n",
31 | "The following data used are:\n",
32 | "\n",
33 | "- [Gender identity (detailed)](https://www.ons.gov.uk/datasets/TS070/editions/2021/versions/3) - this dataset classifies usual residents aged 16 years and over in England and Wales by gender identity.\n",
34 | "- [Local Authority District Boundaries](https://geoportal.statistics.gov.uk/datasets/bb53f91cce9e4fd6b661dc0a6c734a3f_0/about) - this file contains the digital vector boundaries for Local Authority Districts in the UK as of May 2022.\n",
35 | "\n",
36 | "# Install packages\n",
37 | "\n",
38 | "If you're running this code on your own PC (and not through the Binder link) then you're going to want to uncomment the lines below so you can install the requisite packages. Another thing to remember is to set your working directory to the correct folder. Otherwise reading in data will be difficult. \n"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": null,
44 | "metadata": {
45 | "vscode": {
46 | "languageId": "r"
47 | }
48 | },
49 | "outputs": [],
50 | "source": [
51 | " # install.packages(\"leaflet\")\n",
52 | " # install.packages(\"sf\")\n",
53 | " # install.packages(\"dplyr\")\n",
54 | " # install.packages(\"readr\")\n"
55 | ]
56 | },
57 | {
58 | "cell_type": "markdown",
59 | "metadata": {},
60 | "source": [
61 | "# Import libraries\n",
62 | "\n"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": null,
68 | "metadata": {
69 | "vscode": {
70 | "languageId": "r"
71 | }
72 | },
73 | "outputs": [],
74 | "source": [
75 | "# used to read-in datasets\n",
76 | "library(readr)\n",
77 | "# used to manipulate datasets\n",
78 | "library(dplyr)\n",
79 | "# used to read-in spatial data, shapefiles\n",
80 | "library(sf)\n",
81 | "# used to create interactive maps\n",
82 | "library(leaflet)\n",
83 | "# used to scrape data from websites\n",
84 | "library(httr)\n"
85 | ]
86 | },
87 | {
88 | "cell_type": "markdown",
89 | "metadata": {},
90 | "source": [
91 | "# Read-in dataset\n",
92 | "\n"
93 | ]
94 | },
95 | {
96 | "cell_type": "code",
97 | "execution_count": null,
98 | "metadata": {
99 | "vscode": {
100 | "languageId": "r"
101 | }
102 | },
103 | "outputs": [],
104 | "source": [
105 | "# First, let's read in our gender identity dataset\n",
106 | "\n",
107 | "df <- read_csv('../Data/GI_det.csv')\n"
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "metadata": {},
113 | "source": [
114 | "\n",
115 | "\n"
116 | ]
117 | },
118 | {
119 | "cell_type": "code",
120 | "execution_count": null,
121 | "metadata": {
122 | "vscode": {
123 | "languageId": "r"
124 | }
125 | },
126 | "outputs": [],
127 | "source": [
128 | "# Use head function to check out the first few rows - but can also access df via environment pane\n",
129 | "\n",
130 | "head(df, 10)\n"
131 | ]
132 | },
133 | {
134 | "cell_type": "markdown",
135 | "metadata": {},
136 | "source": [
137 | "# Data Cleaning\n",
138 | "\n",
139 | "Before we can calculate the %'s of trans men in each local authority, it's good to do some housekeeping and get our dataframe in order.\n",
140 | "\n",
141 | "There's a few things that need sorting including:\n",
142 | "\n",
143 | "1. renaming columns so they are easier to reference\n",
144 | "2. removing 'Does not apply' from gender identity category\n",
145 | "\n",
146 | "\n",
147 | "## Pipe operator - %\\>%\n",
148 | "\n",
149 | "The pipe operator is used to pass the result of one function directly into the next one.\n",
150 | "E.g. let's say we had some code:\n"
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": null,
156 | "metadata": {
157 | "vscode": {
158 | "languageId": "r"
159 | }
160 | },
161 | "outputs": [],
162 | "source": [
163 | "sorted_data <- my_data %\\>% filter(condition) %\\>% arrange(sorting_variable)\n",
164 | "\n"
165 | ]
166 | },
167 | {
168 | "cell_type": "markdown",
169 | "metadata": {},
170 | "source": [
171 | "What we're doing is using the pipe operator to pass my_data to the filter() function, and the result of this is then passed to the arrange() function.\n",
172 | "\n",
173 | "Basically, pipes allow us to chain together a sequence of functions in a way that's easy to read and understand.\n",
174 | "\n",
175 | "In the code below we use the pipe operator to pass our dataframe to the rename function.\n",
176 | "\n",
177 | "This basically supplies the rename function with its first argument, which is the dataframe to filter on.\n",
178 | "\n",
179 | "## 1\n"
180 | ]
181 | },
182 | {
183 | "cell_type": "code",
184 | "execution_count": null,
185 | "metadata": {
186 | "vscode": {
187 | "languageId": "r"
188 | }
189 | },
190 | "outputs": [],
191 | "source": [
192 | "# Rename columns using the rename function from dplyr\n",
193 | "# Specify what you want to rename the column to, and supply the original column string\n",
194 | "\n",
195 | "df <- df %>% \n",
196 | " rename(LA_code = `Lower tier local authorities Code`,\n",
197 | " # backticks ` necessary when names are syntactically invalid, e.g. spaces, special characters etc.\n",
198 | " LA_name = `Lower tier local authorities`,\n",
199 | " GI_code = `Gender identity (8 categories) Code`,\n",
200 | " GI_cat = `Gender identity (8 categories)`)\n"
201 | ]
202 | },
203 | {
204 | "cell_type": "markdown",
205 | "metadata": {},
206 | "source": [
207 | "\n",
208 | "\n"
209 | ]
210 | },
211 | {
212 | "cell_type": "code",
213 | "execution_count": null,
214 | "metadata": {
215 | "vscode": {
216 | "languageId": "r"
217 | }
218 | },
219 | "outputs": [],
220 | "source": [
221 | "# Let's use the colnames function to see if it worked\n",
222 | "\n",
223 | "colnames(df)\n"
224 | ]
225 | },
226 | {
227 | "cell_type": "markdown",
228 | "metadata": {},
229 | "source": [
230 | "## 2\n",
231 | "\n",
232 | "### Logical operators - ==, !=, \\<, \\>, \\<=, \\>=, &, \\|, !\n",
233 | "\n",
234 | "Logical operators are used to perform comparisons between values or expressions, which result in a logical (Boolean) value of 'TRUE' or 'FALSE'.\n",
235 | "\n",
236 | "In the code below we use the '!=' 'Does not equal' operator which tests if the GI_cat value in each row of the df does not equal the string 'Does not apply'.\n",
237 | "\n",
238 | "For each row where GI_cat is not equal to 'Does not apply', the expression valuates to TRUE.\n",
239 | "\n",
240 | "We filter so we only keep rows where this expression evaluates to TRUE.\n"
241 | ]
242 | },
243 | {
244 | "cell_type": "code",
245 | "execution_count": null,
246 | "metadata": {
247 | "vscode": {
248 | "languageId": "r"
249 | }
250 | },
251 | "outputs": [],
252 | "source": [
253 | "# Use dplyr's filter function to get rid of 'Does not apply'\n",
254 | "# Use '!=' to keep everything except 'Does not apply' category\n",
255 | "\n",
256 | "df <- df %>% filter(GI_cat != 'Does not apply')\n"
257 | ]
258 | },
259 | {
260 | "cell_type": "markdown",
261 | "metadata": {},
262 | "source": [
263 | "### Dollar sign operator - $\n",
264 | "\n",
265 | "This operator is used to access elements, such as columns of a dataframe, by name.Below, we use it to access the gender identity category column, where we want to view the unique values.\n"
266 | ]
267 | },
268 | {
269 | "cell_type": "code",
270 | "execution_count": null,
271 | "metadata": {
272 | "vscode": {
273 | "languageId": "r"
274 | }
275 | },
276 | "outputs": [],
277 | "source": [
278 | "# Unique function can be applied to a column in a df to see which values are in that column\n",
279 | "# Let's see if 'Does not apply' has been successfully dropped\n",
280 | "\n",
281 | "unique(df$GI_cat)\n"
282 | ]
283 | },
284 | {
285 | "cell_type": "markdown",
286 | "metadata": {},
287 | "source": [
288 | "# Data Pre-processing\n",
289 | "\n",
290 | "Now onto the more interesting stuff.\n",
291 | "The data pre-processing stage involves preparing and transforming data into a suitable format for further analysis.It can involve selecting features, transforming variables, and creating new variables.For our purposes, we need to create a new column 'Percentages' which contains the % of Trans men in each local authority. \n",
292 | "\n",
293 | "So, we'll need to first calculate the % of each gender identity category for each local authority. Then, we'll want to filter our dataset so that we only keep the responses related to Trans men.\n"
294 | ]
295 | },
296 | {
297 | "cell_type": "code",
298 | "execution_count": null,
299 | "metadata": {
300 | "vscode": {
301 | "languageId": "r"
302 | }
303 | },
304 | "outputs": [],
305 | "source": [
306 | "# Use group_by to group the dataframe by the LA_name column\n",
307 | "# Use mutate to perform calculation within each LA_name group, convert result to a % by multiplying by 100\n",
308 | "# round() is used to round %'s to 2 decimal places\n",
309 | "\n",
310 | "df <- df %>%\n",
311 | " group_by(LA_name) %>%\n",
312 | " mutate(Percentage = round(Observation / sum(Observation) * 100, 2))\n"
313 | ]
314 | },
315 | {
316 | "cell_type": "markdown",
317 | "metadata": {},
318 | "source": [
319 | "\n",
320 | "\n"
321 | ]
322 | },
323 | {
324 | "cell_type": "code",
325 | "execution_count": null,
326 | "metadata": {
327 | "vscode": {
328 | "languageId": "r"
329 | }
330 | },
331 | "outputs": [],
332 | "source": [
333 | "# Let's check out the results\n",
334 | "\n",
335 | "head(df, 10)\n"
336 | ]
337 | },
338 | {
339 | "cell_type": "markdown",
340 | "metadata": {},
341 | "source": [
342 | "\n"
343 | ]
344 | },
345 | {
346 | "cell_type": "code",
347 | "execution_count": null,
348 | "metadata": {
349 | "vscode": {
350 | "languageId": "r"
351 | }
352 | },
353 | "outputs": [],
354 | "source": [
355 | "# Use filter() to only keep rows where GI_cat equals 'Trans man'\n",
356 | "df <- df %>% \n",
357 | " filter(GI_cat == 'Trans man') %>%\n",
358 | " # Use select() with '-' to remove 'Observation' column\n",
359 | " select(-Observation) %>% \n",
360 | " # Use distinct() to remove duplicate rows, as a precaution\n",
361 | " distinct() %>% \n",
362 | " # Use ungroup() to remove grouping - resetting the dataframes state after performing group operations is good practice\n",
363 | " ungroup()\n"
364 | ]
365 | },
366 | {
367 | "cell_type": "markdown",
368 | "metadata": {},
369 | "source": [
370 | "\n",
371 | "\n"
372 | ]
373 | },
374 | {
375 | "cell_type": "code",
376 | "execution_count": null,
377 | "metadata": {
378 | "vscode": {
379 | "languageId": "r"
380 | }
381 | },
382 | "outputs": [],
383 | "source": [
384 | "# Let's take a look at the results\n",
385 | "head(df)\n"
386 | ]
387 | },
388 | {
389 | "cell_type": "markdown",
390 | "metadata": {},
391 | "source": [
392 | "# Read-in shapefile\n",
393 | "\n",
394 | "Now that we have our gender identity dataset sorted, we can start on the mapping process. And that starts with reading in our shapefile, which we should have downloaded from the geoportal. If (like me) you don't work with spatial data much, you might assume that you only need the shapefile, and you might delete the others that come with the folder. However, a shapefile is not just a single .shp file, but a collection of files that work together, and each of these files plays a crucial role in defining the shapefile's data and behaviour. When you try and read a shapefile into R, the software expects all components to be present, and missing them can lead to errors or incorrect spatial references. E.g. without the .dbf file, you'd lose all attribute data associated with the geographic features, and without the .shx file you might not be able to read the .shp file altogether. \n",
395 | "\n",
396 | "**TLDR: Make sure when you download the shapefile folder you keep all the files!**\n",
397 | "\n",
398 | "Anyway, let's get started.\n"
399 | ]
400 | },
401 | {
402 | "cell_type": "code",
403 | "execution_count": null,
404 | "metadata": {
405 | "vscode": {
406 | "languageId": "r"
407 | }
408 | },
409 | "outputs": [],
410 | "source": [
411 | "# Download shapefiles from geoportal \n",
412 | "\n",
413 | "# URL for the direct download of the shapefile\n",
414 | "url <- \"https://services1.arcgis.com/ESMARspQHYMw9BZ9/arcgis/rest/services/Local_Authority_Districts_May_2022_UK_BFE_V3_2022/FeatureServer/replicafilescache/Local_Authority_Districts_May_2022_UK_BFE_V3_2022_3331011932393166417.zip\"\n",
415 | "\n",
416 | "# Create a temporary directory\n",
417 | "tmp_dir <- tempdir()\n",
418 | "print(paste(\"Created temporary directory:\", tmp_dir))\n",
419 | "\n",
420 | "# Set destination file path\n",
421 | "dest_file <- file.path(tmp_dir, \"shapefile.zip\")\n",
422 | "\n",
423 | "# Download the shapefile\n",
424 | "response <- GET(url, write_disk(dest_file, overwrite = TRUE))\n",
425 | "\n",
426 | "# Check if the download was successful\n",
427 | "if (response$status_code == 200)\n",
428 | " print(\"Download successful\")\n",
429 | " \n",
430 | " # Unzip the file within the temporary directory\n",
431 | " unzip(dest_file, exdir = tmp_dir)\n",
432 | " print(paste(\"Files extracted to:\", tmp_dir))\n",
433 | " \n",
434 | " # List all files in the temporary directory to verify extraction\n",
435 | " extracted_files <- list.files(tmp_dir)\n",
436 | " print(\"Extracted files:\")\n",
437 | " print(extracted_files)\n",
438 | " \n",
439 | " # Define the path to the actual shapefile (.shp)\n",
440 | " shapefile_path <- file.path(tmp_dir, 'LAD_MAY_2022_UK_BFE_V3.shp')\n",
441 | " \n",
442 | " # Read in shapefile to a simple features object\n",
443 | " # st_read() reads in spatial data to a 'simple features' object\n",
444 | " sf <- st_read(shapefile_path)\n",
445 | " print(\"Shapefile loaded successfully.\")\n",
446 | " "
447 | ]
448 | },
449 | {
450 | "cell_type": "markdown",
451 | "metadata": {},
452 | "source": [
453 | "\n",
454 | "\n"
455 | ]
456 | },
457 | {
458 | "cell_type": "code",
459 | "execution_count": null,
460 | "metadata": {
461 | "vscode": {
462 | "languageId": "r"
463 | }
464 | },
465 | "outputs": [],
466 | "source": [
467 | "# Let's check it out \n",
468 | "head(sf)\n",
469 | "# Better to just view via environment pane\n"
470 | ]
471 | },
472 | {
473 | "cell_type": "markdown",
474 | "metadata": {},
475 | "source": [
476 | "\n",
477 | "\n"
478 | ]
479 | },
480 | {
481 | "cell_type": "code",
482 | "execution_count": null,
483 | "metadata": {
484 | "vscode": {
485 | "languageId": "r"
486 | }
487 | },
488 | "outputs": [],
489 | "source": [
490 | "# Inspect dimensions\n",
491 | "dim(sf)\n"
492 | ]
493 | },
494 | {
495 | "cell_type": "markdown",
496 | "metadata": {},
497 | "source": [
498 | "\n"
499 | ]
500 | },
501 | {
502 | "cell_type": "code",
503 | "execution_count": null,
504 | "metadata": {
505 | "vscode": {
506 | "languageId": "r"
507 | }
508 | },
509 | "outputs": [],
510 | "source": [
511 | "# length() with the unique() function gives us the number of unique values in a column\n",
512 | "\n",
513 | "length(unique(sf$LAD22NM))\n"
514 | ]
515 | },
516 | {
517 | "cell_type": "markdown",
518 | "metadata": {},
519 | "source": [
520 | "# Cleaning shapefile\n",
521 | "\n",
522 | "Hmm.We have 331 local authorities in our dataset that we want to plot, but there are 374 listed here.\n",
523 | "We'll need to remove the local authorities that don't match the ones in our df.\n",
524 | "\n",
525 | "1. rename columns to match 'df'\n",
526 | "2. get rid of redundant Local Authorities\n",
527 | "\n",
528 | "## 1\n"
529 | ]
530 | },
531 | {
532 | "cell_type": "code",
533 | "execution_count": null,
534 | "metadata": {
535 | "vscode": {
536 | "languageId": "r"
537 | }
538 | },
539 | "outputs": [],
540 | "source": [
541 | "# Use rename function so sf columns match those in original df\n",
542 | "\n",
543 | "sf <- sf %>% \n",
544 | " rename(LA_code = LAD22CD, \n",
545 | " LA_name = LAD22NM)\n",
546 | "\n",
547 | "# Let's see if it worked\n",
548 | "colnames(sf)\n"
549 | ]
550 | },
551 | {
552 | "cell_type": "markdown",
553 | "metadata": {},
554 | "source": [
555 | "\n"
556 | ]
557 | },
558 | {
559 | "cell_type": "code",
560 | "execution_count": null,
561 | "metadata": {
562 | "vscode": {
563 | "languageId": "r"
564 | }
565 | },
566 | "outputs": [],
567 | "source": [
568 | "# Replace specific values in the LA_name column using recode()\n",
569 | "\n",
570 | "sf$LA_name <- sf$LA_name %>% \n",
571 | " recode(`Bristol, City of` = \"Bristol\", \n",
572 | " `Kingston upon Hull, City of` = \"Kingston upon Hull\", \n",
573 | " `Herefordshire, County of` = \"Herefordshire\")\n"
574 | ]
575 | },
576 | {
577 | "cell_type": "markdown",
578 | "metadata": {},
579 | "source": [
580 | "## 2\n",
581 | "\n",
582 | "### %in% operator\n",
583 | "\n",
584 | "This is used to check if elements of one list are in another list.\n",
585 | "Much like the logical operators, it returns a boolean value TRUE or FALSE.\n",
586 | "And we only keep rows in the LA_code for the 'sf' dataset, if they are present in the LA_code column in 'df'.\n"
587 | ]
588 | },
589 | {
590 | "cell_type": "code",
591 | "execution_count": null,
592 | "metadata": {
593 | "vscode": {
594 | "languageId": "r"
595 | }
596 | },
597 | "outputs": [],
598 | "source": [
599 | "# Use filter() with %in% and unique() to only keep LA's that match \n",
600 | "\n",
601 | "sf <- sf %>% \n",
602 | " filter(LA_code %in% unique(df$LA_code))\n"
603 | ]
604 | },
605 | {
606 | "cell_type": "markdown",
607 | "metadata": {},
608 | "source": [
609 | "\n",
610 | "\n"
611 | ]
612 | },
613 | {
614 | "cell_type": "code",
615 | "execution_count": null,
616 | "metadata": {
617 | "vscode": {
618 | "languageId": "r"
619 | }
620 | },
621 | "outputs": [],
622 | "source": [
623 | "# Let's see how it looks.. \n",
624 | "# We should have 331 unique LA_codes\n",
625 | "length(unique(sf$LA_code))\n"
626 | ]
627 | },
628 | {
629 | "cell_type": "markdown",
630 | "metadata": {},
631 | "source": [
632 | "# Pre-processing shapefile\n",
633 | "\n",
634 | "When it comes to mapping our data, it is important that we know which Coordinate Reference System (CRS) we are working with. Simply put, the CRS is a way to describe how the spatial data in the 'sf' object maps to locations on earth. The CRS is just a way of translating 3D reality into 2D maps. And when it comes to using mapping libraries like 'leaflet', knowing the CRS is important because leaflet expects coordinates in a specific format (usually latitude and longitude), which is EPSG:4326. If our CRS isn't in this format then we might need to transform it so that it matches what leaflet expects. Let's go ahead and see what our CRS is saying. \n"
635 | ]
636 | },
637 | {
638 | "cell_type": "code",
639 | "execution_count": null,
640 | "metadata": {
641 | "vscode": {
642 | "languageId": "r"
643 | }
644 | },
645 | "outputs": [],
646 | "source": [
647 | "# st_crs() shows our CRS info\n",
648 | "st_crs(sf)\n"
649 | ]
650 | },
651 | {
652 | "cell_type": "markdown",
653 | "metadata": {},
654 | "source": [
655 | "\n",
656 | "\n"
657 | ]
658 | },
659 | {
660 | "cell_type": "code",
661 | "execution_count": null,
662 | "metadata": {
663 | "vscode": {
664 | "languageId": "r"
665 | }
666 | },
667 | "outputs": [],
668 | "source": [
669 | "# To transform our crs to EPSG: 4326, simply use st_transform() and specify the crs\n",
670 | "# Note: you don't have to use the %>% pipe operator all the time\n",
671 | "sf <- st_transform(sf, crs = 4326)\n"
672 | ]
673 | },
674 | {
675 | "cell_type": "markdown",
676 | "metadata": {},
677 | "source": [
678 | "## Merge datasets\n",
679 | "\n",
680 | "What we want to do now is merge our 'df' dataframe with our 'sf' spatial object, so that we can directly access the data and map it!\n",
681 | "\n",
682 | "When you use the merge function in R, the order in which you place the data matters in terms of the result's class type and spatial attributes. \n",
683 | "So, in terms of class type, we have a dataframe and a spatial object. By placing 'sf' first, the result will be a spatial object, which is important because this retains the spatial characteristics and geometry columns of the 'sf' object. We merge the columns on the LA_code and LA_name columns which are present in both datasets. \n",
684 | "\n",
685 | "### 'c' function\n",
686 | "\n",
687 | "Don't overthink it. It's just a way to group items together in R, whether for defining a set of values to work with, specifying parameters for a function, or any number of other uses where a list of items is needed. \n"
688 | ]
689 | },
690 | {
691 | "cell_type": "code",
692 | "execution_count": null,
693 | "metadata": {
694 | "vscode": {
695 | "languageId": "r"
696 | }
697 | },
698 | "outputs": [],
699 | "source": [
700 | "# Merge the dataframes\n",
701 | "merged <- merge(sf, df, by = c('LA_code', 'LA_name'))\n"
702 | ]
703 | },
704 | {
705 | "cell_type": "markdown",
706 | "metadata": {},
707 | "source": [
708 | "\n",
709 | "\n"
710 | ]
711 | },
712 | {
713 | "cell_type": "code",
714 | "execution_count": null,
715 | "metadata": {
716 | "vscode": {
717 | "languageId": "r"
718 | }
719 | },
720 | "outputs": [],
721 | "source": [
722 | "# Let's check it out\n",
723 | "head(merged)\n"
724 | ]
725 | },
726 | {
727 | "cell_type": "markdown",
728 | "metadata": {},
729 | "source": [
730 | "# Data Analysis\n",
731 | "\n",
732 | "## Building our interactive map\n",
733 | "\n",
734 | "Finally, we can now build out interactive map using leaflet. You can see from the 'geometry' column that we're working with 'MULTIPOLYGON's' and 'POLYGON's'. Multipolygons are a collection of polygons grouped together as a single geometric entity. Basically, multipolygons are good at representing complex shapes. We also have some standard polygons too. In total we have 331 shapes to plot, each representing a local authority. You can take a look at these separate shapes by using the plot function and indexing the row and column (see below). \n"
735 | ]
736 | },
737 | {
738 | "cell_type": "code",
739 | "execution_count": null,
740 | "metadata": {
741 | "vscode": {
742 | "languageId": "r"
743 | }
744 | },
745 | "outputs": [],
746 | "source": [
747 | "plot(sf[1, 'geometry'])\n",
748 | "\n"
749 | ]
750 | },
751 | {
752 | "cell_type": "markdown",
753 | "metadata": {},
754 | "source": [
755 | "The code below has helpful code comments that should help you grasp what each bit of the code is doing. But, to provide the overall picture, what we have below is some code for our colour palette which will create a colour scale for the range of values in our 'Percentage' column. Then, we create our interactive map which we've named 'uk_map'. We center our map, add some default map tiles, add our polygons, colour them, then add in the interactive elements such as highlight options (how background changes when cursor hovers over a shape) and label (which specifies tooltips). Then, we add a legend. Finally, we can display this interactive map. \n",
756 | "\n"
757 | ]
758 | },
759 | {
760 | "cell_type": "code",
761 | "execution_count": null,
762 | "metadata": {
763 | "vscode": {
764 | "languageId": "r"
765 | }
766 | },
767 | "outputs": [],
768 | "source": [
769 | "# Define the color palette for filling in our multipolygon shapes\n",
770 | "# domain sets the range of data values that the colour scale should cover\n",
771 | "color_palette <- colorNumeric(palette = \"YlGnBu\", domain = merged$Percentage)\n"
772 | ]
773 | },
774 | {
775 | "cell_type": "markdown",
776 | "metadata": {},
777 | "source": [
778 | "\n",
779 | "\n"
780 | ]
781 | },
782 | {
783 | "cell_type": "code",
784 | "execution_count": null,
785 | "metadata": {
786 | "vscode": {
787 | "languageId": "r"
788 | }
789 | },
790 | "outputs": [],
791 | "source": [
792 | "# Use leaflet function with 'merged' dataset\n",
793 | "uk_map <- leaflet(merged) %>%\n",
794 | " # Centers the map on long and lat for UK\n",
795 | " setView(lng = -3.0, lat = 53, zoom = 6) %>%\n",
796 | " # Adds default map tiles (the visual image of the map)\n",
797 | " addTiles() %>%\n",
798 | " # Adds multipolygons to the map, and colours them based on the 'Percentage' column\n",
799 | " # We use the palette we created above\n",
800 | " addPolygons(\n",
801 | " fillColor = ~color_palette(Percentage),\n",
802 | " weight = 1, # Set the border weight to 1 for thinner borders\n",
803 | " color = \"#000000\",\n",
804 | " fillOpacity = 0.7,\n",
805 | " highlightOptions = highlightOptions(color = \"white\", weight = 2, bringToFront = TRUE),\n",
806 | " label = ~paste(LA_name, \":\", Percentage, \"%\"), # This will create tooltips showing the info\n",
807 | " labelOptions = labelOptions(\n",
808 | " style = list(\"font-weight\" = \"normal\", padding = \"3px 8px\"),\n",
809 | " textsize = \"12px\", direction = \"auto\") # Adjust text size as needed\n",
810 | " ) %>%\n",
811 | " addLegend(pal = color_palette, values = ~Percentage, opacity = 0.7, title = \"Percentage\", position = \"topright\")\n",
812 | "\n",
813 | "# Render the map\n",
814 | "uk_map\n"
815 | ]
816 | }
817 | ],
818 | "metadata": {
819 | "anaconda-cloud": "",
820 | "kernelspec": {
821 | "display_name": "R",
822 | "language": "R",
823 | "name": "ir"
824 | },
825 | "language_info": {
826 | "codemirror_mode": "r",
827 | "file_extension": ".r",
828 | "mimetype": "text/x-r-source",
829 | "name": "R",
830 | "pygments_lexer": "r",
831 | "version": "4.4.0"
832 | },
833 | "toc": {
834 | "base_numbering": 1,
835 | "nav_menu": {},
836 | "number_sections": true,
837 | "sideBar": true,
838 | "skip_h1_title": false,
839 | "title_cell": "Table of Contents",
840 | "title_sidebar": "Contents",
841 | "toc_cell": false,
842 | "toc_position": {},
843 | "toc_section_display": true,
844 | "toc_window_display": true
845 | }
846 | },
847 | "nbformat": 4,
848 | "nbformat_minor": 4
849 | }
850 |
--------------------------------------------------------------------------------
/R_code/Guide.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "R Notebook"
3 | output: html_notebook
4 | ---
5 |
6 |
7 | ```{r setup, include=FALSE}
8 | knitr::opts_chunk$set(eval = FALSE)
9 |
10 | ```
11 |
12 | 
13 |
14 | # Guide to Interactive Visualisations
15 |
16 | In this guide, you'll be shown how to make 2 key types of interactive visualisations, which include:
17 |
18 | * Basic bar chart + group and stacked
19 | * Scatterplot + dropdown menu
20 |
21 | To create these visualisations, we'll be using the **'plotly'** package.
22 |
23 | # Census Data
24 |
25 | Datasets used in this workshop are from the 2021 census, and involve the new voluntary question which focuses on gender identity. In particular, we explore the relationship between age and gender identity, as well as ethnicity and gender identity.
26 |
27 | **However, please note:**
28 |
29 | The Office for Statistics Regulation confirmed on 12/09/2024 that the gender identity estimates from Census 2021 in England and Wales are no longer 'accredited official statistics' and are **now classified as 'official statistics in development'**. For further information, please see: [Sexual orientation and gender identity quality information for Census 2021.](https://www.ons.gov.uk/peoplepopulationandcommunity/culturalidentity/sexuality/methodologies/sexualorientationandgenderidentityqualityinformationforcensus2021)
30 |
31 | # Let's begin...
32 |
33 | Let's get started by importing the necessary packages.
34 |
35 | **NOTE:** If you're not following along with Binder, and you have your own computational environment, make sure you install the necessary packages through the command line before proceeding to import.
36 |
37 | ## Install packages
38 |
39 | Uncomment the lines below to install the packages if you're not working in Binder.
40 |
41 | ```{r}
42 | # install.packages("readr")
43 | # install.packages("dplyr")
44 | # install.packages("stringr")
45 | # install.packages("shiny")
46 | # install.packages("plotly")
47 | ```
48 |
49 | ## Load in packages
50 |
51 | ```{r}
52 | # Allows us to read-in csv files
53 | library(readr)
54 | # For data manipulation
55 | library(dplyr)
56 | # For regular expression operations
57 | library(stringr)
58 | # Used tp create interactive visualisations
59 | library(plotly)
60 | ```
61 |
62 | # Dataset 1
63 |
64 | The first dataset that we'll be focusing on is a really simple dataset which shows the total counts for 8 gender identity categories across England and Wales. We'll do a bit of data cleaning, remove unnecessary categories (such as 'Does not apply'), and then calculate the % of each gender identity category. Then, we'll create a simple interactive bar chart which displays the percentage by gender identity category, whilst enabling some interactivity when we hover over each bar.
65 |
66 | ```{r}
67 | # Load in dataset
68 |
69 | df <- read_csv('../Data/GI_det_EW.csv')
70 | ```
71 | * chr - stands for "character" and represents text data. Columns with 'chr' contain strings, meaning any kind of text or combination of letters, numbers, or symbols that are treated as text e.g. "hello" or "123abc" would be of type 'chr'
72 | * dbl - stands for "double" and refers to double-precision floating-point numbers. I.e., it represents numerical data with decimal points e.g. 3.14, 0.001, 29393.
73 |
74 | ```{r}
75 | # Brief glimpse of data structure
76 | # But can also click on the dataset in the Environment pane
77 | head(df, 10)
78 | ```
79 |
80 |
81 | ## Data cleaning
82 |
83 | * Clean column names
84 | * Filter out unecessary categories
85 | * Clean gender identity category values - too wordy
86 | * Ensure gender_identity column is a factor with levels in desired order
87 |
88 |
89 | ```{r}
90 | # str_replace_all() method finds all substrings which match the regex and replaces them with empty string
91 | # First, let's replace any brackets with empty strings
92 | colnames(df) <- str_replace_all(colnames(df), "\\s*\\([^)]*\\)", "")
93 |
94 | # Lowercase column text and replace empty spaces with "_"
95 | colnames(df) <- tolower(colnames(df))
96 | colnames(df) <- str_replace_all(colnames(df), " ", "_")
97 |
98 | # Let's see if it worked..
99 | colnames(df)
100 |
101 | ```
102 |
103 |
104 | ### Pipes and other operators..
105 |
106 | So, we've already come across the assignment operator '<-' which is used to assign a value. E.g. df <- read_csv('Data/GI_age.csv'), here we assign our csv file to a dataframe variable called 'df'.
107 |
108 | But, we're now going to encounter the pipe operator '%>%' which can seem intimidating at first but is actually pretty simple. It's used to pass the result of one function directly into the next function. E.g. df <- df %>% filter(gender_identity_code != -8), here we start with our df and pass it to the filter function using the pipe operator. This basically supplies the filter() function with its first argument, which is the dataframe to filter on. And here we encounter a logical operator '!=' within the filter() function, which specifies that we should only keep rows where gender_identity_code is not equal to -8.
109 |
110 | ### Dollar sign operator - $
111 |
112 | This operator is used to access elements, such as columns of a dataframe, by name.
113 | Below, we use it to access the gender identity code column, where we want to view the unique values.
114 |
115 | ```{r}
116 | # Get rid of redundant categories
117 | df <- df %>%
118 | filter(gender_identity_code != -8)
119 |
120 | # Use unique and access column to output its unique values
121 |
122 | unique(df$gender_identity_code)
123 | ```
124 |
125 | ```{r}
126 | # Let's take a look at our unique values in our gender_identity category column
127 |
128 | unique(df$gender_identity)
129 | ```
130 | ```{r}
131 | # Use combo of mutate and recode to replace multiple values in column
132 | # .default ensures that any value not matching those specified are left unchanged
133 |
134 | df <- df %>%
135 | mutate(gender_identity = recode(gender_identity,
136 | "Gender identity the same as sex registered at birth" = "Cisgender",
137 | "Gender identity different from sex registered at birth but no specific identity given" = "Gender identity different from sex",
138 | .default = gender_identity))
139 | ```
140 |
141 | ```{r}
142 | # Let's see if it worked...
143 | unique(df$gender_identity)
144 | ```
145 |
146 | ```{r}
147 | # We use factor to convert gender_identity column to a factor with specified levels
148 | # This tells Plotly the exact order in which to display categories
149 | # Otherwise, R plots categorical data in alphabetical order..
150 |
151 | df$gender_identity <- factor(df$gender_identity, levels = c(
152 | "Cisgender",
153 | "Gender identity different from sex",
154 | "Trans woman",
155 | "Trans man",
156 | "All other gender identities",
157 | "Not answered"
158 | ))
159 | ```
160 |
161 | ```{r}
162 | class(df$gender_identity)
163 |
164 | ```
165 |
166 | ## Data pre-processing
167 |
168 | Before we can plot our data, we need to calculate the percentage of each gender identity category.
169 |
170 | The mutate() function adds a new column 'percentage' to df, and applies the following calculation to each row.
171 |
172 | ```{r}
173 | # mutate() is used to add new variables to a df or modify existing ones
174 | df <- df %>% mutate(percentage = round(observation / sum(observation) * 100, 2))
175 | ```
176 |
177 | ```{r}
178 | # Let's take a look..
179 | head(df$percentage)
180 | ```
181 |
182 |
183 |
184 | ## Basic interactive bar chart
185 |
186 | Now we can create our first simple interactive visualisation. To do so we use Plotly's plot_ly function, and supply the parameters with the necessary arguments. You'll notice that we use the tilde operator (~) quite a bit when building our graph. By preceding relevant variables with ~ it tells R to look for that variable within the dataframe.
187 |
188 |
189 | ```{r}
190 | # Create the bar chart visualization with percentages on the y-axis
191 | fig <- plot_ly(data = df, x = ~gender_identity, y = ~percentage, type = 'bar',
192 | # defines how the bars should be styled
193 | marker = list(color = 'rgb(158,202,225)', line = list(color = 'rgb(8,48,107)', width = 1.5)),
194 | width = 800, height = 600)
195 | ```
196 |
197 |
198 | ```{r}
199 | # Let's check it out
200 | fig
201 |
202 | ```
203 |
204 | ## Using layout() method
205 |
206 | Once a graph has been created, we can use the layout method to customise the appearance and layout. This allows you to modify things such as titles, legend details, axis properties, etc, without needing to recreate the figure from scratch.
207 |
208 | ```{r}
209 | # Let's apply a log scale to our y-axis so this graph is easier to interpet
210 |
211 | fig <- layout(fig,
212 | title = 'Percentage of Each Gender Identity in England and Wales',
213 | # set showline to true, otherwise it disappears when we apply log scale
214 | xaxis = list(title = 'Gender Identity', showline = TRUE),
215 | yaxis = list(type = 'log', title = 'Percentage (Log Scale)'))
216 | ```
217 |
218 | ```{r}
219 | fig
220 | ```
221 |
222 | ## Tooltips
223 |
224 | When using different R libraries that are geared towards interactive visualisations, you'll often come across 'tooltips'. These are small boxes that provide information when a user hovers over a part of a data visualisation such as: a point on a graph, a bar in a bar chart, or a segment in a pie chart. They are used to display additional information about the data point or object, providing more context without cluttering up the chart. In Plotly tooltips are referred to as 'hover_data'.
225 |
226 | All interactive plotly graphs come with default hover data, so when you scroll over a bar or a scatterplot data point it will display the specific x-axis value and y-axis value. But, variety is the spice of life and there's going to be times when you want to leverage this feature to include interesting info that isn't included by default. For instance, for our bar chart, I'd like to add in data from the 'Observation' column, which shows the raw count for each gender identity category.
227 |
228 | To do this it's quite easy. We use the text and hoverinfo parameter in the plot_ly function, with text defining the variables we'd like to include and how they should appear, and hoverinfo ensuring that this text is displayed in the tooltips. So, let's create the graph again, but this time let's specify our tooltips.
229 |
230 | ```{r}
231 | new_fig <- plot_ly(data = df, x = ~gender_identity, y = ~percentage, type = 'bar',
232 | # ~paste combines multiple pieces of text and data into one string
233 | hovertext = ~paste(#
is HTML code for a line break
234 | # sprintf - used to format strings
235 | "
Percentage: ", sprintf("%.2f%%", percentage),
236 | "
Observations: ", observation),
237 | # tells plotly to only display the text provided in hovertext
238 | hoverinfo = 'text',
239 | marker = list(color = 'rgb(158,202,225)', line = list(color = 'rgb(8,48,107)', width = 1.5)),
240 | width = 800, height = 600)
241 |
242 |
243 |
244 | # Apply a log scale to the y-axis
245 | new_fig <- layout(new_fig,
246 | title = 'Percentage of Each Gender Identity in England and Wales',
247 | xaxis = list(title = 'Gender Identity', showline = TRUE),
248 | yaxis = list(type = 'log', title = 'Percentage (Log Scale)'))
249 |
250 | ```
251 |
252 | ```{r}
253 | new_fig
254 | ```
255 |
256 |
257 |
258 |
259 | # Dataset 2
260 |
261 | This dataset classifies residents by gender identity and age, with the unit of analysis being England and Wales.
262 |
263 | ```{r}
264 | # Load in dataset
265 |
266 | df2 <- read_csv('../Data/GI_age.csv')
267 | ```
268 |
269 |
270 | ```{r}
271 | # Brief glimpse of data structure
272 | head(df2, 10)
273 | ```
274 |
275 | ```{r}
276 | # Let's check out the dimensions
277 |
278 | dim(df2)
279 | ```
280 |
281 | ## Data Cleaning
282 |
283 | * Clean column names
284 | * Filter out unecessary categories
285 | * Clean gender identity category values - too wordy
286 | * Ensure gender_identity column is a factor with levels in desired order
287 | * Clean age category values - too wordy
288 |
289 | We'll whiz through this, because it's the same stuff we did for the last dataset.
290 |
291 | ```{r}
292 | # str_replace_all() method finds all substrings which match the regex and replaces them with empty string
293 | # First, let's replace any brackets with empty strings
294 | colnames(df2) <- str_replace_all(colnames(df2), "\\s*\\([^)]*\\)", "")
295 |
296 | # Lowercase column text and replace empty spaces with "_"
297 | colnames(df2) <- tolower(colnames(df2))
298 | colnames(df2) <- str_replace_all(colnames(df2), " ", "_")
299 |
300 | # Let's see if it worked..
301 | colnames(df2)
302 |
303 | ```
304 |
305 | ```{r}
306 | # Get rid of values that do not apply
307 | df2 <- df2 %>%
308 | filter(gender_identity_code != -8)
309 |
310 | # Use unique and access column to output its unique values
311 |
312 | unique(df2$age_code)
313 | ```
314 |
315 | ```{r}
316 | # Get rid of redundant age category
317 | # Further filter data
318 | df2 <- df2 %>%
319 | filter(age_code != 1)
320 |
321 | ```
322 |
323 | ```{r}
324 | # Clean up the values in the 'age' column. Let's shorten them.
325 |
326 | # Chain str_replace() calls together to apply multiple string replacements in succession
327 | # Each str_replace() call is applied to the result of the previous one
328 | df2$age <- df2$age %>%
329 | str_replace('Aged ', '') %>%
330 | str_replace('to', '-') %>%
331 | str_replace('years', '') %>%
332 | str_replace('and over', '+') %>%
333 | str_replace(' - ', '-')
334 |
335 | # We can pass our df to the select function, where we specify the column we're interested in.
336 | # Then, we pipe the output to the head function.
337 | df2 %>%
338 | select(age) %>%
339 | head(5)
340 | ```
341 |
342 | ```{r}
343 | # Use combo of mutate and recode to replace multiple values in column
344 | # .default ensures that any value not matching those specified are left unchanged
345 |
346 | df2 <- df2 %>%
347 | mutate(gender_identity = recode(gender_identity,
348 | "Gender identity the same as sex registered at birth" = "Cisgender",
349 | "Gender identity different from sex registered at birth but no specific identity given" = "Gender identity different from sex",
350 | .default = gender_identity))
351 |
352 |
353 | ```
354 |
355 |
356 | ```{r}
357 |
358 | unique(df2$gender_identity)
359 | ```
360 |
361 | ```{r}
362 | # We use factor to convert gender_identity column to a factor with specified levels
363 | # This tells Plotly the exact order in which to display categories
364 |
365 | df2$gender_identity <- factor(df2$gender_identity, levels = c(
366 | "Cisgender",
367 | "Gender identity different from sex",
368 | "Trans woman",
369 | "Trans man",
370 | "All other gender identities",
371 | "Not answered"
372 | ))
373 | ```
374 |
375 |
376 | ## Question
377 |
378 | How is gender identity distributed among different age groups?
379 |
380 | Some subquestions that this can help us answer:
381 |
382 | * What % of trans women are aged 16-24 years?
383 | * Are older age groups over represented in the 'non-response' category?
384 |
385 | ## Data pre-processing
386 |
387 | ### Calculate percentages
388 |
389 | Below, we use the group_by function to group the data by 'gender_identity' and calculate the percentage within each group. Then the mutate() function adds a new column 'percentage' to df, which (for each group) divides the observation by the sum of observations, multiplies it by 100, and rounds it up to 2 decimal points. We then use the ungroup function when we're done with the grouping operation.
390 |
391 | ```{r}
392 | df2 <- df2 %>%
393 | group_by(gender_identity) %>%
394 | mutate(percentage = round((observation / sum(observation) * 100), 2)) %>%
395 | ungroup()
396 |
397 | head(df2)
398 | ```
399 |
400 |
401 | ## Interactive grouped bar chart
402 |
403 | When creating grouped bar charts, there's a few subtle differences that you'll need to account for in the code.
404 | We'll need a way to colour each bar in each group, according to age categories, which we can do with the 'color' and 'colors' parameters.
405 |
406 | ```{r}
407 | # Create a grouped bar chart with hover information
408 | fig2 <- plot_ly(data = df2, x = ~gender_identity, y = ~percentage, type = 'bar',
409 | # color specifies which variable to colour by
410 | # colors specifies the colour palette to use, and how many colours are required
411 | color = ~age, colors = RColorBrewer::brewer.pal(length(unique(df2$age)), "Set2"),
412 | hoverinfo = 'text',
413 | hovertext = ~paste("Observation: ", observation,
414 | "
Percentage: ", sprintf("%.2f%%", percentage),
415 | "
Age group: ", age),
416 | marker = list(line = list(color = 'rgba(255,255,255, 0.5)', width = 0.5)),
417 | width = 800, height = 600)
418 |
419 | ```
420 |
421 | ```{r}
422 | fig2
423 | ```
424 |
425 | ```{r}
426 | fig2 <- layout(fig2,title = 'Distribution of Gender Identity Categories Among Age Groups',
427 | xaxis = list(title = 'Gender Identity'),
428 | yaxis = list(title = 'Percentage'),
429 | legend = list(title = list(text = 'Age Group')))
430 |
431 | ```
432 |
433 | ```{r}
434 | fig2
435 | ```
436 |
437 | ## Stacked bar chart
438 |
439 | The method I show below simply converts the previously made grouped bar chart 'fig2' to a stacked bar chart. Stacked bar charts can only be created using the layout() function to change the barmode, as the default is a grouped bar chart.
440 |
441 |
442 |
443 | ```{r}
444 | # Convert to stacked bar chart
445 |
446 | st_fig <- layout(fig2,
447 | barmode = 'stack')
448 |
449 | st_fig
450 | ```
451 |
452 |
453 |
454 | ## Dataset 3
455 |
456 | This dataset classifies residents by gender identity and ethnic group, with the unit of analysis being the 331 local authorities across England and Wales.
457 |
458 | ```{r}
459 | # Load in dataset
460 |
461 | df3 <- read_csv('../Data/GI_ethnic.csv')
462 | ```
463 |
464 |
465 | ```{r}
466 | # Brief glimpse at underlying data structure
467 | head(df3, 10)
468 | ```
469 |
470 | ## Data Cleaning
471 |
472 | * Clean column names
473 | * Filter out unnecessary categories
474 |
475 | Below, I provide another method 'gsub()' which can be used instead of the str_replace_all() method which has been demonstrated in the previous cleaning sections. Basically, looks for a pattern and applies the replacement to any column names which match the pattern.
476 |
477 | ```{r}
478 | # Remove all text within parentheses from column names and replace it with an empty string
479 |
480 | # tilde operator (~) used to apply function 'gsub' to each colname
481 | # .x represents each colname that gsub will be applied to
482 | df3 <- df3 %>%
483 | rename_with(~ gsub("\\s*\\([^)]*\\)", "", .x))
484 | ```
485 |
486 | ```{r}
487 | # Lowercase all text in column names and replace spaces with underscores
488 | df3 <- df3 %>%
489 | rename_with(~ tolower(gsub(" ", "_", .x)))
490 | ```
491 |
492 | ```{r}
493 | # Shorten the local authority column names as they are way too long
494 | df3 <- df3 %>%
495 | rename(LA_code = lower_tier_local_authorities_code,
496 | LA_name = lower_tier_local_authorities)
497 |
498 | ```
499 |
500 | ```{r}
501 | # Let's see if it worked
502 | colnames(df3)
503 | ```
504 |
505 |
506 | ```{r}
507 | # Remove 'Does not apply' categories for the gender identity and ethnic group columns
508 | df3 <- df3 %>%
509 | filter(gender_identity_code != -8, ethnic_group_code != -8)
510 | ```
511 |
512 | ```{r}
513 | # Let's see if it worked..
514 | unique(df3$gender_identity_code)
515 | ```
516 |
517 | ```{r}
518 | # Let's see if it worked..
519 | unique(df3$ethnic_group_code)
520 | ```
521 |
522 | ## Question
523 |
524 | How does the rate of 'non-response' on gender identity vary among different ethnic groups across local authorities in England and Wales?
525 |
526 | A subquestion this could help us answer:
527 |
528 | Does the relationship between non-response and ethnic group % for local authorities differ between the 'White' categories and other ethnic groups?
529 |
530 | ## Data pre-processing
531 |
532 | Given that I want to explore the question above, I'd like to create a scatterplot which explores the relationship between the % of certain ethnic groups within local authorities and their non-response rates. Therefore, I'll need to prep my x and y variables, so I'll need to calculate the percentage of each ethnic group in each LA, and that ethnic groups non-response rate within each LA.
533 |
534 | ### Calculate % of each ethnic group in each LA
535 |
536 | ```{r}
537 | # First, we're going to group our data by LA_name, ethnic group, and sum our observations
538 | # This leaves us with the total of each ethnic group in each local authority
539 | ethnic_totals <- df3 %>%
540 | group_by(LA_name, ethnic_group) %>%
541 | summarise(Ethnic_sum = sum(observation, na.rm = TRUE)) %>%
542 | ungroup()
543 |
544 | # Print the first few rows to check
545 | head(ethnic_totals)
546 | ```
547 |
548 |
549 | ```{r}
550 | # Calculate total observations for each local authority by grouping df3 by local authority and summing up obs
551 | la_totals <- df3 %>%
552 | group_by(LA_name) %>%
553 | summarise(LA_sum = sum(observation, na.rm = TRUE)) %>%
554 | ungroup()
555 |
556 | # Print the first few rows to check
557 | head(la_totals)
558 | ```
559 |
560 | ```{r}
561 | # Merge the ethnic_totals and la_totals dataframes together
562 | # by parameter specifies which column to perform merge on
563 |
564 | grp_pct <- merge(ethnic_totals, la_totals, by = "LA_name")
565 | ```
566 |
567 | ```{r}
568 | # Calculate the percentage of each ethnic group within each local authority
569 | # Store results in new column
570 |
571 | grp_pct <- grp_pct %>%
572 | mutate(Percentage = round((Ethnic_sum / LA_sum * 100), 2))
573 | ```
574 |
575 |
576 | ```{r}
577 | # Print the first few rows to check
578 | head(grp_pct, 10)
579 | ```
580 |
581 | ### Calculate Ethnic Group Non-Response Rates (%'s) Within LAs
582 |
583 | ```{r}
584 | # We already have our ethnic group totals which we can re-use...
585 |
586 | ethnic_totals
587 | ```
588 |
589 | ```{r}
590 | # Calculate sum of non-responses for each ethnic group within each LA
591 | # Filter df3 so that we only have non-response rows
592 | # Group by LA and ethnic group then sum non-response obs and store the results in new column
593 |
594 | non_response_totals <- df3 %>%
595 | filter(gender_identity == 'Not answered') %>%
596 | group_by(LA_name, ethnic_group) %>%
597 | summarise(NR_total = sum(observation, na.rm = TRUE)) %>%
598 | ungroup()
599 | ```
600 |
601 | ```{r}
602 |
603 | # Let's check it out..
604 | head(non_response_totals)
605 | ```
606 |
607 |
608 | ```{r}
609 | # Merge the ethnic group totals with the ethnic group non-response totals
610 | # c - used when we're referencing more than one column
611 | # all.x - performs a left join
612 | grp_nr <- merge(ethnic_totals, non_response_totals, by = c("LA_name", "ethnic_group"), all.x = TRUE)
613 |
614 | ```
615 |
616 | ```{r}
617 | # Let's check it out..
618 |
619 | head(grp_nr)
620 | ```
621 |
622 |
623 | ```{r}
624 | # Calculate the non-response percentage for each ethnic group within each LA
625 | # Store results in new column
626 |
627 | grp_nr <- grp_nr %>%
628 | mutate(Eth_NR_Perc = round((NR_total / Ethnic_sum * 100), 2))
629 | ```
630 |
631 |
632 | ```{r}
633 | # Quick glance..
634 | head(grp_nr)
635 | ```
636 | ### Merge both datasets
637 |
638 | Now that we've completed the necessary calculations, we are left with two datasets:
639 |
640 | * grp_pct - details the % of each ethnic_group in each LA
641 | * grp_nr - details the ethnic group non-response % in each LA
642 |
643 | All we need to do now then, is merge these datasets together so that we can access the new columns and plot them:
644 |
645 | * Percentage
646 | * Eth_NR_Perc
647 |
648 | ```{r}
649 | # Merge the non-response data with the percentage of each ethnic group within each LA
650 | # Use select to isolate columns I want to preserve in the merge, LA_sum is redundant...
651 |
652 | nr <- merge(grp_nr, select(grp_pct, LA_name, ethnic_group, Percentage), by = c("LA_name", "ethnic_group"))
653 | ```
654 |
655 | ```{r}
656 | # Quick glance
657 |
658 | head(nr)
659 | ```
660 |
661 |
662 | ## Interactive scatterplot
663 |
664 | In this section we're going to:
665 |
666 | 1. Create a simple scatterplot exploring the relationship between the percentage of asian citizens within local authorities and their non-response rates
667 |
668 | 2. Implement a dropdown widget to update our scatterplot
669 |
670 | ```{r}
671 | # Subset dataframe so we only have responses from the asian ethnic group
672 |
673 | asian <- nr %>%
674 | filter(ethnic_group == 'Asian, Asian British or Asian Welsh')
675 | ```
676 |
677 |
678 | ```{r}
679 | # Check it out..
680 |
681 | head(asian)
682 | ```
683 |
684 | ```{r}
685 | # Initialize figure
686 | fig3 <- plot_ly(data = asian,
687 | x = ~Percentage,
688 | y = ~Eth_NR_Perc,
689 | text = ~paste('LA Name:', LA_name,
690 | '
Non-response Total:', NR_total,
691 | '
Ethnic Group Total:', Ethnic_sum),
692 | hoverinfo = "text",
693 | mode = 'markers', # Specify marker points
694 | type = 'scatter', # Graph type - scatterplot
695 | name = 'Asian') # Default visible graph
696 |
697 |
698 | # Customize layout
699 | fig3 <- fig3 %>%
700 | layout(title = 'Non-Response Rates of the Asian Ethnic Group Across Local Authorities',
701 | xaxis = list(title = 'Percentage of Ethnic Group'),
702 | yaxis = list(title = 'Non-response Rate'),
703 | width = 700,
704 | height = 700)
705 |
706 | # Show the plot
707 | fig3
708 | ```
709 |
710 | ## Dropdown selection
711 |
712 | What we're going to do now, is use Plotly's 'updatemenus' in conjunction with the 'update' method to create a dropdown where we can switch between the Asian ethnic group, and the White ethnic group to make some comparisons.
713 |
714 | ### Step 1: Initialise figure and add traces
715 |
716 | We'll start by creating a plot_ly figure with no data or variables specified. This is because we're going to use add_trace to add our two sets of datapoints to the plot. 'Traces' refer to a set of data, so in our example we want to add a trace with the data points relating to our asian ethnic group, and another one for our white ethnic group. This will start to make sense when we look at the code below.
717 |
718 | ```{r}
719 | # Initialize a Plotly figure
720 | fig4 <- plot_ly()
721 |
722 | # Let's take a look..
723 | # This is our building block
724 | fig4
725 |
726 | ```
727 |
728 |
729 |
730 | ```{r}
731 | # Subset dataframe so we only have responses from the white ethnic group
732 | white <- nr %>%
733 | filter(ethnic_group == 'White: English, Welsh, Scottish, Northern Irish or British')
734 |
735 | ```
736 |
737 | ```{r}
738 | # Quick check...
739 | head(white)
740 | ```
741 |
742 |
743 | ```{r}
744 | # Add trace for the Asian ethnic group
745 |
746 | fig4 <- fig4 %>% add_trace(
747 | data = asian,
748 | x = ~Percentage,
749 | y = ~Eth_NR_Perc,
750 | text = ~paste('LA Name:', LA_name,
751 | '
Non-response Total:', NR_total,
752 | '
Ethnic Group Total:', Ethnic_sum),
753 | type = 'scatter',
754 | mode = 'markers',
755 | name = 'Asian',
756 | hoverinfo = 'text',
757 | # visible parameter sets initial visibility of each trace when plot is first rendered
758 | visible = T
759 | )
760 |
761 |
762 | # Add trace for the White ethnic group
763 | fig4 <- fig4 %>% add_trace(
764 | data = white,
765 | x = ~Percentage,
766 | y = ~Eth_NR_Perc,
767 | text = ~paste('LA Name:', LA_name,
768 | '
Non-response Total:', NR_total,
769 | '
Ethnic Group Total:', Ethnic_sum),
770 | type = 'scatter',
771 | mode = 'markers',
772 | name = 'White',
773 | hoverinfo = 'text',
774 | visible = F
775 | )
776 |
777 | fig4
778 | ```
779 |
780 | ### Step 2: Configure dropdown buttons and implement update method
781 |
782 | ```{r}
783 |
784 | # Define dropdown buttons for interactivity
785 | fig4 <- fig4 %>% layout(
786 | title = "Non-Response Rates Across Local Authorities",
787 | xaxis = list(title = "Percentage of Ethnic Group"),
788 | yaxis = list(title = "Non-response Rate"),
789 | # Hide the legend, as interactive dropdown will handle trace visibility
790 | showlegend = FALSE,
791 | # Add dropdown menu for interactive plot updates
792 | updatemenus = list(
793 | list(
794 | type = "dropdown",
795 | buttons = list(
796 | list(
797 | # the update method changes plot attributes when a button is clicked
798 | method = "update",
799 | # First button makes Asian data visible and hides the White data
800 | # Used to dynamically update the visibility of the traces based on user interaction
801 | args = list(list("visible" = list(TRUE, FALSE)),
802 | # Update the title specific to the Asian data
803 | list("title" = "Non-Response Rates of the Asian Ethnic Group Across Local Authorities")),
804 | # Specify button label
805 | label = "Asian"
806 | ),
807 | list(
808 | method = "update",
809 | args = list(list("visible" = list(FALSE, TRUE)),
810 | list("title" = "Non-Response Rates of the White Ethnic Group Across Local Authorities")),
811 | label = "White"
812 | )
813 | )
814 | )
815 | )
816 | )
817 |
818 | # Display the figure
819 | fig4
820 | ```
821 |
822 |
823 |
824 | # Sharing your interactive graphs online
825 |
826 | I'm going to provide you first with a really simple way to host Plotly graphs specifically, then we'll look into other more complex options that work with many visualisation packages.
827 |
828 | 1. Use Plotly's ['Chart Studio'](https://chart-studio.plotly.com/). You can upload your visualisations directly from your coding environment and then get a link to share them online. You'll need to sign up for an account but it's free, unless you want to share the link privately then you'll need to upgrade your account. Otherwise, for data that's fine being out in the open, this is a good option.
829 |
830 | 2. Embed your graphs in GitHub pages. Embed your graphs in GitHub pages. I'm not going to go into this fully, but if you're interested in doing something like this I recommend looking at GitHub's tutorial: https://pages.github.com/. This is what I used to create a GitHub pages for the UKDS which now acts as a lil website where we can show off cool projects like this one! Might be something to consider if you're a researcher looking to show off your work.
831 |
832 |
833 |
--------------------------------------------------------------------------------
/Python_code/HTML_files/scatter.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
6 |
7 |
--------------------------------------------------------------------------------