├── data └── git.keep ├── fun_data_files └── figure-html │ ├── unnamed-chunk-10-1.png │ ├── unnamed-chunk-12-1.png │ ├── unnamed-chunk-3-1.png │ ├── unnamed-chunk-6-1.png │ ├── unnamed-chunk-7-1.png │ └── unnamed-chunk-9-1.png ├── fun_data.Rmd └── README.md /data/git.keep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /fun_data_files/figure-html/unnamed-chunk-10-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MangoTheCat/blog_fun_data/master/fun_data_files/figure-html/unnamed-chunk-10-1.png -------------------------------------------------------------------------------- /fun_data_files/figure-html/unnamed-chunk-12-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MangoTheCat/blog_fun_data/master/fun_data_files/figure-html/unnamed-chunk-12-1.png -------------------------------------------------------------------------------- /fun_data_files/figure-html/unnamed-chunk-3-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MangoTheCat/blog_fun_data/master/fun_data_files/figure-html/unnamed-chunk-3-1.png -------------------------------------------------------------------------------- /fun_data_files/figure-html/unnamed-chunk-6-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MangoTheCat/blog_fun_data/master/fun_data_files/figure-html/unnamed-chunk-6-1.png -------------------------------------------------------------------------------- /fun_data_files/figure-html/unnamed-chunk-7-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MangoTheCat/blog_fun_data/master/fun_data_files/figure-html/unnamed-chunk-7-1.png -------------------------------------------------------------------------------- /fun_data_files/figure-html/unnamed-chunk-9-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MangoTheCat/blog_fun_data/master/fun_data_files/figure-html/unnamed-chunk-9-1.png -------------------------------------------------------------------------------- /fun_data.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Fun data: open data that is fun to analyse' 3 | author: "Joe Russell, Adnan Fiaz" 4 | output: 5 | html_document: 6 | toc: true 7 | keep_md: true 8 | --- 9 | 10 | ```{r setup, include=FALSE, message = FALSE, warning = FALSE} 11 | knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, cache=TRUE, 12 | out.width='900px', dpi=200) 13 | library(tidyverse) 14 | ``` 15 | 16 | Jeremy Singer-Vine sends out a [newsletter](https://tinyletter.com/data-is-plural) every week where he highlights a number of interesting open datasets (you can explore all the datasets [here](https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0)). At Mango we are all for open data so we thought we would also share some of the open datasets we think are fun to explore. 17 | 18 | ### Open Food Facts 19 | [Open Food Facts](https://world.openfoodfacts.org/) is a collaborative, free and open database of food products. It is a prime example of how effective crowdsourcing your data is. People from around the world have collected details about more than 300.000 food products and uploaded the information through mobile apps. The data is available as a MongoDB dump, CSV export and an experimental API. We have downloaded the CSV export and will try to visualise the ingredients across all products. 20 | 21 | ```{r, collapse=TRUE} 22 | # http://world.openfoodfacts.org/data/en.openfoodfacts.org.products.csv 23 | foodFacts <- read_tsv("data/en.openfoodfacts.org.products.csv") 24 | dim(foodFacts) 25 | ``` 26 | 27 | ```{r} 28 | library(stringr) 29 | ingredients <- foodFacts %>% 30 | # ideally, the unnest_tokens function is what we want but it was too slow 31 | #tidytext::unnest_tokens(ingredient, ingredients_text, 32 | # token = stringr::str_split, pattern=",|\\(|\\)|\\[|\\]") %>% 33 | # so instead we chose a different approach involving transmute and unlist 34 | # transmute will give us a list-column 35 | transmute(ingredients = str_split(ingredients_text, pattern=",|\\(|\\)|\\[|\\]")) %>% 36 | # unlist will convert the list-column into a character vector 37 | unlist() %>% 38 | # enframe will convert the vector into a data frame which is easier to groupby 39 | enframe() %>% 40 | # now we clean up some of the text 41 | mutate(value = str_replace(value, "org|organic", ""), 42 | value = str_replace(value, "-> en:\\w+", ""), 43 | value = str_replace_all(value, "'", ""), 44 | value = str_trim(value)) %>% 45 | # and finally filter some of the weirder entries 46 | filter(value!="", value!=".", 47 | !str_detect(value, "completed|\\d")) %>% 48 | # to then group by and count the ingredients 49 | count(value) %>% 50 | arrange(desc(n)) 51 | head(ingredients, 10) 52 | ``` 53 | 54 | There are no surprises at the top but further down there are a few ingredients that are odd. Let's create a word cloud to show the relative frequencies. 55 | 56 | ```{r} 57 | library(wordcloud) 58 | top100 <- head(ingredients, 100) 59 | wordcloud::wordcloud(top100$value, top100$n) 60 | ``` 61 | 62 | Ingredients are only one aspect of this very interesting dataset. We could go on and look at the co-occurrence of certain ingredients (using network analysis) and then continue analysing their quantities. We could also include the data on nutritional value and calculate correlations. The data could also use some more cleaning considering there are some ingredients in different languages (e.g. water and eau). 63 | 64 | ### Food prices 65 | Following in the edible theme, [VAM](http://vam.wfp.org/) collate commodity prices from the globe's poorer nations and use them in helping to 66 | identify food insecurity hotspots. The data we will be using can be downloaded [here](https://data.humdata.org/dataset/wfp-food-prices), 67 | from which we'll attempt to visualise how prices have changed over the past 20 years. 68 | 69 | ```{r, collapse=TRUE} 70 | # https://data.humdata.org/dataset/wfp-food-prices 71 | #Providing column names and types to make workings easier later on 72 | colnames <- c("Country_ID", "Country", "Region_ID", "Region", "Market_ID", "Market", 73 | "Commodity_ID", "Commodity", "Currency_ID", "Currency", "Sector_ID", "Sector", 74 | "Units_ID", "Units", "Month", "Year", "Price", "Commodity_Source") 75 | coltypes <- cols(Year = "n", Price = "n", Month = "n") 76 | foodPrices <- read_csv("data/FoodPrices/WFPVAM_FoodPrices_13-03-2017.csv", 77 | col_names = colnames, 78 | col_types = coltypes) %>% 79 | filter(row_number() != 1) 80 | 81 | #Large number of commodities - won't be able to plot them all! 82 | length(unique(foodPrices$Commodity)) 83 | ``` 84 | 85 | ```{r} 86 | #Overall price trend - trend of all commodity prices over time 87 | overallPriceTrend <- foodPrices %>% 88 | group_by(Commodity, Year) %>% 89 | #As different commodities will clearly have different prices, we'll make 90 | #them comparable by scaling based on their max price within our timeframe 91 | summarise(globalAveragePrice = mean(Price)/max(Price)) 92 | 93 | #Food prices trend over time, grouped by commodity - same as above, 94 | #this time selecting a smaller sample for plotting 95 | commodoties <- c("Wheat", "Milk", "Coffee", "Bananas", "Sugar") 96 | priceTrend <- foodPrices %>% 97 | select(Price, Commodity, Year) %>% 98 | #selecting our reduced commodities 99 | filter(Commodity %in% commodoties) %>% 100 | group_by(Commodity, Year) %>% 101 | summarise(globalAveragePrice = mean(Price)/max(Price)) 102 | 103 | priceTrend 104 | ``` 105 | We can see what our data looks like for each commodity. 106 | 107 | Now let's create a graphic. 108 | ```{r} 109 | library(forcats) 110 | #We'll create a ggplot graphic, using geom_smooth 111 | #Specify some colours semi-related to their genuine appearence 112 | colours <- c("Bananas" = "#fea814", "Coffee" = "#383838", "Sugar" = "#4fcfff", 113 | "Milk" = "#cccccc", "Wheat" = "#005692") 114 | #Specify commodity levels for use in our legend 115 | fctLevels <- c("Coffee", "Bananas", "Milk", "Sugar", "Wheat") 116 | ggplot(priceTrend, aes(x = Year, y = globalAveragePrice)) + 117 | geom_smooth(aes(colour = fct_relevel(Commodity, fctLevels)), se = FALSE, size = 1.8, linetype = 5) + 118 | geom_smooth(data = overallPriceTrend, colour = "red", se = FALSE, size = 3.5) + 119 | geom_line(aes(size = "All Commodities", linetype = NA), colour = "red") + 120 | scale_colour_manual(values = colours) + 121 | scale_x_continuous(breaks = seq(1992, 2017, 2)) + 122 | scale_y_continuous(limits = c(0, 1), labels = c("Min", "Max"), breaks = c(0, 1)) + 123 | labs(title = "Average Global Commodity Prices over Time", 124 | subtitle = "Commodity Price Relative to Max in Period", 125 | caption = "Data from https://data.humdata.org/dataset/wfp-food-prices", 126 | x = "", 127 | y = "", 128 | colour = "Commodity", 129 | size = "") + 130 | theme_classic() 131 | ``` 132 | 133 | The trend for our selected commodities seems to show a gradual decrease in prices, as all but coffee and milk prices are now lower than 134 | at the beginning of our timeframe. This is somewhat reflected in our overall price trend, as we can see there has been a slight downward trend, although this has been somewhat negated in the past three years. 135 | 136 | In our analysis we took only a mere peek into the dataset. For example, we could look at seasonality trends, subset by country or region or even by market. Indeed it could be taken a step further, as VAM have, and be used as a tool for predicting when and where food security will 137 | occur in the future. 138 | 139 | ### North Korea Missile Tests 140 | The next dataset is from the James Martin Center for Nonproliferation Studies (CNS) North Korea Missile Test Database. We agree it's not a fun topic but we also wanted to show the breadth of open data that is out there. You can get the data as an [Excel file](https://www.nti.org/documents/2137/north_korea_missile_test_database.xlsx) but it is also hosted on [data.world](https://data.world/ian/the-cns-north-korea-missile-test-database). And fortunately for the R community data.world have a R package to access their API. 141 | 142 | ```{r} 143 | library(data.world) 144 | # We've set the configuration in a previous session 145 | path <- "ian/the-cns-north-korea-missile-test-database" 146 | missiles <- query(qry_sql("SELECT f1, facility_latitude, facility_longitude, 147 | distance_travelled_km, facility_name 148 | FROM missile_tests 149 | WHERE success='Success'"), 150 | path) 151 | # additional filtering outside of query 152 | missiles <- missiles %>% 153 | filter(distance_travelled_km!="Unknown", distance_travelled_km!="N/A") %>% 154 | drop_na() %>% 155 | mutate(distance_travelled_km = as.integer(distance_travelled_km), 156 | facility_name = substr(facility_name, 0, 20)) 157 | head(missiles) 158 | ``` 159 | 160 | The data contains information on the launch location but not on the precise location of where the missile landed. However we can use the distance travelled to approximate this by calculating a radius. 161 | 162 | ```{r} 163 | # slightly adapted from https://stackoverflow.com/questions/34183049/plot-circle-with-a-certain-radius-around-point-on-a-map-in-ggplot2#34187454 164 | 165 | # drop duplicate locations and distances 166 | dups <- duplicated(missiles %>% select(facility_name, distance_travelled_km)) 167 | missiles <- missiles %>% filter(!dups) 168 | 169 | # define the circle we want for each missile 170 | circles <- data_frame(f1 = rep(missiles$f1, each = 100), 171 | angle = rep(seq(0, 2*pi, length.out = 100), nrow(missiles))) 172 | 173 | meanLatitude <- mean(missiles$facility_latitude) 174 | 175 | missile_circles <- missiles %>% 176 | # length per longitude changes with latitude, so need correction 177 | mutate(radiusLon = distance_travelled_km/111/cos(meanLatitude/57.3), 178 | radiusLat = distance_travelled_km/111) %>% 179 | left_join(circles) %>% 180 | mutate(longitude = facility_longitude + radiusLon * cos(angle), 181 | latitude = facility_latitude + radiusLat * sin(angle)) 182 | ``` 183 | 184 | So now we have our circles we can plot them on a map. 185 | 186 | ```{r} 187 | library(ggmap) 188 | nk = get_map(location = c(lon = mean(missile_circles$facility_longitude), 189 | lat = mean(missile_circles$facility_latitude)), 190 | zoom = 5, maptype = "terrain") 191 | ggmap(nk, extent = "panel") + 192 | geom_point(aes(x = facility_longitude, y = facility_latitude, colour=facility_name), 193 | data = missiles) + 194 | ########### add circles 195 | geom_polygon(data = missile_circles, aes(longitude, latitude, group = f1, 196 | colour=facility_name), alpha = 0) 197 | ``` 198 | 199 | These are obviously not missile ranges and none of the missiles will have gone over China. The CNS have also created [visualisations](http://www.nti.org/analysis/articles/cns-north-korea-missile-test-database/) with this data and from that we can see that the time dimension is important. That is something we could add to our visualisation or we could perform more analyses on the success/fail dimension. 200 | 201 | ### Flight Delays 202 | The [Bureau of Transportation Statistics](https://www.bts.gov/) are a leading source of U.S. transportation systems data, helping 203 | to shape transportation policy and research projects across the US. We pulled through their [data](https://www.transtats.bts.gov/DL_SelectFields.asp), selecting the Month and DayOfWeek fields, as well as Departure and Arrival delays. The site only allows us to download data one month at a time, so we need to begin by reading in 12 files to a list and binding 204 | them together. 205 | 206 | ```{r, collapse=TRUE} 207 | # https://www.transtats.bts.gov/DL_SelectFields.asp 208 | # Month files were downloaded to separate csvs, so we'll read them in and combine them 209 | files <- list.files(path = "data/FlightData/", pattern = "*.csv", full.names = TRUE) 210 | flightsRaw <- map_df(files, read_csv) 211 | ``` 212 | 213 | Now let's recode our Months and Weekdays to be character names, rather than just integers. Then we can prepare our data for plotting. In 214 | this plot, we'll be aiming to plot the difference between departure and arrival delays, grouped by Weekday and Month. 215 | ```{r} 216 | #Begin by renaming integer values to days of the week and months 217 | flightData <- flightsRaw %>% 218 | mutate(DAY_OF_WEEK = as.factor(DAY_OF_WEEK), 219 | DAY_OF_WEEK = fct_recode(DAY_OF_WEEK, Monday = "1", Tuesday = "2", Wenesday = "3", 220 | Thursday = "4", Friday = "5", Saturday = "6", Sunday = "7"), 221 | MONTH = as.factor(MONTH), 222 | MONTH = fct_recode(MONTH, January = "1", February = "2", March = "3", April = "4", 223 | May = "5", June = "6", July = "7", August = "8", September = "9", 224 | October = "10", November = "11", December = "12")) 225 | 226 | #Summarise the data to show the mean delay/arrival time split by Month and Weekday 227 | monthDayDelays <- flightData %>% 228 | select(DEP_DELAY, ARR_DELAY, MONTH, DAY_OF_WEEK) %>% 229 | group_by(MONTH, DAY_OF_WEEK) %>% 230 | #Calculate means split by Month and Weekday 231 | summarise(Departures = mean(DEP_DELAY, na.rm = TRUE), Arrivals = mean(ARR_DELAY, na.rm = TRUE)) %>% 232 | filter(!is.na(MONTH)) %>% 233 | ungroup() %>% 234 | #Gather our data so Depature and Arrival delay times are stored within one column 235 | gather(key = delayType, value = Delay, Arrivals, Departures) %>% 236 | #Reorder Departure/Arrival factors so Depatures appear first in our graphic 237 | mutate(delayType = fct_rev(delayType)) 238 | monthDayDelays 239 | ``` 240 | We now have our Delay times in one column, with delayType either arrival or departure. We also have our Month and Weekday columns for 241 | grouping our data in the plot. 242 | 243 | Now let's create a graphic. Note, the colours used for our graphic were obtained from [Color Brewer 2.0](http://colorbrewer2.org). 244 | ```{r} 245 | #Specify colours vector for use in scale_fill_manual. 246 | colours <- c("#ffffb2", "#fed976", "#feb24c", "#fd8d3c", "#fc4e2a", "#e31a1c", "#b10026") 247 | #Graphic 248 | ggplot(monthDayDelays, aes(x = MONTH, y = Delay)) + 249 | #Adding reference lines to our (soon to become) circular chart 250 | geom_hline(yintercept = seq(0, 20, by = 5), alpha = 0.1) + 251 | #Bars for every day of the week grouped by month 252 | geom_bar(stat = "identity", aes(fill = DAY_OF_WEEK), position = "dodge", colour = "black") + 253 | #This bit is hacky. To get a single label to appear on our reference lines at zero degrees, we select a 254 | #single datapoint from January (Month at zero degrees) and use it as our data argument. 255 | #monthDayDelays[1, ] is a datapoint for mean Arrival time in January 256 | geom_text(data = monthDayDelays[1, ], y = 5, label = "5", size = 3) + 257 | geom_text(data = monthDayDelays[1, ], y = 10, label = "10", size = 3) + 258 | geom_text(data = monthDayDelays[1, ], y = 15, label = "15", size = 3) + 259 | geom_text(data = monthDayDelays[1, ], y = 20, label = "20", size = 3) + 260 | #monthDayDelays[85, ] is a datapoint for mean Departure time in January 261 | geom_text(data = monthDayDelays[85, ], y = 10, label = "10", size = 3) + 262 | geom_text(data = monthDayDelays[85, ], y = 15, label = "15", size = 3) + 263 | geom_text(data = monthDayDelays[85, ], y = 20, label = "20", size = 3) + 264 | #Specify colours for chart 265 | scale_fill_manual(values = rev(colours)) + 266 | #Remove yAxis scale from side of plot 267 | scale_y_continuous(breaks = NULL) + 268 | #Make plot circular, starting at due north 269 | coord_polar(theta = "x", start = -pi/12) + 270 | #Separate our departure and arrival plots 271 | facet_wrap(~delayType, labeller = label_parsed) + 272 | #Add labels 273 | labs(title = "Average Minute Delay of American Flights - 2016", 274 | x = "", 275 | y = "", 276 | fill = "", 277 | caption = "Data from https://www.transtats.bts.gov/DL_SelectFields.asp") + 278 | theme_minimal(base_size = 8) + 279 | #Make our facet titles pretty 280 | theme(strip.text.x = element_text(size = 15, colour = "#f03b20")) 281 | ``` 282 | 283 | Interestingly, we can see that even if your flight is delayed, you can expect that delay to have reduced upon arrival. In fact, 284 | on average 5 minutes and 25 seconds will be recovered during air time. 285 | 286 | Perhaps not so interestingly, we can see that the months with the longest delays fall in the summer and winter holidays. Conversely, 287 | if you're not bound by school holidays, November looks to be an excellent time to travel, as you can expect to arrive a whole 288 | 2 minutes early! 289 | 290 | Again with this analysis, there are many more pathways we could explore. For example, [FiveThirtyEight](https://projects.fivethirtyeight.com/flights/) have produced a similar delay chart, only this time grouped by airport. This, when combined with our brief analysis, makes your dream, undelayed, vacation a possible reality!! 291 | 292 | This blogpost turned out longer than we expected but that's what happens when you have fun. If you have a cool dataset that's fun to analyse let us know on [twitter](https://twitter.com/MangotheCat). You can find the code for this blogpost on [GitHub](https://github.com/MangoTheCat/blog_fun_data). 293 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Fun data: open data that is fun to analyse 2 | Joe Russell, Adnan Fiaz 3 | 4 | 5 | 6 | Jeremy Singer-Vine sends out a [newsletter](https://tinyletter.com/data-is-plural) every week where he highlights a number of interesting open datasets (you can explore all the datasets [here](https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0)). At Mango we are all for open data so we thought we would also share some of the open datasets we think are fun to explore. 7 | 8 | ### Open Food Facts 9 | [Open Food Facts](https://world.openfoodfacts.org/) is a collaborative, free and open database of food products. It is a prime example of how effective crowdsourcing your data is. People from around the world have collected details about more than 300.000 food products and uploaded the information through mobile apps. The data is available as a MongoDB dump, CSV export and an experimental API. We have downloaded the CSV export and will try to visualise the ingredients across all products. 10 | 11 | 12 | ```r 13 | # http://world.openfoodfacts.org/data/en.openfoodfacts.org.products.csv 14 | foodFacts <- read_tsv("data/en.openfoodfacts.org.products.csv") 15 | dim(foodFacts) 16 | ## [1] 295958 162 17 | ``` 18 | 19 | 20 | ```r 21 | library(stringr) 22 | ingredients <- foodFacts %>% 23 | # ideally, the unnest_tokens function is what we want but it was too slow 24 | #tidytext::unnest_tokens(ingredient, ingredients_text, 25 | # token = stringr::str_split, pattern=",|\\(|\\)|\\[|\\]") %>% 26 | # so instead we chose a different approach involving transmute and unlist 27 | # transmute will give us a list-column 28 | transmute(ingredients = str_split(ingredients_text, pattern=",|\\(|\\)|\\[|\\]")) %>% 29 | # unlist will convert the list-column into a character vector 30 | unlist() %>% 31 | # enframe will convert the vector into a data frame which is easier to groupby 32 | enframe() %>% 33 | # now we clean up some of the text 34 | mutate(value = str_replace(value, "org|organic", ""), 35 | value = str_replace(value, "-> en:\\w+", ""), 36 | value = str_replace_all(value, "'", ""), 37 | value = str_trim(value)) %>% 38 | # and finally filter some of the weirder entries 39 | filter(value!="", value!=".", 40 | !str_detect(value, "completed|\\d")) %>% 41 | # to then group by and count the ingredients 42 | count(value) %>% 43 | arrange(desc(n)) 44 | head(ingredients, 10) 45 | ``` 46 | 47 | ``` 48 | ## # A tibble: 10 x 2 49 | ## value n 50 | ## 51 | ## 1 salt 122183 52 | ## 2 sugar 88463 53 | ## 3 water 80037 54 | ## 4 sel 38852 55 | ## 5 sucre 29971 56 | ## 6 eau 28502 57 | ## 7 citric acid 28475 58 | ## 8 riboflavin 21527 59 | ## 9 milk 21265 60 | ## 10 niacin 21201 61 | ``` 62 | 63 | There are no surprises at the top but further down there are a few ingredients that are odd. Let's create a word cloud to show the relative frequencies. 64 | 65 | 66 | ```r 67 | library(wordcloud) 68 | top100 <- head(ingredients, 100) 69 | wordcloud::wordcloud(top100$value, top100$n) 70 | ``` 71 | 72 | 73 | 74 | Ingredients are only one aspect of this very interesting dataset. We could go on and look at the co-occurrence of certain ingredients (using network analysis) and then continue analysing their quantities. We could also include the data on nutritional value and calculate correlations. The data could also use some more cleaning considering there are some ingredients in different languages (e.g. water and eau). 75 | 76 | ### Food prices 77 | Following in the edible theme, [VAM](http://vam.wfp.org/) collate commodity prices from the globe's poorer nations and use them in helping to 78 | identify food insecurity hotspots. The data we will be using can be downloaded [here](https://data.humdata.org/dataset/wfp-food-prices), 79 | from which we'll attempt to visualise how prices have changed over the past 20 years. 80 | 81 | 82 | ```r 83 | # https://data.humdata.org/dataset/wfp-food-prices 84 | #Providing column names and types to make workings easier later on 85 | colnames <- c("Country_ID", "Country", "Region_ID", "Region", "Market_ID", "Market", 86 | "Commodity_ID", "Commodity", "Currency_ID", "Currency", "Sector_ID", "Sector", 87 | "Units_ID", "Units", "Month", "Year", "Price", "Commodity_Source") 88 | coltypes <- cols(Year = "n", Price = "n", Month = "n") 89 | foodPrices <- read_csv("data/FoodPrices/WFPVAM_FoodPrices_13-03-2017.csv", 90 | col_names = colnames, 91 | col_types = coltypes) %>% 92 | filter(row_number() != 1) 93 | 94 | #Large number of commodities - won't be able to plot them all! 95 | length(unique(foodPrices$Commodity)) 96 | ## [1] 304 97 | ``` 98 | 99 | 100 | ```r 101 | #Overall price trend - trend of all commodity prices over time 102 | overallPriceTrend <- foodPrices %>% 103 | group_by(Commodity, Year) %>% 104 | #As different commodities will clearly have different prices, we'll make 105 | #them comparable by scaling based on their max price within our timeframe 106 | summarise(globalAveragePrice = mean(Price)/max(Price)) 107 | 108 | #Food prices trend over time, grouped by commodity - same as above, 109 | #this time selecting a smaller sample for plotting 110 | commodoties <- c("Wheat", "Milk", "Coffee", "Bananas", "Sugar") 111 | priceTrend <- foodPrices %>% 112 | select(Price, Commodity, Year) %>% 113 | #selecting our reduced commodities 114 | filter(Commodity %in% commodoties) %>% 115 | group_by(Commodity, Year) %>% 116 | summarise(globalAveragePrice = mean(Price)/max(Price)) 117 | 118 | priceTrend 119 | ``` 120 | 121 | ``` 122 | ## # A tibble: 77 x 3 123 | ## # Groups: Commodity [?] 124 | ## Commodity Year globalAveragePrice 125 | ## 126 | ## 1 Bananas 2008 1.0000000 127 | ## 2 Bananas 2009 0.4487272 128 | ## 3 Bananas 2010 0.1727820 129 | ## 4 Bananas 2011 0.2909874 130 | ## 5 Bananas 2012 0.4470188 131 | ## 6 Bananas 2013 0.1271515 132 | ## 7 Bananas 2014 0.1263784 133 | ## 8 Bananas 2015 0.1129815 134 | ## 9 Bananas 2016 0.2050480 135 | ## 10 Bananas 2017 0.2630983 136 | ## # ... with 67 more rows 137 | ``` 138 | We can see what our data looks like for each commodity. 139 | 140 | Now let's create a graphic. 141 | 142 | ```r 143 | library(forcats) 144 | #We'll create a ggplot graphic, using geom_smooth 145 | #Specify some colours semi-related to their genuine appearence 146 | colours <- c("Bananas" = "#fea814", "Coffee" = "#383838", "Sugar" = "#4fcfff", 147 | "Milk" = "#cccccc", "Wheat" = "#005692") 148 | #Specify commodity levels for use in our legend 149 | fctLevels <- c("Coffee", "Bananas", "Milk", "Sugar", "Wheat") 150 | ggplot(priceTrend, aes(x = Year, y = globalAveragePrice)) + 151 | geom_smooth(aes(colour = fct_relevel(Commodity, fctLevels)), se = FALSE, size = 1.8, linetype = 5) + 152 | geom_smooth(data = overallPriceTrend, colour = "red", se = FALSE, size = 3.5) + 153 | geom_line(aes(size = "All Commodities", linetype = NA), colour = "red") + 154 | scale_colour_manual(values = colours) + 155 | scale_x_continuous(breaks = seq(1992, 2017, 2)) + 156 | scale_y_continuous(limits = c(0, 1), labels = c("Min", "Max"), breaks = c(0, 1)) + 157 | labs(title = "Average Global Commodity Prices over Time", 158 | subtitle = "Commodity Price Relative to Max in Period", 159 | caption = "Data from https://data.humdata.org/dataset/wfp-food-prices", 160 | x = "", 161 | y = "", 162 | colour = "Commodity", 163 | size = "") + 164 | theme_classic() 165 | ``` 166 | 167 | 168 | 169 | The trend for our selected commodities seems to show a gradual decrease in prices, as all but coffee and milk prices are now lower than 170 | at the beginning of our timeframe. This is somewhat reflected in our overall price trend, as we can see there has been a slight downward trend, although this has been somewhat negated in the past three years. 171 | 172 | In our analysis we took only a mere peek into the dataset. For example, we could look at seasonality trends, subset by country or region or even by market. Indeed it could be taken a step further, as VAM have, and be used as a tool for predicting when and where food security will 173 | occur in the future. 174 | 175 | ### North Korea Missile Tests 176 | The next dataset is from the James Martin Center for Nonproliferation Studies (CNS) North Korea Missile Test Database. We agree it's not a fun topic but we also wanted to show the breadth of open data that is out there. You can get the data as an [Excel file](https://www.nti.org/documents/2137/north_korea_missile_test_database.xlsx) but it is also hosted on [data.world](https://data.world/ian/the-cns-north-korea-missile-test-database). And fortunately for the R community data.world have a R package to access their API. 177 | 178 | 179 | ```r 180 | library(data.world) 181 | # We've set the configuration in a previous session 182 | path <- "ian/the-cns-north-korea-missile-test-database" 183 | missiles <- query(qry_sql("SELECT f1, facility_latitude, facility_longitude, 184 | distance_travelled_km, facility_name 185 | FROM missile_tests 186 | WHERE success='Success'"), 187 | path) 188 | # additional filtering outside of query 189 | missiles <- missiles %>% 190 | filter(distance_travelled_km!="Unknown", distance_travelled_km!="N/A") %>% 191 | drop_na() %>% 192 | mutate(distance_travelled_km = as.integer(distance_travelled_km), 193 | facility_name = substr(facility_name, 0, 20)) 194 | head(missiles) 195 | ``` 196 | 197 | ``` 198 | ## # A tibble: 6 x 5 199 | ## f1 facility_latitude facility_longitude distance_travelled_km 200 | ## 201 | ## 1 101 39.65960 124.7057 1000 202 | ## 2 102 39.65960 124.7057 1000 203 | ## 3 12 40.85000 129.6667 500 204 | ## 4 40 38.99083 127.6236 200 205 | ## 5 41 38.99083 127.6236 200 206 | ## 6 42 38.99083 127.6236 200 207 | ## # ... with 1 more variables: facility_name 208 | ``` 209 | 210 | The data contains information on the launch location but not on the precise location of where the missile landed. However we can use the distance travelled to approximate this by calculating a radius. 211 | 212 | 213 | ```r 214 | # slightly adapted from https://stackoverflow.com/questions/34183049/plot-circle-with-a-certain-radius-around-point-on-a-map-in-ggplot2#34187454 215 | 216 | # drop duplicate locations and distances 217 | dups <- duplicated(missiles %>% select(facility_name, distance_travelled_km)) 218 | missiles <- missiles %>% filter(!dups) 219 | 220 | # define the circle we want for each missile 221 | circles <- data_frame(f1 = rep(missiles$f1, each = 100), 222 | angle = rep(seq(0, 2*pi, length.out = 100), nrow(missiles))) 223 | 224 | meanLatitude <- mean(missiles$facility_latitude) 225 | 226 | missile_circles <- missiles %>% 227 | # length per longitude changes with latitude, so need correction 228 | mutate(radiusLon = distance_travelled_km/111/cos(meanLatitude/57.3), 229 | radiusLat = distance_travelled_km/111) %>% 230 | left_join(circles) %>% 231 | mutate(longitude = facility_longitude + radiusLon * cos(angle), 232 | latitude = facility_latitude + radiusLat * sin(angle)) 233 | ``` 234 | 235 | So now we have our circles we can plot them on a map. 236 | 237 | 238 | ```r 239 | library(ggmap) 240 | nk = get_map(location = c(lon = mean(missile_circles$facility_longitude), 241 | lat = mean(missile_circles$facility_latitude)), 242 | zoom = 5, maptype = "terrain") 243 | ggmap(nk, extent = "panel") + 244 | geom_point(aes(x = facility_longitude, y = facility_latitude, colour=facility_name), 245 | data = missiles) + 246 | ########### add circles 247 | geom_polygon(data = missile_circles, aes(longitude, latitude, group = f1, 248 | colour=facility_name), alpha = 0) 249 | ``` 250 | 251 | 252 | 253 | These are obviously not missile ranges and none of the missiles will have gone over China. The CNS have also created [visualisations](http://www.nti.org/analysis/articles/cns-north-korea-missile-test-database/) with this data and from that we can see that the time dimension is important. That is something we could add to our visualisation or we could perform more analyses on the success/fail dimension. 254 | 255 | ### Flight Delays 256 | The [Bureau of Transportation Statistics](https://www.bts.gov/) are a leading source of U.S. transportation systems data, helping 257 | to shape transportation policy and research projects across the US. We pulled through their [data](https://www.transtats.bts.gov/DL_SelectFields.asp), selecting the Month and DayOfWeek fields, as well as Departure and Arrival delays. The site only allows us to download data one month at a time, so we need to begin by reading in 12 files to a list and binding 258 | them together. 259 | 260 | 261 | ```r 262 | # https://www.transtats.bts.gov/DL_SelectFields.asp 263 | # Month files were downloaded to separate csvs, so we'll read them in and combine them 264 | files <- list.files(path = "data/FlightData/", pattern = "*.csv", full.names = TRUE) 265 | flightsRaw <- map_df(files, read_csv) 266 | ``` 267 | 268 | Now let's recode our Months and Weekdays to be character names, rather than just integers. Then we can prepare our data for plotting. In 269 | this plot, we'll be aiming to plot the difference between departure and arrival delays, grouped by Weekday and Month. 270 | 271 | ```r 272 | #Begin by renaming integer values to days of the week and months 273 | flightData <- flightsRaw %>% 274 | mutate(DAY_OF_WEEK = as.factor(DAY_OF_WEEK), 275 | DAY_OF_WEEK = fct_recode(DAY_OF_WEEK, Monday = "1", Tuesday = "2", Wenesday = "3", 276 | Thursday = "4", Friday = "5", Saturday = "6", Sunday = "7"), 277 | MONTH = as.factor(MONTH), 278 | MONTH = fct_recode(MONTH, January = "1", February = "2", March = "3", April = "4", 279 | May = "5", June = "6", July = "7", August = "8", September = "9", 280 | October = "10", November = "11", December = "12")) 281 | 282 | #Summarise the data to show the mean delay/arrival time split by Month and Weekday 283 | monthDayDelays <- flightData %>% 284 | select(DEP_DELAY, ARR_DELAY, MONTH, DAY_OF_WEEK) %>% 285 | group_by(MONTH, DAY_OF_WEEK) %>% 286 | #Calculate means split by Month and Weekday 287 | summarise(Departures = mean(DEP_DELAY, na.rm = TRUE), Arrivals = mean(ARR_DELAY, na.rm = TRUE)) %>% 288 | filter(!is.na(MONTH)) %>% 289 | ungroup() %>% 290 | #Gather our data so Depature and Arrival delay times are stored within one column 291 | gather(key = delayType, value = Delay, Arrivals, Departures) %>% 292 | #Reorder Departure/Arrival factors so Depatures appear first in our graphic 293 | mutate(delayType = fct_rev(delayType)) 294 | monthDayDelays 295 | ``` 296 | 297 | ``` 298 | ## # A tibble: 168 x 4 299 | ## MONTH DAY_OF_WEEK delayType Delay 300 | ## 301 | ## 1 January Monday Arrivals 3.08526171 302 | ## 2 January Tuesday Arrivals 0.49934650 303 | ## 3 January Wenesday Arrivals -0.64372809 304 | ## 4 January Thursday Arrivals -0.61050874 305 | ## 5 January Friday Arrivals 3.37778214 306 | ## 6 January Saturday Arrivals 0.05099645 307 | ## 7 January Sunday Arrivals 4.17985079 308 | ## 8 February Monday Arrivals 1.30443327 309 | ## 9 February Tuesday Arrivals 5.09620906 310 | ## 10 February Wenesday Arrivals 3.00267779 311 | ## # ... with 158 more rows 312 | ``` 313 | We now have our Delay times in one column, with delayType either arrival or departure. We also have our Month and Weekday columns for 314 | grouping our data in the plot. 315 | 316 | Now let's create a graphic. Note, the colours used for our graphic were obtained from [Color Brewer 2.0](http://colorbrewer2.org). 317 | 318 | ```r 319 | #Specify colours vector for use in scale_fill_manual. 320 | colours <- c("#ffffb2", "#fed976", "#feb24c", "#fd8d3c", "#fc4e2a", "#e31a1c", "#b10026") 321 | #Graphic 322 | ggplot(monthDayDelays, aes(x = MONTH, y = Delay)) + 323 | #Adding reference lines to our (soon to become) circular chart 324 | geom_hline(yintercept = seq(0, 20, by = 5), alpha = 0.1) + 325 | #Bars for every day of the week grouped by month 326 | geom_bar(stat = "identity", aes(fill = DAY_OF_WEEK), position = "dodge", colour = "black") + 327 | #This bit is hacky. To get a single label to appear on our reference lines at zero degrees, we select a 328 | #single datapoint from January (Month at zero degrees) and use it as our data argument. 329 | #monthDayDelays[1, ] is a datapoint for mean Arrival time in January 330 | geom_text(data = monthDayDelays[1, ], y = 5, label = "5", size = 3) + 331 | geom_text(data = monthDayDelays[1, ], y = 10, label = "10", size = 3) + 332 | geom_text(data = monthDayDelays[1, ], y = 15, label = "15", size = 3) + 333 | geom_text(data = monthDayDelays[1, ], y = 20, label = "20", size = 3) + 334 | #monthDayDelays[85, ] is a datapoint for mean Departure time in January 335 | geom_text(data = monthDayDelays[85, ], y = 10, label = "10", size = 3) + 336 | geom_text(data = monthDayDelays[85, ], y = 15, label = "15", size = 3) + 337 | geom_text(data = monthDayDelays[85, ], y = 20, label = "20", size = 3) + 338 | #Specify colours for chart 339 | scale_fill_manual(values = rev(colours)) + 340 | #Remove yAxis scale from side of plot 341 | scale_y_continuous(breaks = NULL) + 342 | #Make plot circular, starting at due north 343 | coord_polar(theta = "x", start = -pi/12) + 344 | #Separate our departure and arrival plots 345 | facet_wrap(~delayType, labeller = label_parsed) + 346 | #Add labels 347 | labs(title = "Average Minute Delay of American Flights - 2016", 348 | x = "", 349 | y = "", 350 | fill = "", 351 | caption = "Data from https://www.transtats.bts.gov/DL_SelectFields.asp") + 352 | theme_minimal(base_size = 8) + 353 | #Make our facet titles pretty 354 | theme(strip.text.x = element_text(size = 15, colour = "#f03b20")) 355 | ``` 356 | 357 | 358 | 359 | Interestingly, we can see that even if your flight is delayed, you can expect that delay to have reduced upon arrival. In fact, 360 | on average 5 minutes and 25 seconds will be recovered during air time. 361 | 362 | Perhaps not so interestingly, we can see that the months with the longest delays fall in the summer and winter holidays. Conversely, 363 | if you're not bound by school holidays, November looks to be an excellent time to travel, as you can expect to arrive a whole 364 | 2 minutes early! 365 | 366 | Again with this analysis, there are many more pathways we could explore. For example, [FiveThirtyEight](https://projects.fivethirtyeight.com/flights/) have produced a similar delay chart, only this time grouped by airport. This, when combined with our brief analysis, makes your dream, undelayed, vacation a possible reality!! 367 | 368 | This blogpost turned out longer than we expected but that's what happens when you have fun. If you have a cool dataset that's fun to analyse let us know on [twitter](https://twitter.com/MangotheCat). You can find the code for this blogpost on [GitHub](https://github.com/MangoTheCat/blog_fun_data). 369 | --------------------------------------------------------------------------------