├── data
    └── git.keep
├── fun_data_files
    └── figure-html
    │   ├── unnamed-chunk-10-1.png
    │   ├── unnamed-chunk-12-1.png
    │   ├── unnamed-chunk-3-1.png
    │   ├── unnamed-chunk-6-1.png
    │   ├── unnamed-chunk-7-1.png
    │   └── unnamed-chunk-9-1.png
├── fun_data.Rmd
└── README.md


/data/git.keep:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/fun_data_files/figure-html/unnamed-chunk-10-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MangoTheCat/blog_fun_data/master/fun_data_files/figure-html/unnamed-chunk-10-1.png


--------------------------------------------------------------------------------
/fun_data_files/figure-html/unnamed-chunk-12-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MangoTheCat/blog_fun_data/master/fun_data_files/figure-html/unnamed-chunk-12-1.png


--------------------------------------------------------------------------------
/fun_data_files/figure-html/unnamed-chunk-3-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MangoTheCat/blog_fun_data/master/fun_data_files/figure-html/unnamed-chunk-3-1.png


--------------------------------------------------------------------------------
/fun_data_files/figure-html/unnamed-chunk-6-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MangoTheCat/blog_fun_data/master/fun_data_files/figure-html/unnamed-chunk-6-1.png


--------------------------------------------------------------------------------
/fun_data_files/figure-html/unnamed-chunk-7-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MangoTheCat/blog_fun_data/master/fun_data_files/figure-html/unnamed-chunk-7-1.png


--------------------------------------------------------------------------------
/fun_data_files/figure-html/unnamed-chunk-9-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MangoTheCat/blog_fun_data/master/fun_data_files/figure-html/unnamed-chunk-9-1.png


--------------------------------------------------------------------------------
/fun_data.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'Fun data: open data that is fun to analyse'
  3 | author: "Joe Russell, Adnan Fiaz"
  4 | output: 
  5 |   html_document:
  6 |     toc: true
  7 |     keep_md: true
  8 | ---
  9 | 
 10 | ```{r setup, include=FALSE, message = FALSE, warning = FALSE}
 11 | knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, cache=TRUE,
 12 |                       out.width='900px', dpi=200)
 13 | library(tidyverse)
 14 | ```
 15 | 
 16 | Jeremy Singer-Vine sends out a [newsletter](https://tinyletter.com/data-is-plural) every week where he highlights a number of interesting open datasets (you can explore all the datasets [here](https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0)). At Mango we are all for open data so we thought we would also share some of the open datasets we think are fun to explore.  
 17 | 
 18 | ### Open Food Facts
 19 | [Open Food Facts](https://world.openfoodfacts.org/) is a collaborative, free and open database of food products. It is a prime example of how effective crowdsourcing your data is. People from around the world have collected details about more than 300.000 food products and uploaded the information through mobile apps. The data is available as a MongoDB dump, CSV export and an experimental API. We have downloaded the CSV export and will try to visualise the ingredients across all products.
 20 | 
 21 | ```{r, collapse=TRUE}
 22 | # http://world.openfoodfacts.org/data/en.openfoodfacts.org.products.csv
 23 | foodFacts <- read_tsv("data/en.openfoodfacts.org.products.csv")
 24 | dim(foodFacts)
 25 | ```
 26 | 
 27 | ```{r}
 28 | library(stringr)
 29 | ingredients <- foodFacts %>%
 30 |   # ideally, the unnest_tokens function is what we want but it was too slow
 31 |   #tidytext::unnest_tokens(ingredient, ingredients_text,
 32 |   #                        token = stringr::str_split, pattern=",|\\(|\\)|\\[|\\]") %>%
 33 |   # so instead we chose a different approach involving transmute and unlist
 34 |   # transmute will give us a list-column
 35 |   transmute(ingredients = str_split(ingredients_text, pattern=",|\\(|\\)|\\[|\\]")) %>%
 36 |   # unlist will convert the list-column into a character vector
 37 |   unlist() %>%
 38 |   # enframe will convert the vector into a data frame which is easier to groupby
 39 |   enframe() %>%
 40 |   # now we clean up some of the text
 41 |   mutate(value = str_replace(value, "org|organic", ""),
 42 |          value = str_replace(value, "-> en:\\w+", ""),
 43 |          value = str_replace_all(value, "'", ""),
 44 |          value = str_trim(value)) %>%
 45 |   # and finally filter some of the weirder entries
 46 |   filter(value!="", value!=".",
 47 |          !str_detect(value, "completed|\\d")) %>%
 48 |   # to then group by and count the ingredients
 49 |   count(value) %>%
 50 |   arrange(desc(n))
 51 | head(ingredients, 10)
 52 | ```
 53 | 
 54 | There are no surprises at the top but further down there are a few ingredients that are odd. Let's create a word cloud to show the relative frequencies.
 55 | 
 56 | ```{r}
 57 | library(wordcloud)
 58 | top100 <- head(ingredients, 100)
 59 | wordcloud::wordcloud(top100$value, top100$n)
 60 | ```
 61 | 
 62 | Ingredients are only one aspect of this very interesting dataset. We could go on and look at the co-occurrence  of certain ingredients (using network analysis) and then continue analysing their quantities. We could also include the data on nutritional value and calculate correlations. The data could also use some more cleaning considering there are some ingredients in different languages (e.g. water and eau).
 63 | 
 64 | ### Food prices
 65 | Following in the edible theme, [VAM](http://vam.wfp.org/) collate commodity prices from the globe's poorer nations and use them in helping to
 66 | identify food insecurity hotspots. The data we will be using can be downloaded [here](https://data.humdata.org/dataset/wfp-food-prices),
 67 | from which we'll attempt to visualise how prices have changed over the past 20 years.
 68 | 
 69 | ```{r, collapse=TRUE}
 70 | # https://data.humdata.org/dataset/wfp-food-prices
 71 | #Providing column names and types to make workings easier later on
 72 | colnames <- c("Country_ID", "Country", "Region_ID", "Region", "Market_ID", "Market", 
 73 |               "Commodity_ID", "Commodity", "Currency_ID", "Currency", "Sector_ID", "Sector",
 74 |               "Units_ID", "Units", "Month", "Year", "Price", "Commodity_Source")
 75 | coltypes <- cols(Year = "n", Price = "n", Month = "n")
 76 | foodPrices <- read_csv("data/FoodPrices/WFPVAM_FoodPrices_13-03-2017.csv", 
 77 |                        col_names = colnames,
 78 |                        col_types = coltypes) %>%
 79 |   filter(row_number() != 1)
 80 | 
 81 | #Large number of commodities - won't be able to plot them all!
 82 | length(unique(foodPrices$Commodity))
 83 | ```
 84 | 
 85 | ```{r}
 86 | #Overall price trend - trend of all commodity prices over time
 87 | overallPriceTrend <- foodPrices %>%
 88 |   group_by(Commodity, Year) %>%
 89 |   #As different commodities will clearly have different prices, we'll make
 90 |   #them comparable by scaling based on their max price within our timeframe
 91 |   summarise(globalAveragePrice = mean(Price)/max(Price))
 92 | 
 93 | #Food prices trend over time, grouped by commodity - same as above, 
 94 | #this time selecting a smaller sample for plotting
 95 | commodoties <- c("Wheat", "Milk", "Coffee", "Bananas", "Sugar")
 96 | priceTrend <- foodPrices %>%
 97 |   select(Price, Commodity, Year) %>%
 98 |   #selecting our reduced commodities
 99 |   filter(Commodity %in% commodoties) %>%
100 |   group_by(Commodity, Year) %>%
101 |   summarise(globalAveragePrice = mean(Price)/max(Price))
102 | 
103 | priceTrend
104 | ```
105 | We can see what our data looks like for each commodity.
106 | 
107 | Now let's create a graphic.
108 | ```{r}
109 | library(forcats)
110 | #We'll create a ggplot graphic, using geom_smooth
111 | #Specify some colours semi-related to their genuine appearence
112 | colours <- c("Bananas" = "#fea814", "Coffee" = "#383838", "Sugar" = "#4fcfff", 
113 |              "Milk" = "#cccccc", "Wheat" = "#005692")
114 | #Specify commodity levels for use in our legend
115 | fctLevels <- c("Coffee", "Bananas", "Milk", "Sugar", "Wheat")
116 | ggplot(priceTrend, aes(x = Year, y = globalAveragePrice)) +
117 |   geom_smooth(aes(colour = fct_relevel(Commodity, fctLevels)), se = FALSE, size = 1.8, linetype = 5) +
118 |   geom_smooth(data = overallPriceTrend, colour = "red", se = FALSE, size = 3.5) +
119 |   geom_line(aes(size = "All Commodities", linetype = NA), colour = "red") +
120 |   scale_colour_manual(values = colours) +
121 |   scale_x_continuous(breaks = seq(1992, 2017, 2)) +
122 |   scale_y_continuous(limits = c(0, 1), labels = c("Min", "Max"), breaks = c(0, 1)) +
123 |   labs(title = "Average Global Commodity Prices over Time",
124 |        subtitle = "Commodity Price Relative to Max in Period",
125 |        caption = "Data from https://data.humdata.org/dataset/wfp-food-prices",
126 |        x = "", 
127 |        y = "",
128 |        colour = "Commodity",
129 |        size = "") +
130 |   theme_classic()
131 | ```
132 | 
133 | The trend for our selected commodities seems to show a gradual decrease in prices, as all but coffee and milk prices are now lower than
134 | at the beginning of our timeframe. This is somewhat reflected in our overall price trend, as we can see there has been a slight downward trend, although this has been somewhat negated in the past three years.
135 | 
136 | In our analysis we took only a mere peek into the dataset. For example, we could look at seasonality trends, subset by country or region or even by market. Indeed it could be taken a step further, as VAM have, and be used as a tool for predicting when and where food security will
137 | occur in the future.
138 | 
139 | ### North Korea Missile Tests
140 | The next dataset is from the James Martin Center for Nonproliferation Studies (CNS) North Korea Missile Test Database. We agree it's not a fun topic but we also wanted to show the breadth of open data that is out there. You can get the data as an [Excel file](https://www.nti.org/documents/2137/north_korea_missile_test_database.xlsx) but it is also hosted on [data.world](https://data.world/ian/the-cns-north-korea-missile-test-database). And fortunately for the R community data.world have a R package to access their API.   
141 | 
142 | ```{r}
143 | library(data.world)
144 | # We've set the configuration in a previous session
145 | path <- "ian/the-cns-north-korea-missile-test-database"
146 | missiles <- query(qry_sql("SELECT f1, facility_latitude, facility_longitude, 
147 |                           distance_travelled_km, facility_name 
148 |                           FROM missile_tests 
149 |                           WHERE success='Success'"), 
150 |                   path)
151 | # additional filtering outside of query
152 | missiles <- missiles %>% 
153 |   filter(distance_travelled_km!="Unknown", distance_travelled_km!="N/A") %>% 
154 |   drop_na() %>% 
155 |   mutate(distance_travelled_km = as.integer(distance_travelled_km),
156 |          facility_name = substr(facility_name, 0, 20))
157 | head(missiles)
158 | ```
159 | 
160 | The data contains information on the launch location but not on the precise location of where the missile landed. However we can use the distance travelled to approximate this by calculating a radius.
161 | 
162 | ```{r}
163 | # slightly adapted from https://stackoverflow.com/questions/34183049/plot-circle-with-a-certain-radius-around-point-on-a-map-in-ggplot2#34187454
164 | 
165 | # drop duplicate locations and distances
166 | dups <- duplicated(missiles %>% select(facility_name, distance_travelled_km))
167 | missiles <- missiles %>% filter(!dups)
168 | 
169 | # define the circle we want for each missile
170 | circles <- data_frame(f1 = rep(missiles$f1, each = 100),
171 |                       angle = rep(seq(0, 2*pi, length.out = 100), nrow(missiles)))
172 | 
173 | meanLatitude <- mean(missiles$facility_latitude)
174 | 
175 | missile_circles <- missiles %>% 
176 |   # length per longitude changes with latitude, so need correction
177 |   mutate(radiusLon = distance_travelled_km/111/cos(meanLatitude/57.3),
178 |          radiusLat = distance_travelled_km/111) %>% 
179 |   left_join(circles) %>% 
180 |   mutate(longitude = facility_longitude + radiusLon * cos(angle),
181 |          latitude = facility_latitude + radiusLat * sin(angle))
182 | ```
183 | 
184 | So now we have our circles we can plot them on a map.
185 | 
186 | ```{r}
187 | library(ggmap)
188 | nk = get_map(location = c(lon = mean(missile_circles$facility_longitude), 
189 |                           lat = mean(missile_circles$facility_latitude)), 
190 |              zoom = 5, maptype = "terrain")
191 | ggmap(nk, extent = "panel") + 
192 |   geom_point(aes(x = facility_longitude, y = facility_latitude, colour=facility_name), 
193 |              data = missiles) +
194 |   ########### add circles
195 |   geom_polygon(data = missile_circles, aes(longitude, latitude, group = f1, 
196 |                                            colour=facility_name), alpha = 0)
197 | ```
198 | 
199 | These are obviously not missile ranges and none of the missiles will have gone over China. The CNS have also created [visualisations](http://www.nti.org/analysis/articles/cns-north-korea-missile-test-database/) with this data and from that we can see that the time dimension is important. That is something we could add to our visualisation or we could perform more analyses on the success/fail dimension.   
200 | 
201 | ### Flight Delays
202 | The [Bureau of Transportation Statistics](https://www.bts.gov/) are a leading source of U.S. transportation systems data, helping
203 | to shape transportation policy and research projects across the US. We pulled through their [data](https://www.transtats.bts.gov/DL_SelectFields.asp), selecting the Month and DayOfWeek fields, as well as Departure and Arrival delays. The site only allows us to download data one month at a time, so we need to begin by reading in 12 files to a list and binding
204 | them together.
205 | 
206 | ```{r, collapse=TRUE}
207 | # https://www.transtats.bts.gov/DL_SelectFields.asp
208 | # Month files were downloaded to separate csvs, so we'll read them in and combine them
209 | files <- list.files(path = "data/FlightData/", pattern = "*.csv", full.names = TRUE)
210 | flightsRaw <- map_df(files, read_csv)
211 | ```
212 | 
213 | Now let's recode our Months and Weekdays to be character names, rather than just integers. Then we can prepare our data for plotting. In
214 | this plot, we'll be aiming to plot the difference between departure and arrival delays, grouped by Weekday and Month.
215 | ```{r}
216 | #Begin by renaming integer values to days of the week and months
217 | flightData <- flightsRaw %>%
218 |   mutate(DAY_OF_WEEK = as.factor(DAY_OF_WEEK),
219 |          DAY_OF_WEEK = fct_recode(DAY_OF_WEEK, Monday = "1", Tuesday = "2", Wenesday = "3",
220 |                               Thursday = "4", Friday = "5", Saturday = "6", Sunday = "7"),
221 |          MONTH = as.factor(MONTH),
222 |          MONTH = fct_recode(MONTH, January = "1", February = "2", March = "3", April = "4",
223 |                             May = "5", June = "6", July = "7", August = "8", September = "9",
224 |                             October = "10", November = "11", December = "12"))
225 | 
226 | #Summarise the data to show the mean delay/arrival time split by Month and Weekday
227 | monthDayDelays <- flightData %>%
228 |   select(DEP_DELAY, ARR_DELAY, MONTH, DAY_OF_WEEK) %>%
229 |   group_by(MONTH, DAY_OF_WEEK) %>%
230 |   #Calculate means split by Month and Weekday
231 |   summarise(Departures = mean(DEP_DELAY, na.rm = TRUE), Arrivals = mean(ARR_DELAY, na.rm = TRUE)) %>%
232 |   filter(!is.na(MONTH)) %>%
233 |   ungroup() %>%
234 |   #Gather our data so Depature and Arrival delay times are stored within one column
235 |   gather(key = delayType, value = Delay, Arrivals, Departures) %>%
236 |   #Reorder Departure/Arrival factors so Depatures appear first in our graphic
237 |   mutate(delayType = fct_rev(delayType))
238 | monthDayDelays
239 | ```
240 | We now have our Delay times in one column, with delayType either arrival or departure. We also have our Month and Weekday columns for 
241 | grouping our data in the plot.
242 | 
243 | Now let's create a graphic. Note, the colours used for our graphic were obtained from [Color Brewer 2.0](http://colorbrewer2.org).
244 | ```{r}
245 | #Specify colours vector for use in scale_fill_manual.
246 | colours <- c("#ffffb2", "#fed976", "#feb24c", "#fd8d3c", "#fc4e2a", "#e31a1c", "#b10026")
247 | #Graphic
248 | ggplot(monthDayDelays, aes(x = MONTH, y = Delay)) +
249 |   #Adding reference lines to our (soon to become) circular chart
250 |   geom_hline(yintercept = seq(0, 20, by = 5), alpha = 0.1) +
251 |   #Bars for every day of the week grouped by month
252 |   geom_bar(stat = "identity", aes(fill = DAY_OF_WEEK), position = "dodge", colour = "black") +
253 |   #This bit is hacky. To get a single label to appear on our reference lines at zero degrees, we select a
254 |   #single datapoint from January (Month at zero degrees) and use it as our data argument.
255 |   #monthDayDelays[1, ] is a datapoint for mean Arrival time in January 
256 |   geom_text(data = monthDayDelays[1, ], y = 5, label = "5", size = 3) +
257 |   geom_text(data = monthDayDelays[1, ], y = 10, label = "10", size = 3) +
258 |   geom_text(data = monthDayDelays[1, ], y = 15, label = "15", size = 3) +
259 |   geom_text(data = monthDayDelays[1, ], y = 20, label = "20", size = 3) +
260 |   #monthDayDelays[85, ] is a datapoint for mean Departure time in January 
261 |   geom_text(data = monthDayDelays[85, ], y = 10, label = "10", size = 3) +
262 |   geom_text(data = monthDayDelays[85, ], y = 15, label = "15", size = 3) +
263 |   geom_text(data = monthDayDelays[85, ], y = 20, label = "20", size = 3) +
264 |   #Specify colours for chart
265 |   scale_fill_manual(values = rev(colours)) + 
266 |   #Remove yAxis scale from side of plot
267 |   scale_y_continuous(breaks = NULL) +
268 |   #Make plot circular, starting at due north
269 |   coord_polar(theta = "x", start = -pi/12) + 
270 |   #Separate our departure and arrival plots
271 |   facet_wrap(~delayType, labeller = label_parsed) +
272 |   #Add labels
273 |   labs(title = "Average Minute Delay of American Flights - 2016",
274 |        x = "",
275 |        y = "", 
276 |        fill = "",
277 |        caption = "Data from https://www.transtats.bts.gov/DL_SelectFields.asp") +
278 |   theme_minimal(base_size = 8) +
279 |   #Make our facet titles pretty
280 |   theme(strip.text.x = element_text(size = 15, colour = "#f03b20"))
281 | ```
282 | 
283 | Interestingly, we can see that even if your flight is delayed, you can expect that delay to have reduced upon arrival. In fact,
284 | on average 5 minutes and 25 seconds will be recovered during air time.
285 | 
286 | Perhaps not so interestingly, we can see that the months with the longest delays fall in the summer and winter holidays. Conversely,
287 | if you're not bound by school holidays, November looks to be an excellent time to travel, as you can expect to arrive a whole
288 | 2 minutes early!
289 | 
290 | Again with this analysis, there are many more pathways we could explore. For example, [FiveThirtyEight](https://projects.fivethirtyeight.com/flights/) have produced a similar delay chart, only this time grouped by airport. This, when combined with our brief analysis, makes your dream, undelayed, vacation a possible reality!!
291 | 
292 | This blogpost turned out longer than we expected but that's what happens when you have fun. If you have a cool dataset that's fun to analyse let us know on [twitter](https://twitter.com/MangotheCat). You can find the code for this blogpost on [GitHub](https://github.com/MangoTheCat/blog_fun_data). 
293 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Fun data: open data that is fun to analyse
  2 | Joe Russell, Adnan Fiaz  
  3 | 
  4 | 
  5 | 
  6 | Jeremy Singer-Vine sends out a [newsletter](https://tinyletter.com/data-is-plural) every week where he highlights a number of interesting open datasets (you can explore all the datasets [here](https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0)). At Mango we are all for open data so we thought we would also share some of the open datasets we think are fun to explore.  
  7 | 
  8 | ### Open Food Facts
  9 | [Open Food Facts](https://world.openfoodfacts.org/) is a collaborative, free and open database of food products. It is a prime example of how effective crowdsourcing your data is. People from around the world have collected details about more than 300.000 food products and uploaded the information through mobile apps. The data is available as a MongoDB dump, CSV export and an experimental API. We have downloaded the CSV export and will try to visualise the ingredients across all products.
 10 | 
 11 | 
 12 | ```r
 13 | # http://world.openfoodfacts.org/data/en.openfoodfacts.org.products.csv
 14 | foodFacts <- read_tsv("data/en.openfoodfacts.org.products.csv")
 15 | dim(foodFacts)
 16 | ## [1] 295958    162
 17 | ```
 18 | 
 19 | 
 20 | ```r
 21 | library(stringr)
 22 | ingredients <- foodFacts %>%
 23 |   # ideally, the unnest_tokens function is what we want but it was too slow
 24 |   #tidytext::unnest_tokens(ingredient, ingredients_text,
 25 |   #                        token = stringr::str_split, pattern=",|\\(|\\)|\\[|\\]") %>%
 26 |   # so instead we chose a different approach involving transmute and unlist
 27 |   # transmute will give us a list-column
 28 |   transmute(ingredients = str_split(ingredients_text, pattern=",|\\(|\\)|\\[|\\]")) %>%
 29 |   # unlist will convert the list-column into a character vector
 30 |   unlist() %>%
 31 |   # enframe will convert the vector into a data frame which is easier to groupby
 32 |   enframe() %>%
 33 |   # now we clean up some of the text
 34 |   mutate(value = str_replace(value, "org|organic", ""),
 35 |          value = str_replace(value, "-> en:\\w+", ""),
 36 |          value = str_replace_all(value, "'", ""),
 37 |          value = str_trim(value)) %>%
 38 |   # and finally filter some of the weirder entries
 39 |   filter(value!="", value!=".",
 40 |          !str_detect(value, "completed|\\d")) %>%
 41 |   # to then group by and count the ingredients
 42 |   count(value) %>%
 43 |   arrange(desc(n))
 44 | head(ingredients, 10)
 45 | ```
 46 | 
 47 | ```
 48 | ## # A tibble: 10 x 2
 49 | ##          value      n
 50 | ##          <chr>  <int>
 51 | ##  1        salt 122183
 52 | ##  2       sugar  88463
 53 | ##  3       water  80037
 54 | ##  4         sel  38852
 55 | ##  5       sucre  29971
 56 | ##  6         eau  28502
 57 | ##  7 citric acid  28475
 58 | ##  8  riboflavin  21527
 59 | ##  9        milk  21265
 60 | ## 10      niacin  21201
 61 | ```
 62 | 
 63 | There are no surprises at the top but further down there are a few ingredients that are odd. Let's create a word cloud to show the relative frequencies.
 64 | 
 65 | 
 66 | ```r
 67 | library(wordcloud)
 68 | top100 <- head(ingredients, 100)
 69 | wordcloud::wordcloud(top100$value, top100$n)
 70 | ```
 71 | 
 72 | <img src="fun_data_files/figure-html/unnamed-chunk-3-1.png" width="900px" />
 73 | 
 74 | Ingredients are only one aspect of this very interesting dataset. We could go on and look at the co-occurrence  of certain ingredients (using network analysis) and then continue analysing their quantities. We could also include the data on nutritional value and calculate correlations. The data could also use some more cleaning considering there are some ingredients in different languages (e.g. water and eau).
 75 | 
 76 | ### Food prices
 77 | Following in the edible theme, [VAM](http://vam.wfp.org/) collate commodity prices from the globe's poorer nations and use them in helping to
 78 | identify food insecurity hotspots. The data we will be using can be downloaded [here](https://data.humdata.org/dataset/wfp-food-prices),
 79 | from which we'll attempt to visualise how prices have changed over the past 20 years.
 80 | 
 81 | 
 82 | ```r
 83 | # https://data.humdata.org/dataset/wfp-food-prices
 84 | #Providing column names and types to make workings easier later on
 85 | colnames <- c("Country_ID", "Country", "Region_ID", "Region", "Market_ID", "Market", 
 86 |               "Commodity_ID", "Commodity", "Currency_ID", "Currency", "Sector_ID", "Sector",
 87 |               "Units_ID", "Units", "Month", "Year", "Price", "Commodity_Source")
 88 | coltypes <- cols(Year = "n", Price = "n", Month = "n")
 89 | foodPrices <- read_csv("data/FoodPrices/WFPVAM_FoodPrices_13-03-2017.csv", 
 90 |                        col_names = colnames,
 91 |                        col_types = coltypes) %>%
 92 |   filter(row_number() != 1)
 93 | 
 94 | #Large number of commodities - won't be able to plot them all!
 95 | length(unique(foodPrices$Commodity))
 96 | ## [1] 304
 97 | ```
 98 | 
 99 | 
100 | ```r
101 | #Overall price trend - trend of all commodity prices over time
102 | overallPriceTrend <- foodPrices %>%
103 |   group_by(Commodity, Year) %>%
104 |   #As different commodities will clearly have different prices, we'll make
105 |   #them comparable by scaling based on their max price within our timeframe
106 |   summarise(globalAveragePrice = mean(Price)/max(Price))
107 | 
108 | #Food prices trend over time, grouped by commodity - same as above, 
109 | #this time selecting a smaller sample for plotting
110 | commodoties <- c("Wheat", "Milk", "Coffee", "Bananas", "Sugar")
111 | priceTrend <- foodPrices %>%
112 |   select(Price, Commodity, Year) %>%
113 |   #selecting our reduced commodities
114 |   filter(Commodity %in% commodoties) %>%
115 |   group_by(Commodity, Year) %>%
116 |   summarise(globalAveragePrice = mean(Price)/max(Price))
117 | 
118 | priceTrend
119 | ```
120 | 
121 | ```
122 | ## # A tibble: 77 x 3
123 | ## # Groups:   Commodity [?]
124 | ##    Commodity  Year globalAveragePrice
125 | ##        <chr> <dbl>              <dbl>
126 | ##  1   Bananas  2008          1.0000000
127 | ##  2   Bananas  2009          0.4487272
128 | ##  3   Bananas  2010          0.1727820
129 | ##  4   Bananas  2011          0.2909874
130 | ##  5   Bananas  2012          0.4470188
131 | ##  6   Bananas  2013          0.1271515
132 | ##  7   Bananas  2014          0.1263784
133 | ##  8   Bananas  2015          0.1129815
134 | ##  9   Bananas  2016          0.2050480
135 | ## 10   Bananas  2017          0.2630983
136 | ## # ... with 67 more rows
137 | ```
138 | We can see what our data looks like for each commodity.
139 | 
140 | Now let's create a graphic.
141 | 
142 | ```r
143 | library(forcats)
144 | #We'll create a ggplot graphic, using geom_smooth
145 | #Specify some colours semi-related to their genuine appearence
146 | colours <- c("Bananas" = "#fea814", "Coffee" = "#383838", "Sugar" = "#4fcfff", 
147 |              "Milk" = "#cccccc", "Wheat" = "#005692")
148 | #Specify commodity levels for use in our legend
149 | fctLevels <- c("Coffee", "Bananas", "Milk", "Sugar", "Wheat")
150 | ggplot(priceTrend, aes(x = Year, y = globalAveragePrice)) +
151 |   geom_smooth(aes(colour = fct_relevel(Commodity, fctLevels)), se = FALSE, size = 1.8, linetype = 5) +
152 |   geom_smooth(data = overallPriceTrend, colour = "red", se = FALSE, size = 3.5) +
153 |   geom_line(aes(size = "All Commodities", linetype = NA), colour = "red") +
154 |   scale_colour_manual(values = colours) +
155 |   scale_x_continuous(breaks = seq(1992, 2017, 2)) +
156 |   scale_y_continuous(limits = c(0, 1), labels = c("Min", "Max"), breaks = c(0, 1)) +
157 |   labs(title = "Average Global Commodity Prices over Time",
158 |        subtitle = "Commodity Price Relative to Max in Period",
159 |        caption = "Data from https://data.humdata.org/dataset/wfp-food-prices",
160 |        x = "", 
161 |        y = "",
162 |        colour = "Commodity",
163 |        size = "") +
164 |   theme_classic()
165 | ```
166 | 
167 | <img src="fun_data_files/figure-html/unnamed-chunk-6-1.png" width="900px" />
168 | 
169 | The trend for our selected commodities seems to show a gradual decrease in prices, as all but coffee and milk prices are now lower than
170 | at the beginning of our timeframe. This is somewhat reflected in our overall price trend, as we can see there has been a slight downward trend, although this has been somewhat negated in the past three years.
171 | 
172 | In our analysis we took only a mere peek into the dataset. For example, we could look at seasonality trends, subset by country or region or even by market. Indeed it could be taken a step further, as VAM have, and be used as a tool for predicting when and where food security will
173 | occur in the future.
174 | 
175 | ### North Korea Missile Tests
176 | The next dataset is from the James Martin Center for Nonproliferation Studies (CNS) North Korea Missile Test Database. We agree it's not a fun topic but we also wanted to show the breadth of open data that is out there. You can get the data as an [Excel file](https://www.nti.org/documents/2137/north_korea_missile_test_database.xlsx) but it is also hosted on [data.world](https://data.world/ian/the-cns-north-korea-missile-test-database). And fortunately for the R community data.world have a R package to access their API.   
177 | 
178 | 
179 | ```r
180 | library(data.world)
181 | # We've set the configuration in a previous session
182 | path <- "ian/the-cns-north-korea-missile-test-database"
183 | missiles <- query(qry_sql("SELECT f1, facility_latitude, facility_longitude, 
184 |                           distance_travelled_km, facility_name 
185 |                           FROM missile_tests 
186 |                           WHERE success='Success'"), 
187 |                   path)
188 | # additional filtering outside of query
189 | missiles <- missiles %>% 
190 |   filter(distance_travelled_km!="Unknown", distance_travelled_km!="N/A") %>% 
191 |   drop_na() %>% 
192 |   mutate(distance_travelled_km = as.integer(distance_travelled_km),
193 |          facility_name = substr(facility_name, 0, 20))
194 | head(missiles)
195 | ```
196 | 
197 | ```
198 | ## # A tibble: 6 x 5
199 | ##      f1 facility_latitude facility_longitude distance_travelled_km
200 | ##   <int>             <dbl>              <dbl>                 <int>
201 | ## 1   101          39.65960           124.7057                  1000
202 | ## 2   102          39.65960           124.7057                  1000
203 | ## 3    12          40.85000           129.6667                   500
204 | ## 4    40          38.99083           127.6236                   200
205 | ## 5    41          38.99083           127.6236                   200
206 | ## 6    42          38.99083           127.6236                   200
207 | ## # ... with 1 more variables: facility_name <chr>
208 | ```
209 | 
210 | The data contains information on the launch location but not on the precise location of where the missile landed. However we can use the distance travelled to approximate this by calculating a radius.
211 | 
212 | 
213 | ```r
214 | # slightly adapted from https://stackoverflow.com/questions/34183049/plot-circle-with-a-certain-radius-around-point-on-a-map-in-ggplot2#34187454
215 | 
216 | # drop duplicate locations and distances
217 | dups <- duplicated(missiles %>% select(facility_name, distance_travelled_km))
218 | missiles <- missiles %>% filter(!dups)
219 | 
220 | # define the circle we want for each missile
221 | circles <- data_frame(f1 = rep(missiles$f1, each = 100),
222 |                       angle = rep(seq(0, 2*pi, length.out = 100), nrow(missiles)))
223 | 
224 | meanLatitude <- mean(missiles$facility_latitude)
225 | 
226 | missile_circles <- missiles %>% 
227 |   # length per longitude changes with latitude, so need correction
228 |   mutate(radiusLon = distance_travelled_km/111/cos(meanLatitude/57.3),
229 |          radiusLat = distance_travelled_km/111) %>% 
230 |   left_join(circles) %>% 
231 |   mutate(longitude = facility_longitude + radiusLon * cos(angle),
232 |          latitude = facility_latitude + radiusLat * sin(angle))
233 | ```
234 | 
235 | So now we have our circles we can plot them on a map.
236 | 
237 | 
238 | ```r
239 | library(ggmap)
240 | nk = get_map(location = c(lon = mean(missile_circles$facility_longitude), 
241 |                           lat = mean(missile_circles$facility_latitude)), 
242 |              zoom = 5, maptype = "terrain")
243 | ggmap(nk, extent = "panel") + 
244 |   geom_point(aes(x = facility_longitude, y = facility_latitude, colour=facility_name), 
245 |              data = missiles) +
246 |   ########### add circles
247 |   geom_polygon(data = missile_circles, aes(longitude, latitude, group = f1, 
248 |                                            colour=facility_name), alpha = 0)
249 | ```
250 | 
251 | <img src="fun_data_files/figure-html/unnamed-chunk-9-1.png" width="900px" />
252 | 
253 | These are obviously not missile ranges and none of the missiles will have gone over China. The CNS have also created [visualisations](http://www.nti.org/analysis/articles/cns-north-korea-missile-test-database/) with this data and from that we can see that the time dimension is important. That is something we could add to our visualisation or we could perform more analyses on the success/fail dimension.   
254 | 
255 | ### Flight Delays
256 | The [Bureau of Transportation Statistics](https://www.bts.gov/) are a leading source of U.S. transportation systems data, helping
257 | to shape transportation policy and research projects across the US. We pulled through their [data](https://www.transtats.bts.gov/DL_SelectFields.asp), selecting the Month and DayOfWeek fields, as well as Departure and Arrival delays. The site only allows us to download data one month at a time, so we need to begin by reading in 12 files to a list and binding
258 | them together.
259 | 
260 | 
261 | ```r
262 | # https://www.transtats.bts.gov/DL_SelectFields.asp
263 | # Month files were downloaded to separate csvs, so we'll read them in and combine them
264 | files <- list.files(path = "data/FlightData/", pattern = "*.csv", full.names = TRUE)
265 | flightsRaw <- map_df(files, read_csv)
266 | ```
267 | 
268 | Now let's recode our Months and Weekdays to be character names, rather than just integers. Then we can prepare our data for plotting. In
269 | this plot, we'll be aiming to plot the difference between departure and arrival delays, grouped by Weekday and Month.
270 | 
271 | ```r
272 | #Begin by renaming integer values to days of the week and months
273 | flightData <- flightsRaw %>%
274 |   mutate(DAY_OF_WEEK = as.factor(DAY_OF_WEEK),
275 |          DAY_OF_WEEK = fct_recode(DAY_OF_WEEK, Monday = "1", Tuesday = "2", Wenesday = "3",
276 |                               Thursday = "4", Friday = "5", Saturday = "6", Sunday = "7"),
277 |          MONTH = as.factor(MONTH),
278 |          MONTH = fct_recode(MONTH, January = "1", February = "2", March = "3", April = "4",
279 |                             May = "5", June = "6", July = "7", August = "8", September = "9",
280 |                             October = "10", November = "11", December = "12"))
281 | 
282 | #Summarise the data to show the mean delay/arrival time split by Month and Weekday
283 | monthDayDelays <- flightData %>%
284 |   select(DEP_DELAY, ARR_DELAY, MONTH, DAY_OF_WEEK) %>%
285 |   group_by(MONTH, DAY_OF_WEEK) %>%
286 |   #Calculate means split by Month and Weekday
287 |   summarise(Departures = mean(DEP_DELAY, na.rm = TRUE), Arrivals = mean(ARR_DELAY, na.rm = TRUE)) %>%
288 |   filter(!is.na(MONTH)) %>%
289 |   ungroup() %>%
290 |   #Gather our data so Depature and Arrival delay times are stored within one column
291 |   gather(key = delayType, value = Delay, Arrivals, Departures) %>%
292 |   #Reorder Departure/Arrival factors so Depatures appear first in our graphic
293 |   mutate(delayType = fct_rev(delayType))
294 | monthDayDelays
295 | ```
296 | 
297 | ```
298 | ## # A tibble: 168 x 4
299 | ##       MONTH DAY_OF_WEEK delayType       Delay
300 | ##      <fctr>      <fctr>    <fctr>       <dbl>
301 | ##  1  January      Monday  Arrivals  3.08526171
302 | ##  2  January     Tuesday  Arrivals  0.49934650
303 | ##  3  January    Wenesday  Arrivals -0.64372809
304 | ##  4  January    Thursday  Arrivals -0.61050874
305 | ##  5  January      Friday  Arrivals  3.37778214
306 | ##  6  January    Saturday  Arrivals  0.05099645
307 | ##  7  January      Sunday  Arrivals  4.17985079
308 | ##  8 February      Monday  Arrivals  1.30443327
309 | ##  9 February     Tuesday  Arrivals  5.09620906
310 | ## 10 February    Wenesday  Arrivals  3.00267779
311 | ## # ... with 158 more rows
312 | ```
313 | We now have our Delay times in one column, with delayType either arrival or departure. We also have our Month and Weekday columns for 
314 | grouping our data in the plot.
315 | 
316 | Now let's create a graphic. Note, the colours used for our graphic were obtained from [Color Brewer 2.0](http://colorbrewer2.org).
317 | 
318 | ```r
319 | #Specify colours vector for use in scale_fill_manual.
320 | colours <- c("#ffffb2", "#fed976", "#feb24c", "#fd8d3c", "#fc4e2a", "#e31a1c", "#b10026")
321 | #Graphic
322 | ggplot(monthDayDelays, aes(x = MONTH, y = Delay)) +
323 |   #Adding reference lines to our (soon to become) circular chart
324 |   geom_hline(yintercept = seq(0, 20, by = 5), alpha = 0.1) +
325 |   #Bars for every day of the week grouped by month
326 |   geom_bar(stat = "identity", aes(fill = DAY_OF_WEEK), position = "dodge", colour = "black") +
327 |   #This bit is hacky. To get a single label to appear on our reference lines at zero degrees, we select a
328 |   #single datapoint from January (Month at zero degrees) and use it as our data argument.
329 |   #monthDayDelays[1, ] is a datapoint for mean Arrival time in January 
330 |   geom_text(data = monthDayDelays[1, ], y = 5, label = "5", size = 3) +
331 |   geom_text(data = monthDayDelays[1, ], y = 10, label = "10", size = 3) +
332 |   geom_text(data = monthDayDelays[1, ], y = 15, label = "15", size = 3) +
333 |   geom_text(data = monthDayDelays[1, ], y = 20, label = "20", size = 3) +
334 |   #monthDayDelays[85, ] is a datapoint for mean Departure time in January 
335 |   geom_text(data = monthDayDelays[85, ], y = 10, label = "10", size = 3) +
336 |   geom_text(data = monthDayDelays[85, ], y = 15, label = "15", size = 3) +
337 |   geom_text(data = monthDayDelays[85, ], y = 20, label = "20", size = 3) +
338 |   #Specify colours for chart
339 |   scale_fill_manual(values = rev(colours)) + 
340 |   #Remove yAxis scale from side of plot
341 |   scale_y_continuous(breaks = NULL) +
342 |   #Make plot circular, starting at due north
343 |   coord_polar(theta = "x", start = -pi/12) + 
344 |   #Separate our departure and arrival plots
345 |   facet_wrap(~delayType, labeller = label_parsed) +
346 |   #Add labels
347 |   labs(title = "Average Minute Delay of American Flights - 2016",
348 |        x = "",
349 |        y = "", 
350 |        fill = "",
351 |        caption = "Data from https://www.transtats.bts.gov/DL_SelectFields.asp") +
352 |   theme_minimal(base_size = 8) +
353 |   #Make our facet titles pretty
354 |   theme(strip.text.x = element_text(size = 15, colour = "#f03b20"))
355 | ```
356 | 
357 | <img src="fun_data_files/figure-html/unnamed-chunk-12-1.png" width="900px" />
358 | 
359 | Interestingly, we can see that even if your flight is delayed, you can expect that delay to have reduced upon arrival. In fact,
360 | on average 5 minutes and 25 seconds will be recovered during air time.
361 | 
362 | Perhaps not so interestingly, we can see that the months with the longest delays fall in the summer and winter holidays. Conversely,
363 | if you're not bound by school holidays, November looks to be an excellent time to travel, as you can expect to arrive a whole
364 | 2 minutes early!
365 | 
366 | Again with this analysis, there are many more pathways we could explore. For example, [FiveThirtyEight](https://projects.fivethirtyeight.com/flights/) have produced a similar delay chart, only this time grouped by airport. This, when combined with our brief analysis, makes your dream, undelayed, vacation a possible reality!!
367 | 
368 | This blogpost turned out longer than we expected but that's what happens when you have fun. If you have a cool dataset that's fun to analyse let us know on [twitter](https://twitter.com/MangotheCat). You can find the code for this blogpost on [GitHub](https://github.com/MangoTheCat/blog_fun_data). 
369 | 


--------------------------------------------------------------------------------