├── .Rbuildignore ├── .editorconfig ├── .gitignore ├── .here ├── .lintr ├── .nvmrc ├── DESCRIPTION ├── LICENSE ├── NAMESPACE ├── R ├── analyze.R ├── functions.R ├── process.R └── visualize.R ├── README.md ├── config.R ├── data ├── cache │ └── .gitkeep ├── out │ └── .gitkeep ├── processed │ └── .gitkeep └── raw │ └── .gitkeep ├── plots └── .gitkeep ├── reports ├── .gitkeep └── notebook.Rmd ├── run.R ├── scrape └── scrape.R └── startr.Rproj /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^.*\.Rproj$ 2 | ^\.Rproj\.user$ 3 | -------------------------------------------------------------------------------- /.editorconfig: -------------------------------------------------------------------------------- 1 | # EditorConfig helps developers define and maintain consistent 2 | # coding styles between different editors and IDEs 3 | # editorconfig.org 4 | 5 | root = true 6 | 7 | 8 | [*] 9 | 10 | # Change these settings to your own preference 11 | indent_style = space 12 | indent_size = 2 13 | 14 | # We recommend you to keep these unchanged 15 | end_of_line = lf 16 | charset = utf-8 17 | trim_trailing_whitespace = true 18 | insert_final_newline = true 19 | 20 | [*.md] 21 | trim_trailing_whitespace = false 22 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | .DS_Store 6 | data/out/* 7 | !data/out/.gitkeep 8 | data/processed/* 9 | !data/processed/.gitkeep 10 | plots/* 11 | !plots/.gitkeep 12 | reports/*.html 13 | node_modules 14 | .dropbox 15 | Icon 16 | Icon\r 17 | "Icon\r" 18 | -------------------------------------------------------------------------------- /.here: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/globeandmail/startr/060f0e7863f5dc198a9217c46bb328da79d7ad55/.here -------------------------------------------------------------------------------- /.lintr: -------------------------------------------------------------------------------- 1 | linters: with_defaults( 2 | absolute_path_linter, 3 | nonportable_path_linter, 4 | pipe_continuation_linter, 5 | assignment_linter, 6 | camel_case_linter, 7 | commas_linter, 8 | cyclocomp_linter, 9 | equals_na_linter, 10 | extraction_operator_linter, 11 | function_left_parentheses_linter, 12 | implicit_integer_linter, 13 | line_length_linter(100), 14 | no_tab_linter, 15 | object_length_linter(25), 16 | object_name_linter('snake_case'), 17 | open_curly_linter, 18 | paren_brace_linter, 19 | semicolon_terminator_linter, 20 | seq_linter, 21 | single_quotes_linter, 22 | spaces_inside_linter, 23 | spaces_left_parentheses_linter, 24 | todo_comment_linter, 25 | trailing_blank_lines_linter, 26 | trailing_whitespace_linter, 27 | T_and_F_symbol_linter, 28 | undesirable_function_linter, 29 | undesirable_operator_linter, 30 | unneeded_concatenation_linter, 31 | ) 32 | -------------------------------------------------------------------------------- /.nvmrc: -------------------------------------------------------------------------------- 1 | 8.11.1 2 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: startR 2 | Type: Package 3 | Title: A Template For Data Journalism Projects In R 4 | Version: 1.1.0 5 | Author: Michael Pereira and Tom Cardoso 6 | Maintainer: Michael Pereira and Tom Cardoso 7 | Description: This project structures the data analysis process around an expected set of files and steps. 8 | This lowers the upfront effort of starting and maintaining a project and supports easier verification by providing reviewers with an expected and logically organized project. Think of it like Ruby on Rails or React, but for R analysis. 9 | License: MIT 10 | Encoding: UTF-8 11 | LazyData: true 12 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2019 The Globe and Mail 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | this software and associated documentation files (the "Software"), to deal in 7 | the Software without restriction, including without limitation the rights to 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | the Software, and to permit persons to whom the Software is furnished to do so, 10 | subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 21 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | exportPattern("^[[:alpha:]]+") 2 | -------------------------------------------------------------------------------- /R/analyze.R: -------------------------------------------------------------------------------- 1 | # ======================================================================= 2 | # This file handles the primary analysis using the tidied data as input. 3 | # Should never read from `dir_data_raw()`, only `dir_data_processed()`. 4 | # ======================================================================= 5 | 6 | # sample <- read_feather(dir_data_processed('sample.feather')) %>% 7 | # group_by(cma) %>% 8 | # arrange(desc(date)) %>% 9 | # mutate(sale_avg_3mo = rollmean(sale_avg, k = 3, fill = NA)) %>% 10 | # ungroup() %>% 11 | # drop_na() 12 | -------------------------------------------------------------------------------- /R/functions.R: -------------------------------------------------------------------------------- 1 | # ======================================================================= 2 | # Project-specific functions. 3 | # ======================================================================= 4 | -------------------------------------------------------------------------------- /R/process.R: -------------------------------------------------------------------------------- 1 | # ======================================================================= 2 | # Read raw data, clean it and save it out to `dir_data_processed()` here 3 | # before moving to analysis. If run from `run.R`, all variables generated 4 | # in this file will be wiped after completion to keep the environment 5 | # clean. If your process step is complex, you can break it into several 6 | # files like so: `source(dir_src('process_files', 'process_step_1.R'))` 7 | # ======================================================================= 8 | 9 | # sample.raw <- read_csv(sample.raw.file) %>% 10 | # rename( 11 | # cma = 'CMA', 12 | # date = 'Date', 13 | # index = 'Index', 14 | # pairs = 'Pairs', 15 | # sale_avg = 'SaleAverage', 16 | # mom = 'MoM', 17 | # yoy = 'YoY', 18 | # ytd = 'YTD' 19 | # ) %>% 20 | # arrange(cma, desc(date)) 21 | # 22 | # write_feather(sample.raw, dir_data_processed('sample.feather')) 23 | -------------------------------------------------------------------------------- /R/visualize.R: -------------------------------------------------------------------------------- 1 | # ======================================================================= 2 | # Graphics. Use the `write_plot` function to write the plot directly 3 | # to the `plots/` folder, using the variable name as the filename. 4 | # ======================================================================= 5 | 6 | # plot_house_price_change <- sample %>% 7 | # filter(cma != 'C11') %>% 8 | # ggplot(aes(x = reorder(cma, yoy), y = yoy)) + 9 | # geom_bar(colour = 'white', stat = 'identity') + 10 | # scale_y_continuous(expand = c(0, 0), limits = c(0, 25)) + 11 | # coord_flip() + 12 | # labs( 13 | # title = 'Year-over-year house price change in Canada\'s biggest cities', 14 | # caption = 'THE GLOBE AND MAIL, SOURCE: TERANET-NATIONAL BANK', 15 | # x = '', 16 | # y = '' 17 | # ) + 18 | # theme_classic() 19 | # 20 | # plot_house_price_change 21 | # 22 | # write_plot(plot_house_price_change) 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # startr 2 | 3 | A template for data journalism projects in R. 4 | 5 | This project structures the data analysis process, reducing the amount of time you'll spend setting up and maintaining a project. Essentially, it's an "opinionated framework" like Django, Ruby on Rails or React, but for data journalism. 6 | 7 | Broadly, `startr` does a few things: 8 | 9 | * **Standardizes your projects**: Eliminates the need to think about project structure so you can focus on the analysis. 10 | * **Breaks analysis into discrete steps**: Supports a flexible analysis workflow with clearly-defined steps which can be shared easily across a team. 11 | * **Helps you catch mistakes**: With structure and workflow baked in, you can focus on writing analysis code, reducing the opportunities for mistakes. 12 | * **Bakes in flexibility**: Has a format that works for both large (multi-month) and small (single-day) projects. 13 | * **De-clutters your code**: Improves the painstaking data verification/fact-checking process by cutting down on spaghetti code. 14 | * **Improves communication**: Documents the analysis steps and questions to be answered for large, multi-disciplinary teams (say, developers, data journalists and traditional reporters). 15 | * **Simplifies the generation of charts and reports**: Generates easily updatable RMarkdown reports, Adobe Illustrator-ready graphics, and datasets during analysis. 16 | 17 | ## Table of contents 18 | * [startr](#startr) 19 | * [Table of contents](#table-of-contents) 20 | * [Installation](#installation) 21 | * [Philosophy on data analysis](#philosophy-on-data-analysis) 22 | * [Workflow](#workflow) 23 | 1. [Set up your project](#step-1-set-up-your-project) 24 | 2. [Import and process data](#step-2-import-and-process-data) 25 | 3. [Analyze](#step-3-analyze) 26 | 4. [Visualize](#step-4-visualize) 27 | 5. [Write a notebook](#step-5-write-a-notebook) 28 | * [Helper functions](#helper-functions) 29 | * [Tips](#tips) 30 | * [Directory structure](#directory-structure) 31 | * [See also](#see-also) 32 | * [Version](#version) 33 | * [License](#license) 34 | * [Get in touch](#get-in-touch) 35 | 36 | ## Installation 37 | 38 | This template works with R and RStudio, so you'll need both of those installed. To scaffold a new `startr` project, we recommend using our command-line tool, [`startr-cli`](https://github.com/globeandmail/startr-cli), which will copy down the folder structure, rename some files, configure the project and initialize an empty Git repository. 39 | 40 | Using [`startr-cli`](https://github.com/globeandmail/startr-cli), you can scaffold a new project by simply running `create-startr` in your terminal and following the prompts: 41 | 42 | ![startr-cli interface GIF](http://i.imgur.com/4qtiJar.gif) 43 | 44 | Alternatively, you can run: 45 | ```sh 46 | git clone https://github.com/globeandmail/startr.git 47 | ``` 48 | 49 | (But, if you do that, be sure to rename your `startr.Rproj` file to `.Rproj` and set up your settings in `config.R` manually.) 50 | 51 | Once a fresh project is ready, double-click on the `.Rproj` file to start a scoped RStudio instance. 52 | 53 | You can then start copying in your data and writing your analysis. At The Globe, we like to work in a code editor like Atom or Sublime Text, and use something like [`r-exec`](https://atom.io/packages/r-exec) to send code chunks to RStudio. 54 | 55 | ## Philosophy on data analysis 56 | 57 | This analysis framework is designed to be flexible, reproducible and easy to jump into for a new user. `startr` works best when you assume The Globe’s own philosophy on data analysis: 58 | 59 | - **Raw data is immutable**: Treat the files in `data/raw` as read-only. This means you only ever alter them programmatically, and never edit or overwrite files in that folder. If you need to manually rewrite certain columns in a raw data file, do so by creating a new spreadsheet with the new values, then join it to the original data file during the [processing step](#step-2-import-and-process-data). 60 | - **Outputs are disposable**: Treat all project outputs (everything in `data/processed`, `data/out/`, `data/cache` and `plots/`) as disposable products. By default, this project's `.gitignore` file ignores those files, so they're never checked into source management tools. Unless absolutely necessary, do not alter `.gitignore` to check in those files — the analysis pipeline should be able to reproduce them all from your raw data files. 61 | - **Shorter is not always better**: Your code should, as much as possible, be self-documenting. Keep it clean and as simple as possible. If an analysis chain is becoming particularly long or complex, break it out into smaller chunks, or consider writing a function to abstract out the complexity in your code. 62 | - **Only optimize your code for performance when necessary**: It's easy to fall into a premature optimization rabbit hole, especially on larger or more complex projects. In most cases, there's no need to optimize your code for performance — only do this if your analysis process is taking several minutes or longer. 63 | - **Never overwrite variables**: No variables should ever be overwritten or reassigned. Same goes for fields generated via [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html). 64 | - **Order matters**: We only ever run our R code sequentially, which prevents reproducibility issues resulting from users running code chunks in different orders. For instance, do not run a block of code at line 22, then code at line 11, then some more code at line 37, since that may lead to unexpected results that another journalist won't be able to reproduce. 65 | - **Wipe your environment often**: If using RStudio (our preferred tool for work in R), restart and clear the environment often to make sure your code is reproducible. 66 | - **Use the tidyverse**: For coding style, we rely on the [tidyverse style guide](https://style.tidyverse.org/). 67 | 68 | ## Workflow 69 | 70 | The heart of the project lies in these three files: 71 | 72 | * **`process.R`**: Import your source data, tidy it, fix any errors, set types, apply upfront manipulations and save out a file ready for analysis. We recommend saving out a [`.feather`](https://github.com/wesm/feather) file, which will retain types and is designed to read extremely quickly — but you can also use a .CSV, .RDS file, shapefile or something else if you'd prefer. 73 | 74 | * **`analyze.R`**: Here you'll consume the data files saved out by `process.R`. This is where all of the true "analysis" occurs, including grouping, summarizing, filtering, etc. If your analysis is complex enough, you may want to split it out into additional `analyze_step_X.R` files as required, and then call those files from `analyze.R` using `source()`. 75 | 76 | * **`visualize.R`**: Draw and save out your graphics. 77 | 78 | There's also an optional (but recommended) RMarkdown file (**`notebook.Rmd`**) you can use to generate an HTML codebook – especially useful for longer-term projects where you need to document the questions you're asking. 79 | 80 | #### Step 1: Set up your project 81 | 82 | The bulk of any `startr` project's code lives within the `R` directory, in files that are sourced and run in sequence by the `run.R` at the project's root. 83 | 84 | Many of the core functions for this project are managed by a specialty package, [**upstartr**](https://github.com/globeandmail/upstartr). That package is installed and imported in `run.R` automatically. 85 | 86 | Before starting an analysis, you'll need to set up your `config.R` file. 87 | 88 | That file uses the [`initialize_startr()`](https://globeandmail.github.io/upstartr/reference/initialize_startr.html) function to prepare the environment for analysis. It will also load all the packages you'll need. For instance, you might want to add the [`cancensus`](https://github.com/mountainMath/cancensus) library. To do that, just add `'cancensus'` to the `packages` vector. Package suggestions for GIS work, scraping, dataset summaries, etc. are included in commented-out form to avoid bloat. The function also takes several other optional parameters — for a full list, see our [documentation](https://globeandmail.github.io/upstartr/reference/initialize_startr.html). 89 | 90 | Once you've listed the packages you want to import, you'll want to reference your raw data filenames so that you can read them in during `process.R`. For instance, if you're adding pizza delivery data, you'd add this line to the filenames block in `config.R`: 91 | 92 | ```R 93 | pizza.raw.file <- dir_data_raw('Citywide Pizza Deliveries 1998-2016.xlsx') 94 | ``` 95 | 96 | Our naming convention is to append `.raw` to variables that reference raw data, and `.file` to variables that are just filename strings. 97 | 98 | #### Step 2: Import and process data 99 | 100 | In `process.R`, you'll read in the data for the filename variables you assigned in `config.R`, do some clean-up, rename variables, deal with any errors, convert multiple files to a common data structure if necessary, and finally save out the result. It might look something like this: 101 | 102 | ```R 103 | pizza.raw <- read_excel(pizza.raw.file, skip = 2) %>% 104 | select(-one_of('X1', 'X2')) %>% 105 | rename( 106 | date = 'Date', 107 | time = 'Time', 108 | day = 'Day', 109 | occurrence_id = 'Occurrence Identification Number', 110 | lat = 'Latitude', 111 | lng = 'Longitude', 112 | person = 'Delivery Person', 113 | size = 'Pizza Size (in inches)', 114 | price = 'Pizza bill \n after taxes' 115 | ) %>% 116 | mutate( 117 | price = parse_number(price), 118 | year_month = format(date, '%Y-%m-01'), 119 | date = ymd(date) 120 | ) %>% 121 | filter(!is.na(date)) 122 | 123 | write_feather(pizza.raw, dir_data_processed('pizza.feather')) 124 | ``` 125 | 126 | When called via the [`run_process()`](https://globeandmail.github.io/upstartr/reference/run_process.html) function in `run.R`, variables generated during processing will be removed once the step is completed to keep the working environment clean for analysis. 127 | 128 | We prefer to write out our processed files using the binary [`.feather`](https://github.com/wesm/feather) format, which is designed to read and write files extremely quickly (at roughly 600 MB/s). Feather files can also be opened in other analysis frameworks (i.e. Jupyter Notebooks) and, most importantly, embed column types into the data so that you don't have to re-declare a column as logicals, dates or characters later on. If you'd rather save out files in a different format, you can just use a different function, like the tidyverse's [`write_csv()`](https://readr.tidyverse.org/reference/write_delim.html). 129 | 130 | Output files are written to `/data/processed` using the [`dir_data_processed()`](https://globeandmail.github.io/upstartr/reference/dir-data_processed.html) function. By design, processed files aren't checked into Git — you should be able to reproduce the analysis-ready files from someone else's project by running `process.R`. 131 | 132 | #### Step 3: Analyze 133 | 134 | This part's as simple as consuming that file in `analyze.R` and running with it. It might look something like this: 135 | 136 | ```R 137 | pizza <- read_feather(dir_data_processed('pizza.feather')) 138 | 139 | delivery_person_counts <- pizza %>% 140 | group_by(person) %>% 141 | count() %>% 142 | arrange(desc(n)) 143 | 144 | deliveries_monthly <- pizza %>% 145 | group_by(year_month) %>% 146 | summarise( 147 | n = n(), 148 | unique_persons = n_distinct(person) 149 | ) 150 | ``` 151 | 152 | #### Step 4: Visualize 153 | 154 | You can use `visualize.R` to consume the variables created in `analyze.R`. For instance: 155 | 156 | ```R 157 | plot_delivery_persons <- delivery_person_counts %>% 158 | ggplot(aes(x = person, y = n)) + 159 | geom_col() + 160 | coord_flip() 161 | 162 | plot_delivery_persons 163 | 164 | write_plot(plot_delivery_persons) 165 | 166 | plot_deliveries_monthly <- deliveries_monthly %>% 167 | ggplot(aes(x = year_month, y = n)) + 168 | geom_col() 169 | 170 | plot_deliveries_monthly 171 | 172 | write_plot(plot_deliveries_monthly) 173 | ``` 174 | 175 | #### Step 5: Write a notebook 176 | 177 | TKTKTKTK 178 | 179 | ## Helper functions 180 | 181 | `startr`'s companion package [`upstartr`](https://github.com/globeandmail/upstartr) comes with several functions to support `startr`, plus helpers we've found useful in daily data journalism tasks. A full list can be found on the [reference page here](https://globeandmail.github.io/upstartr/reference/index.html). Below is a partial list of some of its most handy functions: 182 | 183 | - [`simplify_string()`](https://globeandmail.github.io/upstartr/reference/simplify_string.html): By default, takes strings and simplifies them by force-uppercasing, replacing accents with non-accented characters, removing every non-alphanumeric character, and simplifying double/mutli-spaces into single spaces. Very useful when dealing with messy human-entry data with people's names, corporations, etc. 184 | 185 | ```r 186 | pizza_deliveries %>% 187 | mutate(customer_simplified = simplify_string(customer_name)) 188 | ``` 189 | 190 | - [`clean_columns()`](https://globeandmail.github.io/upstartr/reference/clean_columns.html): Renaming columns to something that doesn't have to be referenced with backticks (`` `Column Name!` ``) or square brackets (`.[['Column Name!']]`) gets tedious. This function speeds up the process by forcing everything to lowercase and using underscores – the tidyverse's preferred naming convention for columns. If there are many columns with the same name during cleanup, they'll be appended with an index number. 191 | 192 | ```r 193 | pizza_deliveries %>% 194 | rename_all(clean_columns) 195 | ``` 196 | 197 | - [`convert_str_to_logical()`](https://globeandmail.github.io/upstartr/reference/convert_str_to_logical.html): Does the work of cleaning up your True, TRUE, true, T, False, FALSE, false, F, etc. strings to logicals. 198 | 199 | ```r 200 | pizza_deliveries %>% 201 | mutate(was_delivered_logi = convert_str_to_logical(was_delivered)) 202 | ``` 203 | 204 | - [`calc_index()`](https://globeandmail.github.io/upstartr/reference/calc_index.html): Calculate percentage growth by indexing values to the first value: 205 | 206 | ```r 207 | pizza_deliveries %>% 208 | mutate(year = year(date)) %>% 209 | group_by(size, year) %>% 210 | summarise(total_deliveries = n()) %>% 211 | arrange(year) %>% 212 | mutate(indexed_deliveries = calc_index(total_deliveries)) 213 | ``` 214 | 215 | - [`calc_mode()`](https://globeandmail.github.io/upstartr/reference/calc_mode.html): Calculate the mode for a given field: 216 | 217 | ```r 218 | pizza_deliveries %>% 219 | group_by(pizza_shop) %>% 220 | summarise(most_common_size = calc_mode(size)) 221 | ``` 222 | 223 | - [`write_excel()`](https://globeandmail.github.io/upstartr/reference/write_excel.html): Writes out an Excel file to `data/out` using the variable name as the file name. Useful for quickly generating summary tables for sharing with others. By design, doesn't take any arguments to keep things as simple as possible. If `should_timestamp_output_files` is set to TRUE in `config.R`, will append a timestamp to the filename in the format `%Y%m%d%H%M%S`. 224 | 225 | ```r 226 | undelivered_pizzas <- pizza_deliveries %>% 227 | filter(!was_delivered_logi) 228 | 229 | write_excel(undelivered_pizzas) 230 | ``` 231 | 232 | - [`write_plot()`](https://globeandmail.github.io/upstartr/reference/write_plot.html): Similar to [`write_excel()`](https://globeandmail.github.io/upstartr/reference/write_excel.html), designed to quickly save out a plot directly to `/plots`. Takes all the same arguments as [`ggsave()`](https://ggplot2.tidyverse.org/reference/ggsave.html). 233 | 234 | ```r 235 | plot_undelivered_pizzas <- undelivered_pizzas %>% 236 | group_by(year) %>% 237 | summarise(n = n()) %>% 238 | ggplot(aes(x = year, y = n)) + 239 | geom_col() 240 | 241 | write_plot(plot_undelivered_pizzas) 242 | ``` 243 | 244 | - [`read_all_excel_sheets()`](https://globeandmail.github.io/upstartr/reference/read_all_excel_sheets.html): Combines all Excel sheets in a given file into a single dataframe, adding an extra column called `sheet` for the sheet name. Takes all the same arguments as [`readxl`](https://readxl.tidyverse.org/)'s [`read_excel()`](https://readxl.tidyverse.org/reference/read_excel.html). 245 | 246 | ```r 247 | pizza_deliveries <- read_all_excel_sheets( 248 | pizza_deliveries.file, 249 | skip = 3, 250 | ) 251 | ``` 252 | 253 | - [`combine_csvs()`](https://globeandmail.github.io/upstartr/reference/combine_csvs.html): Read all CSVs in a given directory and concatenate them into a single file. Takes all the same arguments as [`read_csv()`](https://readr.tidyverse.org/reference/read_delim.html) 254 | 255 | ```r 256 | pizzas <- combine_csvs(dir_data_raw()) 257 | ``` 258 | 259 | - [`combine_excels()`](https://globeandmail.github.io/upstartr/reference/combine_excels.html): Read all Excel spreadsheets in a given directory and concatenate them. 260 | 261 | ```r 262 | pizzas_in_excel <- combine_excels(dir_data_raw()) 263 | ``` 264 | 265 | - [`unaccent()`](https://globeandmail.github.io/upstartr/reference/unaccent.html): Remove accents from strings. 266 | 267 | ```r 268 | unaccent('Montréal') 269 | # [1] "Montreal" 270 | ``` 271 | 272 | - [`remove_non_utf8()`](https://globeandmail.github.io/upstartr/reference/remove_non_utf8.html): Remove non-UTF-8 characters from strings. 273 | 274 | ```r 275 | non_utf8 <- 'fa\xE7ile' 276 | Encoding(non_utf8) <- 'latin1' 277 | remove_non_utf8(non_utf8) 278 | # [1] "façile" 279 | ``` 280 | 281 | - [`%not_in%`](https://globeandmail.github.io/upstartr/reference/grapes-not_in-grapes.html): The opposite of the [`%in%`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/match.html) operator. 282 | 283 | ```r 284 | c(1, 2, 3, 4, 5) %not_in% c(4, 5, 6, 7, 8) 285 | # [1] TRUE TRUE TRUE FALSE FALSE 286 | ``` 287 | 288 | - [`not.na()`](https://globeandmail.github.io/upstartr/reference/not.na.html): The opposite of the [`is.na`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/NA) function. 289 | - [`not.null()`](https://globeandmail.github.io/upstartr/reference/not.null.html): The opposite of the [`is.null`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/NULL) function. 290 | 291 | ## Tips 292 | 293 | - **You don't always need to process your data**: If your [processing step](#step-2-import-and-process-data) takes a while and you've already generated your processed files during a previous run, you can tell `startr` to skip this step by setting `should_process_data` to `FALSE` in `config.R`'s [`initialize_startr()`](https://globeandmail.github.io/upstartr/reference/initialize_startr.html) function. Just be sure to set it back to `TRUE` if your processing code changes! 294 | - **Consider timestamping your output files**: If you're using [`upstartr`](https://github.com/globeandmail/upstartr)'s [`write_excel()`](https://globeandmail.github.io/upstartr/reference/write_excel.html) helper, you can automatically timestamp your filenames by setting `should_timestamp_output_files` to `TRUE` in [`initialize_startr()`](https://globeandmail.github.io/upstartr/reference/initialize_startr.html). 295 | - **Use the functions file**: Reduce repetition in your code by writing functions and putting them in the `functions.R` file, which gets `source()`'d when [`run_config()`](https://globeandmail.github.io/upstartr/reference/run_config.html) is run. 296 | - **Help us make `startr` better**: Using this package? Find yourself wishing the structure were slightly different, or have an often-used function you're tired of copying and pasting between projects? Please [send us your feedback](#get-in-touch). 297 | 298 | ## Directory structure 299 | 300 | ``` 301 | ├── data/ 302 | │   ├── raw/ # The original data files. Treat this directory as read-only. 303 | │   ├── cache/ # Cached files, mostly used when scraping or dealing with packages such as `cancensus`. Disposable, ignored by version control software. 304 | │   ├── processed/ # Imported and tidied data used throughout the analysis. Disposable, ignored by version control software. 305 | │   └── out/ # Exports of data at key steps or as a final output. Disposable, ignored by version control software. 306 | ├── R/ 307 | │   ├── process.R # Basic data processing (fixing column types, setting dates, pre-emptive filtering, etc.) ahead of analysis. 308 | │   ├── analyze.R # Your exploratory data analysis. 309 | │   ├── visualize.R # Where your visualization code goes. 310 | │   └── functions.R # Project-specific functions. 311 | ├── plots/ # Your generated graphics go here. 312 | ├── reports/ 313 | │   └── notebook.Rmd # Your analysis notebook. Will be compiled into an .html file by `run.R`. 314 | ├── scrape/ 315 | │   └── scrape.R # Scraping scripts that save collected data to the `/data/raw/` directory. 316 | ├── config.R # Global project variables including packages, key project paths and data sources. 317 | ├── run.R # Wrapper file to run the analysis steps, either inline or sourced from component R files. 318 | └── startr.Rproj # Rproj file for RStudio 319 | ``` 320 | 321 | An `.nvmrc` is included at the project root for Node.js-based scraping. If you prefer to scrape with Python, be sure to add `venv` and `requirements.txt` files, or a `Gemfile` if working in Ruby. 322 | 323 | ## See also 324 | 325 | `startr` is part of a small ecosystem of R utilities. Those include: 326 | 327 | - [**upstartr**](https://github.com/globeandmail/upstartr), a library of functions that support `startr` and daily data journalism tasks 328 | - [**tgamtheme**](https://github.com/globeandmail/tgamtheme), The Globe and Mail's graphics theme 329 | - [**startr-cli**](https://github.com/globeandmail/startr-cli), a command-line tool that scaffolds new `startr` projects 330 | 331 | ## Version 332 | 333 | 1.1.0 334 | 335 | ## License 336 | 337 | startr © 2020 The Globe and Mail. It is free software, and may be redistributed under the terms specified in our MIT license. 338 | 339 | ## Get in touch 340 | 341 | If you've got any questions, feel free to send us an email, or give us a shout on Twitter: 342 | 343 | [![Tom Cardoso](https://avatars0.githubusercontent.com/u/2408118?v=3&s=65)](https://github.com/tomcardoso) | [![Michael Pereira](https://avatars0.githubusercontent.com/u/212666?v=3&s=65)](https://github.com/monkeycycle) 344 | ---|--- 345 | [Tom Cardoso](mailto:tcardoso@globeandmail.com)
[@tom_cardoso](https://www.twitter.com/tom_cardoso) | [Michael Pereira](mailto:hello@monkeycycle.org)
[@__m_pereira](https://www.twitter.com/__m_pereira) 346 | -------------------------------------------------------------------------------- /config.R: -------------------------------------------------------------------------------- 1 | # ================================================================= 2 | # This file configures the project by specifying filenames, loading 3 | # packages and setting up some project-specific variables. 4 | # ================================================================= 5 | 6 | initialize_startr( 7 | title = 'startr', 8 | author = 'Firstname Lastname ', 9 | timezone = 'America/Toronto', 10 | should_render_notebook = FALSE, 11 | should_process_data = TRUE, 12 | should_timestamp_output_files = FALSE, 13 | packages = c( 14 | 'tidyverse', 'glue', 'lubridate', 'readxl', 'feather', 'scales', 'knitr' 15 | # 'rvest', 'janitor', 'zoo', 16 | # 'sf', 'tidymodels', 17 | # 'gganimate', 'tgamtheme', 18 | # 'cansim', 'cancensus' 19 | ) 20 | ) 21 | 22 | # Refer to your source data here. These can be either references to files in 23 | # your `data/raw` folder, or paths to files hosted on the web. For example: 24 | # For example: 25 | # sample.raw.file <- dir_data_raw('your-filename-here.csv') 26 | # sample.raw.path <- 'https://github.com/tidyverse/dplyr/raw/master/data-raw/starwars.csv' 27 | -------------------------------------------------------------------------------- /data/cache/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/globeandmail/startr/060f0e7863f5dc198a9217c46bb328da79d7ad55/data/cache/.gitkeep -------------------------------------------------------------------------------- /data/out/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/globeandmail/startr/060f0e7863f5dc198a9217c46bb328da79d7ad55/data/out/.gitkeep -------------------------------------------------------------------------------- /data/processed/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/globeandmail/startr/060f0e7863f5dc198a9217c46bb328da79d7ad55/data/processed/.gitkeep -------------------------------------------------------------------------------- /data/raw/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/globeandmail/startr/060f0e7863f5dc198a9217c46bb328da79d7ad55/data/raw/.gitkeep -------------------------------------------------------------------------------- /plots/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/globeandmail/startr/060f0e7863f5dc198a9217c46bb328da79d7ad55/plots/.gitkeep -------------------------------------------------------------------------------- /reports/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/globeandmail/startr/060f0e7863f5dc198a9217c46bb328da79d7ad55/reports/.gitkeep -------------------------------------------------------------------------------- /reports/notebook.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "`r getOption('startr.title')`" 3 | date: "`r format(Sys.Date(), '%B %d, %Y')`" 4 | author: "`r getOption('startr.author')`" 5 | output: 6 | html_notebook: 7 | code_folding: hide 8 | df_print: kable 9 | self_contained: yes 10 | theme: cosmo 11 | toc: yes 12 | toc_depth: 3 13 | --- 14 | 15 | ## First heading 16 | 17 | Text goes here. 18 | 19 | ```{r} 20 | # R code goes here 21 | ``` 22 | -------------------------------------------------------------------------------- /run.R: -------------------------------------------------------------------------------- 1 | if (!require('upstartr')) install.packages('upstartr'); library('upstartr') 2 | 3 | run_config() 4 | run_process() 5 | run_analyze() 6 | run_visualize() 7 | run_notebook() 8 | -------------------------------------------------------------------------------- /scrape/scrape.R: -------------------------------------------------------------------------------- 1 | # ======================================================================= 2 | # Put any scraping code here. This file doesn't get called by `run.R`. 3 | # ======================================================================= 4 | 5 | if (!require('upstartr')) install.packages('upstartr'); library('upstartr') 6 | 7 | run_config() 8 | 9 | # Scraping code goes here. 10 | -------------------------------------------------------------------------------- /startr.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: No 4 | SaveWorkspace: No 5 | AlwaysSaveHistory: Default 6 | QuitChildProcessesOnExit: Default 7 | 8 | EnableCodeIndexing: Yes 9 | UseSpacesForTab: Yes 10 | NumSpacesForTab: 2 11 | Encoding: UTF-8 12 | 13 | RnwWeave: knitr 14 | LaTeX: pdfLaTeX 15 | 16 | AutoAppendNewline: Yes 17 | StripTrailingWhitespace: Yes 18 | 19 | BuildType: Package 20 | PackageUseDevtools: Yes 21 | PackageInstallArgs: --no-multiarch --with-keep.source 22 | --------------------------------------------------------------------------------