├── LICENSE
├── README.md
├── common_recipes.md
├── index.html
├── make.md
├── static
└── custom.css
└── styleguide.md
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2016 datamade
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Making Data, the DataMade Way
2 |
3 | This is [DataMade's](http://datamade.us) guide to **extracting**, **transforming** and **loading** (ETL) data using [Make](http://en.wikipedia.org/wiki/Make_%28software%29), a common command line utility.
4 |
5 | This guide is part of a body of technical and process documentation maintained by DataMade. Head over to [`datamade/how-to`](
6 | https://github.com/datamade/how-to/) for other guides on topics ranging from AWS to work practices!
7 |
8 | ## What is ETL?
9 |
10 | ETL refers to the general process of:
11 |
12 | 1. taking raw **source data** (*"Extract"*)
13 | 2. doing some stuff to get the data in shape, possibly involving intermediate **derived files** (*"Transform"*)
14 | 3. producing **final output** in a more usable form (for *"Loading"* into something that consumes the data - be it an app, a system, a visualization, etc.)
15 |
16 | Having a standard ETL workflow helps us make sure that our work is clean, consistent, and easy to reproduce. By following these guidelines you'll be able to keep your work up to date and share it with the world in a standard format - all with as few headaches as possible.
17 |
18 | ## Basic Principles
19 |
20 | These five principles inform all of our data work:
21 |
22 | 1. **Never destroy data** - treat source data as immutable, and show your work when you modify it
23 | 2. Be able to deterministically **produce the final data with one command**
24 | 3. Write as **little custom code** as possible
25 | 4. Use **[standard tools](https://github.com/datamade/data-making-guidelines/blob/master/styleguide.md#4-standard-toolkit)** whenever possible
26 | 5. Keep source data under **version control**
27 |
28 | Unsure how to follow these principles? Read on!
29 |
30 | ## The Guide
31 |
32 | 1. [Make & Makefile Overview](https://github.com/datamade/data-making-guidelines/blob/master/make.md)
33 | - [Why Use Make/Makefiles?](https://github.com/datamade/data-making-guidelines/blob/master/make.md#1-why-use-makemakefiles)
34 | - [Makefile 101](https://github.com/datamade/data-making-guidelines/blob/master/make.md#2-makefile-101)
35 | - [Makefile 201 - Some Fancy Things Built Into Make](https://github.com/datamade/data-making-guidelines/blob/master/make.md#3-makefile-201---some-fancy-things-built-into-make)
36 | 2. [ETL Styleguide](https://github.com/datamade/data-making-guidelines/blob/master/styleguide.md)
37 | - [Makefile Best Practices](https://github.com/datamade/data-making-guidelines/blob/master/styleguide.md#1-makefile-best-practices)
38 | - [Variables](https://github.com/datamade/data-making-guidelines/blob/master/styleguide.md#2-variables)
39 | - [Processors](https://github.com/datamade/data-making-guidelines/blob/master/styleguide.md#3-processors)
40 | - [Standard Toolkit](https://github.com/datamade/data-making-guidelines/blob/master/styleguide.md#4-standard-toolkit)
41 | - [ETL Workflow Directory Structure](https://github.com/datamade/data-making-guidelines/blob/master/styleguide.md#5-etl-workflow-directory-structure)
42 |
43 | ## Code examples
44 | - [Some Annotated ETL Code Examples with Make](http://datamade.github.io/data-making-guidelines/)
45 | - [Recipes for Common Makefile Operations](https://github.com/datamade/data-making-guidelines/blob/master/common_recipes.md)
46 | - [Chicago Lead](https://github.com/City-Bureau/chicago-lead) - data work with a clear README and Makefile
47 | - [EITC Works](https://github.com/datamade/eitc-map/tree/master/data) - adding data attributes to Illinois House and Senate district shapefiles and outputting at GeoJSON
48 |
49 | ## Further reading
50 | - [Makefile Style Guide by Clark Grubb](http://clarkgrubb.com/makefile-style-guide#data-workflows)
51 | - [Why Use Make by Mike Bostock](http://bost.ocks.org/mike/make/)
52 |
--------------------------------------------------------------------------------
/common_recipes.md:
--------------------------------------------------------------------------------
1 | 1. Setup phony targets
2 |
3 | ```make
4 | .PHONY: all clean
5 |
6 | all: $(GENERATED_FILES)
7 |
8 | clean:
9 | rm -Rf finished/*
10 | ```
11 |
12 | 2. Downloading a zip directory
13 |
14 | ```make
15 | parcels.zip:
16 | wget --no-use-server-timestamps \
17 | http://maps.indiana.edu/download/Reference/Land_Parcels_County_IDHS.zip -O $@
18 | ```
19 | Notice the use of `--no-use-server-timestamps`. If you didn't use this argument,
20 | this file would have a last-touched timestamp of the file on the server. By
21 | using this argument, the file will have a timestamp of when it was downloaded.
22 |
23 |
24 | 3. Unzipping a zip directory
25 |
26 | ```make
27 | .INTERMEDIATE: chicomm.shp
28 | chicomm.shp: chicomm.zip
29 | unzip -o $<
30 | ```
31 |
32 | 4. Converting excel to csv
33 |
34 | ```make
35 | .INTERMEDIATE: parcel_survey.csv
36 | parcel_survey.csv: parcel_survey.xlsx
37 | in2csv $< > $@
38 | ```
39 |
40 | 5. Grabbing select columns from an excel doc, & creating a csv with a new header
41 |
42 | ```make
43 | school_id_lookup.csv: School_data_8-3-14.xlsx
44 | in2csv $< |\
45 | csvcut -c "1,2" |\
46 | (echo "school_id,school_name"; tail -n +2) > finished/$(notdir $@)
47 | ```
48 |
49 | 6. Join csvs, using an implicit rule
50 |
51 | ```make
52 | %hourly.joined.csv: %hourly.csv stations.csv
53 | csvjoin -c "3,4" $< $(word 2,$^) > finished/$(notdir $@)
54 | ```
55 |
56 | 7. Substitute many versions of the same thing, such as a different URLs for each
57 | year of an annual report, into a common recipe
58 |
59 | ```make
60 | BASE_URL=www.mydatais.cool
61 | URL_2010=$(BASE_URL)/2010/summary.csv
62 | URL_2011=$(BASE_URL)/2011/summery.csv
63 | URL_2012=$(BASE_URL)/2012/data-summary.csv
64 |
65 | YEARS=2010 2011 2012
66 |
67 | COOL_DATA=$(patsubst %, data_%.csv, $(YEARS))
68 |
69 |
70 | data_%.csv :
71 | wget --no-use-server-timestamps -O $@ $(URL_$*)
72 | ```
73 |
74 | Call `make $cool_data` or set `$(COOL_DATA)` as a dependency of another target
75 | to automagically run your pattern recipe for all defined URLs.
76 |
77 | This pattern is also useful when grabbing data from large datasets where column
78 | names change over time.
79 |
80 | Read more on `patsubst` [in the Make docs](https://www.gnu.org/software/make/manual/html_node/Text-Functions.html).
81 |
--------------------------------------------------------------------------------
/index.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
104 |
105 |
106 |
107 |
108 |
115 |
116 |
--------------------------------------------------------------------------------
/make.md:
--------------------------------------------------------------------------------
1 | ## [Making Data, the DataMade Way](https://github.com/datamade/data-making-guidelines/blob/master/README.md)
2 | 1. **Make & Makefile Overview**
3 | 2. [ETL Styleguide](https://github.com/datamade/data-making-guidelines/blob/master/styleguide.md)
4 |
5 | # Make & Makefile Overview
6 |
7 | To achieve a reproducible data workflow, DataMade uses [GNU's Make](http://en.wikipedia.org/wiki/Make_%28software%29).
8 |
9 | Make is a *build automation tool* – it helps build files from source code by keeping track of dependencies and executing shell commands. Out in the wider world, you're most likely to see people using Make to compile software. But it has a bunch of nice properties that make it useful for processing all kinds of data, too.
10 |
11 | Make runs in the command line, and Makefiles are written primarily in the bash scripting language. To understand this guide, you should be comfortable writing basic scripts in bash and editing files in the Unix filesystem. If you feel shaky on the command line (or just want to brush up), we recommend you start with Ryan Chadwick's [Linux](http://ryanstutorials.net/linuxtutorial/) and [bash scripting](http://ryanstutorials.net/bash-scripting-tutorial/) tutorials and come back when you're ready.
12 |
13 | On this page we discuss our reasons for using Make, and provide a brief introduction to the way it works.
14 |
15 | ## Contents
16 |
17 | 1. [Why Use Make/Makefiles?](#1-why-use-makemakefiles)
18 | 2. [Makefile 101](#2-makefile-101)
19 | 3. [Makefile 201 - Some Fancy Things Built Into Make](#3-makefile-201---some-fancy-things-built-into-make)
20 |
21 |
22 | ### 1. Why Use Make/Makefiles?
23 |
24 | As a build automation tool, Make generates files (called *targets*), each of which can depend upon the existence of other files (called *dependencies*). Targets, dependencies, and instructions for how to build them (called *recipes*) are defined in a special file called a Makefile.
25 |
26 | The nice thing about Makefiles is that once you specify the ways in which your files depend on one another (a "dependency graph"), Make will look at the files you have and do the work of figuring out the individual steps required to build the output that you want. If you're trying to make a target and you already have some of its dependencies, Make will skip the steps required to produce those dependencies; if you're missing a bunch of dependencies, on the other hand, it will figure out the fastest way of producing them. This property of Make means that you can change a rule or a dependency and rebuild your output without having to rerun every single step along the way.
27 |
28 | In short, **Make is a particularly nifty tool for data processing because**:
29 | - Make allows you to produce your final data with a single command
30 | - Writing a Makefile forces you to make your data processing steps explicit
31 | - Make is smart about only building what's necessary, because it keeps track of dependencies
32 | - Make is efficient, and gives you parallel processing for nearly free
33 | - If you have a Mac or run Linux, Make is already on your computer! (If you run Windows, I'm sorry - you may have to [install it manually.](http://gnuwin32.sourceforge.net/packages/make.htm))
34 |
35 | For a more eloquent argument in favor of using Make for data processing, see ["Why Use Make" by Mike Bostock](http://bost.ocks.org/mike/make/).
36 |
37 | ### 2. Makefile 101
38 |
39 | #### Rules
40 |
41 | The basic building block of a Makefile is a *"rule"*: a small block of code that executes one step of your data-making process. Each "rule" consists of (1) a *target*, (2) the target's *dependencies*, and the target's *recipe* (the commands for creating the target).
42 |
43 | **The general structure of a single Make "rule" looks like this:**
44 | ```
45 | :
46 | [tab]
47 | ```
48 | *A note about __tabs__: one big difference between bash scripting and Makefiles is that in Make, recipes absolutely must be indented with a tab (and not spaces). This can be a common source of strange errors.*
49 |
50 | The **target** is what you want the rule to generate. Up until this point we've assumed that the target will be a file (and most often it is), but one of Make's brilliant properties is that the target doesn't have to be a *literal filename* – it can also be a [phony target](https://github.com/datamade/data-making-guidelines#phony-targets) or a [variable](https://github.com/datamade/data-making-guidelines/blob/master/styleguide.md#2-variables), two special kinds of targets. Phony targets and variables are very useful for writing Makefiles, but you don't have to understand what they do yet! The most important concept to understand is that a target is the identifying label for a rule that allows the user to reference it.
51 |
52 | A rule's **dependencies** are everything that needs to exist in order to make the target. Like targets, Make expects dependencies to be files, but they can also be phony targets or variables. Unlike targets, dependencies are *optional* - if your target has no dependencies, Make will go ahead and run whatever code is contained in the recipe, whether or not you have all of the things it needs to run. Dependencies are useful for making sure that all of your files are ready to generate your target.
53 |
54 | A **recipe** lists the commands that Make needs to run to generate the target. Any command you can run on the command line is fair game for recipes, and any recipe should be able to be run on its own in your shell. This means you're free to use command line utilities, edit the filesystem, and run scripts in your recipe. Just remember: any command line utility you use that isn't built into the bash language or included in the repo will need to be clearly identified (ideally with installation instructions) in your README, or else users with different machines might not be able to run your Makefile.
55 |
56 | In the end, a **Makefile** is mostly a collection of rules for generating targets, as well as the dependencies for those targets. In this way, a Makefile is a lot like any old program you might write: it defines functions (rules) for running code (bash commands) to modify files (perform ETL). One of the biggest differences between a Makefile and a standard program, however, is that compared to most programs, writing a Makefile requires thinking *backwards*.
57 |
58 | #### Thinking backwards
59 |
60 | What does it mean to think backwards? Well, consider the simplest way of thinking about editing data: as a series of discrete steps. You start with a source file, then you run it through a script, and then you redirect the output to a new file. We might describe this way of conceptualizing code as "thinking *forward*", since you start at the input and step "forward" toward the output.
61 |
62 | When making data, however, you can also also think *backwards* - in terms of the outputs that you want to produce, and the files that those outputs are derived from. Thinking backwards is a more powerful way of expressing a data workflow, since dependencies aren't always linear, and sometimes your dependencies change (when you receive an updated source file, for example).
63 |
64 | To illustrate this kind of thinking, imagine that you want to produce a [table of test scores for every school in Illinois](https://github.com/datamade/school-report-cards). In this case, generating the table itself is the last step, step 3, where you join a table containing a list of every school in Illinois and its corresponding ID number to a table containing ID numbers and test scores; step 2 is cleaning raw data sources to edit and format all of the columns that you need; and step 1 is scraping the raw data from the web.
65 |
66 | Thinking backwards (and in pseudocode, with the fake commands `join`, `edit`, and `format`), we might represent this workflow something like this:
67 |
68 | ```bash
69 | # step 3 - build the output
70 | final_table: clean_schools.csv clean_scores.csv
71 | join clean_schools.csv clean_scores.csv > final_table
72 |
73 | # step 2b - clean the raw data
74 | clean_schools.csv: raw_schools.csv
75 | edit raw_schools.csv | format > clean_schools.csv
76 |
77 | # step 2a - clean the raw data
78 | clean_scores.csv: raw_scores.csv
79 | edit raw_scores.csv | format > clean_scores.csv
80 |
81 | # step 1b - scrape the web
82 | raw_schools.csv:
83 | wget -O raw_schools.csv https://url.for/raw/schools
84 |
85 | # step 1a - scrape the web
86 | raw_scores.csv:
87 | wget -O raw_scores.csv https://url.for/raw/scores
88 |
89 | ```
90 |
91 | Even though you "thought backwards" to produce this table, when you're done writing your Makefile, you'll typically move things into a more standard "forward" order to help your users read your work.
92 |
93 | This kind of "backwards" thinking can be a tricky concept to get your head around. Let's take a look at how Make runs to see what it means in practice.
94 |
95 | #### Running Make
96 |
97 | Every time you run the `make` command, you tell it which target file you want it to build (with the syntax `make `). Extending the example above, you would run the following command to generate your table:
98 |
99 | ```
100 | make final_table
101 | ```
102 |
103 | Now, Make will look for a Makefile in your working directory, find the target that you're after (`final_table`), and check to see if its dependencies (`clean_schools.csv` and `clean_scores.csv`) exist and are up to date. If a dependency (say, `clean_schools.csv`) is too old or doesn't exist, Make will step back, treat that dependency as a new "target", and check the dependencies *of that dependency* (in this case, `raw_schools.csv`). In this way, Make thinks both *backwards* and *recursively*, generating a dependency tree starting with your output and extending all the way back to the most basic missing dependency.
104 |
105 | ### 3. Makefile 201 - Some Fancy Things Built Into Make
106 |
107 | By now, you should have a decent understanding of what Make is for and what a Makefile typically looks like. If you still feel confused, take a look at our [annotated examples of Make rules](https://datamade.github.io/data-making-guidelines/) or browse our [suggested reading](https://github.com/datamade/data-making-guidelines#further-reading).
108 |
109 | In this section we describe some of the most common tools DataMade uses that come built-in with Make. None of these tools are required for a Makefile to run, but they'll help keep your work clean and concise. What follows is nowhere near complete documentation of Make functionality - for that, you should [read the docs](https://www.gnu.org/software/make/manual/make.html), of course.
110 |
111 | #### Phony Targets
112 |
113 | By default, Make assumes that targets are files - and usually they are. However, sometimes it is useful to define rules that a user can run that do not produce a single file, but instead perform some set of commands. For example, you might want to make all of the targets in the Makefile at once, or you might want to remove everything you've created from your directory - in both of these cases Make needs to do a bunch of things to a bunch of files, rather than run commands to produce one single file. In cases like these we can use "phony targets": targets that aren't defined by one specific file, but that instead act more like function names, encapsulating a set of useful commands.
114 |
115 | To define phony targets, you must explicitly tell Make that they are not associated with files, like so:
116 |
117 | ```
118 | .PHONY: all clean
119 | ```
120 |
121 | Now, Make will understand that the `all` target and the `clean` target are phony targets, and won't necessarily expect to make files out of them.
122 |
123 | In fact, these are the two most common phony targets that we define: `all` typically makes all targets defined in the Makefile at once, while `clean` usually removes all generated targets from the directory. The rules for `all` and `clean` often look something like this:
124 |
125 | ```bash
126 | # Make all targets
127 | all: $(GENERATED_FILES)
128 |
129 | # Remove all generated targets
130 | clean:
131 | rm -Rf finished/*
132 | ```
133 |
134 | In this case, the `$(GENERATED_FILES)` dependency should point to a list of all final output targets in the Makefile, and the directory `finished/` should contain all of the files that have been generated.
135 |
136 | #### Automatic Variables
137 |
138 | GNU Make comes with some [automatic variables](http://www.gnu.org/software/make/manual/html_node/Automatic-Variables.html#Automatic-Variables) that you can use in your recipe to refer to specific targets or dependencies. Automatic variables act as a shorthand for these targets/dependencies.
139 |
140 | The most common automatic variables we use:
141 |
142 | | variable | what it refers to |
143 | |---| --- |
144 | | ```$@``` | the filename of the target |
145 | | ```$^``` | the filenames of all dependencies |
146 | | ```$?``` | the filenames of all dependencies that are newer than the target |
147 | | ```$<``` | the filenames of the first dependency |
148 |
149 | To see an example of automatic variables in action, let's look back at our rule for generating test scores. The final rule (in pseudocode) looked like this:
150 |
151 | ```
152 | final_table: clean_schools.csv clean_scores.csv
153 | join clean_schools.csv clean_scores.csv > final_table
154 | ```
155 |
156 | Make will accept this syntax, but if the target or dependencies have long filenames it might look messy. We could clean it up using automatic variables like so:
157 |
158 | ```
159 | final_table: clean_schools.csv clean_scores.csv
160 | join $^ > $@
161 | ```
162 |
163 | #### Pattern Rules (Implicit Rules)
164 |
165 | In cases where you don't want to state targets explicitly, you can write an [implicit rule](https://www.gnu.org/software/make/manual/html_node/Pattern-Rules.html) by including `%` in the target and dependencies. `%` will match any nonempty substring, and the match is called the *stem*.
166 |
167 | For an example of a pattern rule, let's look at step 2 of the school Makefile:
168 |
169 | ```
170 | clean_schools.csv: raw_schools.csv
171 | edit raw_schools.csv | format > clean_schools.csv
172 |
173 | clean_scores.csv: raw_scores.csv
174 | edit raw_scores.csv | format > clean_scores.csv
175 | ```
176 |
177 | Since our cleaning process is identical for both of these rules, we can collapse them using a pattern rule:
178 |
179 | ```
180 | clean_%.csv: raw_%.csv
181 | edit raw_%.csv | format > clean_%.csv
182 | ```
183 |
184 | Note that we can simplify this even further using automatic variables:
185 |
186 | ```
187 | clean_%.csv: raw_%.csv
188 | edit $^ | format > $@
189 | ```
190 |
191 | **Note:** If your pattern rule fails, check the depedencies. If you've fat-fingered something or omitted the directory by mistake, Make will fail saying a recipe for the target doesn't exist (`make: *** No rule to make target BLAH. Stop.)`, [when in fact it's the dependency that's missing](https://stackoverflow.com/a/5194141/7142170).
192 |
193 | #### Functions for Filenames
194 |
195 | There are some convenient [functions](https://www.gnu.org/software/make/manual/html_node/File-Name-Functions.html) for working with a filename or multiple filenames.
196 |
197 | Some useful filename functions:
198 |
199 | | filename function | what it does |
200 | |---|---|
201 | | ```$(dir [filepaths])``` | returns only the directory path |
202 | | ```$(notdir [filepaths])``` | returns only the file name |
203 |
204 | For example, ```$(dir finished/file1.csv finished/file2.csv)``` = ```'finished/ finished/'```, and ```$(notdir finished/file1.csv finished/file2.csv)``` = ```'file1.csv file2.csv'```
205 |
206 | ## Further Reading
207 |
208 | By now, you might feel like a Make expert. That's great! Move on to our [ETL styleguide ("Makefile 301")](https://github.com/datamade/data-making-guidelines/blob/master/styleguide.md) to dig into the specifics of how to write a beautiful, DataMade-ready Makefile.
209 |
210 | You might also feel really confused about this whole Makefile business. You may ask yourself, "What the heck is a target – is it a file or a variable?" Or, "What's the deal with this phony target thing?" If that sounds like you, don't fret! Make can be confusing if you've never done ETL work before, and with time and practice you'll get the hang of it. Take a look at our [annotated Make examples](http://datamade.github.io/data-making-guidelines/), and then try to annotate your own Makefile. We recommend starting with the [Chicago Lead](https://github.com/City-Bureau/chicago-lead), a well-documented piece of DataMade ETL. For a more challenging Makefile, try to annotate our [Illinois school report cards Makefile](https://github.com/datamade/school-report-cards).
211 |
--------------------------------------------------------------------------------
/static/custom.css:
--------------------------------------------------------------------------------
1 | .codeblock{
2 | background: #d0e9de;
3 | font-family: "Courier";
4 | font-size: .8em;
5 | padding: 15px;
6 | overflow: scroll;
7 | white-space: nowrap;
8 | }
9 |
10 | .pseudocode{
11 | background: #F3F3F3;
12 | width: 60%;
13 | }
14 |
15 | .heavy{
16 | font-weight: 700;
17 | }
18 |
19 | .more{
20 | background-color: #FFF0A5;
21 | }
--------------------------------------------------------------------------------
/styleguide.md:
--------------------------------------------------------------------------------
1 | ## [Making Data, the DataMade Way](https://github.com/datamade/data-making-guidelines/blob/master/README.md)
2 | 1. [Make & Makefile Overview](https://github.com/datamade/data-making-guidelines/blob/master/make.md)
3 | 2. **ETL Styleguide**
4 |
5 | # ETL Styleguide
6 | *AKA "Makefile 301"*
7 |
8 | This page defines the DataMade styleguide for processing data, a collection of "best practices" for writing Makefiles.
9 |
10 | To make use of this page, you should already have a good understanding of [what Make is and why we use it.](https://github.com/datamade/data-making-guidelines/blob/master/make.md) You should also be comfortable with [bash syntax and the Unix filesystem](http://ryanstutorials.net/bash-scripting-tutorial/).
11 |
12 | ## Contents
13 |
14 | 1. [Makefile Best Practices](#1-makefile-best-practices)
15 | 2. [Variables](#2-variables)
16 | 3. [Processors](#3-processors)
17 | 4. [Standard Toolkit](#4-standard-toolkit)
18 | 5. [ETL Workflow Directory Structure](#5-etl-workflow-directory-structure)
19 |
20 | ### 1. Makefile Best Practices
21 |
22 | Some loose notes on best practices:
23 |
24 | #### Show your work
25 | - Some transformations, especially those chaining Unix tools, are obscure. Consider printing the purpose of the transformation (as in: `@echo "Downcasing the header of this csv"`)
26 | - Do not suppress output to the console. Being able to see commands as they execute helps users debug errors.
27 | - Avoid piping stderr to dev/null.
28 | - When possible, list recipes in rough order of processing steps (from start to finish).
29 |
30 | #### Stay [DRY](https://en.wikipedia.org/wiki/Don't_repeat_yourself)
31 | - To limit verbosity, use [arg flags](https://gobyexample.com/command-line-flags) and [automatic variables](https://www.gnu.org/software/make/manual/html_node/Automatic-Variables.html).
32 | - Define 'all' and 'clean' phony targets for keeping the repo clean.
33 | - Prefer implicit patterns over explicit recipes. Implicit patterns are DRY, and files created by implicit patterns will automatically be cleaned up.
34 |
35 | #### Keep things clean
36 | - When implicit patterns are not attractive, set intermediate build files as dependencies of the `.INTERMEDIATE` target. Make will clean up these files for you.
37 | - Makefile directives go at the top of the file, followed by variables, followed by 'all' and 'clean' targets.
38 |
39 | #### A note on cleanliness
40 | It's best to use implicit rules and .INTERMEDIATE targets to have Make clean up intermediate files for you. Sometimes, though, this is not so easy. In particular, if a step in the build process emits multiple files, only some of which will be dependencies, it's not very easy to get Make to clean up everything. This happens frequently when working with ESRI shapefiles which include a .prj, .dbf, .xml, and other files in addition to the .shp file.
41 |
42 | To handle these types of issues, define a .PHONY cleanup target for everything that Make misses, and document the cleanup target in the README.
43 |
44 | ### 2. Variables
45 | Variables are names defined in a Makefile to refer to files, directories, targets, or just about anything that you can represent with text. They are defined in `ALL_CAPS`, and referred to with the same syntax as `$(BASH_VARIABLES)`.
46 |
47 | **A few common variables we define:**
48 |
49 | | variable | description |
50 | |---|---|
51 | | ```GENERATED_FILES``` | A list of all the final output targets that the Makefile can build. This is used as a shorthand way of calling everything in the `all` [phony target](https://github.com/datamade/data-making-guidelines#phony-targets). |
52 | | ```DATA_DIRS``` | If there is a master Makefile with imported sub-Makefiles, this includes all of the sub-directories. It is useful for the `all` [phony target](https://github.com/datamade/data-making-guidelines#phony-targets) in the master Makefile. |
53 | | ```DIR``` | Points to the directory that contains the master Makefile. |
54 | | ```PYTHON_BIN``` | If a Python Virtual Environment needs to be created, this points to the ```bin/``` directory inside that virtual environment. |
55 | | ```PROCESSOR_DIR``` | If the workflow requires [processors](https://github.com/datamade/data-making-guidelines/blob/master/styleguide.md#3-processors), this points at the directory containing the processors. |
56 |
57 | If you have a master Makefile and multiple sub-Makefiles, you should define ```GENERATED_FILES``` in each sub-Makefile, and the other variables above in the master Makefile.
58 |
59 | ### 3. Processors
60 | When processing a target requires more than can be accomplished with our [standard toolkit](https://github.com/datamade/data-making-guidelines/blob/master/styleguide.md#4-standard-toolkit), a processor (i.e. a script for a single operation, often written in Python) can be written.
61 |
62 | For the sake of easier reuse, each processor should be modular, only handling one operation on a file. Each processor should be configured to accept input on ```STDIN``` and write output to ```STDOUT```, so that it's easy to chain processors and operations.
63 |
64 | All processors should live in a ```processors/``` directory in the root of the repository. To make processors available to all Makefiles, define the path to ```processors/``` in the ```PROCESSORS``` [variable](https://github.com/datamade/data-making-guidelines/blob/master/styleguide.md#variables).
65 |
66 | Some examples of single-purpose processors:
67 | - [excel date column -> ISO formatted date column](https://github.com/datamade/gary-counts-data/blob/master/data/processors/convert_excel_time.py)
68 | - ['NA' or 'N/A' -> None](https://github.com/datamade/gary-counts-data/blob/master/data/processors/make_real_nulls.py)
69 | - [delete empty rows from a csv](https://github.com/datamade/gary-counts-data/blob/master/data/processors/delete_empty_rows.py)
70 | - [strip whitespace in a csv](https://github.com/datamade/gary-counts-data/blob/master/data/processors/strip_whitespace.py)
71 |
72 | ## 4. Standard Toolkit
73 |
74 | - For fetching content on the web, use **wget**. Use the `--no-use-server-timestamps` flag so you don't download it every time Make runs; use the `-O` flag to define a custom filepath for the output.
75 | - For manipulating geo files use **GDAL/OGR**.
76 | - For concatenating, cropping, inspecting, or extracting single values from PDF files use the appropriate [**Poppler**](https://packages.debian.org/sid/poppler-utils) (`brew install poppler` or `apt-get poppler-utils`) utility.
77 | - For scraping entire PDFs, use [**tabula-java**](https://github.com/tabulapdf/tabula-java). (See [get-the-lead-out](https://github.com/City-Bureau/get-the-lead-out/blob/master/Makefile) for an example.)
78 | - Use **CSVkit** for manipulating spreadsheets, or things that can be made into spreadsheets. In particular:
79 | - [```in2csv```](https://csvkit.readthedocs.org/en/0.9.1/scripts/in2csv.html)
80 | - [```csvcut```](https://csvkit.readthedocs.org/en/0.9.1/scripts/csvcut.html)
81 | - [```csvjoin```](https://csvkit.readthedocs.org/en/0.9.1/scripts/csvjoin.html)
82 | - For simple SQL-like queries use CSVkit; for more complicated queries use **postgreSQL**.
83 | - For geospatial queries use **postgis**.
84 | - For text manipulation use **[perl](https://luv.asn.au/overheads/perl/man.html)** (like sed but without portability issues), unless it's substantially easier to do it with awk.
85 | - **unzip**, **gzip**, and **tar** for decompressing files. If you are compressing files and have the option, use **tar zcvf**.
86 | - For custom transform code, write **Python scripts**.
87 |
88 | In general, prefer [simple Unix tools](http://physics.oregonstate.edu/~landaur/nacphy/coping-with-unix/node145.html) to custom scripts (processors). When a processor is absolutely necessary, try to follow the [Unix philosophy](http://www.faqs.org/docs/artu/ch01s06.html) ("do one thing and do it well").
89 |
90 | ### 5. ETL Workflow Directory Structure
91 |
92 | In the case that a project has multiple separate data components, you can define a master Makefile at the root of the repository, along with sub-directories that each have a sub-Makefile at the root. When using this type of nested structure, all data processing/transformation should be defined in the sub-Makefiles - the master Makefile should only handle setting up the environment, defining variables/targets used by multiple sub-Makefiles, and calling sub-Makefiles.
93 |
94 | ```
95 | |-- Makefile # the master makefile
96 | |-- README.md
97 | |-- data/
98 | | |--
99 | | | |-- Makefile # a sub-makefile
100 | | | |-- README.md # documents the data source & how data gets processed
101 | | | |-- finished/ # a directory for finished files
102 | | |--
103 | | | |-- Makefile # a sub-makefile
104 | | | `-- < ... etc ... >
105 | |-- processors
106 | | |-- .py
107 | | `-- .sh
108 | `-- requirements.txt # lists install requirements for the pipeline
109 | ```
110 |
--------------------------------------------------------------------------------