├── README.md └── datagov100.json /README.md: -------------------------------------------------------------------------------- 1 | # JSON Lines Guide 2 | 3 | Tutorial on streaming JSON data analysis on the command line. 4 | 5 | #### Table of Contents 6 | 7 | - [Introduction](#introduction) 8 | - [Installation](#installing-json-lines-tools) 9 | - [`jsonfilter`](#jsonfilter) 10 | - [`jsonmap`](#jsonmap) 11 | - [`jsonstats`](#jsonstats) 12 | - [`ndjson-reduce`](#ndjson-reduce) 13 | - [useful pipelines](#useful_pipelines) 14 | 15 | ## Introduction 16 | 17 | [JSON Lines](http://jsonlines.org/) (also known as [Newline Delimited JSON](http://ndjson.org/)) is a really simple way to store JSON that makes it very friendly for data processing and analysis. To store data in JSON Lines format you simply write one JSON stringified object per line. 18 | 19 | Usually you would store multiple JSON objects in a single, pretty printed JSON structure in a file like this: 20 | 21 | ```js 22 | [ 23 | { 24 | "url": "https://tigerweb.geo.census.gov/arcgis/services/TIGERweb/tigerWMS_Census2010/MapServer/WmsServer", 25 | "date": "2017-01-27T03:41:03.083Z" 26 | }, 27 | { 28 | "url": "https://tigerweb.geo.census.gov/arcgis/rest/services/TIGERweb/PUMA_TAD_TAZ_UGA_ZCTA/MapServer", 29 | "date": "2017-01-27T03:41:03.121Z" 30 | } 31 | ] 32 | ``` 33 | 34 | The JSON Lines version of the same information above would be this: 35 | 36 | ``` 37 | {"url":"https://tigerweb.geo.census.gov/arcgis/services/TIGERweb/tigerWMS_Census2010/MapServer/WmsServer","date":"2017-01-27T03:41:03.083Z"} 38 | {"url":"https://tigerweb.geo.census.gov/arcgis/rest/services/TIGERweb/PUMA_TAD_TAZ_UGA_ZCTA/MapServer","date":"2017-01-27T03:41:03.121Z"} 39 | ``` 40 | 41 | This is a little harder to read but the advantage is that it's much easier to process on the command line using streams, especially if instead of two objects you had a million! 42 | 43 | The main advantage of JSON Lines is that you can process each row of the file one at a time without having to read the entire file into memory, which is very important for larger datasets. 44 | 45 | This guide assumes you are on a UNIX machine, but all of the `json*` tools work on Windows as well. 46 | 47 | ## Installing JSON Lines Tools 48 | 49 | The `jsonlines` GitHub organization houses a collection of handy modules for processing JSON Lines data on the command line. 50 | 51 | You should also check out the `ndjson-cli` suite from Mike Bostock which offers even more tools, available at https://github.com/mbostock/ndjson-cli, the concepts in this guide are the same. 52 | 53 | To install the tools do the following: 54 | 55 | 1. [Install a recent Node.js](https://nodejs.org/en/download/) version (LTS Recommended) on your computer 56 | 2. Using `npm`, install the JSON Lines CLI Tools 57 | 58 | ``` 59 | $ npm install -g jsonfilter jsonmap jsonstats 60 | ``` 61 | 62 | If you receive permissions errors, you may need to use `sudo` or [change your permissions](https://docs.npmjs.com/getting-started/fixing-npm-permissions). 63 | 64 | ## jsonfilter 65 | 66 | The [`jsonfilter`](https://github.com/jsonlines/jsonfilter) command takes in a stream of JSON Lines and based on a filter expression exports a subset of data out as JSON Lines. 67 | 68 | For example purposes this repository contains the metadata for 100 Data.gov datasets in JSON Lines format. 69 | 70 | You can download it like this: 71 | 72 | ``` 73 | curl https://raw.githubusercontent.com/jsonlines/guide/master/datagov100.json > data.json 74 | ``` 75 | 76 | Then you can use `jsonfilter` to find the names of all the datasets: 77 | 78 | ``` 79 | $ cat data.json | jsonfilter name 80 | "va-national-formulary" 81 | "safer-company-snapshot-safer-company-snapshot" 82 | "fatality-analysis-reporting-system-fars-ftp-raw-data" 83 | "tiger-line-shapefile-2013-state-alabama-current-county-subdivision-state-based" 84 | "tiger-line-shapefile-2013-state-virginia-current-county-subdivision-state-based" 85 | "national-motor-vehicle-crash-causation-survey-nmvccs-nmvccs-xml-case-viewer" 86 | ... 87 | ``` 88 | 89 | You should see output similar to above, lots of strings. This is also valid JSON Lines output. Lines can be any JSON type: Strings, Numbers, Objects, Arrays. 90 | 91 | You can also do filters like this: 92 | 93 | ``` 94 | $ cat data.json | jsonfilter organization.name 95 | "va-gov" 96 | "dot-gov" 97 | "dot-gov" 98 | "census-gov" 99 | "census-gov" 100 | "dot-gov" 101 | "census-gov" 102 | "census-gov" 103 | ... 104 | ``` 105 | 106 | You can use `sort` and `uniq` to find out which organization is most active. 107 | 108 | ``` 109 | $ cat data.json | jsonfilter organization.name | sort | uniq -c 110 | 58 "census-gov" 111 | 14 "dot-gov" 112 | 2 "gsa-gov" 113 | 1 "nsf-gov" 114 | 8 "opm-gov" 115 | 1 "ssa-gov" 116 | 11 "usgs-gov" 117 | 5 "va-gov" 118 | ``` 119 | 120 | You can view more examples in the [jsonfilter README](https://github.com/jsonlines/jsonfilter) 121 | 122 | ## jsonmap 123 | 124 | This command lets you morph data from one form into another. As opposed to `jsonfilter` which takes a JSON selector expression, `jsonmap` takes a short JavaScript expression as input that gets applied to each row of incoming JSON, and returns a new Object. 125 | 126 | For example if we wanted to grab just a couple of fields from each Data.gov metadata item we could do this: 127 | 128 | ``` 129 | $ cat data.json | jsonmap "{name: this.name, organization: this.organization.name}" 130 | {"name":"va-national-formulary","organization":"va-gov"} 131 | {"name":"safer-company-snapshot-safer-company-snapshot","organization":"dot-gov"} 132 | {"name":"fatality-analysis-reporting-system-fars-ftp-raw-data","organization":"dot-gov"} 133 | ... 134 | ``` 135 | 136 | The variable `this` in the expression has the data for the row of JSON. You can also use Template Strings: 137 | 138 | ``` 139 | $ cat data.json | jsonmap '`Notes: ${this.notes.slice(0, 60)}...`' 140 | "Notes: The VA National Formulary is a listing of products (drugs an..." 141 | "Notes: The Company Snapshot is a concise electronic record of compa..." 142 | "Notes: The program collects data for analysis of traffic safety cra..." 143 | ... 144 | ``` 145 | 146 | If your expression starts with something other than the above two things (objects or template literals) it will simply be used as the function body that gets executed on each row. You can either modify `this` which will get returned at the end of the function: 147 | 148 | ``` 149 | $ cat data.json | jsonmap "if (this.maintainer) this.maintainer = this.maintainer.toUpperCase()" 150 | {"license_title":"Creative Commons CCZero","maintainer":"DON LEES","relationships_... 151 | {"license_title":"Other License Specified","maintainer":"JAMIE VASSER","relationsh... 152 | {"license_title":"U.S. Government Work","maintainer":"LIXIN ZHAO","relationships_a... 153 | ... 154 | ``` 155 | 156 | Or return your own custom data: 157 | 158 | ``` 159 | $ cat data.json | jsonmap "if (this.license_id === 'cc-zero') { return 'Open' } else { return 'Closed'}" 160 | "Open" 161 | "Closed" 162 | "Closed" 163 | ... 164 | ``` 165 | 166 | For more examples you can check out the [jsonmap README](https://github.com/jsonlines/jsonmap/). 167 | 168 | ## jsonstats 169 | 170 | Sometimes you want to get numerical statistics from your data. For example, the metadata includes some page view count metrics: 171 | 172 | ``` 173 | $ cat data.json | jsonfilter tracking_summary.total | head -n5 174 | 11754 175 | 2241 176 | 1925 177 | 1139 178 | 3247 179 | ``` 180 | 181 | You can pipe a stream of numbers to `jsonstats` and when you finish piping, some summary statistics will be printed out: 182 | 183 | ``` 184 | $ cat data.json | jsonfilter tracking_summary.total | jsonstats | json 185 | { 186 | "max": 11754, 187 | "min": 3, 188 | "n": 100, 189 | "_geometric_mean": 2.0157057404897112e+252, 190 | "_reciprocal_sum": 1.2014730007931738, 191 | "mean": 741.4800000000001, 192 | "ss": 165161200.95999998, 193 | "sum": 74148, 194 | "_seen_this": 1, 195 | "_mode": 11754, 196 | "_mode_valid": true, 197 | "variance": 1651612.0095999998, 198 | "standard_deviation": 1285.1505785704646, 199 | "geometric_mean": 333.4604034827265, 200 | "harmonic_mean": 83.23116702080132, 201 | "mode": 11754, 202 | "_max_seen": 1, 203 | "_last": 189 204 | } 205 | ``` 206 | 207 | ## ndjson-reduce 208 | 209 | If you want to take some JSON Lines output and combine the lines into a normal single JSON array of objects that you can use with `JSON.parse` in your programs, you can use the `ndjson-reduce` command. 210 | 211 | The `ndjson-reduce` command gets installed if you run `npm install ndjson-cli -g`, along with some other great tools from the [NDJSON CLI](http://npmjs.org/ndjson-cli) package by Mike Bostock, which offers similar functionality to the other tools in this guide. 212 | 213 | To generate a single JS array with all of the unique organization names: 214 | 215 | ``` 216 | $ cat data.json | jsonfilter organization.name | sort | uniq | ndjson-reduce 217 | ["census-gov","dot-gov","gsa-gov","nsf-gov","opm-gov","ssa-gov","usgs-gov","va-gov"] 218 | ``` 219 | 220 | The only catch with `ndjson-reduce` is it isn't streaming, meaning it assumes your data can fit in memory easily, and be small enough for node to call` JSON.stringify` on, the limit of which is usually around 500MB or so of JSON. 221 | 222 | If you are looking for a streaming alternative to `ndjson-reduce` then check out `https://www.npmjs.com/package/json-write-stream`. 223 | 224 | ## useful pipelines 225 | 226 | Here are some more tools that work really well in combination with JSON Lines tools: 227 | 228 | ### `head` 229 | 230 | Built in to Unix. Lets you see the beginning N lines of a stream: 231 | 232 | ``` 233 | $ cat data.json | head -n1 234 | {"license_title":"Creative Commons CCZero","maintainer":"Don Lees","relationships_... 235 | ``` 236 | 237 | ### `json` 238 | 239 | Available as `npm install json -g` from npm, it pretty prints a JSON object! 240 | 241 | ``` 242 | $ cat data.json | head -n1 | json 243 | { 244 | "license_title": "Creative Commons CCZero", 245 | "maintainer": "Don Lees", 246 | "relationships_as_object": [], 247 | "private": false, 248 | ... 249 | ``` 250 | 251 | ### `wc` 252 | 253 | Built in to Unix for doing word counts, but can also count lines, very useful for working with JSON Lines to count how many lines are in a file or filter 254 | 255 | ``` 256 | $ cat data.json | wc -l 257 | 100 258 | ``` 259 | 260 | ### `grep` 261 | 262 | Built in to Unix, useful if you just want to filter lines based on a regular expression 263 | 264 | ``` 265 | $ cat data.json | grep CCZero | wc -l 266 | 8 267 | ``` 268 | --------------------------------------------------------------------------------