├── README.md ├── cookbook.md ├── hadoop.md ├── humongous-survival-guide.md ├── nfu └── sql.md /README.md: -------------------------------------------------------------------------------- 1 | # nfu: Numeric Fu for your shell 2 | **NOTE:** nfu is unlikely to receive any more major updates, as I'm currently 3 | working on its successor [ni](https://github.com/spencertipping/ni). 4 | 5 | `nfu` is a text data hub and transformation tool with a large set of composable 6 | functions and source/sink adapters. For example, if you wanted to do a map-side 7 | inner join between a PostgreSQL table, a CSV from the Internet, and stuff on 8 | HDFS and gather the results into a sorted/uniqued text file: 9 | 10 | ```sh 11 | $ nfu sql:P@:'%*mytable' \ 12 | -i0 @[ http://data.com/csv -F , ] \ 13 | -H@::H. [ -i0 hdfsjoin:/path/to/hdfs/data ] ^gcf1. \ 14 | -g \ 15 | > output 16 | 17 | # equivalent long version 18 | $ nfu sql:Pdbname:'select * from mytable' \ 19 | --index 0 @[ http://data.com/csv --fieldsplit , ] \ 20 | --hadoop /tmp/temp-resharded-upload-path [ ] [ ] \ 21 | --hadoop . [ --index 0 hdfsjoin:/path/to/hdfs/data ] \ 22 | [ --group --count --fields 1. ] \ 23 | --group \ 24 | > output 25 | ``` 26 | 27 | Then if you wanted to plot a cumulative histogram of the `.metadata.size` JSON 28 | field from the third column values, binned to the nearest 100: 29 | 30 | ```sh 31 | $ nfu output -m 'jd(%2).metadata.size' -q100ocOs1f10p %l 32 | 33 | # equivalent long version 34 | $ nfu output --map 'json_decode($_[2]).metadata.size' \ 35 | --quant 100 --order --count --rorder \ 36 | --sum 1 --fields 10 --plot 'with lines' 37 | ``` 38 | 39 | ## Documentation 40 | - [The Humongous Survival Guide](humongous-survival-guide.md) 41 | - [The nfu Cookbook](cookbook.md) 42 | - [nfu and Hadoop Streaming](hadoop.md) 43 | - [nfu and SQL databases](sql.md) 44 | 45 | ## Contributors 46 | - [Spencer Tipping](https://github.com/spencertipping) 47 | - [Factual, Inc](https://github.com/Factual) 48 | 49 | MIT license as usual. 50 | 51 | ## Options and stuff 52 | If you invoke `nfu` with no arguments, it will give you the following summary: 53 | 54 | ```sh 55 | usage: nfu [prefix-commands...] [input-files...] commands... 56 | where each command is one of the following: 57 | 58 | -A|--aggregate (1) 59 | --append (1) 60 | -a|--average (0) -- window size (0 for full average) -- running average 61 | -b|--branch (1) 62 | -R|--buffer (1) 63 | -c|--count (0) -- counts by first column value; like uniq -c 64 | -S|--delta (0) -- value -> difference from last value 65 | -D|--drop (0) -- number of records to drop 66 | --duplicate (2) 67 | -e|--each (1) 68 | --entropy (0) -- running entropy of relative probabilities/frequencies 69 | -E|--every (1) 70 | -L|--exp (0) -- optional base (default e) 71 | -f|--fields (0) -- string of digits, each a zero-indexed column selector 72 | -F|--fieldsplit (1) 73 | --fold (1) 74 | -g|--group (0) -- sorts ascending, takes optional column list 75 | -H|--hadoop (3) 76 | --http (1) 77 | -i|--index (2) 78 | -I|--indexouter (2) 79 | -z|--intify (0) -- convert column to dense integers (linear space) 80 | -j|--join (2) 81 | -J|--joinouter (2) 82 | -k|--keep (1) 83 | -l|--log (0) -- optional base (default e) 84 | -m|--map (1) 85 | --mplot (1) 86 | -N|--ntiles (1) 87 | -n|--number (0) -- prepends line number to each line 88 | --octave (1) 89 | -o|--order (0) -- sorts ascending by general numeric value 90 | --partition (2) 91 | --pipe (1) 92 | -p|--plot (1) 93 | -M|--pmap (1) 94 | -P|--poll (2) 95 | --prepend (1) 96 | --preview (0) 97 | -q|--quant (1) 98 | -r|--read (0) -- reads pseudofiles from the data stream 99 | -K|--remove (1) 100 | --repeat (2) 101 | -G|--rgroup (0) -- sorts descending, takes optional column list 102 | -O|--rorder (0) -- sorts descending by general numeric value 103 | --sample (1) 104 | --sd (0) -- running standard deviation 105 | --splot (1) 106 | -Q|--sql (3) 107 | -s|--sum (0) -- value -> total += value 108 | -T|--take (0) -- n to take first n, +n to take last n 109 | --tcp (1) 110 | --tee (1) 111 | -C|--uncount (0) -- the opposite of --count; repeats each row N times 112 | -V|--variance (0) -- running variance 113 | -w|--with (1) 114 | 115 | and prefix commands are: 116 | 117 | documentation (not used with normal commands): 118 | --explain 119 | --expand-pseudofile 120 | --expand-code

121 |     --expand-gnuplot    
122 |     --expand-sql        
123 | 
124 |   pipeline modifiers:
125 |     --quote     -- quotes args: eval $(nfu --quote ...)
126 |     --use       
127 |     --run       
128 | 
129 | argument bracket preprocessing:
130 | 
131 |   ^stuff -> [ -stuff ]
132 | 
133 |    [ ]    nfu as function: [ -gc ]     == "$(nfu --quote -gc)"
134 |   @[ ]    nfu as data:    @[ -gc foo ] == sh:"$(nfu --quote -gc foo)"
135 |   q[ ]    quote things:   q[ foo bar ] == "foo bar"
136 | 
137 | pseudofile patterns:
138 | 
139 |   file.bz2       decompress file with bzip2 -dc
140 |   file.gz        decompress file with gzip -dc
141 |   file.lzo       decompress file with lzop -dc
142 |   file.xz        decompress file with xz -dc
143 |   hdfs:path      read HDFS file(s) with hadoop fs -text
144 |   hdfsjoin:path  mapside join pseudofile (a subset of hdfs:path)
145 |   http[s]://url  retrieve url with curl
146 |   id:X           verbatim text X
147 |   n:number       numbers from 1 to n, inclusive
148 |   perl:expr      perl -e 'print "$_\n" for (expr)'
149 |   s3://url       access S3 using s3cmd
150 |   sh:stuff       run sh -c "stuff", take stdout
151 |   sql:db:query   results of query as TSV
152 |   user@host:x    remote data access (x can be a pseudofile)
153 | 
154 | gnuplot expansions:
155 | 
156 |   %d -> ' with dots'
157 |   %i -> ' with impulses'
158 |   %l -> ' with lines'
159 |   %p -> ' lc palette '
160 |   %t -> ' title '
161 |   %u -> ' using '
162 |   %v -> ' with vectors '
163 | 
164 | SQL expansions:
165 | 
166 |   %\* -> ' select * from '
167 |   %c -> ' select count(1) from '
168 |   %d -> ' select distinct * from '
169 |   %g -> ' group by '
170 |   %j -> ' inner join '
171 |   %l -> ' outer left join '
172 |   %r -> ' outer right join '
173 |   %w -> ' where '
174 | 
175 | database prefixes:
176 | 
177 |   P = PostgreSQL
178 |   S = SQLite 3
179 | 
180 | environment variables:
181 | 
182 |   NFU_ALWAYS_VERBOSE    if set, nfu will be verbose all the time
183 |   NFU_HADOOP_COMMAND    hadoop executable; e.g. hadoop jar, hadoop fs -ls
184 |   NFU_HADOOP_OPTIONS    -D options for hadoop streaming jobs
185 |   NFU_HADOOP_STREAMING  absolute location of hadoop-streaming.jar
186 |   NFU_HADOOP_TMPDIR     default /tmp; temp dir for hadoop uploads
187 |   NFU_MAX_FILEHANDLES   default 64; maximum #subprocesses for --partition
188 |   NFU_NO_PAGER          if set, nfu will not use "less" to preview stdout
189 |   NFU_PMAP_PARALLELISM  number of subprocesses for -M
190 |   NFU_SORT_BUFFER       default 256M; size of in-memory sort for -g and -o
191 |   NFU_SORT_COMPRESS     default none; compression program for sort tempfiles
192 |   NFU_SORT_PARALLEL     default 4; number of concurrent sorts to run
193 | 
194 | see https://github.com/spencertipping/nfu for documentation
195 | ```
196 | 


--------------------------------------------------------------------------------
/cookbook.md:
--------------------------------------------------------------------------------
 1 | # The nfu Cookbook
 2 | An ongoing collection of real-world tasks I ended up using nfu to solve. I
 3 | recommend using `nfu --explain ...` and `nfu --expand-code '...'` on the
 4 | examples below if you're new to nfu.
 5 | 
 6 | ## First record within each of N categories
 7 | I had a series of JSON records, each of which had a "category" field. I wanted
 8 | to get 10000 output records, each in a different category.
 9 | 
10 | ```sh
11 | $ nfu records -m  'my $j = jd(%0); row $j.metadata.category, %0' \
12 |               -gA '${%1}[0]' \
13 |               -T10000
14 | ```
15 | 
16 | Initially I also wanted to make sure none of the categories were bogus, so I
17 | previewed with the category keys still in the first column like this:
18 | 
19 | ```sh
20 | $ nfu records -m  'my $j = jd(%0); row $j.metadata.category, %0' \
21 |               -gA 'row $_, ${%1}[0]' \
22 |               -T10000
23 | ```
24 | 
25 | ## Comparing field values across different JSON formats
26 | There were two files, one with lines in this format:
27 | 
28 | ```
29 | $ head -n1 file1
30 | {"metadata": {"category": "foo", ...}, "id": 1, "name": "bar", ...}
31 | ```
32 | 
33 | and the other, derived from the first, with lines in this format:
34 | 
35 | ```
36 | $ head -n1 file2
37 | {"category": "foo", "data": [{"id": 1, "record": {"name": "bar", ...}, ...}]}
38 | ```
39 | 
40 | I wanted to count up the number of names that had changed. The second file's
41 | rows weren't in the same order as the first, so I needed to join by ID.
42 | 
43 | ```sh
44 | $ nfu file1 -m 'my $j = jd(%0); row $j.id, $j.name' > file1-by-id
45 | $ nfu file2 \
46 |     -m 'my $j = jd(%0); row $j.data->[0].id, $j.data->[0].record.name' \
47 |     -i0 file1-by-id \
48 |     -m 'row %1 ne %2' \
49 |     -sT+1
50 | ```
51 | 
52 | `file1-by-id` contains a TSV of `id, name`, and we end up doing the same thing
53 | to the contents of `file2` before joining (`-i0 file1-by-id`) to get the
54 | combined inner join, `id, file1-name, file2-name`. Perl's `ne` operator returns
55 | zero for equal strings and 1 for unequal strings, so we use `-s` to sum the 1's
56 | up, taking just the last record to get the total.
57 | 
58 | ## Compressing field values into a dense integer range
59 | I had a bunch of rows, each of the form `UUID, lat, lng` (with repeated UUIDs),
60 | and I wanted the UUIDs to be integers so I could 3D-plot the coordinates.
61 | 
62 | ```sh
63 | $ nfu data -gm 'row $::n{%0} //= $::i++, %1, %2' --splot
64 | ```
65 | 
66 | ## Removing outliers from plotted data
67 | The 3D plot above was scaled wrong due to a few outliers. Ideally I'd just be
68 | looking at stuff between the 5th and 95th percentiles.
69 | 
70 | ```sh
71 | $ nfu data --run '($::a, $::b) = (read_lines "sh:nfu data -f1N100")[5, 95];
72 |                   ($::c, $::d) = (read_lines "sh:nfu data -f2N100")[5, 95]' \
73 |            -k '%1 > $::a && %1 < $::b &&
74 |                %2 > $::c && %2 < $::d' \
75 |       > clipped
76 | ```
77 | 
78 | I preferred to leave it all as a single command so I could tweak stuff, so I
79 | used variable substitution to eliminate the duplication that would otherwise
80 | result:
81 | 
82 | ```sh
83 | $ nfu --run '($::a, $::b) = (rl "%data -f1N100")[%lo, %hi];
84 |              ($::c, $::d) = (rl "%data -f2N100")[%lo, %hi]' \
85 |       -k '%1 > $::a && %1 < $::b &&
86 |           %2 > $::c && %2 < $::d' \
87 |       %data \
88 |       -gm 'row $::n{%0} //= $::i++, %1, %2' --splot \
89 |       % data='sh:nfu data' lo=5 hi=95
90 | ```
91 | 


--------------------------------------------------------------------------------
/hadoop.md:
--------------------------------------------------------------------------------
  1 | # nfu and Hadoop Streaming
  2 | nfu provides tight integration with Hadoop Streaming, preserving its usual
  3 | pipe-chain semantics while leveraging HDFS for intermediate data storage. It
  4 | does this by providing the `--hadoop` (`-H`) function:
  5 | 
  6 | - `--hadoop 'outpath' 'mapper' 'reducer'`: emit data to specified output path,
  7 |   printing the output path name as a pseudofile (this lets you say things like
  8 |   `nfu $(nfu --hadoop ....) ...`.
  9 | - `--hadoop . 'mapper' 'reducer'`: if `outpath` is `.`, then print output data.
 10 | - `--hadoop @ 'mapper' 'reducer'`: if `outpath` is `@`, create a temporary path
 11 |   and print its pseudofile name.
 12 | 
 13 | The "mapper" and "reducer" arguments are arbitrary shell commands that may or
 14 | may not involve nfu. "reducer" can be set to `NONE`, `-`, or `_` to run a
 15 | map-only job. The mapper, reducer, or both can be `:` as a shorthand for the
 16 | identity job. As a result of this and the way nfu handles short options, the
 17 | following idioms are supported:
 18 | 
 19 | - `-H@`: hadoop and output filename
 20 | - `-H.`: hadoop and cat output
 21 | - `-H@::`: reshard data, as in preparation for a mapside join
 22 | - `-H.: ^gcf1.`: a distributed version of `sort | uniq`
 23 | 
 24 | Normally you'd write the mapper and reducer either as external commands, or by
 25 | using `nfu --quote ...`. However, nfu provides two shorthand notations for
 26 | quoted forms:
 27 | 
 28 | - `[ -gc ]` is the same as `"$(nfu --quote -gc)"` (NB: spaces around brackets
 29 |   are required)
 30 | - `^gc` is the same as `[ -gc ]`
 31 | 
 32 | Quoted nfu jobs can involve `--use` clauses, which are turned into `--run`
 33 | before hadoop sends the command to the workers.
 34 | 
 35 | Hadoop jobs support some magic to simplify data transfer:
 36 | 
 37 | ```sh
 38 | # upload from stdin
 39 | $ seq 100 | nfu --hadoop . "$(nfu --quote -m 'row %0, %0 + 1')" _
 40 | $ seq 100 | nfu --hadoop . [ -m 'row %0, %0 + 1' ] _
 41 | 
 42 | # upload from pseudofiles
 43 | $ nfu sh:'seq 100' --hadoop ...
 44 | 
 45 | # use data already on HDFS
 46 | $ nfu hdfs:/path/to/data --hadoop ...
 47 | ```
 48 | 
 49 | As a special case, nfu looks for cases where stdin contains lines beginning
 50 | with `hdfs:/` and interprets this as a list of HDFS files to process (rather
 51 | than being verbatim data). This allows you to chain `--hadoop` jobs without
 52 | downloading/uploading all of the intermediate results:
 53 | 
 54 | ```sh
 55 | # two hadoop jobs; intermediate results stay on HDFS and are never downloaded
 56 | $ seq 100 | nfu -H@ [ -m 'row %0 % 10, %0 + 1' ] ^gc \
 57 |                 -H. ^C _
 58 | ```
 59 | 
 60 | nfu detects when it's being run as a hadoop streaming job and changes its
 61 | verbose behavior to create hadoop counters. This means you can get the same
 62 | kind of throughput statistics by using the `-v` option:
 63 | 
 64 | ```sh
 65 | $ seq 10000 | nfu --hadoop . ^vgc _
 66 | ```
 67 | 
 68 | Because `hdfs:` is a pseudofile prefix, you can also transparently download
 69 | HDFS data to process locally:
 70 | 
 71 | ```sh
 72 | $ nfu hdfs:/path/to/data -gc
 73 | ```
 74 | 
 75 | ## Mapside joins
 76 | You can use nfu's `--index`, `--indexouter`, `--join`, and `--joinouter`
 77 | functions to join arbitrary data on HDFS. Because HDFS data is often large and
 78 | consistently partitioned, nfu provides a `hdfsjoin:path` pseudofile that
 79 | assumes Hadoop default partitioning and expands into a list of partfiles
 80 | sufficient to cover all keys that coincide with the current mapper's data.
 81 | Here's an example of how you might use it:
 82 | 
 83 | ```sh
 84 | # take 10000 words at random and generate [word, length] in /tmp/nfu-jointest
 85 | # NB: you need a reducer here (even though it's a no-op); otherwise Hadoop
 86 | # won't partition your mapper outputs.
 87 | $ nfu sh:'shuf /usr/share/dict/words' \
 88 |   --take 10000 \
 89 |   --hadoop /tmp/nfu-jointest [ -vm 'row %0, length %0' ] ^v
 90 | 
 91 | # now inner-join against that data
 92 | $ nfu /usr/share/dict/words \
 93 |   --hadoop . [ -vi0 hdfsjoin:/tmp/nfu-jointest ] _
 94 | ```
 95 | 
 96 | ## Examples
 97 | ### Word count
 98 | ```sh
 99 | # local version:
100 | $ nfu hadoop.md -m  'map row($_, 1), split /\s+/, %0' \
101 |                 -gA 'row $_, sum @{%1}'
102 | 
103 | # process on hadoop, download outputs:
104 | $ nfu hadoop.md -H. [ -m 'map row($_, 1), split /\s+/, %0' ] \
105 |                     [ -A 'row $_, sum @{%1}' ]
106 | 
107 | # leave on HDFS, download separately using hdfs: pseudofile
108 | $ nfu hadoop.md -H /tmp/nfu-wordcount-outputs \
109 |                    [ -m 'map row($_, 1), split /\s+/, %0' ] \
110 |                    [ -A 'row $_, sum @{%1}' ]
111 | $ nfu hdfs:/tmp/nfu-wordcount-outputs -g
112 | ```
113 | 


--------------------------------------------------------------------------------
/humongous-survival-guide.md:
--------------------------------------------------------------------------------
  1 | # The Humongous nfu Survival Guide
  2 | ## Introduction
  3 | nfu is all about tab-delimited text data. It does a number of things to make
  4 | this data easier to work with; for example:
  5 | 
  6 | ```sh
  7 | $ git clone git://github.com/spencertipping/nfu
  8 | $ cd nfu
  9 | $ ./nfu README.md               # behaves like 'less'
 10 | $ gzip README.md
 11 | $ ./nfu README.md.gz            # transparent decompression (+ xz, bz2, lzo)
 12 | ```
 13 | 
 14 | Now let's do some basic word counting. We can get a word list by using nfu's
 15 | `-m` operator, which takes a snippet of Perl code and executes it once for each
 16 | line. Then we sort (`-g`, or `--group`), count-distinct (`-c`), and
 17 | reverse-numeric-sort (`-O`, or `--rorder`) to get a histogram descending by
 18 | frequency:
 19 | 
 20 | ```sh
 21 | $ nfu README.md -m 'split /\W+/, %0' -gcO
 22 | 48	
 23 | 28	nfu
 24 | 20	seq
 25 | 19	100
 26 | ...
 27 | $
 28 | ```
 29 | 
 30 | `%0` is shorthand for `$_[0]`, which is how you access the first element of
 31 | Perl's function-arguments (`@_`) variable. Any Perl code you give to nfu will
 32 | be run inside a subroutine, and the arguments are usually tab-separated field
 33 | values.
 34 | 
 35 | Commands you issue to nfu are chained together using shell pipes. This means
 36 | that the following are equivalent:
 37 | 
 38 | ```sh
 39 | $ nfu README.md -m 'split /\W+/, %0' -gcO
 40 | $ nfu README.md | nfu -m 'split /\W+/, %0' \
 41 |                 | nfu -g \
 42 |                 | nfu -c \
 43 |                 | nfu -O
 44 | ```
 45 | 
 46 | nfu uses a number of shorthands whose semantics may become confusing. To see
 47 | what's going on, you can use its documentation options:
 48 | 
 49 | ```sh
 50 | $ nfu --expand-code 'split /\W+/, %0'
 51 | split /\W+/, $_[0]
 52 | $ nfu --explain README.md -m 'split /\W+/, %0' -gcO
 53 | file	README.md
 54 | --map	'split /\W+/, %0'
 55 | --group
 56 | --count
 57 | --rorder
 58 | --preview
 59 | $
 60 | ```
 61 | 
 62 | You can also run nfu with no arguments to see a usage summary.
 63 | 
 64 | ## Basic idioms
 65 | ### Extracting data
 66 | - `-m 'split /\W+/, %0'`: convert text file to one word per line
 67 | - `-m 'map {split /\W+/} @_'`: same thing for text files with tabs
 68 | - `-F '\W+'`: convert file to one word per column, preserving lines
 69 | - `-m '@_'`: reshape to a single column, flattening into rows
 70 | - `seq 10 | tr '\n' '\t'`: reshape to a single row, flattening into columns
 71 | 
 72 | The `-F` operator resplits lines by the regexp you provide. So to parse
 73 | /etc/passwd, for example, you'd say `nfu -F : /etc/passwd ...`.
 74 | 
 75 | ### Generating data
 76 | - `-P 5 'cat /proc/loadavg'`: run 'cat /proc/loadavg' every five seconds,
 77 |   collecting stdout
 78 | - `--repeat 10 README.md`: read README.md 10 times in a row (this is more
 79 |   useful than it looks; see "Pipelines, Combination, and Quotation" below)
 80 | 
 81 | ### Basic transformations
 82 | - `-n`: prepend line numbers as first column
 83 | - `-m 'row @_, %0 * 2'`: keep all existing columns, appending `%0 * 2` as a new
 84 |   one
 85 | - `-m '%1 =~ s/foo/bar/g; row @_'`: transform second column by replacing 'foo'
 86 |   with 'bar'
 87 | - `-m 'row %0, %1 =~ s/foo/bar/gr, @_[2..$#_]'`: same thing, but without
 88 |   in-place modification of `%1`
 89 | 
 90 | `-M` is a variant of `-m` that runs a pool of parallel subprocesses (by default
 91 | 16). This doesn't preserve row ordering, but can be useful if you're doing
 92 | something latency-bound like fetching web documents:
 93 | 
 94 | ```sh
 95 | $ nfu url-list -M 'row %0, qx(curl %0)'
 96 | ```
 97 | 
 98 | In this example, Perl's `qx()` operator could easily produce a string
 99 | containing newlines; in fact most shell commands are written this way. Because
100 | of this, nfu's `row()` function strips the newlines from each of its input
101 | strings. This guarantees that `row()` will produce exactly one line of output.
102 | 
103 | ### Filtering
104 | - `-k '%2 eq "nfu"'`: keep any row whose third column is the text "nfu"
105 | - `-k '%0 < 10'`: keep any row whose first column parses to a number < 10
106 | - `-k '@_ < 5'`: keep any row with fewer than five columns
107 | - `-K '@_ < 5'`: reject any row with fewer than five columns (`-K` vs `-k`)
108 | - `-k 'length %0 < 10'`
109 | - `-k '%0 eq -+-%0'`: keep every row whose first column is numeric
110 | 
111 | ### Row slicing
112 | - `-T5`: take the first 5 lines
113 | - `-T+5`: take the last 5 lines (drop all others)
114 | - `-D5`: drop the first 5 lines
115 | - `--sample 0.01`: take 1% of rows randomly
116 | - `-E100`: take every 100th row deterministically
117 | 
118 | ### Column slicing
119 | - `-f012`: keep the first three columns (fields) in their original order
120 | - `-f10`: swap the first two columns, drop the others
121 | - `-f00.`: duplicate the first column, pushing others to the right
122 | - `-f10.`: swap the first two columns, keep the others in their original order
123 | - `-m 'row(reverse @_)'`: reverse the fields within each row (`row()` is a
124 |   function that keeps an array on one row; otherwise you'd flatten the columns
125 |   across multiple rows)
126 | - `-m 'row(grep /^-/, @_)'`: keep fields beginning with `-`
127 | 
128 | ### Histograms (group, count)
129 | - `-gcO`: descending histogram of most frequent values
130 | - `-gcOl`: descending histogram of most frequent values, log-scaled
131 | - `-gcOs`: cumulative histogram, largest values first
132 | - `-gcf1.`: list of unique values (group, count, fields 1..n)
133 | 
134 | Sorting and counting operators support field selection:
135 | 
136 | - `-g1`: sort by second column
137 | - `-c0`: count unique values of field 0
138 | - `-c01`: count unique combinations of fields 0 and 1 jointly
139 | 
140 | ### Common numeric operations
141 | - `-q0.05`: round (quantize) each number to the nearest 0.05
142 | - `-q10`: quantize each number to the nearest 10
143 | - `-s`: running sum
144 | - `-S`: delta (inverse of `-s`)
145 | - `-l`: log-transform each number, base e
146 | - `-L`: inverse log-transform (exponentiate) each number
147 | - `-a`: running average
148 | - `-V`: running variance
149 | - `--sd`: running sample standard deviation
150 | 
151 | Each of these operations can be applied to a specified set of columns. For
152 | example:
153 | 
154 | - `seq 10 | nfu -f00s1`: first column is 1..10, second is running sum of first
155 | - `seq 10 | nfu -f00a1`: first column is 1..10, second is running mean of first
156 | 
157 | Some of these commands take an optional argument; for example, you can get a
158 | windowed average if you specify a second argument to `-a`:
159 | 
160 | - `seq 10 | nfu -f00a1,5`: second column is a 5-value sliding average
161 | - `seq 10 | nfu -f00q1,5`: second column quantized to 5
162 | - `seq 10 | nfu -f00l1,5`: second column log base-5
163 | - `seq 10 | nfu -f00L1,5`: second column 5^x
164 | 
165 | Multiple-digit fields are interpreted as multiple single-digit fields:
166 | 
167 | - `seq 10 | nfu -f00a01,5`: calculate 5-average of fields 0 and 1 independently
168 | 
169 | The only ambiguous case happens when you specify only one argument: should it
170 | be interpreted as a column selector, or as a numeric parameter? nfu resolves
171 | this by using it as a parameter if the function requires an argument (e.g.
172 | `-q`), otherwise treating it as a column selector.
173 | 
174 | ### Plotting
175 | Note: all plotting requires that `gnuplot` be in your `$PATH`.
176 | 
177 | - `seq 100 | nfu -p`: 2D plot; input values are Y coordinates
178 | - `seq 100 | nfu -m 'row @_, %0 * %0' -p`: 2D plot; first column is X, second
179 |   is Y
180 | - `seq 100 | nfu -p %l`: plot with lines
181 | - `seq 100 | nfu -m 'row %0, sin(%0), cos(%0)' --splot`: 3D plot
182 | 
183 | ```sh
184 | $ seq 1000 | nfu -m '%0 * 0.1' \
185 |                  -m 'row %0, sin(%0), cos(%0)' \
186 |                  --splot %l
187 | ```
188 | 
189 | You can use `nfu --expand-gnuplot '%l'`, for example, to see how nfu is
190 | transforming your gnuplot options. (There's also a list of these shorthands in
191 | nfu's usage documentation.)
192 | 
193 | ### Progress reporting
194 | If you're doing something with a large amount of data, it's sometimes hard to
195 | know whether it's worth hitting `^C` and optimizing stuff. To help with this,
196 | nfu has a `--verbose` (`-v`) option that activates throughput metrics for each
197 | operation in the pipeline. For example:
198 | 
199 | ```sh
200 | $ seq 100000000 | nfu -o                # this might take a while
201 | $ seq 100000000 | nfu -v -o             # keep track of lines and kb
202 | ```
203 | 
204 | ## Advanced usage (assumes some Perl knowledge)
205 | ### JSON
206 | nfu provides two functions, `jd` (or `json_decode`) and `je`/`json_encode`,
207 | that are available within any code you write:
208 | 
209 | ```sh
210 | $ ip_addrs=$(seq 10 | tr '\n' '\r' | nfu -m 'join ",", map "%0.4.4.4", @_')
211 | $ query_url="www.datasciencetoolkit.org/ip2coordinates/$ip_addrs"
212 | $ curl "$query_url" \
213 |   | nfu -m 'my $json = jd(%0);
214 |             map row($_, ${$json}{$_}.locality), keys %$json'
215 | ```
216 | 
217 | This code uses another shorthand, `.locality`, which expands to a Perl hash
218 | dereference `->{"locality"}`. There isn't a similar shorthand for arrays, which
219 | means you need to explicitly dereference those:
220 | 
221 | ```sh
222 | $ echo '[1,2,3]' | nfu -m 'jd(%0)[0]'           # won't work!
223 | $ echo '[1,2,3]' | nfu -m '${jd(%0)}[0]'
224 | ```
225 | 
226 | ### Multi-plotting
227 | You can setup a multiplot by creating multiple columns of data. gnuplot then
228 | lets you refer to these with its `using N` construct, which nfu lets you write
229 | as `%uN`:
230 | 
231 | ```sh
232 | $ seq 1000 | nfu -m '%0 * 0.01' | gzip > numbers.gz
233 | $ nfu numbers.gz -m 'row sin(%0), cos(%0)' \
234 |                  --mplot '%u1%l%t"sin(x)"; %u2%l%t"cos(x)"'
235 | $ nfu numbers.gz -m 'sin %0' \
236 |                  -f00a1 \
237 |                  --mplot '%u1%l%t"sin(x)"; %u2%l%t"average(sin(x))"'
238 | $ nfu numbers.gz -m 'sin %0' \
239 |                  -f00a1 \
240 |                  -m 'row @_, %1-%0' \
241 |                  --mplot '%u1%l%t"sin(x)";
242 |                           %u2%l%t"average(sin(x))";
243 |                           %u3%l%t"difference"'
244 | ```
245 | 
246 | The semicolon notation is something nfu requires. It works this way because
247 | internally nfu scripts gnuplot like this:
248 | 
249 | ```
250 | plot "tempfile-name" using 1 with lines title "sin(x)"
251 | plot "tempfile-name" using 2 with lines title "average(sin(x))"
252 | plot "tempfile-name" using 3 with lines title "difference"
253 | ```
254 | 
255 | ### Local map-reduce
256 | nfu provides an aggregation operator for sorted data. This groups adjacent rows
257 | by their first column and hands you a series of array references, one for each
258 | column's values within that group. For example, here's word-frequency again,
259 | this time using `-A`:
260 | 
261 | ```sh
262 | $ nfu README.md -m 'split /\W+/, %0' \
263 |                 -m 'row %0, 1' \
264 |                 -gA 'row $_, sum @{%1}'
265 | ```
266 | 
267 | A couple of things are happening here. First, the current group key is stored
268 | in `$_`; this allows you to avoid the more cumbersome (but equivalent)
269 | `${%0}[0]`. Second, `%1` is now an array reference containing the second field
270 | of all grouped rows. `sum` is provided by nfu and does what you'd expect.
271 | 
272 | In addition to map/reduce functions, nfu also gives you `--partition`, which
273 | you can use to send groups of records to different files. For example:
274 | 
275 | ```sh
276 | $ nfu README.md -m 'split /\W+/, %0' \
277 |                 --partition 'substr(%0, 0, 1)' \
278 |                             'cat > words-starting-with-{}'
279 | ```
280 | 
281 | `--partition` will keep up to 256 subprocesses running; if you have more groups
282 | than that, it will close and reopen pipes as necessary, which will cause your
283 | subprocesses to be restarted. (For this reason, `cat > ...` isn't a great
284 | subprocess; `cat >> ...` is better.)
285 | 
286 | ### Loading Perl code
287 | nfu provides a few utility functions:
288 | 
289 | - `sum @array`
290 | - `mean @array`
291 | - `uniq @array`
292 | - `frequencies @array`
293 | - `read_file "filename"`: returns a string
294 | - `read_lines "filename"`: returns an array of chomped strings
295 | 
296 | But sometimes you'll need more definitions to write application-specific code.
297 | For this nfu gives you two options, `--use` and `--run`:
298 | 
299 | ```sh
300 | $ nfu --use myfile.pl ...
301 | $ nfu --run 'sub foo {...}' ...
302 | ```
303 | 
304 | Any definitions will be available inside `-m`, `-A`, and other code-evaluating
305 | operators.
306 | 
307 | A common case where you'd use `--run` is to precompute some kind of data
308 | structure before using it within a row function. For example, to count up all
309 | words that never appear at the beginning of a line:
310 | 
311 | ```sh
312 | $ nfu README.md -F '\s+' -f0 > first-words
313 | $ nfu --run '$::seen{$_} = 1 for read_lines "first-words"' \
314 |       -m 'split /\W+/, %0' \
315 |       -K '$::seen{%0}'
316 | ```
317 | 
318 | Notice that we're package-scoping `%::seen`. This is required because while row
319 | functions reside in the same package as `--run` and `--use` code, they're in a
320 | different lexical scope. This means that any `my` or `our` variables are
321 | invisible and will trigger compile-time errors if you try to refer to them from
322 | other compiled code.
323 | 
324 | ### Pseudofiles
325 | Gzipped data is uncompressed automatically by an abstraction that nfu calls a
326 | pseudofile. In addition to uncompressing things, several other pseudofile forms
327 | are recognized:
328 | 
329 | ```sh
330 | $ nfu http://factual.com                # uses stdout from curl
331 | $ nfu sh:ls                             # uses stdout from a command
332 | $ nfu [user@]host:other-file            # pipe file over ssh -C
333 | $ nfu hdfs:/path/to/data                # uses hadoop fs -text
334 | $ nfu psql:'query'                      # uses psql -c and exports as TSV
335 | ```
336 | 
337 | nfu supports pseudofiles everywhere it expects a filename, including in
338 | `read_file` and `read_lines`.
339 | 
340 | ### Pipelines, combination, and quotation
341 | nfu gives you several commands that let you gather data from other sources. For
342 | example:
343 | 
344 | ```sh
345 | $ nfu README.md -m 'split /\W+/, %0' --prepend README.md
346 | $ nfu README.md -m 'split /\W+/, %0' --append README.md
347 | $ nfu README.md --with sh:'tac README.md'
348 | $ nfu --repeat 10 README.md
349 | $ nfu README.md --pipe tac
350 | $ nfu README.md --tee 'cat > README2.md'
351 | $ nfu README.md --duplicate 'cat > README2.md' 'tac > README-reverse.md'
352 | ```
353 | 
354 | Here's what these things do:
355 | 
356 | - `--prepend`: prepends a pseudofile's contents to the current data
357 | - `--append`: appends a pseudofile
358 | - `--with`: joins a pseudofile column-wise, ending when either side runs out of
359 |   rows
360 | - `--repeat`: repeats a pseudofile the specified number of times, forever if n
361 |   = 0; ignores any prior data
362 | - `--pipe`: same thing as a shell pipe, but doesn't lose nfu state
363 | - `--tee`: duplicates data to a shell process, collecting its _stdout into your
364 |   data stream_ (you can avoid this by using `> /dev/null`)
365 | - `--duplicate`: sends your data to two shell processes, combining their
366 |   stdouts
367 | 
368 | Sometimes you'll want to use nfu itself as a shell command, but this can become
369 | difficult due to nested quotation. To get around this, nfu provides the
370 | `--quote` operator, which generates a properly quoted command line:
371 | 
372 | ```sh
373 | $ nfu --repeat 10 sh:"$(nfu --quote README.md -m 'split /\W+/, %0')"
374 | ```
375 | 
376 | This is clunky, so nfu goes a step further and provides bracket syntax:
377 | 
378 | ```sh
379 | $ nfu --repeat 10 nfu[ README.md -m 'split /\W+/, %0' ]
380 | ```
381 | 
382 | ### Keyed joins
383 | This works on sorted data, and behaves like SQL's JOIN construct. Under the
384 | hood, nfu takes care of the sorting and the voodoo associated with getting
385 | `sort` and `join` to work together, so you can write something simple like
386 | this:
387 | 
388 | ```sh
389 | $ nfu /usr/share/dict/words -m 'row %0, length %0' > bytes-per-word
390 | $ nfu README.md -m 'split /\W+/, %0' \
391 |                 -I0 bytes-per-word \
392 |                 -m 'row %0, %1 // 0' \
393 |                 -gA 'row $_, sum @{%1}'
394 | ```
395 | 
396 | Here's what's going on:
397 | 
398 | - `-I0 bytes-per-word`: outer left join using field 0 from the data, adjoining
399 |   all columns after the key field from the pseudofile 'bytes-per-word'
400 | - `-m 'row %0, %1 // 0'`: when we didn't get any join data, default to 0 (`//`
401 |   is Perl's defined-or-else operator)
402 | - `-gA 'row $_, sum @{%1}'`: reduce by word, summing total bytes
403 | 
404 | We could sidestep all nonexistent words by using `-i0` for an inner join
405 | instead. This drops all rows with no corresponding entry in the lookup table.
406 | 


--------------------------------------------------------------------------------
/nfu:
--------------------------------------------------------------------------------
   1 | #!/usr/bin/env perl
   2 | # nfu: Command-line numeric fu | Spencer Tipping
   3 | # Licensed under the terms of the MIT source code license
   4 | 
   5 | use v5.14;
   6 | use strict;
   7 | use warnings;
   8 | use utf8;
   9 | 
  10 | use Fcntl;
  11 | use Socket;
  12 | use Time::HiRes qw/time/;
  13 | use POSIX       qw/dup2 mkfifo setsid :sys_wait_h/;
  14 | use File::Temp  qw/tmpnam/;
  15 | use Math::Trig;
  16 | 
  17 | use constant VERBOSE_INTERVAL        => 50;
  18 | use constant HADOOP_VERBOSE_INTERVAL => 10000;
  19 | use constant DELAY_BEFORE_VERBOSE    => 500;
  20 | 
  21 | use constant SQL_INFER_PEEK_LINES    => 20;
  22 | 
  23 | use constant LOG_2 => log(2);
  24 | 
  25 | # 64-bit hex constants in geohash encoder won't work on 32-bit architectures
  26 | no warnings 'portable';
  27 | 
  28 | our $diamond_has_data = 1;
  29 | 
  30 | ++$|;
  31 | 
  32 | # Setup child capture. All we need to do is wait for child pids; there's no
  33 | # formal teardown.
  34 | $SIG{CHLD} = sub {
  35 |   local ($!, $?);
  36 |   1 while waitpid(-1, WNOHANG) > 0;
  37 | };
  38 | 
  39 | # NB: This import is not used in nfu directly; it's here so you can use these
  40 | # functions inside aggregators.
  41 | use List::Util qw(first max maxstr min minstr reduce shuffle sum);
  42 | 
  43 | sub prod {
  44 |   my $p = 1;
  45 |   $p *= $_ for @_;
  46 |   $p;
  47 | }
  48 | 
  49 | # Same for this, which is especially useful from aggregators because multiple
  50 | # values create multiple output rows, not multiple columns on the same output
  51 | # row.
  52 | sub row {join "\t", map s/\n//gr, @_}
  53 | 
  54 | # Order-preserving unique values for strings. This is just too useful not to
  55 | # provide.
  56 | sub uniq {
  57 |   local $_;
  58 |   my %seen;
  59 |   my @order;
  60 |   $seen{$_}++ or push @order, $_ for @_;
  61 |   @order;
  62 | }
  63 | 
  64 | sub frequencies {
  65 |   local $_;
  66 |   my %freqs;
  67 |   ++$freqs{$_} for @_;
  68 |   %freqs;
  69 | }
  70 | 
  71 | sub reductions(&$@) {
  72 |   my ($f, $x, @xs) = @_;
  73 |   my @ys;
  74 |   push @ys, $x = $f->($x, $_) for @xs;
  75 |   @ys;
  76 | }
  77 | 
  78 | sub cp {
  79 |   # Cartesian product of N arrays, each passed in as a ref
  80 |   return ()                 if @_ == 0;
  81 |   return map [$_], @{$_[0]} if @_ == 1;
  82 | 
  83 |   my @ns     = map scalar(@$_), @_;
  84 |   my @shifts = reverse reductions {$_[0] * $_[1]} 1 / $ns[0], reverse @ns;
  85 |   map {
  86 |     my $i = $_;
  87 |     [map $_[$_][int($i / $shifts[$_]) % $ns[$_]], 0..$#_];
  88 |   } 0..prod(@ns) - 1;
  89 | }
  90 | 
  91 | sub round_to {
  92 |   my ($x, $quantum) = @_;
  93 |   $quantum ||= 1;
  94 |   my $sign = $x < 0 ? -1 : 1;
  95 |   int(abs($x) / $quantum + 0.5) * $quantum * $sign;
  96 | }
  97 | 
  98 | sub mean {scalar @_ && sum(@_) / @_}
  99 | sub log2 {log($_[0]) / LOG_2}
 100 | 
 101 | sub entropy {
 102 |   local $_;
 103 |   my $s = sum(@_) || 1;
 104 |   my $t = 0;
 105 |   $t -= ($_ / $s) * log($_ / $s) for @_;
 106 |   $t / LOG_2;
 107 | }
 108 | 
 109 | sub dot {
 110 |   local $_;
 111 |   my ($u, $v) = @_;
 112 |   my $s = 0;
 113 |   $s += $$u[$_] * $$v[$_] for 0 .. min($#{$u}, $#{$v});
 114 |   $s;
 115 | }
 116 | 
 117 | sub dist {
 118 |   # Euclidean distance (you specify the deltas)
 119 |   local $_;
 120 |   my $s = 0;
 121 |   $s += $_*$_ for @_;
 122 |   sqrt($s);
 123 | }
 124 | 
 125 | sub sdist {
 126 |   # Spherical-coordinate great circle distance; you specify theta1, phi1,
 127 |   # theta2, phi2, each in degrees; radius is assumed to be 1. Math from
 128 |   # http://stackoverflow.com/questions/27928/how-do-i-calculate-distance-between-two-latitude-longitude-points
 129 |   local $_;
 130 |   my ($t1, $p1, $t2, $p2) = map $_ / 180 * pi, @_;
 131 |   my $dt = $t2 - $t1;
 132 |   my $dp = $p2 - $p1;
 133 |   my $a  = sin($dp / 2) * sin($dp / 2)
 134 |          + cos($p1) * cos($p2) * sin($dt / 2) * sin($dt / 2);
 135 |   2 * atan2(sqrt($a), sqrt(1 - $a));
 136 | }
 137 | 
 138 | sub edist {
 139 |   # Earth distance between two latitudes/longitudes, in km; up to 0.5% error
 140 |   my ($lat1, $lng1, $lat2, $lng2) = @_;
 141 |   6371 * sdist $lng1, $lat1, $lng2, $lat2;
 142 | }
 143 | 
 144 | sub line_opposite {
 145 |   # Returns true if two points are on opposite sides of the line starting at
 146 |   # (x0, y0) and whose direction is (dx, dy).
 147 |   my ($x0, $y0, $dx, $dy, $x1, $y1, $x2, $y2) = @_;
 148 |   return (($x1 - $x0) * $dy - ($y1 - $y0) * $dx)
 149 |        * (($x2 - $x0) * $dy - ($y2 - $y0) * $dx) < 0;
 150 | }
 151 | 
 152 | sub evens(@) {local $_; @_[map $_ * 2,     0 .. $#_ >> 1]}
 153 | sub odds(@)  {local $_; @_[map $_ * 2 + 1, 0 .. $#_ >> 1]}
 154 | 
 155 | sub parse_wkt {
 156 |   my @rings = map [/([-0-9.]+)\s+([-0-9.]+)/g], split /\)\s*,\s*\(/, $_[0];
 157 |   {
 158 |     rings  => [map [map $_ + 0, @$_], @rings],
 159 |     ylimit => 1 + max(map max(@$_), @rings),
 160 |     bounds => [min(map evens(@$_), @rings), max(map evens(@$_), @rings),
 161 |                min(map odds(@$_),  @rings), max(map odds(@$_),  @rings)],
 162 |   }
 163 | }
 164 | 
 165 | sub in_poly {
 166 |   # Returns true if a point resides in the given parsed polygon.
 167 |   my ($x, $y, $parsed) = @_;
 168 |   my $ylimit = $parsed->{ylimit};
 169 |   my @bounds = @{$parsed->{bounds}};
 170 |   return 0 if $x < $bounds[0] || $x > $bounds[1]
 171 |            || $y < $bounds[2] || $y > $bounds[3];
 172 | 
 173 |   my $hits = 0;
 174 |   for my $r (@{$parsed->{rings}}) {
 175 |     my ($lx, $ly) = @$r[0, 1];
 176 |     for (my $i = 2; $i < @$r; $i += 2) {
 177 |       my $cx = $$r[$i];
 178 |       my $cy = $$r[$i + 1];
 179 |       ++$hits if $lx <= $x && $x < $cx || $lx >= $x && $x > $cx
 180 |              and line_opposite $lx, $ly, $cx - $lx, $cy - $ly,
 181 |                                $x, $y, $x, $ylimit;
 182 |       $lx = $cx;
 183 |       $ly = $cy;
 184 |     }
 185 |   }
 186 |   $hits & 1;
 187 | }
 188 | 
 189 | sub rect_polar { (dist(@_), atan2($_[0], $_[1]) / pi * 180) }
 190 | sub polar_rect { ($_[0] * sin($_[1] / 180 * pi),
 191 |                   $_[0] * cos($_[1] / 180 * pi)) }
 192 | 
 193 | sub degrees_radians { $_[0] / 180 * pi }
 194 | sub radians_degrees { $_[0] / 180 * pi }
 195 | 
 196 | # JSON support (if available)
 197 | our $json;
 198 | if (eval {require JSON}) {
 199 |   JSON->import;
 200 |   no warnings qw(uninitialized);
 201 |   $json = JSON->new->allow_nonref->utf8(1);
 202 | } elsif (eval {require JSON::PP}) {
 203 |   JSON::PP->import;
 204 |   no warnings qw(uninitialized);
 205 |   $json = JSON::PP->new->allow_nonref->utf8(1);
 206 | } else {
 207 |   print STDERR "note: no JSON support detected (try 'cpan install JSON')\n";
 208 |   print STDERR "nfu will soon have its own JSON parser rather than using ";
 209 |   print STDERR "a native library for this. Sorry for the inconvenience.";
 210 | }
 211 | 
 212 | # These are callable from evaled code
 213 | sub expand_filename_shorthands;
 214 | sub read_file {
 215 |   open my $fh, expand_filename_shorthands $_[0], 1;
 216 |   my $result = join '', <$fh>;
 217 |   close $fh;
 218 |   $result;
 219 | }
 220 | 
 221 | sub write_file {
 222 |   open my $fh, '>', $_[0];
 223 |   $fh->print($_[1]);
 224 |   close $fh;
 225 |   $_[0];
 226 | }
 227 | 
 228 | sub read_lines {
 229 |   local $_;
 230 |   open my $fh, expand_filename_shorthands $_[0], 1;
 231 |   my @result;
 232 |   chomp, push @result, $_ for <$fh>;
 233 |   close $fh;
 234 |   @result;
 235 | }
 236 | 
 237 | sub write_lines {
 238 |   local $_;
 239 |   my $filename = shift @_;
 240 |   open my $fh, '>', $filename;
 241 |   $fh->print($_, "\n") for @_;
 242 |   close $fh;
 243 |   $filename;
 244 | }
 245 | 
 246 | sub open_file {
 247 |   open my $fh, expand_filename_shorthands $_[0], 1;
 248 |   $fh;
 249 | }
 250 | 
 251 | sub json_encode {$json->encode(@_)}
 252 | sub json_decode {$json->decode(@_)}
 253 | 
 254 | sub hadoop_counter {
 255 |   printf STDERR "reporter:counter:nfu,%s_%s,%d\n", $_[0] =~ y/,/_/r,
 256 |                                                    $_[1] =~ y/,/_/r,
 257 |                                                    $_[2] // 1;
 258 | }
 259 | 
 260 | sub F {my $l = shift @_; (split /\t/, $l)[@_]}
 261 | 
 262 | our @gh_alphabet = split //, '0123456789bcdefghjkmnpqrstuvwxyz';
 263 | our %gh_decode   = map(($gh_alphabet[$_], $_), 0..$#gh_alphabet);
 264 | 
 265 | sub gap_bits {
 266 |   my ($x) = @_;
 267 |   $x |= $x << 16; $x &= 0x0000ffff0000ffff;
 268 |   $x |= $x << 8;  $x &= 0x00ff00ff00ff00ff;
 269 |   $x |= $x << 4;  $x &= 0x0f0f0f0f0f0f0f0f;
 270 |   $x |= $x << 2;  $x &= 0x3333333333333333;
 271 |   return ($x | $x << 1) & 0x5555555555555555;
 272 | }
 273 | 
 274 | sub ungap_bits {
 275 |   my ($x) = @_;  $x &= 0x5555555555555555;
 276 |   $x ^= $x >> 1; $x &= 0x3333333333333333;
 277 |   $x ^= $x >> 2; $x &= 0x0f0f0f0f0f0f0f0f;
 278 |   $x ^= $x >> 4; $x &= 0x00ff00ff00ff00ff;
 279 |   $x ^= $x >> 8; $x &= 0x0000ffff0000ffff;
 280 |   return ($x ^ $x >> 16) & 0x00000000ffffffff;
 281 | }
 282 | 
 283 | sub geohash_encode {
 284 |   local $_;
 285 |   my ($lat, $lng, $precision) = @_;
 286 |   $precision //= 12;
 287 |   my $bits = $precision > 0 ? $precision * 5 : -$precision;
 288 |   my $gh   = (gap_bits(int(($lat +  90) / 180 * 0x40000000)) |
 289 |               gap_bits(int(($lng + 180) / 360 * 0x40000000)) << 1)
 290 |              >> 60 - $bits;
 291 | 
 292 |   $precision > 0 ? join '', reverse map $gh_alphabet[$gh >> $_ * 5 & 31],
 293 |                                         0 .. $precision - 1
 294 |                  : $gh;
 295 | }
 296 | 
 297 | sub geohash_decode {
 298 |   local $_;
 299 |   my ($gh, $bits) = @_;
 300 |   unless (defined $bits) {
 301 |     # Decode gh from base-32
 302 |     $bits = length($gh) * 5;
 303 |     my $n = 0;
 304 |     $n = $n << 5 | $gh_decode{lc $_} for split //, $gh;
 305 |     $gh = $n;
 306 |   }
 307 |   $gh <<= 60 - $bits;
 308 |   return (ungap_bits($gh)      / 0x40000000 * 180 -  90,
 309 |           ungap_bits($gh >> 1) / 0x40000000 * 360 - 180);
 310 | }
 311 | 
 312 | # HTTP
 313 | sub http_send { sprintf "HTTP/1.0 %s\nContent-Length: %d\n\n%s\n",
 314 |                         $_[0],
 315 |                         length($_[1]),
 316 |                         $_[1] }
 317 | 
 318 | sub httpok   { sprintf "HTTP/1.0 200 OK\nContent-Length: %d\n\n%s\n",
 319 |                        length($_[0]),
 320 |                        $_[0] }
 321 | 
 322 | sub chunk    { sprintf "%x\r\n%s\r\n", length($_[0]), $_[0] }
 323 | sub endchunk { chunk '' }
 324 | 
 325 | # Function shorthands
 326 | BEGIN {
 327 |   *dr = \°rees_radians;
 328 |   *rd = \&radians_degrees;
 329 |   *je = \&json_encode;
 330 |   *jd = \&json_decode;
 331 | 
 332 |   *hc = \&hadoop_counter;
 333 | 
 334 |   *rf = \&read_file;
 335 |   *rl = \&read_lines;
 336 |   *wf = \&write_file;
 337 |   *wl = \&write_lines;
 338 | 
 339 |   *of = \&open_file;
 340 | 
 341 |   *ghe = \&geohash_encode;
 342 |   *ghd = \&geohash_decode;
 343 | }
 344 | 
 345 | # File functions
 346 | sub random128 {join '', map sprintf('%04x', rand 65536), 0 .. 8}
 347 | sub hadoop_ls;
 348 | sub hadoop_matched_partfiles;
 349 | sub shell_quote;
 350 | sub expand_sql_shorthands;
 351 | sub expand_sqlite_db;
 352 | sub expand_postgres_db;
 353 | 
 354 | sub expand_filename_shorthands {
 355 |   # NB: we prepend a shell comment containing the original $f so anyone
 356 |   # downstream can get it back. This is currently used by Hadoop to reuse data
 357 |   # stored on HDFS if you write it on the command-line. (Otherwise nfu would
 358 |   # hadoop fs -text to download it, then re-upload to a tempfile.)
 359 | 
 360 |   my ($f, $always_make_a_command) = @_;
 361 |   my $result;
 362 |   my $original = ($f =~ s/^/#/mgr) . "\n";
 363 | 
 364 |   no warnings 'newline';
 365 | 
 366 |   if (-e $f || $f =~ s/^file://) {
 367 |     # It's really a filename, so push it onto @ARGV. If it's compressed, run it
 368 |     # through the appropriate decompressor first.
 369 |     my $piped = $f =~ s/^(.*\.gz)/cat '$1' | gzip -d/ri
 370 |                    =~ s/^(.*\.bz2)/cat '$1' | bzip2 -d/ri
 371 |                    =~ s/^(.*\.xz)/cat '$1' | xz -d/ri
 372 |                    =~ s/^(.*\.lzo)/cat '$1' | lzop -d/ri;
 373 |     $result = $piped =~ /\|/ ? "$original$piped |" : $piped;
 374 |   } elsif ($f =~ /^https?:\/\//) {
 375 |     # Assume a URL and curl it
 376 |     $f = shell_quote $f;
 377 |     $result = "${original}curl $f |";
 378 |   } elsif ($f =~ s/^sh://) {
 379 |     # Execute a command and capture stdout
 380 |     $result = "$original$f |";
 381 |   } elsif ($f =~ /^s3:/) {
 382 |     # Use s3cmd and cat to stdout
 383 |     $result = "${original}s3cmd get '$f' - |";
 384 |   } elsif ($f =~ s/^hdfs-ls:(?:\/\/)?//) {
 385 |     # Just list a directory. This is used by hdfs: below to provide
 386 |     # indirection.
 387 |     $result = "${original}$0 n:1 -m '"
 388 |             . q{map "hdfs:" . shell_quote($_), hadoop_ls %[q[} . $f . q{]%]}
 389 |             . "' |";
 390 |   } elsif ($f =~ s/^hdfs:(?:\/\/)?//) {
 391 |     # Use Hadoop commands to read stuff. The command itself needs to be
 392 |     # deferred because the hdfs path might not exist when the pseudofile is
 393 |     # resolved. Indirected through $0 hdfs-ls:X because some HDFS directories
 394 |     # contain enough files that we'll get arg-list-too-long errors if we invoke
 395 |     # hadoop fs -text directly. This workaround lets us stream the output from
 396 |     # hadoop fs -ls.
 397 |     #
 398 |     # Redirecting errors to /dev/null because hadoop fs -text complains loudly
 399 |     # if you close its output stream, and this is really annoying.
 400 |     $f = shell_quote "hdfs-ls:$f";
 401 |     $result = "${original}$0 $f | xargs hadoop fs -text 2>/dev/null |";
 402 |   } elsif ($f =~ s/^hdfsjoin://) {
 403 |     $result = "${original}$0 "
 404 |             . shell_quote(map "hdfs:$_",
 405 |                           hadoop_matched_partfiles $f, $ENV{map_input_file})
 406 |             . " |";
 407 |   } elsif ($f =~ s/^perl://) {
 408 |     # Evaluate a Perl expression
 409 |     $f =~ s/'/'"'"'/g;
 410 |     $result = "${original}perl -e 'print \$_, \"\\n\" for ($f)' |";
 411 |   } elsif ($f =~ s/^n://) {
 412 |     $result = "${original}perl -e 'print \$_, \"\\n\" for (1..($f))' |";
 413 |   } elsif ($f =~ s/^id://) {
 414 |     $result = "${original}echo " . shell_quote($f) . " |";
 415 |   } elsif ($f =~ s/^sql://) {
 416 |     # Run a postgres or sqlite3 query, exporting results as TSV
 417 |     my ($db, $query) = split /:/, $f, 2;
 418 |     $query = expand_sql_shorthands $query;
 419 | 
 420 |     if ($db =~ s/^P//) {
 421 |       # postgres
 422 |       $db     = expand_postgres_db $db;
 423 |       $query  = shell_quote "COPY ($query) TO STDOUT WITH NULL AS ''";
 424 |       $result = "${original}psql -c $query $db |";
 425 |     } elsif ($db =~ s/^S//) {
 426 |       # sqlite3
 427 |       $db     = shell_quote expand_sqlite_db $db;
 428 |       $result = "${original}echo " . shell_quote(".mode tabs\n$query;")
 429 |                                    . "| sqlite3 $db |";
 430 |     } else {
 431 |       die "unknown database prefix " . substr($db, 0, 1)
 432 |         . " for pseudofile $original (valid prefixes are P and S)";
 433 |     }
 434 |   } elsif ($f =~ /(\w*@?[^:]+):(.*)$/) {
 435 |     # Access file over SSH. We need to make sure nfu is running on the remote
 436 |     # end, since the remote file might be gzipped or some such. Because the
 437 |     # remote machine might not have nfu, this involves us piping ourselves over
 438 |     # to that machine.
 439 |     $result = "${original}cat '$0' | ssh -C '$1' perl - '$2' |";
 440 |   } else {
 441 |     return undef;
 442 |   }
 443 | 
 444 |   $always_make_a_command && $result !~ /\|/ ? "${original}cat '$result' |"
 445 |                                             : $result;
 446 | }
 447 | 
 448 | sub unexpand_filename_shorthands {
 449 |   my ($f) = @_;
 450 |   $f =~ /^#(.*)/ ? $1 : $f;
 451 | }
 452 | 
 453 | my %pseudofile_docs = (
 454 |   'file.gz'       => 'decompress file with gzip -dc',
 455 |   'file.bz2'      => 'decompress file with bzip2 -dc',
 456 |   'file.xz'       => 'decompress file with xz -dc',
 457 |   'file.lzo'      => 'decompress file with lzop -dc',
 458 |   'http[s]://url' => 'retrieve url with curl',
 459 |   'sh:stuff'      => 'run sh -c "stuff", take stdout',
 460 |   's3://url'      => 'access S3 using s3cmd',
 461 |   'hdfs:path'     => 'read HDFS file(s) with hadoop fs -text',
 462 |   'hdfs-ls:path'  => 'pseudofiles parsed from hadoop fs -ls',
 463 |   'hdfsjoin:path' => 'mapside join pseudofile (a subset of hdfs:path)',
 464 |   'sql:db:query'  => 'results of query as TSV',
 465 |   'perl:expr'     => 'perl -e \'print "$_\n" for (expr)\'',
 466 |   'n:number'      => 'numbers from 1 to n, inclusive',
 467 |   'id:X'          => 'verbatim text X',
 468 |   'user@host:x'   => 'remote data access (x can be a pseudofile)',
 469 | );
 470 | 
 471 | # Flags
 472 | our $is_child   = 0;
 473 | our $verbose    = 0;
 474 | our $n_lines    = 0;
 475 | our $n_bytes    = 0;
 476 | our $start_time = undef;
 477 | 
 478 | our $verbose_command = '';
 479 | our @verbose_args;
 480 | our $verbose_command_formatted = undef;
 481 | our $inside_hadoop_job         = length $ENV{mapred_job_id};
 482 | our $verbose_interval          = $inside_hadoop_job ? HADOOP_VERBOSE_INTERVAL
 483 |                                                     : VERBOSE_INTERVAL;
 484 | our $empirical_verbosity       = 0;
 485 | 
 486 | $verbose ||= $inside_hadoop_job;
 487 | $verbose ||= length $ENV{NFU_ALWAYS_VERBOSE};
 488 | 
 489 | our $last_verbose_report = 0;
 490 | our $verbose_row         = 0;
 491 | 
 492 | # Call it like this:
 493 | # while (<>) {
 494 | #   be_verbose_as_appropriate length;
 495 | #   ...
 496 | # }
 497 | sub be_verbose_as_appropriate {
 498 |   return if $is_child;
 499 |   return unless $verbose;
 500 |   local $_;
 501 |   my ($record_length) = @_;
 502 |   $n_lines += !!$record_length;
 503 |   $n_bytes += $record_length;
 504 |   my $now = time;
 505 |   return unless $record_length == 0
 506 |              || ($now - $last_verbose_report) * 1000 > $verbose_interval;
 507 | 
 508 |   $last_verbose_report = $now;
 509 |   $verbose_command_formatted //= join ' ', $verbose_command, @verbose_args;
 510 |   $start_time //= $now;
 511 |   my $runtime = $now - $start_time || 0.001;
 512 | 
 513 |   return if $runtime * 1000 < DELAY_BEFORE_VERBOSE;
 514 |   ++$empirical_verbosity;
 515 | 
 516 |   unless ($inside_hadoop_job) {
 517 |     # Print status updates straight to the terminal
 518 |     printf STDERR "\033[%d;1H\033[K%10dl %8.1fl/s %10dk %8.1fkB/s  %s",
 519 |                   $verbose_row,
 520 |                   $n_lines,
 521 |                   $n_lines / $runtime,
 522 |                   $n_bytes / 1024,
 523 |                   $n_bytes / 1024 / $runtime,
 524 |                   substr($verbose_command_formatted, 0, 40);
 525 |   } else {
 526 |     # Use Hadoop-specific syntax to update job counters. Smaller units are
 527 |     # better because Hadoop counters are integers, so they'll suffer from
 528 |     # truncation for fractional quantities.
 529 |     $verbose_command_formatted =~ s/[,\n]/_/g;
 530 |     hc $verbose_command_formatted, @$_
 531 |     for (['lines',      $n_lines],
 532 |          ['runtime ms', $runtime * 1000],
 533 |          ['bytes',      $n_bytes]);
 534 | 
 535 |     # Reset variables because Hadoop treats them as incremental
 536 |     $n_lines    = 0;
 537 |     $n_bytes    = 0;
 538 |     $start_time = $now;
 539 |   }
 540 | }
 541 | 
 542 | END {
 543 |   be_verbose_as_appropriate 0;
 544 |   print STDERR "\n" if $empirical_verbosity;
 545 | }
 546 | 
 547 | # This variable will keep track of any state accumulated from --use or --run
 548 | # arguments. This is required for --pmap to work correctly.
 549 | my @evaled_code;
 550 | 
 551 | sub shell_quote {join ' ', map /[^-\/\w]/ ? "'" . s/(['\\])/'\\$1'/gr . "'"
 552 |                              : length $_  ? $_
 553 |                              :              "''", @_}
 554 | 
 555 | sub quote_self {shell_quote $0, @_}
 556 | 
 557 | my %explosions = (
 558 |   a => '--average',
 559 |   A => '--aggregate',
 560 |   b => '--branch',
 561 |   c => '--count',
 562 |   C => '--uncount',
 563 |   D => '--drop',
 564 |   e => '--each',
 565 |   E => '--every',
 566 |   f => '--fields',
 567 |   F => '--fieldsplit',
 568 |   g => '--group',
 569 |   G => '--rgroup',
 570 |   h => '--hadoopc',
 571 |   H => '--hadoop',
 572 |   i => '--index',
 573 |   I => '--indexouter',
 574 |   j => '--join',
 575 |   J => '--joinouter',
 576 |   k => '--keep',
 577 |   K => '--remove',
 578 |   l => '--log',
 579 |   L => '--exp',
 580 |   m => '--map',
 581 |   M => '--pmap',
 582 |   n => '--number',
 583 |   N => '--ntiles',
 584 |   o => '--order',
 585 |   O => '--rorder',
 586 |   p => '--plot',
 587 |   P => '--poll',
 588 |   q => '--quant',
 589 |   Q => '--sql',
 590 |   r => '--read',
 591 |   R => '--buffer',
 592 |   s => '--sum',
 593 |   S => '--delta',
 594 |   T => '--take',
 595 |   # v => '--verbose'            # handled during option parsing
 596 |   V => '--variance',
 597 |   w => '--with',
 598 |   z => '--intify',
 599 | );
 600 | 
 601 | my %implosions;
 602 | $implosions{$explosions{$_}} = $_ for keys %explosions;
 603 | 
 604 | # Minimum number of required arguments for each function. Numeric arguments are
 605 | # automatically forwarded, so are always optional.
 606 | my %arity = (
 607 |   average    => 0,
 608 |   aggregate  => 1,
 609 |   branch     => 1,
 610 |   count      => 0,
 611 |   uncount    => 0,
 612 |   delta      => 0,
 613 |   drop       => 0,
 614 |   each       => 1,
 615 |   every      => 1,
 616 |   fields     => 0,
 617 |   fieldsplit => 1,
 618 |   fold       => 1,
 619 |   group      => 0,
 620 |   rgroup     => 0,
 621 |   hadoop     => 3,
 622 |   hadoopc    => 4,
 623 |   index      => 2,
 624 |   indexouter => 2,
 625 |   join       => 2,
 626 |   joinouter  => 2,
 627 |   keep       => 1,
 628 |   log        => 0,
 629 |   exp        => 0,
 630 |   map        => 1,
 631 |   pmap       => 1,
 632 |   number     => 0,
 633 |   ntiles     => 1,
 634 |   order      => 0,
 635 |   rorder     => 0,
 636 |   plot       => 1,
 637 |   poll       => 2,
 638 |   read       => 0,
 639 |   buffer     => 1,
 640 |   sum        => 0,
 641 |   quant      => 1,
 642 |   remove     => 1,
 643 |   sample     => 1,
 644 |   take       => 0,
 645 |   variance   => 0,
 646 |   with       => 1,
 647 |   intify     => 0,
 648 | 
 649 |   # Commands with no shorthands
 650 |   append     => 1,
 651 |   prepend    => 1,
 652 |   tee        => 1,
 653 |   duplicate  => 2,
 654 |   partition  => 2,
 655 |   splot      => 1,
 656 |   sd         => 0,
 657 |   mplot      => 1,
 658 |   preview    => 0,
 659 |   pipe       => 1,
 660 |   entropy    => 0,
 661 |   sql        => 3,
 662 |   tcp        => 1,
 663 |   http       => 1,
 664 |   repeat     => 2,
 665 |   octave     => 1,
 666 |   numpy      => 1,
 667 | );
 668 | 
 669 | my %usages = (
 670 |   average    => 'window size (0 for full average) -- running average',
 671 |   aggregate  => 'aggregator fn',
 672 |   branch     => 'branch (takes a pattern map)',
 673 |   count      => 'counts by first column value; like uniq -c',
 674 |   uncount    => 'the opposite of --count; repeats each row N times',
 675 |   delta      => 'value -> difference from last value',
 676 |   drop       => 'number of records to drop',
 677 |   each       => 'template; executes with {} set to each value',
 678 |   every      => 'n (returns every nth row)',
 679 |   fields     => 'string of digits, each a zero-indexed column selector',
 680 |   fieldsplit => 'regexp to use for splitting',
 681 |   fold       => 'function that returns true when line should be folded',
 682 |   group      => 'sorts ascending, takes optional column list',
 683 |   rgroup     => 'sorts descending, takes optional column list',
 684 |   hadoop     => 'hadoop streaming: outpath|.|@, mapper|:, reducer|:|_',
 685 |   hadoopc    => 'hadoop streaming: ..., combiner|:|_, reducer|:|_',
 686 |   index      => 'field index, unsorted pseudofile to join against',
 687 |   indexouter => 'field index, unsorted pseudofile to join against',
 688 |   join       => 'field index, sorted pseudofile to join against',
 689 |   joinouter  => 'field index, sorted pseudofile to join against',
 690 |   keep       => 'row filter fn',
 691 |   log        => 'optional base (default e)',
 692 |   exp        => 'optional base (default e)',
 693 |   map        => 'row map fn',
 694 |   pmap       => 'row map fn (executed multiple times in parallel)',
 695 |   number     => 'prepends line number to each line',
 696 |   ntiles     => 'takes N, produces ntiles of numbers',
 697 |   order      => 'sorts ascending by general numeric value',
 698 |   rorder     => 'sorts descending by general numeric value',
 699 |   plot       => 'gnuplot arguments',
 700 |   poll       => 'interval in seconds, command whose output to collect',
 701 |   sum        => 'value -> total += value',
 702 |   quant      => 'number to round to',
 703 |   read       => 'reads pseudofiles from the data stream',
 704 |   buffer     => 'creates a pseudofile from the data stream',
 705 |   remove     => 'inverted row filter fn',
 706 |   sample     => 'row selection probability in [0, 1]',
 707 |   take       => 'n to take first n, +n to take last n',
 708 |   variance   => 'running variance',
 709 |   with       => 'pseudofile to join column-wise onto input',
 710 |   intify     => 'convert column to dense integers (linear space)',
 711 | 
 712 |   append     => 'pseudofile; appends its contents to current stream',
 713 |   prepend    => 'pseudofile; prepends its contents to current stream',
 714 |   tee        => 'shell command; duplicates data to stdin of command',
 715 |   duplicate  => 'two shell commands as separate arguments',
 716 |   partition  => 'partition id fn, shell command (using {})',
 717 |   splot      => 'gnuplot arguments',
 718 |   sd         => 'running standard deviation',
 719 |   mplot      => 'gnuplot arguments per column, separated by ;',
 720 |   preview    => '',
 721 |   pipe       => 'shell command to pipe through',
 722 |   entropy    => 'running entropy of relative probabilities/frequencies',
 723 |   sql        => 'create/query SQL table: db[:[+]table], schema|_, query|_',
 724 |   tcp        => 'TCP server (emits fifo filenames)',
 725 |   http       => 'HTTP adapter for TCP server output',
 726 |   repeat     => 'repeat count, pseudofile to repeat',
 727 |   octave     => 'pipe through octave; vector is called xs',
 728 |   numpy      => 'pipe through numpy; vector is called xs',
 729 | );
 730 | 
 731 | my %env_docs = (
 732 |   NFU_SORT_BUFFER      => 'default 64M; size of in-memory sort for -g and -o',
 733 |   NFU_SORT_PARALLEL    => 'default 4; number of concurrent sorts to run',
 734 |   NFU_SORT_COMPRESS    => 'default none; compression program for sort tempfiles',
 735 |   NFU_SORT_OPTIONS     => 'override all sort options except column spec',
 736 |   NFU_ALWAYS_VERBOSE   => 'if set, nfu will be verbose all the time',
 737 |   NFU_NO_PAGER         => 'if set, nfu will not use "less" to preview stdout',
 738 |   NFU_PMAP_PARALLELISM => 'number of subprocesses for -M',
 739 |   NFU_MAX_FILEHANDLES  => 'default 64; maximum #subprocesses for --partition',
 740 |   NFU_HADOOP_FILES     => 'comma-separated files to include with streaming job',
 741 |   NFU_HADOOP_STREAMING => 'absolute location of hadoop-streaming.jar',
 742 |   NFU_HADOOP_OPTIONS   => '-D options for hadoop streaming jobs',
 743 |   NFU_HADOOP_COMMAND   => 'hadoop executable; e.g. hadoop jar, hadoop fs -ls',
 744 |   NFU_HADOOP_TMPDIR    => 'default /tmp; temp dir for hadoop uploads',
 745 | );
 746 | 
 747 | my %gnuplot_aliases = (
 748 |   '%l' => ' with lines',
 749 |   '%d' => ' with dots',
 750 |   '%i' => ' with impulses',
 751 |   '%v' => ' with vectors ',
 752 |   '%u' => ' using ',
 753 |   '%t' => ' title ',
 754 |   '%p' => ' lc palette ',
 755 | );
 756 | 
 757 | my %fieldsplit_shorthands = (
 758 |   S => '\s+',
 759 |   W => '\W+',
 760 |   C => ',',
 761 | );
 762 | 
 763 | sub expand_gnuplot_options {
 764 |   my @transformed_opts;
 765 |   for my $opt (@_) {
 766 |     $opt =~ s/$_/$gnuplot_aliases{$_}/g for keys %gnuplot_aliases;
 767 |     push @transformed_opts, $opt;
 768 |   }
 769 |   @transformed_opts;
 770 | }
 771 | 
 772 | my %sql_aliases = (
 773 |   '%\*' => ' select * from ',
 774 |   '%c'  => ' select count(1) from ',
 775 |   '%d'  => ' select distinct * from ',
 776 |   '%g'  => ' group by ',
 777 |   '%j'  => ' inner join ',
 778 |   '%l'  => ' outer left join ',
 779 |   '%r'  => ' outer right join ',
 780 |   '%w'  => ' where ',
 781 | );
 782 | 
 783 | sub expand_sql_shorthands {
 784 |   my ($sql) = @_;
 785 |   $sql =~ s/$_/$sql_aliases{$_}/eg for keys %sql_aliases;
 786 |   $sql;
 787 | }
 788 | 
 789 | sub expand_sqlite_db {
 790 |   my $tempdir = tmpnam =~ s/\/[^\/]+$//r;
 791 |   return "$tempdir/nfu-$ENV{USER}-sqlite.db" if $_[0] eq '@';
 792 |   return $_[0];
 793 | }
 794 | 
 795 | sub expand_postgres_db {
 796 |   # Expands a DB descriptor into a series of properly-shellquoted options to
 797 |   # pass to the 'psql' command.
 798 |   my ($host, $user, $db) = ('localhost', $ENV{USER}, $ENV{USER});
 799 |   if ($_[0] =~ m#^(?:([^\@/:]+)@)?([^\@/:]+)/([^/]+)$#) {
 800 |     # NB: don't try to change this to use $1, $2, etc as arguments to
 801 |     # shell_quote. Perl references arguments, so this will fail horribly.
 802 |     ($user, $host, $db) = ($1 // $ENV{USER}, $2 // 'localhost', $3);
 803 |     shell_quote '-U', $user, '-h', $host, '-d', $db;
 804 |   } elsif ($_[0] =~ m#^(\w+)@(\w+)$#) {
 805 |     # Simple DB connection; assume localhost
 806 |     ($user, $db) = ($1, $2);
 807 |     shell_quote '-U', $user, '-d', $db;
 808 |   } elsif ($_[0] =~ m#^[^\@:/]+$#) {
 809 |     # Really simple connection
 810 |     shell_quote '-d', $_[0];
 811 |   } elsif ($_[0] eq '@') {
 812 |     # Really simple connection; use username as DB
 813 |     shell_quote '-d', $ENV{USER};
 814 |   } else {
 815 |     die "not sure how to parse postgres DB '$_[0]'";
 816 |   }
 817 | }
 818 | 
 819 | sub expand_eval_shorthands {
 820 |   my $code   = $_[0];
 821 |   my @pieces = split /%\[(.*?)%\]/, $code;
 822 |   for my $i (0..$#pieces) {
 823 |     unless ($i & 1) {
 824 |       $pieces[$i] =~ s/%(\d+)/\$_[$1]/g;
 825 |       1 while $pieces[$i]
 826 |                 =~ s/([a-zA-Z0-9_\)\}\]?\$])
 827 |                      \.
 828 |                      ([\$_a-zA-Z](?:-[0-9\w?\$]|[0-9_\w?\$])*)
 829 |                     /$1\->{'$2'}/x;
 830 |     }
 831 |   }
 832 |   join '', @pieces;
 833 | }
 834 | 
 835 | sub parse_join_options {
 836 |   my ($f1, $f2, $file) = @_ == 3 ? @_ : ($_[0], 0, $_[1]);
 837 |   ($f1 + 1, $f2 + 1, $file);
 838 | }
 839 | 
 840 | sub compile_eval_into_function {
 841 |   my ($code, $name) = @_;
 842 |   $code = expand_eval_shorthands $code;
 843 |   eval "sub {\n$code\n}"
 844 |     or die "failed to compile $name function: $@\n  (code was $code)";
 845 | }
 846 | 
 847 | sub stateless_unary_fn {
 848 |   my ($name, $f) = @_;
 849 |   my $arity = $arity{$name};
 850 |   ($name, sub {
 851 |     my @columns = split //, (@_ > $arity ? shift : undef) // '0';
 852 |     while (<>) {
 853 |       be_verbose_as_appropriate length;
 854 |       chomp;
 855 |       my @fs = split /\t/;
 856 |       $fs[$_] = $f->($fs[$_], @_) for @columns;
 857 |       print row(@fs), "\n";
 858 |     }
 859 |   });
 860 | }
 861 | 
 862 | sub stateful_unary_fn {
 863 |   my ($name, $setup, $f) = @_;
 864 |   my $arity = $arity{$name};
 865 |   ($name, sub {
 866 |     my @columns = split //, (@_ > $arity ? shift : undef) // '0';
 867 |     my %states;
 868 |     $states{$_} = $setup->(@_) for @columns;
 869 |     while (<>) {
 870 |       be_verbose_as_appropriate length;
 871 |       chomp;
 872 |       my @fs = split /\t/;
 873 |       $fs[$_] = $f->($fs[$_], $states{$_}, @_) for @columns;
 874 |       print row(@fs), "\n";
 875 |     }
 876 |   });
 877 | }
 878 | 
 879 | sub exec_with_stdin {
 880 |   open my $fh, '|' . shell_quote @_ or die "failed to exec @_";
 881 |   be_verbose_as_appropriate(length), print $fh $_ while <>;
 882 |   close $fh;
 883 | }
 884 | 
 885 | sub exec_with_diamond {
 886 |   if ($verbose || grep /\|/, @ARGV) {
 887 |     # Arguments are specified in filenames and involve processes, so use perl
 888 |     # to forward data.
 889 |     exec_with_stdin @_;
 890 |   } else {
 891 |     # Faster option: just exec the program in-place. This avoids a layer of
 892 |     # interprocess piping. Assume filenames follow arguments.
 893 |     exec @_, @ARGV or die "failed to exec @_ @ARGV";
 894 |   }
 895 | }
 896 | 
 897 | sub sort_options {
 898 |   my ($column_spec) = @_;
 899 |   my @columns       = split //, $column_spec // '';
 900 |   my @options       = exists $ENV{NFU_SORT_OPTIONS}
 901 |                     ? split /\s+/, $ENV{NFU_SORT_OPTIONS}
 902 |                     : ('-S', $ENV{NFU_SORT_BUFFER} || '64M',
 903 |                        '--parallel=' . ($ENV{NFU_SORT_PARALLEL} || 4),
 904 |                        $ENV{NFU_SORT_COMPRESS}
 905 |                          ? ("--compress-program=$ENV{NFU_SORT_COMPRESS}")
 906 |                          : ());
 907 |   return @options,
 908 |          (@columns
 909 |            ? ('-t', "\t",
 910 |               map {('-k', sprintf "%d,%d", $_ + 1, $_ + 1)} @columns)
 911 |            : ());
 912 | }
 913 | 
 914 | sub sort_cmd {join ' ', 'sort', sort_options, @_}
 915 | 
 916 | sub fifo_for {
 917 |   my ($file, @transforms) = @_;
 918 |   my $fifo_name = tmpnam;
 919 | 
 920 |   mkfifo $fifo_name, 0700 or die "failed to create fifo: $!";
 921 | 
 922 |   return $fifo_name if fork;
 923 | 
 924 |   my $command = expand_filename_shorthands($file, 1)
 925 |               . join '', map {"$_ |"} @transforms;
 926 |   open my $into_fifo, '>', $fifo_name
 927 |     or die "failed to open fifo $fifo_name for writing: $!";
 928 |   open my $from_file, $command
 929 |     or die "failed to open file/command $command for reading: $!";
 930 | 
 931 |   be_verbose_as_appropriate(length), $into_fifo->print($_) while <$from_file>;
 932 |   close $into_fifo;
 933 |   close $from_file;
 934 | 
 935 |   unlink $fifo_name or warn "failed to unlink temporary fifo $fifo_name: $!";
 936 |   exit 0;
 937 | }
 938 | 
 939 | sub hadoop {
 940 |   # Generates a hadoop command string, quoting all args as necessary
 941 |   join ' ', $ENV{NFU_HADOOP_COMMAND} // "hadoop",
 942 |             @_[0, 1],
 943 |             shell_quote(@_ > 2 ? @_[2..$#_] : ());
 944 | }
 945 | 
 946 | sub hadoop_ls {
 947 |   # Now get the output file listing. This is a huge mess because Hadoop is a
 948 |   # huge mess.
 949 |   my $ls_command = hadoop('fs', '-ls', @_);
 950 |   grep /\/[^_][^\/]*$/, map +(split " ", $_, 8)[7],
 951 |                         grep !/^Found/,
 952 |                         split /\n/, ''.qx/$ls_command/;
 953 | }
 954 | 
 955 | sub gcd {
 956 |   my ($a, $b) = @_;
 957 |   while ($b) {
 958 |     if ($b > $a) { ($a, $b) = ($b, $a) }
 959 |     else         { $a %= $b }
 960 |   }
 961 |   $a;
 962 | }
 963 | 
 964 | sub hadoop_partfile_n { $_[0] =~ /[^0-9]([0-9]+)(?:\.[^\/]+)?$/ ? $1 : 0 }
 965 | sub hadoop_partsort {
 966 |   sort {hadoop_partfile_n($a) <=> hadoop_partfile_n($b)} @_
 967 | }
 968 | 
 969 | sub hadoop_matched_partfiles {
 970 |   my ($path, $partfile) = @_;
 971 |   my $partfile_dirname = $partfile =~ s/\/[^\/]+$//r;
 972 |   my @left_files;
 973 |   my @possibilities;
 974 | 
 975 |   # We won't be able to do anything until both sides of the join exist.
 976 |   until (@left_files = hadoop_partsort hadoop_ls $partfile_dirname) {
 977 |     print STDERR "hadoop_matched_partfiles: waiting for input...\n";
 978 |     hc qw/hadoop_matched_partfiles left_wait/;
 979 |     sleep 1;
 980 |   }
 981 | 
 982 |   until (@possibilities = hadoop_partsort hadoop_ls $path) {
 983 |     print STDERR "hadoop_matched_partfiles: waiting for join data...\n";
 984 |     hc qw/hadoop_matched_partfiles right wait/;
 985 |     sleep 1;
 986 |   }
 987 | 
 988 |   my $n = hadoop_partfile_n $partfile;
 989 |   unless ($n < @left_files && $left_files[$n] eq $partfile) {
 990 |     my $partfile_n = $n;
 991 |     $n = 0;
 992 |     ++$n until $n >= @left_files
 993 |             || $partfile_n == hadoop_partfile_n $partfile;
 994 |   }
 995 | 
 996 |   die "hadoop_matched_partfiles couldn't find the index of the specified "
 997 |     . "partfile ($partfile) within the list of map inputs (@left_files)"
 998 |   if $n >= @left_files;
 999 | 
1000 |   # Assume Hadoop's default partitioning strategy using hashcode modulus. This
1001 |   # means that the Kth of N files contains all keys for which H(key) % N == K.
1002 |   my $reduction_factor = gcd scalar(@left_files), scalar(@possibilities);
1003 |   my $left_redundancy  = @possibilities / $reduction_factor;
1004 |   my @files_to_read    = @possibilities[map $_ * $reduction_factor
1005 |                                           + $n % $reduction_factor,
1006 |                                             0 .. $left_redundancy - 1];
1007 | 
1008 |   # Log some stuff to job counters so any problems are more evident.
1009 |   if ($verbose) {
1010 |     printf STDERR "hdfsjoin:$path [$partfile] (inferred partfile $n): %s\n",
1011 |                   join(' ', @files_to_read);
1012 | 
1013 |     hc "hdfsjoin $path", "joins attempted",  1;
1014 |     hc "hdfsjoin $path", "left/right reads", $left_redundancy;
1015 |     hc "hdfsjoin $path", "overreads",        $left_redundancy - 1;
1016 |   }
1017 | 
1018 |   @files_to_read;
1019 | }
1020 | 
1021 | sub hadoop_tempfile {
1022 |   my $randomness = random128;
1023 |   my $dir        = $ENV{NFU_HADOOP_TMPDIR} // '/tmp';
1024 |   "$dir/nfu-hadoop-$ENV{USER}-$randomness";
1025 | }
1026 | 
1027 | sub find_hadoop_streaming {
1028 |   my @files = split /\n/, ''.qx|locate "*hadoop-streaming.jar*"|;
1029 |   die "failed to locate hadoop streaming jar automatically; "
1030 |     . "you should set NFU_HADOOP_STREAMING" unless @files;
1031 |   print STDERR "nfu found hadoop streaming jar: $files[0]\n";
1032 |   $files[0];
1033 | }
1034 | 
1035 | sub hadoop_into {
1036 |   my ($outfile, $mapper, $combiner, $reducer) = @_;
1037 |   my $streaming_jar = $ENV{NFU_HADOOP_STREAMING} // find_hadoop_streaming;
1038 | 
1039 |   # Various shorthands for common cases. Both mapper and reducer being the
1040 |   # identity function is used to repartition stuff, so we want this to be easy
1041 |   # to type and recognize.
1042 |   $mapper   = $0     if $mapper   eq ':';
1043 |   $reducer  = $0     if $reducer  eq ':';
1044 |   $reducer  = 'NONE' if $reducer  =~ /^[-_]$/;
1045 |   $combiner = $0     if $combiner eq ':';
1046 |   $combiner = 'NONE' if $combiner =~ /^[-_]$/;
1047 |   $combiner = 'NONE' if $reducer  eq 'NONE';
1048 | 
1049 |   my @input_files;
1050 |   my @delete_afterwards;
1051 | 
1052 |   # Figure out where our input seems to be coming from. This is a little hacky,
1053 |   # but I think it's worthwhile for the flexibility we get.
1054 |   #
1055 |   # Hadoop jobs output their list of output partfile names because the data is
1056 |   # often too large. It's possible we'll get one of these as stdin, so each
1057 |   # line will look like "hdfs:/...". In that case, we want to use each as an
1058 |   # input file. Otherwise we'll want to upload stdin and all non-HDFS files to
1059 |   # HDFS and use those.
1060 |   #
1061 |   # It's actually a bit tricky to figure out which files started out as hdfs:
1062 |   # locations because by now they've all been filename alias-expanded. We need
1063 |   # to reverse-engineer the alias if we want to reuse stuff already on HDFS.
1064 | 
1065 |   my @other_argv;
1066 |   while (@ARGV) {
1067 |     local $_ = unexpand_filename_shorthands shift @ARGV;
1068 |     if (s/^hdfs:(?:\/\/)?//) {
1069 |       push @input_files, '-input', $_;
1070 |     } else {
1071 |       push @other_argv, expand_filename_shorthands $_;
1072 |     }
1073 |   }
1074 | 
1075 |   @ARGV = @other_argv;
1076 |   my $line = undef;
1077 |   unless (-t STDIN) {
1078 |     chomp($line), push @input_files, '-input', $line
1079 |     while defined($line = ) && $line =~ s/^hdfs://;
1080 |   }
1081 | 
1082 |   # At this point $line is either undefined or contains a line we shouldn't
1083 |   # have read from stdin (i.e. it's data). If the latter, prepend it to the
1084 |   # upload to HDFS.
1085 |   if (defined $line) {
1086 |     my $tempfile = hadoop_tempfile;
1087 |     open my $fh, "| " . hadoop('fs', '-put', '-', $tempfile) . ' 1>&2'
1088 |       or die "failed to open hadoop fs -put process for uploading: $!";
1089 |     print $fh $line;
1090 |     be_verbose_as_appropriate(length), print $fh $_ while ;
1091 |     close $fh;
1092 | 
1093 |     push @input_files,       '-input', $tempfile;
1094 |     push @delete_afterwards, $tempfile;
1095 |   }
1096 | 
1097 |   if (@other_argv) {
1098 |     my $tempfile = hadoop_tempfile;
1099 |     open my $fh, "| " . hadoop('fs', '-put', '-', $tempfile) . ' 1>&2'
1100 |       or die "failed to open hadoop fs -put process for uploading: $!";
1101 |     be_verbose_as_appropriate(length), print $fh $_ while <>;
1102 |     close $fh;
1103 | 
1104 |     push @input_files,       '-input', $tempfile;
1105 |     push @delete_afterwards, $tempfile;
1106 |   }
1107 | 
1108 |   # Now all the input files are in place, so we can kick off the job.
1109 |   my $extra_args = $ENV{NFU_HADOOP_OPTIONS} // '';
1110 |   my @file_deps  = map {('-file', $_)}
1111 |                    ($0, split /,/, $ENV{NFU_HADOOP_FILES} // '');
1112 |   my $dirname    = $0 =~ s/\/nfu$/\//r;
1113 | 
1114 |   # We're uploading ourselves, so we'll need a current-directory reference to
1115 |   # nfu. When nfu quotes itself, it produces an absolute path instead; this
1116 |   # code rewrites those into ./nfu.
1117 |   my $transformed_mapper   = $mapper   =~ s|\Q$dirname\E|./|gr;
1118 |   my $transformed_combiner = $combiner =~ s|\Q$dirname\E|./|gr;
1119 |   my $transformed_reducer  = $reducer  =~ s|\Q$dirname\E|./|gr;
1120 | 
1121 |   # Write the mapper and reducer commands into files rather than passing them
1122 |   # straight to hadoop streaming. This bypasses two problems:
1123 |   #
1124 |   # 1. Hadoop streaming might chop long arguments sooner than bash.
1125 |   # 2. It word-splits differently from bash, which breaks shell_quote.
1126 | 
1127 |   my $mapper_file   = tmpnam;
1128 |   my $combiner_file = tmpnam;
1129 |   my $reducer_file  = tmpnam;
1130 | 
1131 |   open my $mapper_fh,  '>', $mapper_file
1132 |     or die "failed to create tempfile $mapper_file for map job: $!";
1133 |   open my $combiner_fh,  '>', $combiner_file
1134 |     or die "failed to create tempfile $combiner_file for combine job: $!";
1135 |   open my $reducer_fh, '>', $reducer_file
1136 |     or die "failed to create tempfile $reducer_file for reduce job: $!";
1137 |   print $mapper_fh   "#!/bin/bash\n$transformed_mapper\n";
1138 |   print $combiner_fh "#!/bin/bash\n$transformed_combiner\n";
1139 |   print $reducer_fh  "#!/bin/bash\n$transformed_reducer\n";
1140 |   close $mapper_fh;
1141 |   close $combiner_fh;
1142 |   close $reducer_fh;
1143 | 
1144 |   chmod 0755, $mapper_file;
1145 |   chmod 0755, $combiner_file;
1146 |   chmod 0755, $reducer_file;
1147 | 
1148 |   my $jobname = "nfu streaming ["
1149 |               . join(' ', grep $_ ne '-input', @input_files)
1150 |               . "]: map($transformed_mapper), "
1151 |               . "combine($transformed_combiner), "
1152 |               . "reduce($transformed_reducer), "
1153 |               . "options($extra_args)"
1154 |               . " > $outfile";
1155 | 
1156 |   my $hadoop_command =
1157 |     hadoop('jar',
1158 |            shell_quote($streaming_jar)
1159 |            . " -D mapred.job.name=" . shell_quote($jobname) . " $extra_args",
1160 |            @file_deps,
1161 |            @input_files,
1162 |            '-file',    $mapper_file,
1163 |            '-file',    $combiner_file,
1164 |            '-file',    $reducer_file,
1165 |            '-output',  $outfile,
1166 |            '-mapper',  "./" . ($mapper_file =~ s/^.*\///r),
1167 |            $combiner ne 'NONE'
1168 |              ? ('-combiner', "./" . ($combiner_file =~ s/^.*\///r))
1169 |              : (),
1170 |            '-reducer', $reducer eq 'NONE'
1171 |                          ? $reducer
1172 |                          : "./" . ($reducer_file =~ s/^.*\///r));
1173 | 
1174 |   system $hadoop_command . ' 1>&2'
1175 |     and die "failed to execute hadoop command $hadoop_command: $!";
1176 | 
1177 |   unlink $mapper_file;
1178 |   unlink $combiner_file;
1179 |   unlink $reducer_file;
1180 | 
1181 |   system hadoop('fs', '-rm', '-r', @delete_afterwards) . ' 1>&2'
1182 |     if @delete_afterwards;
1183 |   "hdfs:$outfile";
1184 | }
1185 | 
1186 | sub sql_infer_column_type {
1187 |   # Try to figure out the right type for a column based on some values for it.
1188 |   # The possibilities are 'text', 'integer', or 'real'; this is roughly the set
1189 |   # of stuff supported by both postgres and sqlite.
1190 |   return 'integer' unless grep length && !/^-?[0-9]+$/, @_;
1191 |   return 'real'    unless grep length &&
1192 |                                !/^-?[0-9]+$
1193 |                                | ^-?[0-9]+(?:\.[0-9]+)?(?:[eE][-+]?[0-9]+)?$
1194 |                                | ^-?[0-9]*   \.[0-9]+  (?:[eE][-+]?[0-9]+)?$/x,
1195 |                                @_;
1196 |   return 'text';
1197 | }
1198 | 
1199 | sub sql_infer_schema {
1200 |   # Takes a list of TSV lines and generates a table schema that will store
1201 |   # them.
1202 |   my $n       = max map scalar(split /\t/), @_;
1203 |   my @columns = map [], 1 .. $n;
1204 |   for (@_) {
1205 |     my @vs = split /\t/;
1206 |     push @{$columns[$_]}, $vs[$_] for 0 .. $#vs;
1207 |   }
1208 | 
1209 |   my @types = map sql_infer_column_type(@$_), @columns;
1210 |   join ', ', map sprintf("f%d %s", $_, $types[$_]), 0 .. $#columns;
1211 | }
1212 | 
1213 | sub sql_schema_and_buffer {
1214 |   # WARNING: this function modifies its argument
1215 |   my ($schema) = @_;
1216 |   my @read;
1217 |   if ($schema eq '_') {
1218 |     # Infer the schema, which involves reading some data up front.
1219 |     push @read, $_ while @read < SQL_INFER_PEEK_LINES
1220 |                          and $diamond_has_data &&= defined($_ = <>);
1221 |     $schema = sql_infer_schema @read;
1222 |   }
1223 |   $_[0] = $schema;
1224 |   @read;
1225 | }
1226 | 
1227 | sub sql_parse_args {
1228 |   my ($dbt, $schema, $query) = @_;
1229 |   my @pieces = split /:/, $dbt;
1230 | 
1231 |   if (@pieces > 1) {
1232 |     my $table = pop @pieces;
1233 |     my $db    = join ':', @pieces;
1234 |     ($db, $table, $schema, $query);
1235 |   } else {
1236 |     # Use a default table called 't' with an automatic index
1237 |     ($pieces[0], '+t', $schema, $query);
1238 |   }
1239 | }
1240 | 
1241 | sub write_buffer_and_stdin {
1242 |   my ($fh, @buffer) = @_;
1243 |   be_verbose_as_appropriate(length), $fh->print($_) for @buffer;
1244 |   be_verbose_as_appropriate(length), $fh->print($_)
1245 |     while $diamond_has_data &&= defined($_ = <>);
1246 | }
1247 | 
1248 | sub sql_first_column_index {
1249 |   # WARNING: this function modifies its first argument
1250 |   my ($table, $schema) = @_;
1251 |   my $column = $schema =~ s/\s.*$//r;
1252 |   my $index  = $table =~ s/^\+//
1253 |     ? "CREATE INDEX $table$column ON $table($column);\n"
1254 |     : '';
1255 |   $_[0] = $table;
1256 |   $index;
1257 | }
1258 | 
1259 | my %functions = (
1260 |   read => sub {
1261 |     while (<>) {
1262 |       chomp;
1263 |       my $f = expand_filename_shorthands $_, 1;
1264 |       open my $fh, $f or die "failed to open pseudofile $_ ($f): $!";
1265 |       be_verbose_as_appropriate(length), print while <$fh>;
1266 |       close $fh;
1267 |     }
1268 |   },
1269 | 
1270 |   buffer => sub {
1271 |     my $f = $_[0] =~ /^[-_:]$/ ? tmpnam : $_[0];
1272 |     open my $fh, '>', $f or die "failed to open buffer file $f: $!";
1273 |     be_verbose_as_appropriate(length), $fh->print($_) while <>;
1274 |     close $fh;
1275 |     print $f, "\n";
1276 |   },
1277 | 
1278 |   branch => sub {
1279 |     my (@cases) = split /\n/, $_[0];
1280 |     my %branches;
1281 |     my %branch_matchers;
1282 |     my @order;
1283 | 
1284 |     for (@cases) {
1285 |       my ($k, $v) = map unpack('u', $_), split /\t/;
1286 |       $v //= '';
1287 |       open my $fh, "| $0 $v"
1288 |         or die "failed to open branch subprocess '$0 $v': $!";
1289 |       push @order, $k;
1290 |       $branches{$k} = $fh;
1291 |       $branch_matchers{$k} =
1292 |         $k =~ s/^:// ? compile_eval_into_function $k, 'branch matcher function'
1293 |       : $k eq '_'    ? sub { 1 }
1294 |                      : sub { $_[0] eq $k };
1295 |     }
1296 | 
1297 |     while (<>) {
1298 |       be_verbose_as_appropriate length;
1299 |       chomp;
1300 |       my @xs = split /\t/;
1301 |       my @matching = grep $branch_matchers{$_}->(@xs), @order;
1302 | 
1303 |       die "branch failed to match '$xs[0]' against any of ( @order )"
1304 |       unless @matching;
1305 |       $branches{$matching[0]}->print(row(@xs), "\n");
1306 |     }
1307 | 
1308 |     close for values %branches;
1309 |   },
1310 | 
1311 |   tcp => sub {
1312 |     my ($hostport)    = @_;
1313 |     my ($host, $port) = $hostport =~ /:/ ? split /:/, $hostport
1314 |                                          : ('0.0.0.0', $hostport);
1315 | 
1316 |     socket my($serversock), PF_INET, SOCK_STREAM, getprotobyname 'tcp'
1317 |       or die "socket failed: $!";
1318 |     setsockopt $serversock, SOL_SOCKET, SO_REUSEADDR, pack 'l', 1
1319 |       or die "setsockopt failed: $!";
1320 |     bind $serversock, sockaddr_in $port, INADDR_ANY
1321 |       or die "bind failed: $!";
1322 |     listen $serversock, SOMAXCONN
1323 |       or die "listen failed: $!";
1324 | 
1325 |     while (1) {
1326 |       # Fork twice per connection. This is egregious but necessary because the
1327 |       # open() call blocks on a FIFO until the other end is connected, and we
1328 |       # want the FIFO consumer to be at liberty to open them in either order
1329 |       # without creating a deadlock.
1330 |       my ($paddr, $client);
1331 |       unless ($paddr = accept $client, $serversock) {
1332 |         next if $!{EINTR};
1333 |         die "accept failed: $!";
1334 |       }
1335 | 
1336 |       my ($port, $iaddr) = sockaddr_in $paddr;
1337 |       my $iname = gethostbyaddr $iaddr, AF_INET;
1338 | 
1339 |       my $r = tmpnam; mkfifo $r, 0700 or die "failed to create FIFO $r: $!";
1340 |       my $w = tmpnam; mkfifo $w, 0700 or die "failed to create FIFO $w: $!";
1341 | 
1342 |       print join("\t", $r, $w, $iname, $port), "\n";
1343 | 
1344 |       # Socket reads: write to the "reader" fifo
1345 |       unless (fork) {
1346 |         my $buf;
1347 |         sysopen my $fh, $r, O_WRONLY or die "failed to open $r for writing: $!";
1348 |         syswrite $fh, $buf while sysread $client, $buf, 8192;
1349 |         close $fh;
1350 |         close $client;
1351 |         unlink $r;
1352 |         exit;
1353 |       }
1354 | 
1355 |       # Socket writes: read from the "writer" fifo
1356 |       unless (fork) {
1357 |         my $buf;
1358 |         sysopen my $fh, $w, O_RDONLY or die "failed to open $w for writing: $!";
1359 |         syswrite $client, $buf while sysread $fh, $buf, 8192;
1360 |         close $fh;
1361 |         close $client;
1362 |         unlink $w;
1363 |         exit;
1364 |       }
1365 | 
1366 |       close $client;
1367 |     }
1368 |   },
1369 | 
1370 |   http => sub {
1371 |     # Connect this to a --tcp thing to do HTTP stuff. In practice this just
1372 |     # means parsing out the headers and handing off parsed data to the command.
1373 |     my ($command) = @_;
1374 |     open my $fh, "| $command" or die "http failed to launch $command: $!";
1375 |     select((select($fh), $| = 1)[0]);
1376 | 
1377 |     while (<>) {
1378 |       chomp;
1379 |       my ($in, $out, $port, $iaddr) = split /\t/;
1380 |       open my $indata, '<', $in or die "http failed to read socket $in: $!";
1381 | 
1382 |       my $url;
1383 |       my %headers;
1384 |       while (<$indata>) {
1385 |         chomp;
1386 |         last if length($_) <= 1;
1387 |         $headers{$1} = $2 if  defined $url && /^([^:]+):\s*(.*)$/;
1388 |         $url         = $1 if !defined $url && /^[A-Z]+\s+(\S+)/;
1389 |       }
1390 |       close $indata;
1391 |       my $http_data = row($out, $url, je({%headers}), $port, $iaddr) . "\n";
1392 |       $fh->print($http_data);
1393 |     }
1394 |   },
1395 | 
1396 |   group  => sub {exec_with_diamond 'sort', sort_options @_},
1397 |   rgroup => sub {exec_with_diamond 'sort', '-r', sort_options @_},
1398 |   order  => sub {exec_with_diamond 'sort', '-g', sort_options @_},
1399 |   rorder => sub {exec_with_diamond 'sort', '-rg', sort_options @_},
1400 | 
1401 |   count => sub {
1402 |     # Same behavior as uniq -c, but delimits counts with \t; also takes an
1403 |     # optional series of columns to uniq by, rather than using the whole row.
1404 |     my @columns = split //, shift // '';
1405 |     my $last;
1406 |     my @last;
1407 |     my $count = -1;
1408 | 
1409 |     while (<>) {
1410 |       be_verbose_as_appropriate length;
1411 |       chomp;
1412 | 
1413 |       my @xs = split /\t/;
1414 |       @xs   = @xs[@columns] if @columns;
1415 |       $last = $_, @last = @xs unless ++$count;
1416 | 
1417 |       for (my $i = 0; $i < max scalar(@xs), scalar(@last); ++$i) {
1418 |         if (!defined $xs[$i] || !defined $last[$i] || $xs[$i] ne $last[$i]) {
1419 |           print "$count\t$last\n";
1420 |           $count = 0;
1421 |           @last  = @xs;
1422 |           $last  = $_;
1423 |           last;
1424 |         }
1425 |       }
1426 |     }
1427 | 
1428 |     ++$count;
1429 |     print "$count\t$last\n" if defined $last;
1430 |   },
1431 | 
1432 |   uncount => sub {
1433 |     while (<>) {
1434 |       be_verbose_as_appropriate length;
1435 |       my ($n, $line) = split /\t/, $_, 2;
1436 |       $line //= "\n";
1437 |       print $line for 1..$n;
1438 |     }
1439 |   },
1440 | 
1441 |   index => sub {
1442 |     # Inner join by appending joined fields to the end.
1443 |     my ($f1, $f2, $join_file) = parse_join_options @_;
1444 | 
1445 |     my $sorted_index = fifo_for $join_file, sort_cmd "-t '\t' -k${f2}b,$f2";
1446 |     my $command = sort_cmd "-t '\t' -k ${f1}b,$f1" .
1447 |                   "| join -t '\t' -1 $f1 -2 $f2 - '$sorted_index'";
1448 | 
1449 |     open my $to_join, "| $command" or die "failed to exec $command: $!";
1450 |     be_verbose_as_appropriate(length), print $to_join $_ while <>;
1451 |     close $to_join;
1452 |   },
1453 | 
1454 |   indexouter => sub {
1455 |     # Outer left join by appending joined fields to the end.
1456 |     my ($f1, $f2, $join_file) = parse_join_options @_;
1457 | 
1458 |     my $sorted_index = fifo_for $join_file, sort_cmd "-t '\t' -k ${f2}b,$f2";
1459 |     my $command = sort_cmd "-t '\t' -k ${f1}b,$f1" .
1460 |                   "| join -a 1 -t '\t' -1 $f1 -2 $f2 - '$sorted_index'";
1461 | 
1462 |     open my $to_join, "| $command" or die "failed to exec $command: $!";
1463 |     be_verbose_as_appropriate(length), print $to_join $_ while <>;
1464 |     close $to_join;
1465 |   },
1466 | 
1467 |   join => sub {
1468 |     # Inner join against sorted data by appending joined fields to the end.
1469 |     my ($f1, $f2, $join_file) = parse_join_options @_;
1470 | 
1471 |     my $sorted_index = fifo_for $join_file;
1472 |     my $command = sort_cmd "-t '\t' -k ${f1}b,$f1" .
1473 |                   "| join -t '\t' -1 $f1 -2 $f2 - '$sorted_index'";
1474 | 
1475 |     open my $to_join, "| $command" or die "failed to exec $command: $!";
1476 |     be_verbose_as_appropriate(length), print $to_join $_ while <>;
1477 |     close $to_join;
1478 |   },
1479 | 
1480 |   joinouter => sub {
1481 |     # Outer left join against sorted data by appending joined fields to the
1482 |     # end.
1483 |     my ($f1, $f2, $join_file) = parse_join_options @_;
1484 | 
1485 |     my $sorted_index = fifo_for $join_file;
1486 |     my $command = sort_cmd "-t '\t' -k ${f1}b,$f1" .
1487 |                   "| join -a 1 -t '\t' -1 $f1 -2 $f2 - '$sorted_index'";
1488 | 
1489 |     open my $to_join, "| $command" or die "failed to exec $command: $!";
1490 |     be_verbose_as_appropriate(length), print $to_join $_ while <>;
1491 |     close $to_join;
1492 |   },
1493 | 
1494 |   with => sub {
1495 |     # Like 'paste'. Joins lines with \t.
1496 |     my ($f) = @_;
1497 |     open my $fh, expand_filename_shorthands $f, 1
1498 |       or die "failed to open --with pseudofile $f: $!";
1499 |     my ($part1, $part2);
1500 |     while (defined($part1 = <>) and defined($part2 = <$fh>)) {
1501 |       be_verbose_as_appropriate length($part1) + length($part2);
1502 |       chomp $part1;
1503 |       chomp $part2;
1504 |       print $part1, "\t", $part2, "\n";
1505 |     }
1506 |     close $fh;
1507 |   },
1508 | 
1509 |   repeat => sub {
1510 |     my ($n, $f) = @_;
1511 |     my $count = 0;
1512 |     while (!$n || $count++ < $n) {
1513 |       open my $fh, expand_filename_shorthands $f, 1
1514 |         or die "failed to open --repeat pseudofile $f: $!";
1515 |       be_verbose_as_appropriate(length), print while <$fh>;
1516 |       close $fh;
1517 |     }
1518 |   },
1519 | 
1520 |   octave => sub {
1521 |     my ($commands) = @_;
1522 |     my $temp       = tmpnam;
1523 |     open my $fh, '>', $temp or die $!;
1524 |     be_verbose_as_appropriate(length), print $fh $_ while <>;
1525 |     close $fh;
1526 | 
1527 |     system 'octave',
1528 |            '-q',
1529 |            '--eval', "xs = load(\"$temp\");"
1530 |                      . "unlink(\"$temp\");"
1531 |                      . "save_precision(48);"
1532 |                      . "$commands;"
1533 |                      . "save -text $temp xs" and die "octave command failed";
1534 | 
1535 |     open $fh, '<', $temp;
1536 |     /^\s*#/ or
1537 |     /^\s*$/ or
1538 |     print join("\t", map $_ eq "NA" ? $_ : 0 + $_, grep length, split /\s+/),
1539 |           "\n" while <$fh>;
1540 |     close $fh;
1541 |     unlink $temp;
1542 |   },
1543 | 
1544 |   numpy => sub {
1545 |     my ($commands) = @_;
1546 |     my $temp       = tmpnam;
1547 |     open my $fh, '>', $temp or die $!;
1548 |     be_verbose_as_appropriate(length), print $fh $_ while <>;
1549 |     close $fh;
1550 | 
1551 |     # Fix up the indentation for cases like this:
1552 |     # $ nfu ... --numpy 'xs += 1
1553 |     #                    xs *= 4'       # no real indent here
1554 |     #
1555 |     # $ nfu ... --numpy 'if something:
1556 |     #                      xs += 1'     # minor indent here (assume 2)
1557 | 
1558 |     my @lines   = split /\n/, $commands;
1559 |     my @indents = map length(s/\S.*$//r), @lines;
1560 |     my $indent  = @lines > 1 ? $indents[1] - $indents[0] : 0;
1561 | 
1562 |     # If we're expecting an indentation of some amount after the first line, we
1563 |     # need to be careful: we don't know how much the user decided to indent the
1564 |     # block, and if we get it wrong then Python will complain at the next
1565 |     # outdent. (If there's no outdent, then we can use anything.)
1566 |     $indent = min $indent - 1, @indents[2..$#indents]
1567 |       if $lines[0] =~ /:\s*(#.*)?$/ && @lines > 2;
1568 | 
1569 |     my $spaces = ' ' x $indent;
1570 |     $lines[$_] =~ s/^$spaces// for 1..$#lines;
1571 |     $commands = join "\n", @lines;
1572 | 
1573 |     system 'python',
1574 |            '-c',
1575 |            "
1576 | import numpy as np
1577 | xs = np.loadtxt(\"$temp\")
1578 | $commands
1579 | np.savetxt(\"$temp\", xs, delimiter=\"\\t\")" and die "numpy command failed";
1580 | 
1581 |     open $fh, '<', $temp;
1582 |     /^\s*#/ or
1583 |     /^\s*$/ or
1584 |     print join("\t", map $_ eq "NA" ? $_ : 0 + $_, grep length, split /\s+/),
1585 |           "\n" while <$fh>;
1586 |     close $fh;
1587 |     unlink $temp;
1588 |   },
1589 | 
1590 |   stateful_unary_fn('average',
1591 |     sub {my ($size, $n, $total) = ($_[0] // 0, 0, 0);
1592 |          [$size, $n, $total, []]},
1593 |     sub {
1594 |       my ($x, $state) = @_;
1595 |       my ($size, $n, $total, $window) = @$state;
1596 |       $total += $x;
1597 |       ++$n;
1598 |       my $v = $total / ($n > $size && $size ? $size : $n);
1599 |       $total -= shift @$window if $size and push(@$window, $x) >= $size;
1600 |       $$state[1] = $n;
1601 |       $$state[2] = $total;
1602 |       $v;
1603 |     }),
1604 | 
1605 |   stateful_unary_fn('intify',
1606 |     sub {[{}, 0]},
1607 |     sub {
1608 |       my ($x, $state) = @_;
1609 |       $state->[0]->{$x} //= $state->[1]++;
1610 |     }),
1611 | 
1612 |   aggregate => sub {
1613 |     my $f = compile_eval_into_function $_[0], 'aggregate function';
1614 |     my @columns;
1615 |     while (my $line = <>) {
1616 |       be_verbose_as_appropriate length $line;
1617 |       chomp $line;
1618 |       my @fields = split /\t/, $line;
1619 | 
1620 |       # Two cases here. If the new record is compatible with the most recent
1621 |       # existing one, or there aren't any existing ones, then group it and
1622 |       # don't call the aggregator yet.
1623 |       #
1624 |       # If we see a change, then call the aggregator and empty out the group.
1625 |       #
1626 |       # Note that the aggregator function is called on columns, not rows.
1627 | 
1628 |       my $n = @columns && @{$columns[0]};
1629 |       if (!$n or $fields[0] eq ${$columns[0]}[0]) {
1630 |         $columns[$_][$n] = $fields[$_] for 0 .. $#fields;
1631 |       } else {
1632 |         $_ = ${$columns[0]}[0];
1633 |         print $_, "\n" for $f->(@columns);
1634 |         @columns = ();
1635 |         $columns[$_][0] = $fields[$_] for 0 .. $#fields;
1636 |       }
1637 |     }
1638 |     if (@columns) {
1639 |       $_ = ${$columns[0]}[0];
1640 |       print $_, "\n" for $f->(@columns);
1641 |     }
1642 |   },
1643 | 
1644 |   fold => sub {
1645 |     my $f = compile_eval_into_function $_[0], 'fold function';
1646 |     my @saved;
1647 |     while (<>) {
1648 |       be_verbose_as_appropriate length;
1649 |       chomp;
1650 |       my $line = $_;
1651 |       if ($f->(split /\t/)) {
1652 |         push @saved, $line;
1653 |       } else {
1654 |         print row(@saved), "\n" if @saved;
1655 |         @saved = ($line);
1656 |       }
1657 |     }
1658 |     print row(@saved), "\n" if @saved;
1659 |   },
1660 | 
1661 |   stateless_unary_fn('log', sub {
1662 |     my ($x, $base) = @_;
1663 |     my $log = log $x;
1664 |     $log /= log $base if defined $base;
1665 |     $log;
1666 |   }),
1667 | 
1668 |   stateless_unary_fn('exp', sub {
1669 |     my ($x, $base) = @_;
1670 |     defined $base ? $base ** $x : exp $x;
1671 |   }),
1672 | 
1673 |   stateless_unary_fn('quant', sub {
1674 |     my ($x, $quantum) = @_;
1675 |     round_to $x, $quantum;
1676 |   }),
1677 | 
1678 |   # Note: this needs to be stdin; otherwise "nfu -p %l filename" will fail
1679 |   # (since exec_with_diamond trieds to pass filename straight into gnuplot).
1680 |   plot => sub {
1681 |     exec_with_stdin 'gnuplot',
1682 |                     '-e',
1683 |                     'plot "-" ' . join(' ', expand_gnuplot_options @_),
1684 |                     '-persist';
1685 |   },
1686 | 
1687 |   splot => sub {
1688 |     exec_with_stdin 'gnuplot',
1689 |                     '-e',
1690 |                     'splot "-" ' . join(' ', expand_gnuplot_options @_),
1691 |                     '-persist';
1692 |   },
1693 | 
1694 |   mplot => sub {
1695 |     my @gnuplot_options = split /;/, join ' ', expand_gnuplot_options @_;
1696 |     my $fname = tmpnam;
1697 |     my $cols  = 0;
1698 |     open my $fh, '>', $fname or die "failed to open tempfile for mplot: $!";
1699 |     while (<>) {
1700 |       be_verbose_as_appropriate length;
1701 |       $cols = max $cols, 1 + scalar(my @xs = /\t/g);
1702 |       print $fh $_;
1703 |     }
1704 |     close $fh;
1705 | 
1706 |     # If we're requesting only one plot, assume the intent is to replicate
1707 |     # those settings across every observed column.
1708 |     my $plot_command =
1709 |       'plot ' . join ',',
1710 |                   @gnuplot_options > 1
1711 |                     ? map("\"$fname\" $_", @gnuplot_options)
1712 |                     : map("\"$fname\" using $_ $gnuplot_options[0]", 1..$cols);
1713 | 
1714 |     system 'gnuplot', '-e', $plot_command, '-persist';
1715 | 
1716 |     # HACK: the problem is that gnuplot internally forks a subprocess for the
1717 |     # plot window, which we won't be able to see from here (that I know of). If
1718 |     # we delete the file before that subprocess exits, then any zoom operations
1719 |     # will cause gnuplot to abruptly exit.
1720 |     #
1721 |     # I'm sure there's a better way to solve this, but for now this should do
1722 |     # the job for now.
1723 |     unless (fork) {
1724 |       setsid;
1725 |       close STDIN;
1726 |       close STDOUT;
1727 |       unless (fork) {
1728 |         sleep 3600;
1729 |         unlink $fname or die "failed to unlink $fname: $!";
1730 |       }
1731 |     }
1732 |   },
1733 | 
1734 |   poll => sub {
1735 |     my ($sleep, $command) = @_;
1736 |     die "usage: --poll sleep-amount 'command ...'"
1737 |       unless defined $sleep and defined $command;
1738 |     system($command), sleep $sleep while 1;
1739 |   },
1740 | 
1741 |   stateful_unary_fn('delta',
1742 |     sub {[0]},
1743 |     sub {my ($x, $state) = @_;
1744 |          my $v = $x - $$state[0];
1745 |          $$state[0] = $x;
1746 |          $v}),
1747 | 
1748 |   stateful_unary_fn('sum',
1749 |     sub {[0]},
1750 |     sub {my ($x, $state) = @_;
1751 |          $$state[0] += $x}),
1752 | 
1753 |   stateful_unary_fn('variance',
1754 |     sub {[0, 0, 0]},
1755 |     sub {my ($x, $state) = @_;
1756 |          $$state[0] += $x;
1757 |          $$state[1] += $x * $x;
1758 |          $$state[2]++;
1759 |          my ($sx, $sx2, $count) = @$state;
1760 |          ($sx2 - ($sx * $sx / $count)) / ($count - 1 || 1)}),
1761 | 
1762 |   stateful_unary_fn('sd',
1763 |     sub {[0, 0, 0]},
1764 |     sub {my ($x, $state) = @_;
1765 |          $$state[0] += $x;
1766 |          $$state[1] += $x * $x;
1767 |          $$state[2]++;
1768 |          my ($sx, $sx2, $count) = @$state;
1769 |          sqrt(($sx2 - ($sx * $sx / $count)) / ($count - 1 || 1))}),
1770 | 
1771 |   stateful_unary_fn('entropy',
1772 |     # state contains [$total, $entropy_so_far] and uses the following
1773 |     # associative combiner (where F(X) = frequency of X, unscaled probability):
1774 |     #
1775 |     # let t = F(A) + F(B)
1776 |     # H(A + B) = F(A)/t * (-log(F(A)/t) + H(A))
1777 |     #          + F(B)/t * (-log(F(B)/t) + H(B))
1778 | 
1779 |     sub {[0, 0]},
1780 |     sub {my ($x, $state) = @_;
1781 |          my ($f0, $h0)   = @$state;
1782 |          my $f           = $$state[0] += $x;
1783 |          my $p           = $x  / $f;
1784 |          my $p0          = $f0 / $f;
1785 |          $$state[1]      = $p0 * (($p0 > 0 ? -log($p0) / LOG_2 : 0) + $h0)
1786 |                          + $p  *  ($p  > 0 ? -log($p)  / LOG_2 : 0)}),
1787 | 
1788 |   take => sub {
1789 |     if ($_[0] =~ s/^\+//) {
1790 |       # Take last n, so we need a line queue
1791 |       my @q;
1792 |       my $i = 0;
1793 |       be_verbose_as_appropriate(length), $q[$i++ % $_[0]] = $_ while <>;
1794 |       print for @q[$i % $_[0] .. $#q];
1795 |       print for @q[0 .. $i % $_[0] - 1];
1796 |     } else {
1797 |       my $n = $_[0] // 1;
1798 |       while (<>) {
1799 |         be_verbose_as_appropriate length;
1800 |         last if --$n < 0;
1801 |         print;
1802 |       }
1803 |     }
1804 |   },
1805 | 
1806 |   sample => sub {
1807 |     while (<>) {
1808 |       be_verbose_as_appropriate length;
1809 |       print if rand() < $_[0];
1810 |     }
1811 |   },
1812 | 
1813 |   drop => sub {
1814 |     my $n = $_[0] // 1;
1815 |     if ($n) {
1816 |       while (<>) {
1817 |         be_verbose_as_appropriate length;
1818 |         last if --$n <= 0;
1819 |       }
1820 |     }
1821 |     be_verbose_as_appropriate(length), print while <>;
1822 |   },
1823 | 
1824 |   map => sub {
1825 |     my $f = compile_eval_into_function $_[0], 'map function';
1826 |     while (<>) {
1827 |       be_verbose_as_appropriate length;
1828 |       chomp;
1829 |       print "$_\n" for $f->(split /\t/);
1830 |     }
1831 |   },
1832 | 
1833 |   pmap => sub {
1834 |     my @fhs;
1835 |     my $wbits = '';
1836 |     my $wout  = '';
1837 |     my $i     = 0;
1838 | 
1839 |     for (1 .. $ENV{NFU_PMAP_PARALLELISM} // 16) {
1840 |       my $mapper = quote_self '--child', @evaled_code, '--map', $_[0];
1841 |       open my $fh, "| $mapper"
1842 |         or die "failed to open child process $mapper: $!";
1843 | 
1844 |       vec($wbits, fileno($fh), 1) = 1;
1845 |       push @fhs, $fh;
1846 |     }
1847 | 
1848 |     while (<>) {
1849 |       be_verbose_as_appropriate length;
1850 |       select undef, $wout = $wbits, undef, undef;
1851 |       ++$i until vec($wout, fileno $fhs[$i % @fhs], 1);
1852 |       syswrite $fhs[$i++ % @fhs], $_;
1853 |     }
1854 |     close for @fhs;
1855 |   },
1856 | 
1857 |   keep => sub {
1858 |     my $f = $_[0] =~ /^\d+$/
1859 |           ? eval "sub {" . join("&&", map "\$_[$_]", split //, $_[0]) . "}"
1860 |           : compile_eval_into_function $_[0], 'keep function';
1861 |     while (<>) {
1862 |       my $line = $_;
1863 |       be_verbose_as_appropriate length;
1864 |       chomp;
1865 |       my @xs = split /\t/;
1866 |       print $line if $f->(@xs);
1867 |     }
1868 |   },
1869 | 
1870 |   remove => sub {
1871 |     my $f = $_[0] =~ /^\d+$/
1872 |           ? eval "sub {" . join("&&", map "\$_[$_]", split //, $_[0]) . "}"
1873 |           : compile_eval_into_function $_[0], 'remove function';
1874 |     while (<>) {
1875 |       my $line = $_;
1876 |       be_verbose_as_appropriate length;
1877 |       chomp;
1878 |       my @xs = split /\t/;
1879 |       print $line unless $f->(@xs);
1880 |     }
1881 |   },
1882 | 
1883 |   each => sub {
1884 |     my ($template) = @_;
1885 |     while (<>) {
1886 |       be_verbose_as_appropriate length;
1887 |       chomp;
1888 |       my $c = $template =~ s/\{\}/$_/gr;
1889 |       system $c and die "each: failed to run $c: $!";
1890 |     }
1891 |   },
1892 | 
1893 |   every => sub {
1894 |     my ($n) = @_;
1895 |     my $i = 0;
1896 |     while (<>) {
1897 |       be_verbose_as_appropriate length;
1898 |       print unless $i++ % $n;
1899 |     }
1900 |   },
1901 | 
1902 |   fields => sub {
1903 |     my ($fields)   = @_;
1904 |     my $everything = $fields =~ s/\.$//;
1905 |     my @fs         = split //, $fields;
1906 |     $everything &&= 1 + max @fs;
1907 | 
1908 |     while (<>) {
1909 |       be_verbose_as_appropriate length;
1910 |       chomp;
1911 |       my @xs = split /\t/;
1912 |       my @ys = @xs[@fs];
1913 |       push @ys, @xs[$everything .. $#xs] if $everything;
1914 |       print join("\t", map $_ // '', @ys), "\n";
1915 |     }
1916 |   },
1917 | 
1918 |   fieldsplit => sub {
1919 |     my $pattern = $implosions{$_[0]} // $_[0];
1920 |     $pattern    = $fieldsplit_shorthands{$pattern} // $pattern;
1921 |     my $delim   = qr/$pattern/;
1922 |     while (<>) {
1923 |       be_verbose_as_appropriate length;
1924 |       chomp;
1925 |       print join("\t", split /$delim/), "\n";
1926 |     }
1927 |   },
1928 | 
1929 |   number => sub {
1930 |     my $n = 0;
1931 |     while (<>) {
1932 |       be_verbose_as_appropriate length;
1933 |       chomp;
1934 |       print row(++$n, $_), "\n";
1935 |     }
1936 |   },
1937 | 
1938 |   ntiles => sub {
1939 |     my ($n)        = @_;
1940 |     my $line_count = 0;
1941 |     my $fifo       = tmpnam;
1942 |     mkfifo $fifo, 0700 or die "failed to create fifo: $!";
1943 |     open my $sorted, sort_cmd('-g', $fifo) . " |"
1944 |       or die "failed to create sort process: $!";
1945 |     open my $fifo_fh, '>', $fifo or die "failed to write to fifo: $!";
1946 | 
1947 |     # Push data into the sort process, keeping track of the number of lines.
1948 |     # We'll use this count later to use constant rather than linear space.
1949 |     while (<>) {
1950 |       ++$line_count;
1951 |       be_verbose_as_appropriate(length);
1952 |       $fifo_fh->print($_);
1953 |     }
1954 |     close $fifo_fh;
1955 |     unlink $fifo;
1956 | 
1957 |     # Ok, now grab the data for each of the N sampling points. Here's what
1958 |     # we're doing:
1959 |     #
1960 |     # 0           1           2    <- things we grab for 2-tiles
1961 |     # 1 2 3 4 5 6 7 8 9 a b c d    <- data points
1962 |     #
1963 |     # 0          1          2      <- things we grab for 2-tiles
1964 |     # 1 2 3 4 5 6 7 8 9 a b c      <- data points
1965 |     #
1966 |     # 0     1     2     3     4    <- quartiles
1967 |     # 1   2   3   4   5   6   7    <- data points
1968 |     #
1969 |     # To make this work, we just keep track of the previous data point as we're
1970 |     # reading.
1971 | 
1972 |     # Normal case
1973 |     my $i        = 0;
1974 |     my $previous = undef;
1975 |     my $max_line = $line_count - 1;
1976 |     while (<$sorted>) {
1977 |       chomp;
1978 |       $previous = $_ unless defined $previous;
1979 | 
1980 |       my $break = int($i * $n / $max_line) / $n * $max_line;
1981 |       if ($i >= $break && $i - $break < 1) {
1982 |         # Take this row, performing a weighted average with the previous one:
1983 |         #
1984 |         #   break
1985 |         # x   V           y
1986 |         # |               |
1987 |         # ----|------------
1988 |         #  $w     1-$w
1989 |         #
1990 |         # We want x*(1-$w) + y*$w.
1991 | 
1992 |         my $w = $break - int $break || 1;
1993 |         my $v = $previous * (1 - $w) + $_ * $w;
1994 |         print $v, "\n";
1995 |       }
1996 | 
1997 |       ++$i;
1998 |       $previous = $_;
1999 |     }
2000 |     close $sorted;
2001 |   },
2002 | 
2003 |   prepend => sub {
2004 |     open my $fh, expand_filename_shorthands $_[0], 1
2005 |       or die "failed to open --prepend pseudofile $_[0]: $!";
2006 |     be_verbose_as_appropriate(length), print while <$fh>;
2007 |     close $fh;
2008 |     print while <>;
2009 |   },
2010 | 
2011 |   append => sub {
2012 |     open my $fh, expand_filename_shorthands $_[0], 1
2013 |       or die "failed to open --append pseudofile $_[0]: $!";
2014 |     print while <>;
2015 |     be_verbose_as_appropriate(length), print while <$fh>;
2016 |     close $fh;
2017 |   },
2018 | 
2019 |   pipe => sub {
2020 |     open my $fh, "| $_[0]" or die "failed to launch $_[0]: $!";
2021 |     be_verbose_as_appropriate(length), print $fh $_ while <>;
2022 |     close $fh;
2023 |   },
2024 | 
2025 |   tee => sub {
2026 |     open my $fh, "| $_[0]" or die "failed to launch $_[0]: $!";
2027 |     $SIG{PIPE} = 'IGNORE';
2028 |     while (<>) {
2029 |       be_verbose_as_appropriate length;
2030 |       $fh->print($_);
2031 |       print;
2032 |     }
2033 |     close $fh;
2034 |   },
2035 | 
2036 |   duplicate => sub {
2037 |     open my $fh1, "| $_[0]" or die "failed to launch $_[0]: $!";
2038 |     open my $fh2, "| $_[1]" or die "failed to launch $_[1]: $!";
2039 | 
2040 |     # Important: keep going even if a subprocess rejects data. Otherwise things
2041 |     # like "nfu --duplicate ^T1 ^T+1" will produce truncated output.
2042 |     $SIG{PIPE} = 'IGNORE';
2043 | 
2044 |     while (<>) {
2045 |       be_verbose_as_appropriate length;
2046 |       $fh1->print($_);
2047 |       $fh2->print($_);
2048 |     }
2049 |     close $fh1;
2050 |     close $fh2;
2051 |   },
2052 | 
2053 |   partition => sub {
2054 |     my ($splitter, $cmd) = @_;
2055 |     my %fhs;
2056 |     my $f = compile_eval_into_function $splitter, 'partition function';
2057 | 
2058 |     # Important: keep going even if a subprocess rejects data. Otherwise things
2059 |     # like "nfu --partition ... ^T10" will produce truncated output.
2060 |     $SIG{PIPE} = 'IGNORE';
2061 | 
2062 |     my @open_partitions;
2063 |     while (<>) {
2064 |       be_verbose_as_appropriate length;
2065 |       my $line = $_;
2066 |       chomp(my $cline = $line);
2067 |       my $p = $f->(split /\t/, $cline);
2068 |       unless (exists $fhs{$p}) {
2069 |         my $cmdsub = $cmd =~ s/\{\}/$p/gr =~ s/\{\.(\.*)\}/\{$1\}/gr;
2070 |         open $fhs{$p}, "| $cmdsub" or die "failed to launch $cmdsub: $!";
2071 |         push @open_partitions, $p;
2072 |       }
2073 |       $fhs{$p}->print($line);
2074 |       close($fhs{$p = shift @open_partitions}), delete $fhs{$p}
2075 |         while @open_partitions > ($ENV{NFU_MAX_FILEHANDLES} // 64);
2076 |     }
2077 |     close for values %fhs;
2078 |   },
2079 | 
2080 |   hadoopc => sub {
2081 |     my ($outfile, $mapper, $combiner, $reducer) = @_;
2082 |     if ($outfile eq '.') {
2083 |       # Print output data to stdout, then delete outfile
2084 |       my $filename  = hadoop_tempfile;
2085 |       my @partfiles = map s/^hdfs:(?:\/\/)?//r,
2086 |                           hadoop_ls hadoop_into $filename, $mapper, $combiner, $reducer;
2087 |       open my $fh, "| xargs hadoop fs -text"
2088 |         or die "failed to execute xargs: $!";
2089 |       $fh->print("$_\n") for @partfiles;
2090 |       close $fh;
2091 |       system hadoop('fs', '-rm', '-r', $filename) . ' 1>&2';
2092 |     } else {
2093 |       $outfile = hadoop_tempfile if $outfile eq '@';
2094 |       print hadoop_into($outfile, $mapper, $combiner, $reducer), "\n";
2095 |     }
2096 |   },
2097 | 
2098 |   hadoop => sub {
2099 |     my ($outfile, $mapper, $reducer) = @_;
2100 |     if ($outfile eq '.') {
2101 |       # Print output data to stdout, then delete outfile
2102 |       my $filename  = hadoop_tempfile;
2103 |       my @partfiles = map s/^hdfs:(?:\/\/)?//r,
2104 |                           hadoop_ls hadoop_into $filename, $mapper, 'NONE', $reducer;
2105 |       open my $fh, "| xargs hadoop fs -text"
2106 |         or die "failed to execute xargs: $!";
2107 |       $fh->print("$_\n") for @partfiles;
2108 |       close $fh;
2109 |       system hadoop('fs', '-rm', '-r', $filename) . ' 1>&2';
2110 |     } else {
2111 |       $outfile = hadoop_tempfile if $outfile eq '@';
2112 |       print hadoop_into($outfile, $mapper, 'NONE', $reducer), "\n";
2113 |     }
2114 |   },
2115 | 
2116 |   sql => sub {
2117 |     my ($db, $table, $schema, $query) = sql_parse_args @_;
2118 |     my @read  = sql_schema_and_buffer $schema;
2119 |     my $index = sql_first_column_index $table, $schema;
2120 |     $query = expand_sql_shorthands $query;
2121 | 
2122 |     if ($db =~ s/^P//) {
2123 |       # postgres
2124 |       my $edb = expand_postgres_db $db;
2125 |       my $q = $query eq '_' ? ''
2126 |                             : "COPY ($query) TO STDOUT WITH NULL AS '';\n";
2127 | 
2128 |       open my $fh, "| psql -c "
2129 |                  . shell_quote("DROP TABLE IF EXISTS $table;\n"
2130 |                              . "CREATE TABLE $table ($schema);\n"
2131 |                              . "COPY $table FROM STDIN;\n"
2132 |                              . $index
2133 |                              . $q)
2134 |                  . " $edb 1>&2"
2135 |         or die "psql: failed to open psql for table $table $!";
2136 | 
2137 |       write_buffer_and_stdin $fh, @read;
2138 |       close $fh;
2139 |       print "sql:P$db:select * from $table\n" unless length $q;
2140 |     } elsif ($db =~ s/^S//) {
2141 |       # sqlite3
2142 |       my $fifo_name = tmpnam;
2143 |       mkfifo $fifo_name, 0700 or die "failed to create fifo: $!";
2144 | 
2145 |       my $child = fork;
2146 |       unless ($child) {
2147 |         system "echo "
2148 |              . shell_quote(".mode tabs\n"
2149 |                          . "DROP TABLE IF EXISTS $table;\n"
2150 |                          . "CREATE TABLE $table ($schema);\n"
2151 |                          . ".import $fifo_name $table\n"
2152 |                          . $index
2153 |                          . ($query eq '_' ? '' : "$query;\n"))
2154 |              . "| sqlite3 " . shell_quote expand_sqlite_db $db;
2155 |       } else {
2156 |         open my $fh, '>', $fifo_name or die "failed to open fifo into sqlite: $!";
2157 |         write_buffer_and_stdin $fh, @read;
2158 |         close $fh;
2159 |         unlink $fifo_name;
2160 |         waitpid $child, 0;
2161 |         print "sql:S$db:select * from $table\n" if $query eq '_';
2162 |       }
2163 |     } else {
2164 |       die "unknown SQL prefix: " . substr($db, 0, 1)
2165 |         . " (valid prefixes are P and S)";
2166 |     }
2167 |   },
2168 | 
2169 |   preview => sub {
2170 |     $verbose = 0;               # don't print over the pager
2171 |     my $have_less = !system 'which less > /dev/null';
2172 |     my $have_more = !system 'which more > /dev/null';
2173 | 
2174 |     my $less_program = $have_less ? 'less'
2175 |                      : $have_more ? 'more' : 'cat';
2176 | 
2177 |     exec_with_stdin $less_program;
2178 |   },
2179 | );
2180 | 
2181 | my %bracket_handlers = (
2182 |   ''  => sub {my $stuff = shell_quote @_;
2183 |               "".qx|$0 --quote $stuff| =~ s/\s*$//r},
2184 |   '@' => sub {my $stuff = shell_quote @_;
2185 |               "sh:".(qx|$0 --quote $stuff| =~ s/\s*$//r)},
2186 |   q   => sub {shell_quote @_},
2187 | );
2188 | 
2189 | my %bracket_docs = (
2190 |   ''  => 'nfu as function: [ -gc ]     == "$(nfu --quote -gc)"',
2191 |   '@' => 'nfu as data:    @[ -gc foo ] == sh:"$(nfu --quote -gc foo)"',
2192 |   q   => 'quote things:   q[ foo bar ] == "foo bar"',
2193 | );
2194 | 
2195 | # Print usage if the user clearly doesn't know what they're doing.
2196 | if (@ARGV ? $ARGV[0] =~ /^-[h?]$/ || $ARGV[0] =~ /^--(usage|help)$/
2197 |           : -t STDIN) {
2198 | 
2199 |   # Some checks for me to make sure I'm keeping the code well-maintained
2200 |   exists $functions{$_}         or die "no function for $_" for keys %usages;
2201 |   exists $usages{$_}            or die "no usage for $_"    for keys %functions;
2202 |   exists $arity{$_}             or die "no arity for $_"    for keys %usages;
2203 |   exists $usages{$_ =~ s/--//r} or die "no usage for $_"
2204 |     for values %explosions, keys %usages;
2205 | 
2206 |   exists $bracket_docs{$_} or die "no bracket doc for $_"
2207 |     for keys %bracket_handlers;
2208 | 
2209 |   print STDERR "usage: nfu [prefix-commands...] [input-files...] commands...\n";
2210 |   print STDERR "where each command is one of the following:\n\n";
2211 | 
2212 |   my $len = 1 + max map length, keys %usages;
2213 |   my %short_lookup;
2214 |   $short_lookup{$explosions{$_} =~ s/^--//r} = $_ for keys %explosions;
2215 | 
2216 |   for my $cmd (sort keys %usages) {
2217 |     my $short = $short_lookup{$cmd};
2218 |     $short = defined $short ? "-$short|" : '   ';
2219 |     printf STDERR "  %s--%-${len}s(%d) %s\n",
2220 |                   $short,
2221 |                   $cmd,
2222 |                   $arity{$cmd},
2223 |                   $usages{$cmd} ? $arity{$cmd} ? "<$usages{$cmd}>"
2224 |                                                : "-- $usages{$cmd}" : '';
2225 |   }
2226 | 
2227 |   print STDERR "\nand prefix commands are:\n\n";
2228 | 
2229 |   print STDERR "  documentation (not used with normal commands):\n";
2230 |   print STDERR "    --explain           \n";
2231 |   print STDERR "    --expand-pseudofile \n";
2232 |   print STDERR "    --expand-code       \n";
2233 |   print STDERR "    --expand-gnuplot    \n";
2234 |   print STDERR "    --expand-sql        \n";
2235 | 
2236 |   print STDERR "\n  pipeline modifiers:\n";
2237 |   print STDERR "    --quote     -- quotes args: eval \$(nfu --quote ...)\n";
2238 |   print STDERR "    --use       \n";
2239 |   print STDERR "    --run       \n";
2240 | 
2241 |   print STDERR "\nargument bracket preprocessing:\n\n";
2242 | 
2243 |   print STDERR "  ^stuff -> [ -stuff ]\n\n";
2244 | 
2245 |   my $bracket_max = max map length, keys %bracket_docs;
2246 |   printf STDERR "  %${bracket_max}s[ ]    %s\n", $_, $bracket_docs{$_}
2247 |   for sort keys %bracket_docs;
2248 | 
2249 |   my $pseudofile_len = 1 + max map length, keys %pseudofile_docs;
2250 |   print STDERR "\npseudofile patterns:\n\n";
2251 |   printf STDERR "  %-${pseudofile_len}s %s\n", $_, $pseudofile_docs{$_}
2252 |   for sort keys %pseudofile_docs;
2253 | 
2254 |   print STDERR "\ngnuplot expansions:\n\n";
2255 |   printf STDERR "  %2s -> '%s'\n", $_, $gnuplot_aliases{$_}
2256 |   for sort keys %gnuplot_aliases;
2257 | 
2258 |   print STDERR "\nSQL expansions:\n\n";
2259 |   printf STDERR "  %2s -> '%s'\n", $_, $sql_aliases{$_}
2260 |   for sort keys %sql_aliases;
2261 | 
2262 |   print STDERR "\ndatabase prefixes:\n\n";
2263 |   printf STDERR "  %s = %s\n", @$_
2264 |   for ['P' => 'PostgreSQL'],
2265 |       ['S' => 'SQLite 3'];
2266 | 
2267 |   my $env_len = 1 + max map length, keys %env_docs;
2268 |   print STDERR "\nenvironment variables:\n\n";
2269 |   printf STDERR "  %-${env_len}s %s\n", $_, $env_docs{$_}
2270 |   for sort keys %env_docs;
2271 | 
2272 |   print STDERR "\n";
2273 |   print STDERR "see https://github.com/spencertipping/nfu for documentation\n";
2274 |   print STDERR "\n";
2275 | 
2276 |   exit 1;
2277 | }
2278 | 
2279 | if (@ARGV && $ARGV[0] =~ /^--expand/) {
2280 |   my ($command, $x, @others) = @ARGV;
2281 |   if ($command =~ /-pseudofile$/) {
2282 |     print expand_filename_shorthands($x) // '', "\n";
2283 |   } elsif ($command =~ /-code$/) {
2284 |     print expand_eval_shorthands($x), "\n";
2285 |   } elsif ($command =~ /-gnuplot$/) {
2286 |     print expand_gnuplot_options($x), "\n";
2287 |   } elsif ($command =~ /-sql$/) {
2288 |     print expand_sql_shorthands($x), "\n";
2289 |   } else {
2290 |     print STDERR "unknown expansion command: $command\n";
2291 |     exit 1;
2292 |   }
2293 |   exit 0;
2294 | }
2295 | 
2296 | my @args_to_parse;
2297 | 
2298 | # Preprocess args to look for bracketed groups. These need to be collapsed into
2299 | # single args, which must happen before we start assigning arguments to
2300 | # commands.
2301 | while (@ARGV) {
2302 |   my $x = shift @ARGV;
2303 |   last if $x eq '--';
2304 |   if ($x =~ s/\[$//) {
2305 |     die "unknown bracket prefix: $x" unless exists $bracket_handlers{$x};
2306 |     my @xs;
2307 |     my $depth = 1;
2308 |     while (@ARGV) {
2309 |       my $next = shift @ARGV;
2310 |       $depth-- if $next eq ']';
2311 |       last unless $depth;
2312 |       push @xs, $next;
2313 |       $depth++ if $next =~ /\[$/;
2314 |     }
2315 |     unshift @ARGV, $bracket_handlers{$x}->(@xs);
2316 |   } elsif ($x =~ s/^\^//) {
2317 |     # Lift the command (as a short option) into a quoted nfu instance.
2318 |     $x = shell_quote $x;
2319 |     unshift @ARGV, ''.qx|$0 --quote -$x|;
2320 |   } elsif ($x =~ s/\{$//) {
2321 |     # Parse a branching map, which has the form { 'pattern' stuff... ,
2322 |     # 'pattern' stuff... , ... }. We generate a quoted TSV of packed base-64
2323 |     # values.
2324 |     die "unknown brace prefix: $x" if length $x;
2325 |     my @lines;
2326 |     my $key;
2327 |     my @value;
2328 |     my $depth = 1;
2329 |     while (@ARGV) {
2330 |       my $next = shift @ARGV;
2331 |       $depth-- if $next eq '}';
2332 | 
2333 |       if (!$depth || defined $key && $next eq ',') {
2334 |         push @lines, row pack('u', $key),
2335 |                          pack('u', shell_quote @value);
2336 |         @value = ();
2337 |         $key   = undef;
2338 |       } elsif (defined $key) {
2339 |         push @value, $next;
2340 |       } else {
2341 |         $key = $next;
2342 |       }
2343 | 
2344 |       last unless $depth;
2345 |       $depth++ if $next =~ /\{$/;
2346 |     }
2347 | 
2348 |     unshift @ARGV, join "\n", @lines;
2349 |   } elsif ($x eq '%') {
2350 |     # Everything else is a variable binding of the form 'x=y'. Then go back
2351 |     # through and rewrite all %x to be y.
2352 |     my %bindings = map split(/=/, $_, 2), @ARGV;
2353 |     my $names    = join '|', keys %bindings;
2354 |     @ARGV = ();
2355 |     s#%($names)#$bindings{$1} // "%$1"#ge for @args_to_parse;
2356 |   } else {
2357 |     push @args_to_parse, $x;
2358 |   }
2359 | }
2360 | 
2361 | sub explode {
2362 |   return $_[0] unless $_[0] =~ s/^-([^-])/$1/;
2363 |   map {$explosions{$_} // $_} grep length, split /([-+.\d]*),?/, $_[0];
2364 | }
2365 | 
2366 | my %custom_env;
2367 | my @parsed;
2368 | my $quote_self = 0;
2369 | my $explain    = 0;
2370 | 
2371 | while (@args_to_parse) {
2372 |   unshift @args_to_parse, explode shift @args_to_parse;
2373 |   (my $command = shift @args_to_parse) =~ s/^--//;
2374 | 
2375 |   if (defined(my $arity = $arity{$command})) {
2376 |     my @args;
2377 |     push @args, shift @args_to_parse
2378 |     while @args_to_parse && (--$arity >= 0
2379 |                              || ! -e $args_to_parse[0]
2380 |                                 && $args_to_parse[0] =~ /^[-+]?\d+/);
2381 |     push @parsed, [$command, @args];
2382 |   } elsif ($command =~ /(\w+)=(.*)/) {
2383 |     $ENV{$1} = $custom_env{$1} = $2;
2384 |   } elsif ($command eq 'run') {
2385 |     my $x = shift @args_to_parse;
2386 |     push @evaled_code, '--run', $x;
2387 |     eval $x;
2388 |     die "failed to run '$x': $@" if $@;
2389 |   } elsif ($command eq 'use') {
2390 |     my $x = shift @args_to_parse;
2391 |     my $s = read_file $x;       # you can --use pseudofiles, woot!
2392 |     push @evaled_code, '--use', $x;
2393 |     eval $s;
2394 |     die "failed to use '$x': $@" if $@;
2395 |   } elsif ($command eq 'explain') {
2396 |     $explain = 1;
2397 |   } elsif ($command eq 'verbose' || $command eq 'v') {
2398 |     print STDERR "\033[2J" unless $quote_self;
2399 |     $verbose = 1;
2400 |   } elsif ($command eq 'child') {
2401 |     $is_child = 1;
2402 |   } elsif ($command eq 'quote') {
2403 |     $quote_self = 1;
2404 |   } else {
2405 |     if ($quote_self) {
2406 |       # Defer pseudofile resolution. This matters for things like intermediate
2407 |       # Hadoop outputs.
2408 |       push @ARGV, $command;
2409 |     } else {
2410 |       my $f = expand_filename_shorthands $command;
2411 |       die "nonexistent pseudofile: $command" unless defined $f;
2412 |       push @ARGV, $f;
2413 |     }
2414 |   }
2415 | }
2416 | 
2417 | if ($quote_self) {
2418 |   # Quote all other arguments so a shell will parse them correctly.
2419 |   print quote_self($verbose ? ('--verbose') : (),
2420 |                    map("$_=$custom_env{$_}", keys %custom_env),
2421 |                    @evaled_code,
2422 |                    map(("--$$_[0]", @$_[1..$#$_]), @parsed),
2423 |                    @ARGV), "\n";
2424 |   exit 0;
2425 | }
2426 | 
2427 | # Open output in an interactive previewer if...
2428 | push @parsed, ['preview'] if !$ENV{NFU_NO_PAGER}    # we can page
2429 |                           && (!-t STDIN || @ARGV)   # not interacting for input
2430 |                           && -t STDOUT;             # interacting for output
2431 | 
2432 | if ($explain) {
2433 |   # Explain what we would have done with the given command line.
2434 |   printf "file\t%s\n", $_ =~ s/#.*\n//gr for @ARGV;
2435 |   printf "--%s\t%s\n", ${$_}[0], join "\t", @{$_}[1 .. $#$_] for @parsed;
2436 | } elsif (@parsed) {
2437 |   my $reader = undef;
2438 | 
2439 |   # Note: the loop below uses pipe/fork/dup2 instead of a more idiomatic Open2
2440 |   # call. I don't have a good reason for this other than to figure out how the
2441 |   # low-level stuff worked.
2442 |   for (my $i = 0; $i < @parsed; ++$i) {
2443 |     my ($command, @args) = @{$parsed[$i]};
2444 | 
2445 |     # Here's where things get fun. The question right now is, "do we need to
2446 |     # fork, or can we run in-process?" -- i.e. are we in the middle, or at the
2447 |     # end? When we're in the middle, we want to redirect STDOUT to the pipe's
2448 |     # writer and fork; otherwise we run in-process and write directly to the
2449 |     # existing STDOUT.
2450 |     ++$verbose_row;
2451 |     if ($i < @parsed - 1) {
2452 |       # We're in the middle, so allocate a pipe and fork.
2453 |       pipe my($new_reader), my($writer);
2454 |       $verbose_command = $command;
2455 |       @verbose_args    = @args;
2456 |       unless (fork) {
2457 |         # We're the child, so do STDOUT redirection.
2458 |         close $new_reader or die "failed to close pipe reader: $!";
2459 |         POSIX::close(0) or die "failed to close stdin" if defined $reader;
2460 |         dup2(fileno($reader), 0) or die "failed to dup input: $!"
2461 |           if defined $reader;
2462 |         POSIX::close(1);
2463 |         dup2(fileno($writer), 1) or die "failed to dup stdout: $!";
2464 | 
2465 |         close $reader or die "failed to close reader: $!" if defined $reader;
2466 |         close $writer or die "failed to close writer: $!";
2467 | 
2468 |         # The function here may never return.
2469 |         $functions{$command}->(@args);
2470 |         exit;
2471 |       } else {
2472 |         close $writer or die "failed to close pipe writer: $!";
2473 |         close $reader if defined $reader;
2474 |         $reader = $new_reader;
2475 |       }
2476 |     } else {
2477 |       # We've hit the end of the chain. Preserve stdout, redirect stdin from
2478 |       # current reader.
2479 |       POSIX::close(0) or die "failed to close stdin" if defined $reader;
2480 |       dup2(fileno($reader), 0) or die "failed to dup input: $!"
2481 |         if defined $reader;
2482 |       close $reader or die "failed to close reader: $!" if defined $reader;
2483 |       $verbose_command = $command;
2484 |       @verbose_args    = @args;
2485 |       $functions{$command}->(@args);
2486 |     }
2487 | 
2488 |     # Prevent <> from reading files after the first iteration (this is such a
2489 |     # hack).
2490 |     @ARGV = ();
2491 |   }
2492 | } else {
2493 |   # Behave like cat, which is useful for auto-decompressing things.
2494 |   be_verbose_as_appropriate(length), print while <>;
2495 | }
2496 | 


--------------------------------------------------------------------------------
/sql.md:
--------------------------------------------------------------------------------
 1 | # SQL databases
 2 | nfu knows how to talk to PostgreSQL and SQLite 3 using their command-line
 3 | interfaces. There are two ways to do this:
 4 | 
 5 | - `--sql` (`-Q`) command: populate a table and optionally issue a query
 6 | - `sql:` pseudofile: use a database query as TSV data
 7 | 
 8 | **WARNING:** `--sql` always first drops and recreates the table before
 9 | importing data. The `sql:` pseudofile does not delete or modify anything
10 | (unless you specifically write that into a query).
11 | 
12 | You indicate postgres vs sqlite using a `P` or `S` prefix on the database name;
13 | otherwise the commands behave identically between databases. For example:
14 | 
15 | ```sh
16 | # import data into an indexed sqlite table
17 | $ nfu /usr/share/dict/words \
18 |       -m 'row %0, length %0' \
19 |       -Q S@:+wordlengths _ _
20 | 
21 | # length of all words starting with 'a'
22 | $ nfu sql:S@:"%*wordlengths %w f0 LIKE 'a%'"
23 | 
24 | # nfu explains what's going on here:
25 | $ nfu --expand-sql "%*wordlengths %w f0 LIKE 'a%'"
26 |  select * from wordlengths  where f0 LIKE 'a%'
27 | ```
28 | 
29 | ## How this works
30 | The first command generates pairs of `word, length`, which the sqlite3 command
31 | batch-imports into an indexed table called `wordlengths`. Here's how nfu makes
32 | this happen.
33 | 
34 | `--sql` takes three arguments: `{P|S}dbname:tablename`, `schema`, and `query`,
35 | with the following special cases:
36 | 
37 | - if `dbname` is `@`, a "default" database is used. For postgres this is your
38 |   username, and for sqlite3 this is `/tmp/nfu-$USER-sqlite.db`.
39 | - if `tablename` begins with `+`, a first-column index is created after the
40 |   data is inserted. If `tablename` is omitted altogether, nfu defaults to `+t`
41 |   -- a table called `t` with a first-field index.
42 | - if `schema` is `_`, nfu looks at the first 20 lines of data and infers one,
43 |   generating column names `f0`, `f1`, ..., `fN-1`.
44 | - if `query` is `_`, nfu prints a `sql:` pseudofile that will read the whole
45 |   table. Otherwise the query is executed and the results printed as TSV.
46 | 
47 | Queries are subject to SQL shorthand expansion; you can use `--expand-sql` to
48 | see the result. (All shorthands begin with `%`.)
49 | 


--------------------------------------------------------------------------------