├── README.md ├── cookbook.md ├── hadoop.md ├── humongous-survival-guide.md ├── nfu └── sql.md /README.md: -------------------------------------------------------------------------------- 1 | # nfu: Numeric Fu for your shell 2 | **NOTE:** nfu is unlikely to receive any more major updates, as I'm currently 3 | working on its successor [ni](https://github.com/spencertipping/ni). 4 | 5 | `nfu` is a text data hub and transformation tool with a large set of composable 6 | functions and source/sink adapters. For example, if you wanted to do a map-side 7 | inner join between a PostgreSQL table, a CSV from the Internet, and stuff on 8 | HDFS and gather the results into a sorted/uniqued text file: 9 | 10 | ```sh 11 | $ nfu sql:P@:'%*mytable' \ 12 | -i0 @[ http://data.com/csv -F , ] \ 13 | -H@::H. [ -i0 hdfsjoin:/path/to/hdfs/data ] ^gcf1. \ 14 | -g \ 15 | > output 16 | 17 | # equivalent long version 18 | $ nfu sql:Pdbname:'select * from mytable' \ 19 | --index 0 @[ http://data.com/csv --fieldsplit , ] \ 20 | --hadoop /tmp/temp-resharded-upload-path [ ] [ ] \ 21 | --hadoop . [ --index 0 hdfsjoin:/path/to/hdfs/data ] \ 22 | [ --group --count --fields 1. ] \ 23 | --group \ 24 | > output 25 | ``` 26 | 27 | Then if you wanted to plot a cumulative histogram of the `.metadata.size` JSON 28 | field from the third column values, binned to the nearest 100: 29 | 30 | ```sh 31 | $ nfu output -m 'jd(%2).metadata.size' -q100ocOs1f10p %l 32 | 33 | # equivalent long version 34 | $ nfu output --map 'json_decode($_[2]).metadata.size' \ 35 | --quant 100 --order --count --rorder \ 36 | --sum 1 --fields 10 --plot 'with lines' 37 | ``` 38 | 39 | ## Documentation 40 | - [The Humongous Survival Guide](humongous-survival-guide.md) 41 | - [The nfu Cookbook](cookbook.md) 42 | - [nfu and Hadoop Streaming](hadoop.md) 43 | - [nfu and SQL databases](sql.md) 44 | 45 | ## Contributors 46 | - [Spencer Tipping](https://github.com/spencertipping) 47 | - [Factual, Inc](https://github.com/Factual) 48 | 49 | MIT license as usual. 50 | 51 | ## Options and stuff 52 | If you invoke `nfu` with no arguments, it will give you the following summary: 53 | 54 | ```sh 55 | usage: nfu [prefix-commands...] [input-files...] commands... 56 | where each command is one of the following: 57 | 58 | -A|--aggregate (1) 59 | --append (1) 60 | -a|--average (0) -- window size (0 for full average) -- running average 61 | -b|--branch (1) 62 | -R|--buffer (1) 63 | -c|--count (0) -- counts by first column value; like uniq -c 64 | -S|--delta (0) -- value -> difference from last value 65 | -D|--drop (0) -- number of records to drop 66 | --duplicate (2) 67 | -e|--each (1) 68 | --entropy (0) -- running entropy of relative probabilities/frequencies 69 | -E|--every (1) 70 | -L|--exp (0) -- optional base (default e) 71 | -f|--fields (0) -- string of digits, each a zero-indexed column selector 72 | -F|--fieldsplit (1) 73 | --fold (1) 74 | -g|--group (0) -- sorts ascending, takes optional column list 75 | -H|--hadoop (3) 76 | --http (1) 77 | -i|--index (2) 78 | -I|--indexouter (2) 79 | -z|--intify (0) -- convert column to dense integers (linear space) 80 | -j|--join (2) 81 | -J|--joinouter (2) 82 | -k|--keep (1) 83 | -l|--log (0) -- optional base (default e) 84 | -m|--map (1) 85 | --mplot (1) 86 | -N|--ntiles (1) 87 | -n|--number (0) -- prepends line number to each line 88 | --octave (1) 89 | -o|--order (0) -- sorts ascending by general numeric value 90 | --partition (2) 91 | --pipe (1) 92 | -p|--plot (1) 93 | -M|--pmap (1) 94 | -P|--poll (2) 95 | --prepend (1) 96 | --preview (0) 97 | -q|--quant (1) 98 | -r|--read (0) -- reads pseudofiles from the data stream 99 | -K|--remove (1) 100 | --repeat (2) 101 | -G|--rgroup (0) -- sorts descending, takes optional column list 102 | -O|--rorder (0) -- sorts descending by general numeric value 103 | --sample (1) 104 | --sd (0) -- running standard deviation 105 | --splot (1) 106 | -Q|--sql (3) 107 | -s|--sum (0) -- value -> total += value 108 | -T|--take (0) -- n to take first n, +n to take last n 109 | --tcp (1) 110 | --tee (1) 111 | -C|--uncount (0) -- the opposite of --count; repeats each row N times 112 | -V|--variance (0) -- running variance 113 | -w|--with (1) 114 | 115 | and prefix commands are: 116 | 117 | documentation (not used with normal commands): 118 | --explain 119 | --expand-pseudofile 120 | --expand-code 121 | --expand-gnuplot 122 | --expand-sql 123 | 124 | pipeline modifiers: 125 | --quote -- quotes args: eval $(nfu --quote ...) 126 | --use 127 | --run 128 | 129 | argument bracket preprocessing: 130 | 131 | ^stuff -> [ -stuff ] 132 | 133 | [ ] nfu as function: [ -gc ] == "$(nfu --quote -gc)" 134 | @[ ] nfu as data: @[ -gc foo ] == sh:"$(nfu --quote -gc foo)" 135 | q[ ] quote things: q[ foo bar ] == "foo bar" 136 | 137 | pseudofile patterns: 138 | 139 | file.bz2 decompress file with bzip2 -dc 140 | file.gz decompress file with gzip -dc 141 | file.lzo decompress file with lzop -dc 142 | file.xz decompress file with xz -dc 143 | hdfs:path read HDFS file(s) with hadoop fs -text 144 | hdfsjoin:path mapside join pseudofile (a subset of hdfs:path) 145 | http[s]://url retrieve url with curl 146 | id:X verbatim text X 147 | n:number numbers from 1 to n, inclusive 148 | perl:expr perl -e 'print "$_\n" for (expr)' 149 | s3://url access S3 using s3cmd 150 | sh:stuff run sh -c "stuff", take stdout 151 | sql:db:query results of query as TSV 152 | user@host:x remote data access (x can be a pseudofile) 153 | 154 | gnuplot expansions: 155 | 156 | %d -> ' with dots' 157 | %i -> ' with impulses' 158 | %l -> ' with lines' 159 | %p -> ' lc palette ' 160 | %t -> ' title ' 161 | %u -> ' using ' 162 | %v -> ' with vectors ' 163 | 164 | SQL expansions: 165 | 166 | %\* -> ' select * from ' 167 | %c -> ' select count(1) from ' 168 | %d -> ' select distinct * from ' 169 | %g -> ' group by ' 170 | %j -> ' inner join ' 171 | %l -> ' outer left join ' 172 | %r -> ' outer right join ' 173 | %w -> ' where ' 174 | 175 | database prefixes: 176 | 177 | P = PostgreSQL 178 | S = SQLite 3 179 | 180 | environment variables: 181 | 182 | NFU_ALWAYS_VERBOSE if set, nfu will be verbose all the time 183 | NFU_HADOOP_COMMAND hadoop executable; e.g. hadoop jar, hadoop fs -ls 184 | NFU_HADOOP_OPTIONS -D options for hadoop streaming jobs 185 | NFU_HADOOP_STREAMING absolute location of hadoop-streaming.jar 186 | NFU_HADOOP_TMPDIR default /tmp; temp dir for hadoop uploads 187 | NFU_MAX_FILEHANDLES default 64; maximum #subprocesses for --partition 188 | NFU_NO_PAGER if set, nfu will not use "less" to preview stdout 189 | NFU_PMAP_PARALLELISM number of subprocesses for -M 190 | NFU_SORT_BUFFER default 256M; size of in-memory sort for -g and -o 191 | NFU_SORT_COMPRESS default none; compression program for sort tempfiles 192 | NFU_SORT_PARALLEL default 4; number of concurrent sorts to run 193 | 194 | see https://github.com/spencertipping/nfu for documentation 195 | ``` 196 | -------------------------------------------------------------------------------- /cookbook.md: -------------------------------------------------------------------------------- 1 | # The nfu Cookbook 2 | An ongoing collection of real-world tasks I ended up using nfu to solve. I 3 | recommend using `nfu --explain ...` and `nfu --expand-code '...'` on the 4 | examples below if you're new to nfu. 5 | 6 | ## First record within each of N categories 7 | I had a series of JSON records, each of which had a "category" field. I wanted 8 | to get 10000 output records, each in a different category. 9 | 10 | ```sh 11 | $ nfu records -m 'my $j = jd(%0); row $j.metadata.category, %0' \ 12 | -gA '${%1}[0]' \ 13 | -T10000 14 | ``` 15 | 16 | Initially I also wanted to make sure none of the categories were bogus, so I 17 | previewed with the category keys still in the first column like this: 18 | 19 | ```sh 20 | $ nfu records -m 'my $j = jd(%0); row $j.metadata.category, %0' \ 21 | -gA 'row $_, ${%1}[0]' \ 22 | -T10000 23 | ``` 24 | 25 | ## Comparing field values across different JSON formats 26 | There were two files, one with lines in this format: 27 | 28 | ``` 29 | $ head -n1 file1 30 | {"metadata": {"category": "foo", ...}, "id": 1, "name": "bar", ...} 31 | ``` 32 | 33 | and the other, derived from the first, with lines in this format: 34 | 35 | ``` 36 | $ head -n1 file2 37 | {"category": "foo", "data": [{"id": 1, "record": {"name": "bar", ...}, ...}]} 38 | ``` 39 | 40 | I wanted to count up the number of names that had changed. The second file's 41 | rows weren't in the same order as the first, so I needed to join by ID. 42 | 43 | ```sh 44 | $ nfu file1 -m 'my $j = jd(%0); row $j.id, $j.name' > file1-by-id 45 | $ nfu file2 \ 46 | -m 'my $j = jd(%0); row $j.data->[0].id, $j.data->[0].record.name' \ 47 | -i0 file1-by-id \ 48 | -m 'row %1 ne %2' \ 49 | -sT+1 50 | ``` 51 | 52 | `file1-by-id` contains a TSV of `id, name`, and we end up doing the same thing 53 | to the contents of `file2` before joining (`-i0 file1-by-id`) to get the 54 | combined inner join, `id, file1-name, file2-name`. Perl's `ne` operator returns 55 | zero for equal strings and 1 for unequal strings, so we use `-s` to sum the 1's 56 | up, taking just the last record to get the total. 57 | 58 | ## Compressing field values into a dense integer range 59 | I had a bunch of rows, each of the form `UUID, lat, lng` (with repeated UUIDs), 60 | and I wanted the UUIDs to be integers so I could 3D-plot the coordinates. 61 | 62 | ```sh 63 | $ nfu data -gm 'row $::n{%0} //= $::i++, %1, %2' --splot 64 | ``` 65 | 66 | ## Removing outliers from plotted data 67 | The 3D plot above was scaled wrong due to a few outliers. Ideally I'd just be 68 | looking at stuff between the 5th and 95th percentiles. 69 | 70 | ```sh 71 | $ nfu data --run '($::a, $::b) = (read_lines "sh:nfu data -f1N100")[5, 95]; 72 | ($::c, $::d) = (read_lines "sh:nfu data -f2N100")[5, 95]' \ 73 | -k '%1 > $::a && %1 < $::b && 74 | %2 > $::c && %2 < $::d' \ 75 | > clipped 76 | ``` 77 | 78 | I preferred to leave it all as a single command so I could tweak stuff, so I 79 | used variable substitution to eliminate the duplication that would otherwise 80 | result: 81 | 82 | ```sh 83 | $ nfu --run '($::a, $::b) = (rl "%data -f1N100")[%lo, %hi]; 84 | ($::c, $::d) = (rl "%data -f2N100")[%lo, %hi]' \ 85 | -k '%1 > $::a && %1 < $::b && 86 | %2 > $::c && %2 < $::d' \ 87 | %data \ 88 | -gm 'row $::n{%0} //= $::i++, %1, %2' --splot \ 89 | % data='sh:nfu data' lo=5 hi=95 90 | ``` 91 | -------------------------------------------------------------------------------- /hadoop.md: -------------------------------------------------------------------------------- 1 | # nfu and Hadoop Streaming 2 | nfu provides tight integration with Hadoop Streaming, preserving its usual 3 | pipe-chain semantics while leveraging HDFS for intermediate data storage. It 4 | does this by providing the `--hadoop` (`-H`) function: 5 | 6 | - `--hadoop 'outpath' 'mapper' 'reducer'`: emit data to specified output path, 7 | printing the output path name as a pseudofile (this lets you say things like 8 | `nfu $(nfu --hadoop ....) ...`. 9 | - `--hadoop . 'mapper' 'reducer'`: if `outpath` is `.`, then print output data. 10 | - `--hadoop @ 'mapper' 'reducer'`: if `outpath` is `@`, create a temporary path 11 | and print its pseudofile name. 12 | 13 | The "mapper" and "reducer" arguments are arbitrary shell commands that may or 14 | may not involve nfu. "reducer" can be set to `NONE`, `-`, or `_` to run a 15 | map-only job. The mapper, reducer, or both can be `:` as a shorthand for the 16 | identity job. As a result of this and the way nfu handles short options, the 17 | following idioms are supported: 18 | 19 | - `-H@`: hadoop and output filename 20 | - `-H.`: hadoop and cat output 21 | - `-H@::`: reshard data, as in preparation for a mapside join 22 | - `-H.: ^gcf1.`: a distributed version of `sort | uniq` 23 | 24 | Normally you'd write the mapper and reducer either as external commands, or by 25 | using `nfu --quote ...`. However, nfu provides two shorthand notations for 26 | quoted forms: 27 | 28 | - `[ -gc ]` is the same as `"$(nfu --quote -gc)"` (NB: spaces around brackets 29 | are required) 30 | - `^gc` is the same as `[ -gc ]` 31 | 32 | Quoted nfu jobs can involve `--use` clauses, which are turned into `--run` 33 | before hadoop sends the command to the workers. 34 | 35 | Hadoop jobs support some magic to simplify data transfer: 36 | 37 | ```sh 38 | # upload from stdin 39 | $ seq 100 | nfu --hadoop . "$(nfu --quote -m 'row %0, %0 + 1')" _ 40 | $ seq 100 | nfu --hadoop . [ -m 'row %0, %0 + 1' ] _ 41 | 42 | # upload from pseudofiles 43 | $ nfu sh:'seq 100' --hadoop ... 44 | 45 | # use data already on HDFS 46 | $ nfu hdfs:/path/to/data --hadoop ... 47 | ``` 48 | 49 | As a special case, nfu looks for cases where stdin contains lines beginning 50 | with `hdfs:/` and interprets this as a list of HDFS files to process (rather 51 | than being verbatim data). This allows you to chain `--hadoop` jobs without 52 | downloading/uploading all of the intermediate results: 53 | 54 | ```sh 55 | # two hadoop jobs; intermediate results stay on HDFS and are never downloaded 56 | $ seq 100 | nfu -H@ [ -m 'row %0 % 10, %0 + 1' ] ^gc \ 57 | -H. ^C _ 58 | ``` 59 | 60 | nfu detects when it's being run as a hadoop streaming job and changes its 61 | verbose behavior to create hadoop counters. This means you can get the same 62 | kind of throughput statistics by using the `-v` option: 63 | 64 | ```sh 65 | $ seq 10000 | nfu --hadoop . ^vgc _ 66 | ``` 67 | 68 | Because `hdfs:` is a pseudofile prefix, you can also transparently download 69 | HDFS data to process locally: 70 | 71 | ```sh 72 | $ nfu hdfs:/path/to/data -gc 73 | ``` 74 | 75 | ## Mapside joins 76 | You can use nfu's `--index`, `--indexouter`, `--join`, and `--joinouter` 77 | functions to join arbitrary data on HDFS. Because HDFS data is often large and 78 | consistently partitioned, nfu provides a `hdfsjoin:path` pseudofile that 79 | assumes Hadoop default partitioning and expands into a list of partfiles 80 | sufficient to cover all keys that coincide with the current mapper's data. 81 | Here's an example of how you might use it: 82 | 83 | ```sh 84 | # take 10000 words at random and generate [word, length] in /tmp/nfu-jointest 85 | # NB: you need a reducer here (even though it's a no-op); otherwise Hadoop 86 | # won't partition your mapper outputs. 87 | $ nfu sh:'shuf /usr/share/dict/words' \ 88 | --take 10000 \ 89 | --hadoop /tmp/nfu-jointest [ -vm 'row %0, length %0' ] ^v 90 | 91 | # now inner-join against that data 92 | $ nfu /usr/share/dict/words \ 93 | --hadoop . [ -vi0 hdfsjoin:/tmp/nfu-jointest ] _ 94 | ``` 95 | 96 | ## Examples 97 | ### Word count 98 | ```sh 99 | # local version: 100 | $ nfu hadoop.md -m 'map row($_, 1), split /\s+/, %0' \ 101 | -gA 'row $_, sum @{%1}' 102 | 103 | # process on hadoop, download outputs: 104 | $ nfu hadoop.md -H. [ -m 'map row($_, 1), split /\s+/, %0' ] \ 105 | [ -A 'row $_, sum @{%1}' ] 106 | 107 | # leave on HDFS, download separately using hdfs: pseudofile 108 | $ nfu hadoop.md -H /tmp/nfu-wordcount-outputs \ 109 | [ -m 'map row($_, 1), split /\s+/, %0' ] \ 110 | [ -A 'row $_, sum @{%1}' ] 111 | $ nfu hdfs:/tmp/nfu-wordcount-outputs -g 112 | ``` 113 | -------------------------------------------------------------------------------- /humongous-survival-guide.md: -------------------------------------------------------------------------------- 1 | # The Humongous nfu Survival Guide 2 | ## Introduction 3 | nfu is all about tab-delimited text data. It does a number of things to make 4 | this data easier to work with; for example: 5 | 6 | ```sh 7 | $ git clone git://github.com/spencertipping/nfu 8 | $ cd nfu 9 | $ ./nfu README.md # behaves like 'less' 10 | $ gzip README.md 11 | $ ./nfu README.md.gz # transparent decompression (+ xz, bz2, lzo) 12 | ``` 13 | 14 | Now let's do some basic word counting. We can get a word list by using nfu's 15 | `-m` operator, which takes a snippet of Perl code and executes it once for each 16 | line. Then we sort (`-g`, or `--group`), count-distinct (`-c`), and 17 | reverse-numeric-sort (`-O`, or `--rorder`) to get a histogram descending by 18 | frequency: 19 | 20 | ```sh 21 | $ nfu README.md -m 'split /\W+/, %0' -gcO 22 | 48 23 | 28 nfu 24 | 20 seq 25 | 19 100 26 | ... 27 | $ 28 | ``` 29 | 30 | `%0` is shorthand for `$_[0]`, which is how you access the first element of 31 | Perl's function-arguments (`@_`) variable. Any Perl code you give to nfu will 32 | be run inside a subroutine, and the arguments are usually tab-separated field 33 | values. 34 | 35 | Commands you issue to nfu are chained together using shell pipes. This means 36 | that the following are equivalent: 37 | 38 | ```sh 39 | $ nfu README.md -m 'split /\W+/, %0' -gcO 40 | $ nfu README.md | nfu -m 'split /\W+/, %0' \ 41 | | nfu -g \ 42 | | nfu -c \ 43 | | nfu -O 44 | ``` 45 | 46 | nfu uses a number of shorthands whose semantics may become confusing. To see 47 | what's going on, you can use its documentation options: 48 | 49 | ```sh 50 | $ nfu --expand-code 'split /\W+/, %0' 51 | split /\W+/, $_[0] 52 | $ nfu --explain README.md -m 'split /\W+/, %0' -gcO 53 | file README.md 54 | --map 'split /\W+/, %0' 55 | --group 56 | --count 57 | --rorder 58 | --preview 59 | $ 60 | ``` 61 | 62 | You can also run nfu with no arguments to see a usage summary. 63 | 64 | ## Basic idioms 65 | ### Extracting data 66 | - `-m 'split /\W+/, %0'`: convert text file to one word per line 67 | - `-m 'map {split /\W+/} @_'`: same thing for text files with tabs 68 | - `-F '\W+'`: convert file to one word per column, preserving lines 69 | - `-m '@_'`: reshape to a single column, flattening into rows 70 | - `seq 10 | tr '\n' '\t'`: reshape to a single row, flattening into columns 71 | 72 | The `-F` operator resplits lines by the regexp you provide. So to parse 73 | /etc/passwd, for example, you'd say `nfu -F : /etc/passwd ...`. 74 | 75 | ### Generating data 76 | - `-P 5 'cat /proc/loadavg'`: run 'cat /proc/loadavg' every five seconds, 77 | collecting stdout 78 | - `--repeat 10 README.md`: read README.md 10 times in a row (this is more 79 | useful than it looks; see "Pipelines, Combination, and Quotation" below) 80 | 81 | ### Basic transformations 82 | - `-n`: prepend line numbers as first column 83 | - `-m 'row @_, %0 * 2'`: keep all existing columns, appending `%0 * 2` as a new 84 | one 85 | - `-m '%1 =~ s/foo/bar/g; row @_'`: transform second column by replacing 'foo' 86 | with 'bar' 87 | - `-m 'row %0, %1 =~ s/foo/bar/gr, @_[2..$#_]'`: same thing, but without 88 | in-place modification of `%1` 89 | 90 | `-M` is a variant of `-m` that runs a pool of parallel subprocesses (by default 91 | 16). This doesn't preserve row ordering, but can be useful if you're doing 92 | something latency-bound like fetching web documents: 93 | 94 | ```sh 95 | $ nfu url-list -M 'row %0, qx(curl %0)' 96 | ``` 97 | 98 | In this example, Perl's `qx()` operator could easily produce a string 99 | containing newlines; in fact most shell commands are written this way. Because 100 | of this, nfu's `row()` function strips the newlines from each of its input 101 | strings. This guarantees that `row()` will produce exactly one line of output. 102 | 103 | ### Filtering 104 | - `-k '%2 eq "nfu"'`: keep any row whose third column is the text "nfu" 105 | - `-k '%0 < 10'`: keep any row whose first column parses to a number < 10 106 | - `-k '@_ < 5'`: keep any row with fewer than five columns 107 | - `-K '@_ < 5'`: reject any row with fewer than five columns (`-K` vs `-k`) 108 | - `-k 'length %0 < 10'` 109 | - `-k '%0 eq -+-%0'`: keep every row whose first column is numeric 110 | 111 | ### Row slicing 112 | - `-T5`: take the first 5 lines 113 | - `-T+5`: take the last 5 lines (drop all others) 114 | - `-D5`: drop the first 5 lines 115 | - `--sample 0.01`: take 1% of rows randomly 116 | - `-E100`: take every 100th row deterministically 117 | 118 | ### Column slicing 119 | - `-f012`: keep the first three columns (fields) in their original order 120 | - `-f10`: swap the first two columns, drop the others 121 | - `-f00.`: duplicate the first column, pushing others to the right 122 | - `-f10.`: swap the first two columns, keep the others in their original order 123 | - `-m 'row(reverse @_)'`: reverse the fields within each row (`row()` is a 124 | function that keeps an array on one row; otherwise you'd flatten the columns 125 | across multiple rows) 126 | - `-m 'row(grep /^-/, @_)'`: keep fields beginning with `-` 127 | 128 | ### Histograms (group, count) 129 | - `-gcO`: descending histogram of most frequent values 130 | - `-gcOl`: descending histogram of most frequent values, log-scaled 131 | - `-gcOs`: cumulative histogram, largest values first 132 | - `-gcf1.`: list of unique values (group, count, fields 1..n) 133 | 134 | Sorting and counting operators support field selection: 135 | 136 | - `-g1`: sort by second column 137 | - `-c0`: count unique values of field 0 138 | - `-c01`: count unique combinations of fields 0 and 1 jointly 139 | 140 | ### Common numeric operations 141 | - `-q0.05`: round (quantize) each number to the nearest 0.05 142 | - `-q10`: quantize each number to the nearest 10 143 | - `-s`: running sum 144 | - `-S`: delta (inverse of `-s`) 145 | - `-l`: log-transform each number, base e 146 | - `-L`: inverse log-transform (exponentiate) each number 147 | - `-a`: running average 148 | - `-V`: running variance 149 | - `--sd`: running sample standard deviation 150 | 151 | Each of these operations can be applied to a specified set of columns. For 152 | example: 153 | 154 | - `seq 10 | nfu -f00s1`: first column is 1..10, second is running sum of first 155 | - `seq 10 | nfu -f00a1`: first column is 1..10, second is running mean of first 156 | 157 | Some of these commands take an optional argument; for example, you can get a 158 | windowed average if you specify a second argument to `-a`: 159 | 160 | - `seq 10 | nfu -f00a1,5`: second column is a 5-value sliding average 161 | - `seq 10 | nfu -f00q1,5`: second column quantized to 5 162 | - `seq 10 | nfu -f00l1,5`: second column log base-5 163 | - `seq 10 | nfu -f00L1,5`: second column 5x 164 | 165 | Multiple-digit fields are interpreted as multiple single-digit fields: 166 | 167 | - `seq 10 | nfu -f00a01,5`: calculate 5-average of fields 0 and 1 independently 168 | 169 | The only ambiguous case happens when you specify only one argument: should it 170 | be interpreted as a column selector, or as a numeric parameter? nfu resolves 171 | this by using it as a parameter if the function requires an argument (e.g. 172 | `-q`), otherwise treating it as a column selector. 173 | 174 | ### Plotting 175 | Note: all plotting requires that `gnuplot` be in your `$PATH`. 176 | 177 | - `seq 100 | nfu -p`: 2D plot; input values are Y coordinates 178 | - `seq 100 | nfu -m 'row @_, %0 * %0' -p`: 2D plot; first column is X, second 179 | is Y 180 | - `seq 100 | nfu -p %l`: plot with lines 181 | - `seq 100 | nfu -m 'row %0, sin(%0), cos(%0)' --splot`: 3D plot 182 | 183 | ```sh 184 | $ seq 1000 | nfu -m '%0 * 0.1' \ 185 | -m 'row %0, sin(%0), cos(%0)' \ 186 | --splot %l 187 | ``` 188 | 189 | You can use `nfu --expand-gnuplot '%l'`, for example, to see how nfu is 190 | transforming your gnuplot options. (There's also a list of these shorthands in 191 | nfu's usage documentation.) 192 | 193 | ### Progress reporting 194 | If you're doing something with a large amount of data, it's sometimes hard to 195 | know whether it's worth hitting `^C` and optimizing stuff. To help with this, 196 | nfu has a `--verbose` (`-v`) option that activates throughput metrics for each 197 | operation in the pipeline. For example: 198 | 199 | ```sh 200 | $ seq 100000000 | nfu -o # this might take a while 201 | $ seq 100000000 | nfu -v -o # keep track of lines and kb 202 | ``` 203 | 204 | ## Advanced usage (assumes some Perl knowledge) 205 | ### JSON 206 | nfu provides two functions, `jd` (or `json_decode`) and `je`/`json_encode`, 207 | that are available within any code you write: 208 | 209 | ```sh 210 | $ ip_addrs=$(seq 10 | tr '\n' '\r' | nfu -m 'join ",", map "%0.4.4.4", @_') 211 | $ query_url="www.datasciencetoolkit.org/ip2coordinates/$ip_addrs" 212 | $ curl "$query_url" \ 213 | | nfu -m 'my $json = jd(%0); 214 | map row($_, ${$json}{$_}.locality), keys %$json' 215 | ``` 216 | 217 | This code uses another shorthand, `.locality`, which expands to a Perl hash 218 | dereference `->{"locality"}`. There isn't a similar shorthand for arrays, which 219 | means you need to explicitly dereference those: 220 | 221 | ```sh 222 | $ echo '[1,2,3]' | nfu -m 'jd(%0)[0]' # won't work! 223 | $ echo '[1,2,3]' | nfu -m '${jd(%0)}[0]' 224 | ``` 225 | 226 | ### Multi-plotting 227 | You can setup a multiplot by creating multiple columns of data. gnuplot then 228 | lets you refer to these with its `using N` construct, which nfu lets you write 229 | as `%uN`: 230 | 231 | ```sh 232 | $ seq 1000 | nfu -m '%0 * 0.01' | gzip > numbers.gz 233 | $ nfu numbers.gz -m 'row sin(%0), cos(%0)' \ 234 | --mplot '%u1%l%t"sin(x)"; %u2%l%t"cos(x)"' 235 | $ nfu numbers.gz -m 'sin %0' \ 236 | -f00a1 \ 237 | --mplot '%u1%l%t"sin(x)"; %u2%l%t"average(sin(x))"' 238 | $ nfu numbers.gz -m 'sin %0' \ 239 | -f00a1 \ 240 | -m 'row @_, %1-%0' \ 241 | --mplot '%u1%l%t"sin(x)"; 242 | %u2%l%t"average(sin(x))"; 243 | %u3%l%t"difference"' 244 | ``` 245 | 246 | The semicolon notation is something nfu requires. It works this way because 247 | internally nfu scripts gnuplot like this: 248 | 249 | ``` 250 | plot "tempfile-name" using 1 with lines title "sin(x)" 251 | plot "tempfile-name" using 2 with lines title "average(sin(x))" 252 | plot "tempfile-name" using 3 with lines title "difference" 253 | ``` 254 | 255 | ### Local map-reduce 256 | nfu provides an aggregation operator for sorted data. This groups adjacent rows 257 | by their first column and hands you a series of array references, one for each 258 | column's values within that group. For example, here's word-frequency again, 259 | this time using `-A`: 260 | 261 | ```sh 262 | $ nfu README.md -m 'split /\W+/, %0' \ 263 | -m 'row %0, 1' \ 264 | -gA 'row $_, sum @{%1}' 265 | ``` 266 | 267 | A couple of things are happening here. First, the current group key is stored 268 | in `$_`; this allows you to avoid the more cumbersome (but equivalent) 269 | `${%0}[0]`. Second, `%1` is now an array reference containing the second field 270 | of all grouped rows. `sum` is provided by nfu and does what you'd expect. 271 | 272 | In addition to map/reduce functions, nfu also gives you `--partition`, which 273 | you can use to send groups of records to different files. For example: 274 | 275 | ```sh 276 | $ nfu README.md -m 'split /\W+/, %0' \ 277 | --partition 'substr(%0, 0, 1)' \ 278 | 'cat > words-starting-with-{}' 279 | ``` 280 | 281 | `--partition` will keep up to 256 subprocesses running; if you have more groups 282 | than that, it will close and reopen pipes as necessary, which will cause your 283 | subprocesses to be restarted. (For this reason, `cat > ...` isn't a great 284 | subprocess; `cat >> ...` is better.) 285 | 286 | ### Loading Perl code 287 | nfu provides a few utility functions: 288 | 289 | - `sum @array` 290 | - `mean @array` 291 | - `uniq @array` 292 | - `frequencies @array` 293 | - `read_file "filename"`: returns a string 294 | - `read_lines "filename"`: returns an array of chomped strings 295 | 296 | But sometimes you'll need more definitions to write application-specific code. 297 | For this nfu gives you two options, `--use` and `--run`: 298 | 299 | ```sh 300 | $ nfu --use myfile.pl ... 301 | $ nfu --run 'sub foo {...}' ... 302 | ``` 303 | 304 | Any definitions will be available inside `-m`, `-A`, and other code-evaluating 305 | operators. 306 | 307 | A common case where you'd use `--run` is to precompute some kind of data 308 | structure before using it within a row function. For example, to count up all 309 | words that never appear at the beginning of a line: 310 | 311 | ```sh 312 | $ nfu README.md -F '\s+' -f0 > first-words 313 | $ nfu --run '$::seen{$_} = 1 for read_lines "first-words"' \ 314 | -m 'split /\W+/, %0' \ 315 | -K '$::seen{%0}' 316 | ``` 317 | 318 | Notice that we're package-scoping `%::seen`. This is required because while row 319 | functions reside in the same package as `--run` and `--use` code, they're in a 320 | different lexical scope. This means that any `my` or `our` variables are 321 | invisible and will trigger compile-time errors if you try to refer to them from 322 | other compiled code. 323 | 324 | ### Pseudofiles 325 | Gzipped data is uncompressed automatically by an abstraction that nfu calls a 326 | pseudofile. In addition to uncompressing things, several other pseudofile forms 327 | are recognized: 328 | 329 | ```sh 330 | $ nfu http://factual.com # uses stdout from curl 331 | $ nfu sh:ls # uses stdout from a command 332 | $ nfu [user@]host:other-file # pipe file over ssh -C 333 | $ nfu hdfs:/path/to/data # uses hadoop fs -text 334 | $ nfu psql:'query' # uses psql -c and exports as TSV 335 | ``` 336 | 337 | nfu supports pseudofiles everywhere it expects a filename, including in 338 | `read_file` and `read_lines`. 339 | 340 | ### Pipelines, combination, and quotation 341 | nfu gives you several commands that let you gather data from other sources. For 342 | example: 343 | 344 | ```sh 345 | $ nfu README.md -m 'split /\W+/, %0' --prepend README.md 346 | $ nfu README.md -m 'split /\W+/, %0' --append README.md 347 | $ nfu README.md --with sh:'tac README.md' 348 | $ nfu --repeat 10 README.md 349 | $ nfu README.md --pipe tac 350 | $ nfu README.md --tee 'cat > README2.md' 351 | $ nfu README.md --duplicate 'cat > README2.md' 'tac > README-reverse.md' 352 | ``` 353 | 354 | Here's what these things do: 355 | 356 | - `--prepend`: prepends a pseudofile's contents to the current data 357 | - `--append`: appends a pseudofile 358 | - `--with`: joins a pseudofile column-wise, ending when either side runs out of 359 | rows 360 | - `--repeat`: repeats a pseudofile the specified number of times, forever if n 361 | = 0; ignores any prior data 362 | - `--pipe`: same thing as a shell pipe, but doesn't lose nfu state 363 | - `--tee`: duplicates data to a shell process, collecting its _stdout into your 364 | data stream_ (you can avoid this by using `> /dev/null`) 365 | - `--duplicate`: sends your data to two shell processes, combining their 366 | stdouts 367 | 368 | Sometimes you'll want to use nfu itself as a shell command, but this can become 369 | difficult due to nested quotation. To get around this, nfu provides the 370 | `--quote` operator, which generates a properly quoted command line: 371 | 372 | ```sh 373 | $ nfu --repeat 10 sh:"$(nfu --quote README.md -m 'split /\W+/, %0')" 374 | ``` 375 | 376 | This is clunky, so nfu goes a step further and provides bracket syntax: 377 | 378 | ```sh 379 | $ nfu --repeat 10 nfu[ README.md -m 'split /\W+/, %0' ] 380 | ``` 381 | 382 | ### Keyed joins 383 | This works on sorted data, and behaves like SQL's JOIN construct. Under the 384 | hood, nfu takes care of the sorting and the voodoo associated with getting 385 | `sort` and `join` to work together, so you can write something simple like 386 | this: 387 | 388 | ```sh 389 | $ nfu /usr/share/dict/words -m 'row %0, length %0' > bytes-per-word 390 | $ nfu README.md -m 'split /\W+/, %0' \ 391 | -I0 bytes-per-word \ 392 | -m 'row %0, %1 // 0' \ 393 | -gA 'row $_, sum @{%1}' 394 | ``` 395 | 396 | Here's what's going on: 397 | 398 | - `-I0 bytes-per-word`: outer left join using field 0 from the data, adjoining 399 | all columns after the key field from the pseudofile 'bytes-per-word' 400 | - `-m 'row %0, %1 // 0'`: when we didn't get any join data, default to 0 (`//` 401 | is Perl's defined-or-else operator) 402 | - `-gA 'row $_, sum @{%1}'`: reduce by word, summing total bytes 403 | 404 | We could sidestep all nonexistent words by using `-i0` for an inner join 405 | instead. This drops all rows with no corresponding entry in the lookup table. 406 | -------------------------------------------------------------------------------- /nfu: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env perl 2 | # nfu: Command-line numeric fu | Spencer Tipping 3 | # Licensed under the terms of the MIT source code license 4 | 5 | use v5.14; 6 | use strict; 7 | use warnings; 8 | use utf8; 9 | 10 | use Fcntl; 11 | use Socket; 12 | use Time::HiRes qw/time/; 13 | use POSIX qw/dup2 mkfifo setsid :sys_wait_h/; 14 | use File::Temp qw/tmpnam/; 15 | use Math::Trig; 16 | 17 | use constant VERBOSE_INTERVAL => 50; 18 | use constant HADOOP_VERBOSE_INTERVAL => 10000; 19 | use constant DELAY_BEFORE_VERBOSE => 500; 20 | 21 | use constant SQL_INFER_PEEK_LINES => 20; 22 | 23 | use constant LOG_2 => log(2); 24 | 25 | # 64-bit hex constants in geohash encoder won't work on 32-bit architectures 26 | no warnings 'portable'; 27 | 28 | our $diamond_has_data = 1; 29 | 30 | ++$|; 31 | 32 | # Setup child capture. All we need to do is wait for child pids; there's no 33 | # formal teardown. 34 | $SIG{CHLD} = sub { 35 | local ($!, $?); 36 | 1 while waitpid(-1, WNOHANG) > 0; 37 | }; 38 | 39 | # NB: This import is not used in nfu directly; it's here so you can use these 40 | # functions inside aggregators. 41 | use List::Util qw(first max maxstr min minstr reduce shuffle sum); 42 | 43 | sub prod { 44 | my $p = 1; 45 | $p *= $_ for @_; 46 | $p; 47 | } 48 | 49 | # Same for this, which is especially useful from aggregators because multiple 50 | # values create multiple output rows, not multiple columns on the same output 51 | # row. 52 | sub row {join "\t", map s/\n//gr, @_} 53 | 54 | # Order-preserving unique values for strings. This is just too useful not to 55 | # provide. 56 | sub uniq { 57 | local $_; 58 | my %seen; 59 | my @order; 60 | $seen{$_}++ or push @order, $_ for @_; 61 | @order; 62 | } 63 | 64 | sub frequencies { 65 | local $_; 66 | my %freqs; 67 | ++$freqs{$_} for @_; 68 | %freqs; 69 | } 70 | 71 | sub reductions(&$@) { 72 | my ($f, $x, @xs) = @_; 73 | my @ys; 74 | push @ys, $x = $f->($x, $_) for @xs; 75 | @ys; 76 | } 77 | 78 | sub cp { 79 | # Cartesian product of N arrays, each passed in as a ref 80 | return () if @_ == 0; 81 | return map [$_], @{$_[0]} if @_ == 1; 82 | 83 | my @ns = map scalar(@$_), @_; 84 | my @shifts = reverse reductions {$_[0] * $_[1]} 1 / $ns[0], reverse @ns; 85 | map { 86 | my $i = $_; 87 | [map $_[$_][int($i / $shifts[$_]) % $ns[$_]], 0..$#_]; 88 | } 0..prod(@ns) - 1; 89 | } 90 | 91 | sub round_to { 92 | my ($x, $quantum) = @_; 93 | $quantum ||= 1; 94 | my $sign = $x < 0 ? -1 : 1; 95 | int(abs($x) / $quantum + 0.5) * $quantum * $sign; 96 | } 97 | 98 | sub mean {scalar @_ && sum(@_) / @_} 99 | sub log2 {log($_[0]) / LOG_2} 100 | 101 | sub entropy { 102 | local $_; 103 | my $s = sum(@_) || 1; 104 | my $t = 0; 105 | $t -= ($_ / $s) * log($_ / $s) for @_; 106 | $t / LOG_2; 107 | } 108 | 109 | sub dot { 110 | local $_; 111 | my ($u, $v) = @_; 112 | my $s = 0; 113 | $s += $$u[$_] * $$v[$_] for 0 .. min($#{$u}, $#{$v}); 114 | $s; 115 | } 116 | 117 | sub dist { 118 | # Euclidean distance (you specify the deltas) 119 | local $_; 120 | my $s = 0; 121 | $s += $_*$_ for @_; 122 | sqrt($s); 123 | } 124 | 125 | sub sdist { 126 | # Spherical-coordinate great circle distance; you specify theta1, phi1, 127 | # theta2, phi2, each in degrees; radius is assumed to be 1. Math from 128 | # http://stackoverflow.com/questions/27928/how-do-i-calculate-distance-between-two-latitude-longitude-points 129 | local $_; 130 | my ($t1, $p1, $t2, $p2) = map $_ / 180 * pi, @_; 131 | my $dt = $t2 - $t1; 132 | my $dp = $p2 - $p1; 133 | my $a = sin($dp / 2) * sin($dp / 2) 134 | + cos($p1) * cos($p2) * sin($dt / 2) * sin($dt / 2); 135 | 2 * atan2(sqrt($a), sqrt(1 - $a)); 136 | } 137 | 138 | sub edist { 139 | # Earth distance between two latitudes/longitudes, in km; up to 0.5% error 140 | my ($lat1, $lng1, $lat2, $lng2) = @_; 141 | 6371 * sdist $lng1, $lat1, $lng2, $lat2; 142 | } 143 | 144 | sub line_opposite { 145 | # Returns true if two points are on opposite sides of the line starting at 146 | # (x0, y0) and whose direction is (dx, dy). 147 | my ($x0, $y0, $dx, $dy, $x1, $y1, $x2, $y2) = @_; 148 | return (($x1 - $x0) * $dy - ($y1 - $y0) * $dx) 149 | * (($x2 - $x0) * $dy - ($y2 - $y0) * $dx) < 0; 150 | } 151 | 152 | sub evens(@) {local $_; @_[map $_ * 2, 0 .. $#_ >> 1]} 153 | sub odds(@) {local $_; @_[map $_ * 2 + 1, 0 .. $#_ >> 1]} 154 | 155 | sub parse_wkt { 156 | my @rings = map [/([-0-9.]+)\s+([-0-9.]+)/g], split /\)\s*,\s*\(/, $_[0]; 157 | { 158 | rings => [map [map $_ + 0, @$_], @rings], 159 | ylimit => 1 + max(map max(@$_), @rings), 160 | bounds => [min(map evens(@$_), @rings), max(map evens(@$_), @rings), 161 | min(map odds(@$_), @rings), max(map odds(@$_), @rings)], 162 | } 163 | } 164 | 165 | sub in_poly { 166 | # Returns true if a point resides in the given parsed polygon. 167 | my ($x, $y, $parsed) = @_; 168 | my $ylimit = $parsed->{ylimit}; 169 | my @bounds = @{$parsed->{bounds}}; 170 | return 0 if $x < $bounds[0] || $x > $bounds[1] 171 | || $y < $bounds[2] || $y > $bounds[3]; 172 | 173 | my $hits = 0; 174 | for my $r (@{$parsed->{rings}}) { 175 | my ($lx, $ly) = @$r[0, 1]; 176 | for (my $i = 2; $i < @$r; $i += 2) { 177 | my $cx = $$r[$i]; 178 | my $cy = $$r[$i + 1]; 179 | ++$hits if $lx <= $x && $x < $cx || $lx >= $x && $x > $cx 180 | and line_opposite $lx, $ly, $cx - $lx, $cy - $ly, 181 | $x, $y, $x, $ylimit; 182 | $lx = $cx; 183 | $ly = $cy; 184 | } 185 | } 186 | $hits & 1; 187 | } 188 | 189 | sub rect_polar { (dist(@_), atan2($_[0], $_[1]) / pi * 180) } 190 | sub polar_rect { ($_[0] * sin($_[1] / 180 * pi), 191 | $_[0] * cos($_[1] / 180 * pi)) } 192 | 193 | sub degrees_radians { $_[0] / 180 * pi } 194 | sub radians_degrees { $_[0] / 180 * pi } 195 | 196 | # JSON support (if available) 197 | our $json; 198 | if (eval {require JSON}) { 199 | JSON->import; 200 | no warnings qw(uninitialized); 201 | $json = JSON->new->allow_nonref->utf8(1); 202 | } elsif (eval {require JSON::PP}) { 203 | JSON::PP->import; 204 | no warnings qw(uninitialized); 205 | $json = JSON::PP->new->allow_nonref->utf8(1); 206 | } else { 207 | print STDERR "note: no JSON support detected (try 'cpan install JSON')\n"; 208 | print STDERR "nfu will soon have its own JSON parser rather than using "; 209 | print STDERR "a native library for this. Sorry for the inconvenience."; 210 | } 211 | 212 | # These are callable from evaled code 213 | sub expand_filename_shorthands; 214 | sub read_file { 215 | open my $fh, expand_filename_shorthands $_[0], 1; 216 | my $result = join '', <$fh>; 217 | close $fh; 218 | $result; 219 | } 220 | 221 | sub write_file { 222 | open my $fh, '>', $_[0]; 223 | $fh->print($_[1]); 224 | close $fh; 225 | $_[0]; 226 | } 227 | 228 | sub read_lines { 229 | local $_; 230 | open my $fh, expand_filename_shorthands $_[0], 1; 231 | my @result; 232 | chomp, push @result, $_ for <$fh>; 233 | close $fh; 234 | @result; 235 | } 236 | 237 | sub write_lines { 238 | local $_; 239 | my $filename = shift @_; 240 | open my $fh, '>', $filename; 241 | $fh->print($_, "\n") for @_; 242 | close $fh; 243 | $filename; 244 | } 245 | 246 | sub open_file { 247 | open my $fh, expand_filename_shorthands $_[0], 1; 248 | $fh; 249 | } 250 | 251 | sub json_encode {$json->encode(@_)} 252 | sub json_decode {$json->decode(@_)} 253 | 254 | sub hadoop_counter { 255 | printf STDERR "reporter:counter:nfu,%s_%s,%d\n", $_[0] =~ y/,/_/r, 256 | $_[1] =~ y/,/_/r, 257 | $_[2] // 1; 258 | } 259 | 260 | sub F {my $l = shift @_; (split /\t/, $l)[@_]} 261 | 262 | our @gh_alphabet = split //, '0123456789bcdefghjkmnpqrstuvwxyz'; 263 | our %gh_decode = map(($gh_alphabet[$_], $_), 0..$#gh_alphabet); 264 | 265 | sub gap_bits { 266 | my ($x) = @_; 267 | $x |= $x << 16; $x &= 0x0000ffff0000ffff; 268 | $x |= $x << 8; $x &= 0x00ff00ff00ff00ff; 269 | $x |= $x << 4; $x &= 0x0f0f0f0f0f0f0f0f; 270 | $x |= $x << 2; $x &= 0x3333333333333333; 271 | return ($x | $x << 1) & 0x5555555555555555; 272 | } 273 | 274 | sub ungap_bits { 275 | my ($x) = @_; $x &= 0x5555555555555555; 276 | $x ^= $x >> 1; $x &= 0x3333333333333333; 277 | $x ^= $x >> 2; $x &= 0x0f0f0f0f0f0f0f0f; 278 | $x ^= $x >> 4; $x &= 0x00ff00ff00ff00ff; 279 | $x ^= $x >> 8; $x &= 0x0000ffff0000ffff; 280 | return ($x ^ $x >> 16) & 0x00000000ffffffff; 281 | } 282 | 283 | sub geohash_encode { 284 | local $_; 285 | my ($lat, $lng, $precision) = @_; 286 | $precision //= 12; 287 | my $bits = $precision > 0 ? $precision * 5 : -$precision; 288 | my $gh = (gap_bits(int(($lat + 90) / 180 * 0x40000000)) | 289 | gap_bits(int(($lng + 180) / 360 * 0x40000000)) << 1) 290 | >> 60 - $bits; 291 | 292 | $precision > 0 ? join '', reverse map $gh_alphabet[$gh >> $_ * 5 & 31], 293 | 0 .. $precision - 1 294 | : $gh; 295 | } 296 | 297 | sub geohash_decode { 298 | local $_; 299 | my ($gh, $bits) = @_; 300 | unless (defined $bits) { 301 | # Decode gh from base-32 302 | $bits = length($gh) * 5; 303 | my $n = 0; 304 | $n = $n << 5 | $gh_decode{lc $_} for split //, $gh; 305 | $gh = $n; 306 | } 307 | $gh <<= 60 - $bits; 308 | return (ungap_bits($gh) / 0x40000000 * 180 - 90, 309 | ungap_bits($gh >> 1) / 0x40000000 * 360 - 180); 310 | } 311 | 312 | # HTTP 313 | sub http_send { sprintf "HTTP/1.0 %s\nContent-Length: %d\n\n%s\n", 314 | $_[0], 315 | length($_[1]), 316 | $_[1] } 317 | 318 | sub httpok { sprintf "HTTP/1.0 200 OK\nContent-Length: %d\n\n%s\n", 319 | length($_[0]), 320 | $_[0] } 321 | 322 | sub chunk { sprintf "%x\r\n%s\r\n", length($_[0]), $_[0] } 323 | sub endchunk { chunk '' } 324 | 325 | # Function shorthands 326 | BEGIN { 327 | *dr = \°rees_radians; 328 | *rd = \&radians_degrees; 329 | *je = \&json_encode; 330 | *jd = \&json_decode; 331 | 332 | *hc = \&hadoop_counter; 333 | 334 | *rf = \&read_file; 335 | *rl = \&read_lines; 336 | *wf = \&write_file; 337 | *wl = \&write_lines; 338 | 339 | *of = \&open_file; 340 | 341 | *ghe = \&geohash_encode; 342 | *ghd = \&geohash_decode; 343 | } 344 | 345 | # File functions 346 | sub random128 {join '', map sprintf('%04x', rand 65536), 0 .. 8} 347 | sub hadoop_ls; 348 | sub hadoop_matched_partfiles; 349 | sub shell_quote; 350 | sub expand_sql_shorthands; 351 | sub expand_sqlite_db; 352 | sub expand_postgres_db; 353 | 354 | sub expand_filename_shorthands { 355 | # NB: we prepend a shell comment containing the original $f so anyone 356 | # downstream can get it back. This is currently used by Hadoop to reuse data 357 | # stored on HDFS if you write it on the command-line. (Otherwise nfu would 358 | # hadoop fs -text to download it, then re-upload to a tempfile.) 359 | 360 | my ($f, $always_make_a_command) = @_; 361 | my $result; 362 | my $original = ($f =~ s/^/#/mgr) . "\n"; 363 | 364 | no warnings 'newline'; 365 | 366 | if (-e $f || $f =~ s/^file://) { 367 | # It's really a filename, so push it onto @ARGV. If it's compressed, run it 368 | # through the appropriate decompressor first. 369 | my $piped = $f =~ s/^(.*\.gz)/cat '$1' | gzip -d/ri 370 | =~ s/^(.*\.bz2)/cat '$1' | bzip2 -d/ri 371 | =~ s/^(.*\.xz)/cat '$1' | xz -d/ri 372 | =~ s/^(.*\.lzo)/cat '$1' | lzop -d/ri; 373 | $result = $piped =~ /\|/ ? "$original$piped |" : $piped; 374 | } elsif ($f =~ /^https?:\/\//) { 375 | # Assume a URL and curl it 376 | $f = shell_quote $f; 377 | $result = "${original}curl $f |"; 378 | } elsif ($f =~ s/^sh://) { 379 | # Execute a command and capture stdout 380 | $result = "$original$f |"; 381 | } elsif ($f =~ /^s3:/) { 382 | # Use s3cmd and cat to stdout 383 | $result = "${original}s3cmd get '$f' - |"; 384 | } elsif ($f =~ s/^hdfs-ls:(?:\/\/)?//) { 385 | # Just list a directory. This is used by hdfs: below to provide 386 | # indirection. 387 | $result = "${original}$0 n:1 -m '" 388 | . q{map "hdfs:" . shell_quote($_), hadoop_ls %[q[} . $f . q{]%]} 389 | . "' |"; 390 | } elsif ($f =~ s/^hdfs:(?:\/\/)?//) { 391 | # Use Hadoop commands to read stuff. The command itself needs to be 392 | # deferred because the hdfs path might not exist when the pseudofile is 393 | # resolved. Indirected through $0 hdfs-ls:X because some HDFS directories 394 | # contain enough files that we'll get arg-list-too-long errors if we invoke 395 | # hadoop fs -text directly. This workaround lets us stream the output from 396 | # hadoop fs -ls. 397 | # 398 | # Redirecting errors to /dev/null because hadoop fs -text complains loudly 399 | # if you close its output stream, and this is really annoying. 400 | $f = shell_quote "hdfs-ls:$f"; 401 | $result = "${original}$0 $f | xargs hadoop fs -text 2>/dev/null |"; 402 | } elsif ($f =~ s/^hdfsjoin://) { 403 | $result = "${original}$0 " 404 | . shell_quote(map "hdfs:$_", 405 | hadoop_matched_partfiles $f, $ENV{map_input_file}) 406 | . " |"; 407 | } elsif ($f =~ s/^perl://) { 408 | # Evaluate a Perl expression 409 | $f =~ s/'/'"'"'/g; 410 | $result = "${original}perl -e 'print \$_, \"\\n\" for ($f)' |"; 411 | } elsif ($f =~ s/^n://) { 412 | $result = "${original}perl -e 'print \$_, \"\\n\" for (1..($f))' |"; 413 | } elsif ($f =~ s/^id://) { 414 | $result = "${original}echo " . shell_quote($f) . " |"; 415 | } elsif ($f =~ s/^sql://) { 416 | # Run a postgres or sqlite3 query, exporting results as TSV 417 | my ($db, $query) = split /:/, $f, 2; 418 | $query = expand_sql_shorthands $query; 419 | 420 | if ($db =~ s/^P//) { 421 | # postgres 422 | $db = expand_postgres_db $db; 423 | $query = shell_quote "COPY ($query) TO STDOUT WITH NULL AS ''"; 424 | $result = "${original}psql -c $query $db |"; 425 | } elsif ($db =~ s/^S//) { 426 | # sqlite3 427 | $db = shell_quote expand_sqlite_db $db; 428 | $result = "${original}echo " . shell_quote(".mode tabs\n$query;") 429 | . "| sqlite3 $db |"; 430 | } else { 431 | die "unknown database prefix " . substr($db, 0, 1) 432 | . " for pseudofile $original (valid prefixes are P and S)"; 433 | } 434 | } elsif ($f =~ /(\w*@?[^:]+):(.*)$/) { 435 | # Access file over SSH. We need to make sure nfu is running on the remote 436 | # end, since the remote file might be gzipped or some such. Because the 437 | # remote machine might not have nfu, this involves us piping ourselves over 438 | # to that machine. 439 | $result = "${original}cat '$0' | ssh -C '$1' perl - '$2' |"; 440 | } else { 441 | return undef; 442 | } 443 | 444 | $always_make_a_command && $result !~ /\|/ ? "${original}cat '$result' |" 445 | : $result; 446 | } 447 | 448 | sub unexpand_filename_shorthands { 449 | my ($f) = @_; 450 | $f =~ /^#(.*)/ ? $1 : $f; 451 | } 452 | 453 | my %pseudofile_docs = ( 454 | 'file.gz' => 'decompress file with gzip -dc', 455 | 'file.bz2' => 'decompress file with bzip2 -dc', 456 | 'file.xz' => 'decompress file with xz -dc', 457 | 'file.lzo' => 'decompress file with lzop -dc', 458 | 'http[s]://url' => 'retrieve url with curl', 459 | 'sh:stuff' => 'run sh -c "stuff", take stdout', 460 | 's3://url' => 'access S3 using s3cmd', 461 | 'hdfs:path' => 'read HDFS file(s) with hadoop fs -text', 462 | 'hdfs-ls:path' => 'pseudofiles parsed from hadoop fs -ls', 463 | 'hdfsjoin:path' => 'mapside join pseudofile (a subset of hdfs:path)', 464 | 'sql:db:query' => 'results of query as TSV', 465 | 'perl:expr' => 'perl -e \'print "$_\n" for (expr)\'', 466 | 'n:number' => 'numbers from 1 to n, inclusive', 467 | 'id:X' => 'verbatim text X', 468 | 'user@host:x' => 'remote data access (x can be a pseudofile)', 469 | ); 470 | 471 | # Flags 472 | our $is_child = 0; 473 | our $verbose = 0; 474 | our $n_lines = 0; 475 | our $n_bytes = 0; 476 | our $start_time = undef; 477 | 478 | our $verbose_command = ''; 479 | our @verbose_args; 480 | our $verbose_command_formatted = undef; 481 | our $inside_hadoop_job = length $ENV{mapred_job_id}; 482 | our $verbose_interval = $inside_hadoop_job ? HADOOP_VERBOSE_INTERVAL 483 | : VERBOSE_INTERVAL; 484 | our $empirical_verbosity = 0; 485 | 486 | $verbose ||= $inside_hadoop_job; 487 | $verbose ||= length $ENV{NFU_ALWAYS_VERBOSE}; 488 | 489 | our $last_verbose_report = 0; 490 | our $verbose_row = 0; 491 | 492 | # Call it like this: 493 | # while (<>) { 494 | # be_verbose_as_appropriate length; 495 | # ... 496 | # } 497 | sub be_verbose_as_appropriate { 498 | return if $is_child; 499 | return unless $verbose; 500 | local $_; 501 | my ($record_length) = @_; 502 | $n_lines += !!$record_length; 503 | $n_bytes += $record_length; 504 | my $now = time; 505 | return unless $record_length == 0 506 | || ($now - $last_verbose_report) * 1000 > $verbose_interval; 507 | 508 | $last_verbose_report = $now; 509 | $verbose_command_formatted //= join ' ', $verbose_command, @verbose_args; 510 | $start_time //= $now; 511 | my $runtime = $now - $start_time || 0.001; 512 | 513 | return if $runtime * 1000 < DELAY_BEFORE_VERBOSE; 514 | ++$empirical_verbosity; 515 | 516 | unless ($inside_hadoop_job) { 517 | # Print status updates straight to the terminal 518 | printf STDERR "\033[%d;1H\033[K%10dl %8.1fl/s %10dk %8.1fkB/s %s", 519 | $verbose_row, 520 | $n_lines, 521 | $n_lines / $runtime, 522 | $n_bytes / 1024, 523 | $n_bytes / 1024 / $runtime, 524 | substr($verbose_command_formatted, 0, 40); 525 | } else { 526 | # Use Hadoop-specific syntax to update job counters. Smaller units are 527 | # better because Hadoop counters are integers, so they'll suffer from 528 | # truncation for fractional quantities. 529 | $verbose_command_formatted =~ s/[,\n]/_/g; 530 | hc $verbose_command_formatted, @$_ 531 | for (['lines', $n_lines], 532 | ['runtime ms', $runtime * 1000], 533 | ['bytes', $n_bytes]); 534 | 535 | # Reset variables because Hadoop treats them as incremental 536 | $n_lines = 0; 537 | $n_bytes = 0; 538 | $start_time = $now; 539 | } 540 | } 541 | 542 | END { 543 | be_verbose_as_appropriate 0; 544 | print STDERR "\n" if $empirical_verbosity; 545 | } 546 | 547 | # This variable will keep track of any state accumulated from --use or --run 548 | # arguments. This is required for --pmap to work correctly. 549 | my @evaled_code; 550 | 551 | sub shell_quote {join ' ', map /[^-\/\w]/ ? "'" . s/(['\\])/'\\$1'/gr . "'" 552 | : length $_ ? $_ 553 | : "''", @_} 554 | 555 | sub quote_self {shell_quote $0, @_} 556 | 557 | my %explosions = ( 558 | a => '--average', 559 | A => '--aggregate', 560 | b => '--branch', 561 | c => '--count', 562 | C => '--uncount', 563 | D => '--drop', 564 | e => '--each', 565 | E => '--every', 566 | f => '--fields', 567 | F => '--fieldsplit', 568 | g => '--group', 569 | G => '--rgroup', 570 | h => '--hadoopc', 571 | H => '--hadoop', 572 | i => '--index', 573 | I => '--indexouter', 574 | j => '--join', 575 | J => '--joinouter', 576 | k => '--keep', 577 | K => '--remove', 578 | l => '--log', 579 | L => '--exp', 580 | m => '--map', 581 | M => '--pmap', 582 | n => '--number', 583 | N => '--ntiles', 584 | o => '--order', 585 | O => '--rorder', 586 | p => '--plot', 587 | P => '--poll', 588 | q => '--quant', 589 | Q => '--sql', 590 | r => '--read', 591 | R => '--buffer', 592 | s => '--sum', 593 | S => '--delta', 594 | T => '--take', 595 | # v => '--verbose' # handled during option parsing 596 | V => '--variance', 597 | w => '--with', 598 | z => '--intify', 599 | ); 600 | 601 | my %implosions; 602 | $implosions{$explosions{$_}} = $_ for keys %explosions; 603 | 604 | # Minimum number of required arguments for each function. Numeric arguments are 605 | # automatically forwarded, so are always optional. 606 | my %arity = ( 607 | average => 0, 608 | aggregate => 1, 609 | branch => 1, 610 | count => 0, 611 | uncount => 0, 612 | delta => 0, 613 | drop => 0, 614 | each => 1, 615 | every => 1, 616 | fields => 0, 617 | fieldsplit => 1, 618 | fold => 1, 619 | group => 0, 620 | rgroup => 0, 621 | hadoop => 3, 622 | hadoopc => 4, 623 | index => 2, 624 | indexouter => 2, 625 | join => 2, 626 | joinouter => 2, 627 | keep => 1, 628 | log => 0, 629 | exp => 0, 630 | map => 1, 631 | pmap => 1, 632 | number => 0, 633 | ntiles => 1, 634 | order => 0, 635 | rorder => 0, 636 | plot => 1, 637 | poll => 2, 638 | read => 0, 639 | buffer => 1, 640 | sum => 0, 641 | quant => 1, 642 | remove => 1, 643 | sample => 1, 644 | take => 0, 645 | variance => 0, 646 | with => 1, 647 | intify => 0, 648 | 649 | # Commands with no shorthands 650 | append => 1, 651 | prepend => 1, 652 | tee => 1, 653 | duplicate => 2, 654 | partition => 2, 655 | splot => 1, 656 | sd => 0, 657 | mplot => 1, 658 | preview => 0, 659 | pipe => 1, 660 | entropy => 0, 661 | sql => 3, 662 | tcp => 1, 663 | http => 1, 664 | repeat => 2, 665 | octave => 1, 666 | numpy => 1, 667 | ); 668 | 669 | my %usages = ( 670 | average => 'window size (0 for full average) -- running average', 671 | aggregate => 'aggregator fn', 672 | branch => 'branch (takes a pattern map)', 673 | count => 'counts by first column value; like uniq -c', 674 | uncount => 'the opposite of --count; repeats each row N times', 675 | delta => 'value -> difference from last value', 676 | drop => 'number of records to drop', 677 | each => 'template; executes with {} set to each value', 678 | every => 'n (returns every nth row)', 679 | fields => 'string of digits, each a zero-indexed column selector', 680 | fieldsplit => 'regexp to use for splitting', 681 | fold => 'function that returns true when line should be folded', 682 | group => 'sorts ascending, takes optional column list', 683 | rgroup => 'sorts descending, takes optional column list', 684 | hadoop => 'hadoop streaming: outpath|.|@, mapper|:, reducer|:|_', 685 | hadoopc => 'hadoop streaming: ..., combiner|:|_, reducer|:|_', 686 | index => 'field index, unsorted pseudofile to join against', 687 | indexouter => 'field index, unsorted pseudofile to join against', 688 | join => 'field index, sorted pseudofile to join against', 689 | joinouter => 'field index, sorted pseudofile to join against', 690 | keep => 'row filter fn', 691 | log => 'optional base (default e)', 692 | exp => 'optional base (default e)', 693 | map => 'row map fn', 694 | pmap => 'row map fn (executed multiple times in parallel)', 695 | number => 'prepends line number to each line', 696 | ntiles => 'takes N, produces ntiles of numbers', 697 | order => 'sorts ascending by general numeric value', 698 | rorder => 'sorts descending by general numeric value', 699 | plot => 'gnuplot arguments', 700 | poll => 'interval in seconds, command whose output to collect', 701 | sum => 'value -> total += value', 702 | quant => 'number to round to', 703 | read => 'reads pseudofiles from the data stream', 704 | buffer => 'creates a pseudofile from the data stream', 705 | remove => 'inverted row filter fn', 706 | sample => 'row selection probability in [0, 1]', 707 | take => 'n to take first n, +n to take last n', 708 | variance => 'running variance', 709 | with => 'pseudofile to join column-wise onto input', 710 | intify => 'convert column to dense integers (linear space)', 711 | 712 | append => 'pseudofile; appends its contents to current stream', 713 | prepend => 'pseudofile; prepends its contents to current stream', 714 | tee => 'shell command; duplicates data to stdin of command', 715 | duplicate => 'two shell commands as separate arguments', 716 | partition => 'partition id fn, shell command (using {})', 717 | splot => 'gnuplot arguments', 718 | sd => 'running standard deviation', 719 | mplot => 'gnuplot arguments per column, separated by ;', 720 | preview => '', 721 | pipe => 'shell command to pipe through', 722 | entropy => 'running entropy of relative probabilities/frequencies', 723 | sql => 'create/query SQL table: db[:[+]table], schema|_, query|_', 724 | tcp => 'TCP server (emits fifo filenames)', 725 | http => 'HTTP adapter for TCP server output', 726 | repeat => 'repeat count, pseudofile to repeat', 727 | octave => 'pipe through octave; vector is called xs', 728 | numpy => 'pipe through numpy; vector is called xs', 729 | ); 730 | 731 | my %env_docs = ( 732 | NFU_SORT_BUFFER => 'default 64M; size of in-memory sort for -g and -o', 733 | NFU_SORT_PARALLEL => 'default 4; number of concurrent sorts to run', 734 | NFU_SORT_COMPRESS => 'default none; compression program for sort tempfiles', 735 | NFU_SORT_OPTIONS => 'override all sort options except column spec', 736 | NFU_ALWAYS_VERBOSE => 'if set, nfu will be verbose all the time', 737 | NFU_NO_PAGER => 'if set, nfu will not use "less" to preview stdout', 738 | NFU_PMAP_PARALLELISM => 'number of subprocesses for -M', 739 | NFU_MAX_FILEHANDLES => 'default 64; maximum #subprocesses for --partition', 740 | NFU_HADOOP_FILES => 'comma-separated files to include with streaming job', 741 | NFU_HADOOP_STREAMING => 'absolute location of hadoop-streaming.jar', 742 | NFU_HADOOP_OPTIONS => '-D options for hadoop streaming jobs', 743 | NFU_HADOOP_COMMAND => 'hadoop executable; e.g. hadoop jar, hadoop fs -ls', 744 | NFU_HADOOP_TMPDIR => 'default /tmp; temp dir for hadoop uploads', 745 | ); 746 | 747 | my %gnuplot_aliases = ( 748 | '%l' => ' with lines', 749 | '%d' => ' with dots', 750 | '%i' => ' with impulses', 751 | '%v' => ' with vectors ', 752 | '%u' => ' using ', 753 | '%t' => ' title ', 754 | '%p' => ' lc palette ', 755 | ); 756 | 757 | my %fieldsplit_shorthands = ( 758 | S => '\s+', 759 | W => '\W+', 760 | C => ',', 761 | ); 762 | 763 | sub expand_gnuplot_options { 764 | my @transformed_opts; 765 | for my $opt (@_) { 766 | $opt =~ s/$_/$gnuplot_aliases{$_}/g for keys %gnuplot_aliases; 767 | push @transformed_opts, $opt; 768 | } 769 | @transformed_opts; 770 | } 771 | 772 | my %sql_aliases = ( 773 | '%\*' => ' select * from ', 774 | '%c' => ' select count(1) from ', 775 | '%d' => ' select distinct * from ', 776 | '%g' => ' group by ', 777 | '%j' => ' inner join ', 778 | '%l' => ' outer left join ', 779 | '%r' => ' outer right join ', 780 | '%w' => ' where ', 781 | ); 782 | 783 | sub expand_sql_shorthands { 784 | my ($sql) = @_; 785 | $sql =~ s/$_/$sql_aliases{$_}/eg for keys %sql_aliases; 786 | $sql; 787 | } 788 | 789 | sub expand_sqlite_db { 790 | my $tempdir = tmpnam =~ s/\/[^\/]+$//r; 791 | return "$tempdir/nfu-$ENV{USER}-sqlite.db" if $_[0] eq '@'; 792 | return $_[0]; 793 | } 794 | 795 | sub expand_postgres_db { 796 | # Expands a DB descriptor into a series of properly-shellquoted options to 797 | # pass to the 'psql' command. 798 | my ($host, $user, $db) = ('localhost', $ENV{USER}, $ENV{USER}); 799 | if ($_[0] =~ m#^(?:([^\@/:]+)@)?([^\@/:]+)/([^/]+)$#) { 800 | # NB: don't try to change this to use $1, $2, etc as arguments to 801 | # shell_quote. Perl references arguments, so this will fail horribly. 802 | ($user, $host, $db) = ($1 // $ENV{USER}, $2 // 'localhost', $3); 803 | shell_quote '-U', $user, '-h', $host, '-d', $db; 804 | } elsif ($_[0] =~ m#^(\w+)@(\w+)$#) { 805 | # Simple DB connection; assume localhost 806 | ($user, $db) = ($1, $2); 807 | shell_quote '-U', $user, '-d', $db; 808 | } elsif ($_[0] =~ m#^[^\@:/]+$#) { 809 | # Really simple connection 810 | shell_quote '-d', $_[0]; 811 | } elsif ($_[0] eq '@') { 812 | # Really simple connection; use username as DB 813 | shell_quote '-d', $ENV{USER}; 814 | } else { 815 | die "not sure how to parse postgres DB '$_[0]'"; 816 | } 817 | } 818 | 819 | sub expand_eval_shorthands { 820 | my $code = $_[0]; 821 | my @pieces = split /%\[(.*?)%\]/, $code; 822 | for my $i (0..$#pieces) { 823 | unless ($i & 1) { 824 | $pieces[$i] =~ s/%(\d+)/\$_[$1]/g; 825 | 1 while $pieces[$i] 826 | =~ s/([a-zA-Z0-9_\)\}\]?\$]) 827 | \. 828 | ([\$_a-zA-Z](?:-[0-9\w?\$]|[0-9_\w?\$])*) 829 | /$1\->{'$2'}/x; 830 | } 831 | } 832 | join '', @pieces; 833 | } 834 | 835 | sub parse_join_options { 836 | my ($f1, $f2, $file) = @_ == 3 ? @_ : ($_[0], 0, $_[1]); 837 | ($f1 + 1, $f2 + 1, $file); 838 | } 839 | 840 | sub compile_eval_into_function { 841 | my ($code, $name) = @_; 842 | $code = expand_eval_shorthands $code; 843 | eval "sub {\n$code\n}" 844 | or die "failed to compile $name function: $@\n (code was $code)"; 845 | } 846 | 847 | sub stateless_unary_fn { 848 | my ($name, $f) = @_; 849 | my $arity = $arity{$name}; 850 | ($name, sub { 851 | my @columns = split //, (@_ > $arity ? shift : undef) // '0'; 852 | while (<>) { 853 | be_verbose_as_appropriate length; 854 | chomp; 855 | my @fs = split /\t/; 856 | $fs[$_] = $f->($fs[$_], @_) for @columns; 857 | print row(@fs), "\n"; 858 | } 859 | }); 860 | } 861 | 862 | sub stateful_unary_fn { 863 | my ($name, $setup, $f) = @_; 864 | my $arity = $arity{$name}; 865 | ($name, sub { 866 | my @columns = split //, (@_ > $arity ? shift : undef) // '0'; 867 | my %states; 868 | $states{$_} = $setup->(@_) for @columns; 869 | while (<>) { 870 | be_verbose_as_appropriate length; 871 | chomp; 872 | my @fs = split /\t/; 873 | $fs[$_] = $f->($fs[$_], $states{$_}, @_) for @columns; 874 | print row(@fs), "\n"; 875 | } 876 | }); 877 | } 878 | 879 | sub exec_with_stdin { 880 | open my $fh, '|' . shell_quote @_ or die "failed to exec @_"; 881 | be_verbose_as_appropriate(length), print $fh $_ while <>; 882 | close $fh; 883 | } 884 | 885 | sub exec_with_diamond { 886 | if ($verbose || grep /\|/, @ARGV) { 887 | # Arguments are specified in filenames and involve processes, so use perl 888 | # to forward data. 889 | exec_with_stdin @_; 890 | } else { 891 | # Faster option: just exec the program in-place. This avoids a layer of 892 | # interprocess piping. Assume filenames follow arguments. 893 | exec @_, @ARGV or die "failed to exec @_ @ARGV"; 894 | } 895 | } 896 | 897 | sub sort_options { 898 | my ($column_spec) = @_; 899 | my @columns = split //, $column_spec // ''; 900 | my @options = exists $ENV{NFU_SORT_OPTIONS} 901 | ? split /\s+/, $ENV{NFU_SORT_OPTIONS} 902 | : ('-S', $ENV{NFU_SORT_BUFFER} || '64M', 903 | '--parallel=' . ($ENV{NFU_SORT_PARALLEL} || 4), 904 | $ENV{NFU_SORT_COMPRESS} 905 | ? ("--compress-program=$ENV{NFU_SORT_COMPRESS}") 906 | : ()); 907 | return @options, 908 | (@columns 909 | ? ('-t', "\t", 910 | map {('-k', sprintf "%d,%d", $_ + 1, $_ + 1)} @columns) 911 | : ()); 912 | } 913 | 914 | sub sort_cmd {join ' ', 'sort', sort_options, @_} 915 | 916 | sub fifo_for { 917 | my ($file, @transforms) = @_; 918 | my $fifo_name = tmpnam; 919 | 920 | mkfifo $fifo_name, 0700 or die "failed to create fifo: $!"; 921 | 922 | return $fifo_name if fork; 923 | 924 | my $command = expand_filename_shorthands($file, 1) 925 | . join '', map {"$_ |"} @transforms; 926 | open my $into_fifo, '>', $fifo_name 927 | or die "failed to open fifo $fifo_name for writing: $!"; 928 | open my $from_file, $command 929 | or die "failed to open file/command $command for reading: $!"; 930 | 931 | be_verbose_as_appropriate(length), $into_fifo->print($_) while <$from_file>; 932 | close $into_fifo; 933 | close $from_file; 934 | 935 | unlink $fifo_name or warn "failed to unlink temporary fifo $fifo_name: $!"; 936 | exit 0; 937 | } 938 | 939 | sub hadoop { 940 | # Generates a hadoop command string, quoting all args as necessary 941 | join ' ', $ENV{NFU_HADOOP_COMMAND} // "hadoop", 942 | @_[0, 1], 943 | shell_quote(@_ > 2 ? @_[2..$#_] : ()); 944 | } 945 | 946 | sub hadoop_ls { 947 | # Now get the output file listing. This is a huge mess because Hadoop is a 948 | # huge mess. 949 | my $ls_command = hadoop('fs', '-ls', @_); 950 | grep /\/[^_][^\/]*$/, map +(split " ", $_, 8)[7], 951 | grep !/^Found/, 952 | split /\n/, ''.qx/$ls_command/; 953 | } 954 | 955 | sub gcd { 956 | my ($a, $b) = @_; 957 | while ($b) { 958 | if ($b > $a) { ($a, $b) = ($b, $a) } 959 | else { $a %= $b } 960 | } 961 | $a; 962 | } 963 | 964 | sub hadoop_partfile_n { $_[0] =~ /[^0-9]([0-9]+)(?:\.[^\/]+)?$/ ? $1 : 0 } 965 | sub hadoop_partsort { 966 | sort {hadoop_partfile_n($a) <=> hadoop_partfile_n($b)} @_ 967 | } 968 | 969 | sub hadoop_matched_partfiles { 970 | my ($path, $partfile) = @_; 971 | my $partfile_dirname = $partfile =~ s/\/[^\/]+$//r; 972 | my @left_files; 973 | my @possibilities; 974 | 975 | # We won't be able to do anything until both sides of the join exist. 976 | until (@left_files = hadoop_partsort hadoop_ls $partfile_dirname) { 977 | print STDERR "hadoop_matched_partfiles: waiting for input...\n"; 978 | hc qw/hadoop_matched_partfiles left_wait/; 979 | sleep 1; 980 | } 981 | 982 | until (@possibilities = hadoop_partsort hadoop_ls $path) { 983 | print STDERR "hadoop_matched_partfiles: waiting for join data...\n"; 984 | hc qw/hadoop_matched_partfiles right wait/; 985 | sleep 1; 986 | } 987 | 988 | my $n = hadoop_partfile_n $partfile; 989 | unless ($n < @left_files && $left_files[$n] eq $partfile) { 990 | my $partfile_n = $n; 991 | $n = 0; 992 | ++$n until $n >= @left_files 993 | || $partfile_n == hadoop_partfile_n $partfile; 994 | } 995 | 996 | die "hadoop_matched_partfiles couldn't find the index of the specified " 997 | . "partfile ($partfile) within the list of map inputs (@left_files)" 998 | if $n >= @left_files; 999 | 1000 | # Assume Hadoop's default partitioning strategy using hashcode modulus. This 1001 | # means that the Kth of N files contains all keys for which H(key) % N == K. 1002 | my $reduction_factor = gcd scalar(@left_files), scalar(@possibilities); 1003 | my $left_redundancy = @possibilities / $reduction_factor; 1004 | my @files_to_read = @possibilities[map $_ * $reduction_factor 1005 | + $n % $reduction_factor, 1006 | 0 .. $left_redundancy - 1]; 1007 | 1008 | # Log some stuff to job counters so any problems are more evident. 1009 | if ($verbose) { 1010 | printf STDERR "hdfsjoin:$path [$partfile] (inferred partfile $n): %s\n", 1011 | join(' ', @files_to_read); 1012 | 1013 | hc "hdfsjoin $path", "joins attempted", 1; 1014 | hc "hdfsjoin $path", "left/right reads", $left_redundancy; 1015 | hc "hdfsjoin $path", "overreads", $left_redundancy - 1; 1016 | } 1017 | 1018 | @files_to_read; 1019 | } 1020 | 1021 | sub hadoop_tempfile { 1022 | my $randomness = random128; 1023 | my $dir = $ENV{NFU_HADOOP_TMPDIR} // '/tmp'; 1024 | "$dir/nfu-hadoop-$ENV{USER}-$randomness"; 1025 | } 1026 | 1027 | sub find_hadoop_streaming { 1028 | my @files = split /\n/, ''.qx|locate "*hadoop-streaming.jar*"|; 1029 | die "failed to locate hadoop streaming jar automatically; " 1030 | . "you should set NFU_HADOOP_STREAMING" unless @files; 1031 | print STDERR "nfu found hadoop streaming jar: $files[0]\n"; 1032 | $files[0]; 1033 | } 1034 | 1035 | sub hadoop_into { 1036 | my ($outfile, $mapper, $combiner, $reducer) = @_; 1037 | my $streaming_jar = $ENV{NFU_HADOOP_STREAMING} // find_hadoop_streaming; 1038 | 1039 | # Various shorthands for common cases. Both mapper and reducer being the 1040 | # identity function is used to repartition stuff, so we want this to be easy 1041 | # to type and recognize. 1042 | $mapper = $0 if $mapper eq ':'; 1043 | $reducer = $0 if $reducer eq ':'; 1044 | $reducer = 'NONE' if $reducer =~ /^[-_]$/; 1045 | $combiner = $0 if $combiner eq ':'; 1046 | $combiner = 'NONE' if $combiner =~ /^[-_]$/; 1047 | $combiner = 'NONE' if $reducer eq 'NONE'; 1048 | 1049 | my @input_files; 1050 | my @delete_afterwards; 1051 | 1052 | # Figure out where our input seems to be coming from. This is a little hacky, 1053 | # but I think it's worthwhile for the flexibility we get. 1054 | # 1055 | # Hadoop jobs output their list of output partfile names because the data is 1056 | # often too large. It's possible we'll get one of these as stdin, so each 1057 | # line will look like "hdfs:/...". In that case, we want to use each as an 1058 | # input file. Otherwise we'll want to upload stdin and all non-HDFS files to 1059 | # HDFS and use those. 1060 | # 1061 | # It's actually a bit tricky to figure out which files started out as hdfs: 1062 | # locations because by now they've all been filename alias-expanded. We need 1063 | # to reverse-engineer the alias if we want to reuse stuff already on HDFS. 1064 | 1065 | my @other_argv; 1066 | while (@ARGV) { 1067 | local $_ = unexpand_filename_shorthands shift @ARGV; 1068 | if (s/^hdfs:(?:\/\/)?//) { 1069 | push @input_files, '-input', $_; 1070 | } else { 1071 | push @other_argv, expand_filename_shorthands $_; 1072 | } 1073 | } 1074 | 1075 | @ARGV = @other_argv; 1076 | my $line = undef; 1077 | unless (-t STDIN) { 1078 | chomp($line), push @input_files, '-input', $line 1079 | while defined($line = ) && $line =~ s/^hdfs://; 1080 | } 1081 | 1082 | # At this point $line is either undefined or contains a line we shouldn't 1083 | # have read from stdin (i.e. it's data). If the latter, prepend it to the 1084 | # upload to HDFS. 1085 | if (defined $line) { 1086 | my $tempfile = hadoop_tempfile; 1087 | open my $fh, "| " . hadoop('fs', '-put', '-', $tempfile) . ' 1>&2' 1088 | or die "failed to open hadoop fs -put process for uploading: $!"; 1089 | print $fh $line; 1090 | be_verbose_as_appropriate(length), print $fh $_ while ; 1091 | close $fh; 1092 | 1093 | push @input_files, '-input', $tempfile; 1094 | push @delete_afterwards, $tempfile; 1095 | } 1096 | 1097 | if (@other_argv) { 1098 | my $tempfile = hadoop_tempfile; 1099 | open my $fh, "| " . hadoop('fs', '-put', '-', $tempfile) . ' 1>&2' 1100 | or die "failed to open hadoop fs -put process for uploading: $!"; 1101 | be_verbose_as_appropriate(length), print $fh $_ while <>; 1102 | close $fh; 1103 | 1104 | push @input_files, '-input', $tempfile; 1105 | push @delete_afterwards, $tempfile; 1106 | } 1107 | 1108 | # Now all the input files are in place, so we can kick off the job. 1109 | my $extra_args = $ENV{NFU_HADOOP_OPTIONS} // ''; 1110 | my @file_deps = map {('-file', $_)} 1111 | ($0, split /,/, $ENV{NFU_HADOOP_FILES} // ''); 1112 | my $dirname = $0 =~ s/\/nfu$/\//r; 1113 | 1114 | # We're uploading ourselves, so we'll need a current-directory reference to 1115 | # nfu. When nfu quotes itself, it produces an absolute path instead; this 1116 | # code rewrites those into ./nfu. 1117 | my $transformed_mapper = $mapper =~ s|\Q$dirname\E|./|gr; 1118 | my $transformed_combiner = $combiner =~ s|\Q$dirname\E|./|gr; 1119 | my $transformed_reducer = $reducer =~ s|\Q$dirname\E|./|gr; 1120 | 1121 | # Write the mapper and reducer commands into files rather than passing them 1122 | # straight to hadoop streaming. This bypasses two problems: 1123 | # 1124 | # 1. Hadoop streaming might chop long arguments sooner than bash. 1125 | # 2. It word-splits differently from bash, which breaks shell_quote. 1126 | 1127 | my $mapper_file = tmpnam; 1128 | my $combiner_file = tmpnam; 1129 | my $reducer_file = tmpnam; 1130 | 1131 | open my $mapper_fh, '>', $mapper_file 1132 | or die "failed to create tempfile $mapper_file for map job: $!"; 1133 | open my $combiner_fh, '>', $combiner_file 1134 | or die "failed to create tempfile $combiner_file for combine job: $!"; 1135 | open my $reducer_fh, '>', $reducer_file 1136 | or die "failed to create tempfile $reducer_file for reduce job: $!"; 1137 | print $mapper_fh "#!/bin/bash\n$transformed_mapper\n"; 1138 | print $combiner_fh "#!/bin/bash\n$transformed_combiner\n"; 1139 | print $reducer_fh "#!/bin/bash\n$transformed_reducer\n"; 1140 | close $mapper_fh; 1141 | close $combiner_fh; 1142 | close $reducer_fh; 1143 | 1144 | chmod 0755, $mapper_file; 1145 | chmod 0755, $combiner_file; 1146 | chmod 0755, $reducer_file; 1147 | 1148 | my $jobname = "nfu streaming [" 1149 | . join(' ', grep $_ ne '-input', @input_files) 1150 | . "]: map($transformed_mapper), " 1151 | . "combine($transformed_combiner), " 1152 | . "reduce($transformed_reducer), " 1153 | . "options($extra_args)" 1154 | . " > $outfile"; 1155 | 1156 | my $hadoop_command = 1157 | hadoop('jar', 1158 | shell_quote($streaming_jar) 1159 | . " -D mapred.job.name=" . shell_quote($jobname) . " $extra_args", 1160 | @file_deps, 1161 | @input_files, 1162 | '-file', $mapper_file, 1163 | '-file', $combiner_file, 1164 | '-file', $reducer_file, 1165 | '-output', $outfile, 1166 | '-mapper', "./" . ($mapper_file =~ s/^.*\///r), 1167 | $combiner ne 'NONE' 1168 | ? ('-combiner', "./" . ($combiner_file =~ s/^.*\///r)) 1169 | : (), 1170 | '-reducer', $reducer eq 'NONE' 1171 | ? $reducer 1172 | : "./" . ($reducer_file =~ s/^.*\///r)); 1173 | 1174 | system $hadoop_command . ' 1>&2' 1175 | and die "failed to execute hadoop command $hadoop_command: $!"; 1176 | 1177 | unlink $mapper_file; 1178 | unlink $combiner_file; 1179 | unlink $reducer_file; 1180 | 1181 | system hadoop('fs', '-rm', '-r', @delete_afterwards) . ' 1>&2' 1182 | if @delete_afterwards; 1183 | "hdfs:$outfile"; 1184 | } 1185 | 1186 | sub sql_infer_column_type { 1187 | # Try to figure out the right type for a column based on some values for it. 1188 | # The possibilities are 'text', 'integer', or 'real'; this is roughly the set 1189 | # of stuff supported by both postgres and sqlite. 1190 | return 'integer' unless grep length && !/^-?[0-9]+$/, @_; 1191 | return 'real' unless grep length && 1192 | !/^-?[0-9]+$ 1193 | | ^-?[0-9]+(?:\.[0-9]+)?(?:[eE][-+]?[0-9]+)?$ 1194 | | ^-?[0-9]* \.[0-9]+ (?:[eE][-+]?[0-9]+)?$/x, 1195 | @_; 1196 | return 'text'; 1197 | } 1198 | 1199 | sub sql_infer_schema { 1200 | # Takes a list of TSV lines and generates a table schema that will store 1201 | # them. 1202 | my $n = max map scalar(split /\t/), @_; 1203 | my @columns = map [], 1 .. $n; 1204 | for (@_) { 1205 | my @vs = split /\t/; 1206 | push @{$columns[$_]}, $vs[$_] for 0 .. $#vs; 1207 | } 1208 | 1209 | my @types = map sql_infer_column_type(@$_), @columns; 1210 | join ', ', map sprintf("f%d %s", $_, $types[$_]), 0 .. $#columns; 1211 | } 1212 | 1213 | sub sql_schema_and_buffer { 1214 | # WARNING: this function modifies its argument 1215 | my ($schema) = @_; 1216 | my @read; 1217 | if ($schema eq '_') { 1218 | # Infer the schema, which involves reading some data up front. 1219 | push @read, $_ while @read < SQL_INFER_PEEK_LINES 1220 | and $diamond_has_data &&= defined($_ = <>); 1221 | $schema = sql_infer_schema @read; 1222 | } 1223 | $_[0] = $schema; 1224 | @read; 1225 | } 1226 | 1227 | sub sql_parse_args { 1228 | my ($dbt, $schema, $query) = @_; 1229 | my @pieces = split /:/, $dbt; 1230 | 1231 | if (@pieces > 1) { 1232 | my $table = pop @pieces; 1233 | my $db = join ':', @pieces; 1234 | ($db, $table, $schema, $query); 1235 | } else { 1236 | # Use a default table called 't' with an automatic index 1237 | ($pieces[0], '+t', $schema, $query); 1238 | } 1239 | } 1240 | 1241 | sub write_buffer_and_stdin { 1242 | my ($fh, @buffer) = @_; 1243 | be_verbose_as_appropriate(length), $fh->print($_) for @buffer; 1244 | be_verbose_as_appropriate(length), $fh->print($_) 1245 | while $diamond_has_data &&= defined($_ = <>); 1246 | } 1247 | 1248 | sub sql_first_column_index { 1249 | # WARNING: this function modifies its first argument 1250 | my ($table, $schema) = @_; 1251 | my $column = $schema =~ s/\s.*$//r; 1252 | my $index = $table =~ s/^\+// 1253 | ? "CREATE INDEX $table$column ON $table($column);\n" 1254 | : ''; 1255 | $_[0] = $table; 1256 | $index; 1257 | } 1258 | 1259 | my %functions = ( 1260 | read => sub { 1261 | while (<>) { 1262 | chomp; 1263 | my $f = expand_filename_shorthands $_, 1; 1264 | open my $fh, $f or die "failed to open pseudofile $_ ($f): $!"; 1265 | be_verbose_as_appropriate(length), print while <$fh>; 1266 | close $fh; 1267 | } 1268 | }, 1269 | 1270 | buffer => sub { 1271 | my $f = $_[0] =~ /^[-_:]$/ ? tmpnam : $_[0]; 1272 | open my $fh, '>', $f or die "failed to open buffer file $f: $!"; 1273 | be_verbose_as_appropriate(length), $fh->print($_) while <>; 1274 | close $fh; 1275 | print $f, "\n"; 1276 | }, 1277 | 1278 | branch => sub { 1279 | my (@cases) = split /\n/, $_[0]; 1280 | my %branches; 1281 | my %branch_matchers; 1282 | my @order; 1283 | 1284 | for (@cases) { 1285 | my ($k, $v) = map unpack('u', $_), split /\t/; 1286 | $v //= ''; 1287 | open my $fh, "| $0 $v" 1288 | or die "failed to open branch subprocess '$0 $v': $!"; 1289 | push @order, $k; 1290 | $branches{$k} = $fh; 1291 | $branch_matchers{$k} = 1292 | $k =~ s/^:// ? compile_eval_into_function $k, 'branch matcher function' 1293 | : $k eq '_' ? sub { 1 } 1294 | : sub { $_[0] eq $k }; 1295 | } 1296 | 1297 | while (<>) { 1298 | be_verbose_as_appropriate length; 1299 | chomp; 1300 | my @xs = split /\t/; 1301 | my @matching = grep $branch_matchers{$_}->(@xs), @order; 1302 | 1303 | die "branch failed to match '$xs[0]' against any of ( @order )" 1304 | unless @matching; 1305 | $branches{$matching[0]}->print(row(@xs), "\n"); 1306 | } 1307 | 1308 | close for values %branches; 1309 | }, 1310 | 1311 | tcp => sub { 1312 | my ($hostport) = @_; 1313 | my ($host, $port) = $hostport =~ /:/ ? split /:/, $hostport 1314 | : ('0.0.0.0', $hostport); 1315 | 1316 | socket my($serversock), PF_INET, SOCK_STREAM, getprotobyname 'tcp' 1317 | or die "socket failed: $!"; 1318 | setsockopt $serversock, SOL_SOCKET, SO_REUSEADDR, pack 'l', 1 1319 | or die "setsockopt failed: $!"; 1320 | bind $serversock, sockaddr_in $port, INADDR_ANY 1321 | or die "bind failed: $!"; 1322 | listen $serversock, SOMAXCONN 1323 | or die "listen failed: $!"; 1324 | 1325 | while (1) { 1326 | # Fork twice per connection. This is egregious but necessary because the 1327 | # open() call blocks on a FIFO until the other end is connected, and we 1328 | # want the FIFO consumer to be at liberty to open them in either order 1329 | # without creating a deadlock. 1330 | my ($paddr, $client); 1331 | unless ($paddr = accept $client, $serversock) { 1332 | next if $!{EINTR}; 1333 | die "accept failed: $!"; 1334 | } 1335 | 1336 | my ($port, $iaddr) = sockaddr_in $paddr; 1337 | my $iname = gethostbyaddr $iaddr, AF_INET; 1338 | 1339 | my $r = tmpnam; mkfifo $r, 0700 or die "failed to create FIFO $r: $!"; 1340 | my $w = tmpnam; mkfifo $w, 0700 or die "failed to create FIFO $w: $!"; 1341 | 1342 | print join("\t", $r, $w, $iname, $port), "\n"; 1343 | 1344 | # Socket reads: write to the "reader" fifo 1345 | unless (fork) { 1346 | my $buf; 1347 | sysopen my $fh, $r, O_WRONLY or die "failed to open $r for writing: $!"; 1348 | syswrite $fh, $buf while sysread $client, $buf, 8192; 1349 | close $fh; 1350 | close $client; 1351 | unlink $r; 1352 | exit; 1353 | } 1354 | 1355 | # Socket writes: read from the "writer" fifo 1356 | unless (fork) { 1357 | my $buf; 1358 | sysopen my $fh, $w, O_RDONLY or die "failed to open $w for writing: $!"; 1359 | syswrite $client, $buf while sysread $fh, $buf, 8192; 1360 | close $fh; 1361 | close $client; 1362 | unlink $w; 1363 | exit; 1364 | } 1365 | 1366 | close $client; 1367 | } 1368 | }, 1369 | 1370 | http => sub { 1371 | # Connect this to a --tcp thing to do HTTP stuff. In practice this just 1372 | # means parsing out the headers and handing off parsed data to the command. 1373 | my ($command) = @_; 1374 | open my $fh, "| $command" or die "http failed to launch $command: $!"; 1375 | select((select($fh), $| = 1)[0]); 1376 | 1377 | while (<>) { 1378 | chomp; 1379 | my ($in, $out, $port, $iaddr) = split /\t/; 1380 | open my $indata, '<', $in or die "http failed to read socket $in: $!"; 1381 | 1382 | my $url; 1383 | my %headers; 1384 | while (<$indata>) { 1385 | chomp; 1386 | last if length($_) <= 1; 1387 | $headers{$1} = $2 if defined $url && /^([^:]+):\s*(.*)$/; 1388 | $url = $1 if !defined $url && /^[A-Z]+\s+(\S+)/; 1389 | } 1390 | close $indata; 1391 | my $http_data = row($out, $url, je({%headers}), $port, $iaddr) . "\n"; 1392 | $fh->print($http_data); 1393 | } 1394 | }, 1395 | 1396 | group => sub {exec_with_diamond 'sort', sort_options @_}, 1397 | rgroup => sub {exec_with_diamond 'sort', '-r', sort_options @_}, 1398 | order => sub {exec_with_diamond 'sort', '-g', sort_options @_}, 1399 | rorder => sub {exec_with_diamond 'sort', '-rg', sort_options @_}, 1400 | 1401 | count => sub { 1402 | # Same behavior as uniq -c, but delimits counts with \t; also takes an 1403 | # optional series of columns to uniq by, rather than using the whole row. 1404 | my @columns = split //, shift // ''; 1405 | my $last; 1406 | my @last; 1407 | my $count = -1; 1408 | 1409 | while (<>) { 1410 | be_verbose_as_appropriate length; 1411 | chomp; 1412 | 1413 | my @xs = split /\t/; 1414 | @xs = @xs[@columns] if @columns; 1415 | $last = $_, @last = @xs unless ++$count; 1416 | 1417 | for (my $i = 0; $i < max scalar(@xs), scalar(@last); ++$i) { 1418 | if (!defined $xs[$i] || !defined $last[$i] || $xs[$i] ne $last[$i]) { 1419 | print "$count\t$last\n"; 1420 | $count = 0; 1421 | @last = @xs; 1422 | $last = $_; 1423 | last; 1424 | } 1425 | } 1426 | } 1427 | 1428 | ++$count; 1429 | print "$count\t$last\n" if defined $last; 1430 | }, 1431 | 1432 | uncount => sub { 1433 | while (<>) { 1434 | be_verbose_as_appropriate length; 1435 | my ($n, $line) = split /\t/, $_, 2; 1436 | $line //= "\n"; 1437 | print $line for 1..$n; 1438 | } 1439 | }, 1440 | 1441 | index => sub { 1442 | # Inner join by appending joined fields to the end. 1443 | my ($f1, $f2, $join_file) = parse_join_options @_; 1444 | 1445 | my $sorted_index = fifo_for $join_file, sort_cmd "-t '\t' -k${f2}b,$f2"; 1446 | my $command = sort_cmd "-t '\t' -k ${f1}b,$f1" . 1447 | "| join -t '\t' -1 $f1 -2 $f2 - '$sorted_index'"; 1448 | 1449 | open my $to_join, "| $command" or die "failed to exec $command: $!"; 1450 | be_verbose_as_appropriate(length), print $to_join $_ while <>; 1451 | close $to_join; 1452 | }, 1453 | 1454 | indexouter => sub { 1455 | # Outer left join by appending joined fields to the end. 1456 | my ($f1, $f2, $join_file) = parse_join_options @_; 1457 | 1458 | my $sorted_index = fifo_for $join_file, sort_cmd "-t '\t' -k ${f2}b,$f2"; 1459 | my $command = sort_cmd "-t '\t' -k ${f1}b,$f1" . 1460 | "| join -a 1 -t '\t' -1 $f1 -2 $f2 - '$sorted_index'"; 1461 | 1462 | open my $to_join, "| $command" or die "failed to exec $command: $!"; 1463 | be_verbose_as_appropriate(length), print $to_join $_ while <>; 1464 | close $to_join; 1465 | }, 1466 | 1467 | join => sub { 1468 | # Inner join against sorted data by appending joined fields to the end. 1469 | my ($f1, $f2, $join_file) = parse_join_options @_; 1470 | 1471 | my $sorted_index = fifo_for $join_file; 1472 | my $command = sort_cmd "-t '\t' -k ${f1}b,$f1" . 1473 | "| join -t '\t' -1 $f1 -2 $f2 - '$sorted_index'"; 1474 | 1475 | open my $to_join, "| $command" or die "failed to exec $command: $!"; 1476 | be_verbose_as_appropriate(length), print $to_join $_ while <>; 1477 | close $to_join; 1478 | }, 1479 | 1480 | joinouter => sub { 1481 | # Outer left join against sorted data by appending joined fields to the 1482 | # end. 1483 | my ($f1, $f2, $join_file) = parse_join_options @_; 1484 | 1485 | my $sorted_index = fifo_for $join_file; 1486 | my $command = sort_cmd "-t '\t' -k ${f1}b,$f1" . 1487 | "| join -a 1 -t '\t' -1 $f1 -2 $f2 - '$sorted_index'"; 1488 | 1489 | open my $to_join, "| $command" or die "failed to exec $command: $!"; 1490 | be_verbose_as_appropriate(length), print $to_join $_ while <>; 1491 | close $to_join; 1492 | }, 1493 | 1494 | with => sub { 1495 | # Like 'paste'. Joins lines with \t. 1496 | my ($f) = @_; 1497 | open my $fh, expand_filename_shorthands $f, 1 1498 | or die "failed to open --with pseudofile $f: $!"; 1499 | my ($part1, $part2); 1500 | while (defined($part1 = <>) and defined($part2 = <$fh>)) { 1501 | be_verbose_as_appropriate length($part1) + length($part2); 1502 | chomp $part1; 1503 | chomp $part2; 1504 | print $part1, "\t", $part2, "\n"; 1505 | } 1506 | close $fh; 1507 | }, 1508 | 1509 | repeat => sub { 1510 | my ($n, $f) = @_; 1511 | my $count = 0; 1512 | while (!$n || $count++ < $n) { 1513 | open my $fh, expand_filename_shorthands $f, 1 1514 | or die "failed to open --repeat pseudofile $f: $!"; 1515 | be_verbose_as_appropriate(length), print while <$fh>; 1516 | close $fh; 1517 | } 1518 | }, 1519 | 1520 | octave => sub { 1521 | my ($commands) = @_; 1522 | my $temp = tmpnam; 1523 | open my $fh, '>', $temp or die $!; 1524 | be_verbose_as_appropriate(length), print $fh $_ while <>; 1525 | close $fh; 1526 | 1527 | system 'octave', 1528 | '-q', 1529 | '--eval', "xs = load(\"$temp\");" 1530 | . "unlink(\"$temp\");" 1531 | . "save_precision(48);" 1532 | . "$commands;" 1533 | . "save -text $temp xs" and die "octave command failed"; 1534 | 1535 | open $fh, '<', $temp; 1536 | /^\s*#/ or 1537 | /^\s*$/ or 1538 | print join("\t", map $_ eq "NA" ? $_ : 0 + $_, grep length, split /\s+/), 1539 | "\n" while <$fh>; 1540 | close $fh; 1541 | unlink $temp; 1542 | }, 1543 | 1544 | numpy => sub { 1545 | my ($commands) = @_; 1546 | my $temp = tmpnam; 1547 | open my $fh, '>', $temp or die $!; 1548 | be_verbose_as_appropriate(length), print $fh $_ while <>; 1549 | close $fh; 1550 | 1551 | # Fix up the indentation for cases like this: 1552 | # $ nfu ... --numpy 'xs += 1 1553 | # xs *= 4' # no real indent here 1554 | # 1555 | # $ nfu ... --numpy 'if something: 1556 | # xs += 1' # minor indent here (assume 2) 1557 | 1558 | my @lines = split /\n/, $commands; 1559 | my @indents = map length(s/\S.*$//r), @lines; 1560 | my $indent = @lines > 1 ? $indents[1] - $indents[0] : 0; 1561 | 1562 | # If we're expecting an indentation of some amount after the first line, we 1563 | # need to be careful: we don't know how much the user decided to indent the 1564 | # block, and if we get it wrong then Python will complain at the next 1565 | # outdent. (If there's no outdent, then we can use anything.) 1566 | $indent = min $indent - 1, @indents[2..$#indents] 1567 | if $lines[0] =~ /:\s*(#.*)?$/ && @lines > 2; 1568 | 1569 | my $spaces = ' ' x $indent; 1570 | $lines[$_] =~ s/^$spaces// for 1..$#lines; 1571 | $commands = join "\n", @lines; 1572 | 1573 | system 'python', 1574 | '-c', 1575 | " 1576 | import numpy as np 1577 | xs = np.loadtxt(\"$temp\") 1578 | $commands 1579 | np.savetxt(\"$temp\", xs, delimiter=\"\\t\")" and die "numpy command failed"; 1580 | 1581 | open $fh, '<', $temp; 1582 | /^\s*#/ or 1583 | /^\s*$/ or 1584 | print join("\t", map $_ eq "NA" ? $_ : 0 + $_, grep length, split /\s+/), 1585 | "\n" while <$fh>; 1586 | close $fh; 1587 | unlink $temp; 1588 | }, 1589 | 1590 | stateful_unary_fn('average', 1591 | sub {my ($size, $n, $total) = ($_[0] // 0, 0, 0); 1592 | [$size, $n, $total, []]}, 1593 | sub { 1594 | my ($x, $state) = @_; 1595 | my ($size, $n, $total, $window) = @$state; 1596 | $total += $x; 1597 | ++$n; 1598 | my $v = $total / ($n > $size && $size ? $size : $n); 1599 | $total -= shift @$window if $size and push(@$window, $x) >= $size; 1600 | $$state[1] = $n; 1601 | $$state[2] = $total; 1602 | $v; 1603 | }), 1604 | 1605 | stateful_unary_fn('intify', 1606 | sub {[{}, 0]}, 1607 | sub { 1608 | my ($x, $state) = @_; 1609 | $state->[0]->{$x} //= $state->[1]++; 1610 | }), 1611 | 1612 | aggregate => sub { 1613 | my $f = compile_eval_into_function $_[0], 'aggregate function'; 1614 | my @columns; 1615 | while (my $line = <>) { 1616 | be_verbose_as_appropriate length $line; 1617 | chomp $line; 1618 | my @fields = split /\t/, $line; 1619 | 1620 | # Two cases here. If the new record is compatible with the most recent 1621 | # existing one, or there aren't any existing ones, then group it and 1622 | # don't call the aggregator yet. 1623 | # 1624 | # If we see a change, then call the aggregator and empty out the group. 1625 | # 1626 | # Note that the aggregator function is called on columns, not rows. 1627 | 1628 | my $n = @columns && @{$columns[0]}; 1629 | if (!$n or $fields[0] eq ${$columns[0]}[0]) { 1630 | $columns[$_][$n] = $fields[$_] for 0 .. $#fields; 1631 | } else { 1632 | $_ = ${$columns[0]}[0]; 1633 | print $_, "\n" for $f->(@columns); 1634 | @columns = (); 1635 | $columns[$_][0] = $fields[$_] for 0 .. $#fields; 1636 | } 1637 | } 1638 | if (@columns) { 1639 | $_ = ${$columns[0]}[0]; 1640 | print $_, "\n" for $f->(@columns); 1641 | } 1642 | }, 1643 | 1644 | fold => sub { 1645 | my $f = compile_eval_into_function $_[0], 'fold function'; 1646 | my @saved; 1647 | while (<>) { 1648 | be_verbose_as_appropriate length; 1649 | chomp; 1650 | my $line = $_; 1651 | if ($f->(split /\t/)) { 1652 | push @saved, $line; 1653 | } else { 1654 | print row(@saved), "\n" if @saved; 1655 | @saved = ($line); 1656 | } 1657 | } 1658 | print row(@saved), "\n" if @saved; 1659 | }, 1660 | 1661 | stateless_unary_fn('log', sub { 1662 | my ($x, $base) = @_; 1663 | my $log = log $x; 1664 | $log /= log $base if defined $base; 1665 | $log; 1666 | }), 1667 | 1668 | stateless_unary_fn('exp', sub { 1669 | my ($x, $base) = @_; 1670 | defined $base ? $base ** $x : exp $x; 1671 | }), 1672 | 1673 | stateless_unary_fn('quant', sub { 1674 | my ($x, $quantum) = @_; 1675 | round_to $x, $quantum; 1676 | }), 1677 | 1678 | # Note: this needs to be stdin; otherwise "nfu -p %l filename" will fail 1679 | # (since exec_with_diamond trieds to pass filename straight into gnuplot). 1680 | plot => sub { 1681 | exec_with_stdin 'gnuplot', 1682 | '-e', 1683 | 'plot "-" ' . join(' ', expand_gnuplot_options @_), 1684 | '-persist'; 1685 | }, 1686 | 1687 | splot => sub { 1688 | exec_with_stdin 'gnuplot', 1689 | '-e', 1690 | 'splot "-" ' . join(' ', expand_gnuplot_options @_), 1691 | '-persist'; 1692 | }, 1693 | 1694 | mplot => sub { 1695 | my @gnuplot_options = split /;/, join ' ', expand_gnuplot_options @_; 1696 | my $fname = tmpnam; 1697 | my $cols = 0; 1698 | open my $fh, '>', $fname or die "failed to open tempfile for mplot: $!"; 1699 | while (<>) { 1700 | be_verbose_as_appropriate length; 1701 | $cols = max $cols, 1 + scalar(my @xs = /\t/g); 1702 | print $fh $_; 1703 | } 1704 | close $fh; 1705 | 1706 | # If we're requesting only one plot, assume the intent is to replicate 1707 | # those settings across every observed column. 1708 | my $plot_command = 1709 | 'plot ' . join ',', 1710 | @gnuplot_options > 1 1711 | ? map("\"$fname\" $_", @gnuplot_options) 1712 | : map("\"$fname\" using $_ $gnuplot_options[0]", 1..$cols); 1713 | 1714 | system 'gnuplot', '-e', $plot_command, '-persist'; 1715 | 1716 | # HACK: the problem is that gnuplot internally forks a subprocess for the 1717 | # plot window, which we won't be able to see from here (that I know of). If 1718 | # we delete the file before that subprocess exits, then any zoom operations 1719 | # will cause gnuplot to abruptly exit. 1720 | # 1721 | # I'm sure there's a better way to solve this, but for now this should do 1722 | # the job for now. 1723 | unless (fork) { 1724 | setsid; 1725 | close STDIN; 1726 | close STDOUT; 1727 | unless (fork) { 1728 | sleep 3600; 1729 | unlink $fname or die "failed to unlink $fname: $!"; 1730 | } 1731 | } 1732 | }, 1733 | 1734 | poll => sub { 1735 | my ($sleep, $command) = @_; 1736 | die "usage: --poll sleep-amount 'command ...'" 1737 | unless defined $sleep and defined $command; 1738 | system($command), sleep $sleep while 1; 1739 | }, 1740 | 1741 | stateful_unary_fn('delta', 1742 | sub {[0]}, 1743 | sub {my ($x, $state) = @_; 1744 | my $v = $x - $$state[0]; 1745 | $$state[0] = $x; 1746 | $v}), 1747 | 1748 | stateful_unary_fn('sum', 1749 | sub {[0]}, 1750 | sub {my ($x, $state) = @_; 1751 | $$state[0] += $x}), 1752 | 1753 | stateful_unary_fn('variance', 1754 | sub {[0, 0, 0]}, 1755 | sub {my ($x, $state) = @_; 1756 | $$state[0] += $x; 1757 | $$state[1] += $x * $x; 1758 | $$state[2]++; 1759 | my ($sx, $sx2, $count) = @$state; 1760 | ($sx2 - ($sx * $sx / $count)) / ($count - 1 || 1)}), 1761 | 1762 | stateful_unary_fn('sd', 1763 | sub {[0, 0, 0]}, 1764 | sub {my ($x, $state) = @_; 1765 | $$state[0] += $x; 1766 | $$state[1] += $x * $x; 1767 | $$state[2]++; 1768 | my ($sx, $sx2, $count) = @$state; 1769 | sqrt(($sx2 - ($sx * $sx / $count)) / ($count - 1 || 1))}), 1770 | 1771 | stateful_unary_fn('entropy', 1772 | # state contains [$total, $entropy_so_far] and uses the following 1773 | # associative combiner (where F(X) = frequency of X, unscaled probability): 1774 | # 1775 | # let t = F(A) + F(B) 1776 | # H(A + B) = F(A)/t * (-log(F(A)/t) + H(A)) 1777 | # + F(B)/t * (-log(F(B)/t) + H(B)) 1778 | 1779 | sub {[0, 0]}, 1780 | sub {my ($x, $state) = @_; 1781 | my ($f0, $h0) = @$state; 1782 | my $f = $$state[0] += $x; 1783 | my $p = $x / $f; 1784 | my $p0 = $f0 / $f; 1785 | $$state[1] = $p0 * (($p0 > 0 ? -log($p0) / LOG_2 : 0) + $h0) 1786 | + $p * ($p > 0 ? -log($p) / LOG_2 : 0)}), 1787 | 1788 | take => sub { 1789 | if ($_[0] =~ s/^\+//) { 1790 | # Take last n, so we need a line queue 1791 | my @q; 1792 | my $i = 0; 1793 | be_verbose_as_appropriate(length), $q[$i++ % $_[0]] = $_ while <>; 1794 | print for @q[$i % $_[0] .. $#q]; 1795 | print for @q[0 .. $i % $_[0] - 1]; 1796 | } else { 1797 | my $n = $_[0] // 1; 1798 | while (<>) { 1799 | be_verbose_as_appropriate length; 1800 | last if --$n < 0; 1801 | print; 1802 | } 1803 | } 1804 | }, 1805 | 1806 | sample => sub { 1807 | while (<>) { 1808 | be_verbose_as_appropriate length; 1809 | print if rand() < $_[0]; 1810 | } 1811 | }, 1812 | 1813 | drop => sub { 1814 | my $n = $_[0] // 1; 1815 | if ($n) { 1816 | while (<>) { 1817 | be_verbose_as_appropriate length; 1818 | last if --$n <= 0; 1819 | } 1820 | } 1821 | be_verbose_as_appropriate(length), print while <>; 1822 | }, 1823 | 1824 | map => sub { 1825 | my $f = compile_eval_into_function $_[0], 'map function'; 1826 | while (<>) { 1827 | be_verbose_as_appropriate length; 1828 | chomp; 1829 | print "$_\n" for $f->(split /\t/); 1830 | } 1831 | }, 1832 | 1833 | pmap => sub { 1834 | my @fhs; 1835 | my $wbits = ''; 1836 | my $wout = ''; 1837 | my $i = 0; 1838 | 1839 | for (1 .. $ENV{NFU_PMAP_PARALLELISM} // 16) { 1840 | my $mapper = quote_self '--child', @evaled_code, '--map', $_[0]; 1841 | open my $fh, "| $mapper" 1842 | or die "failed to open child process $mapper: $!"; 1843 | 1844 | vec($wbits, fileno($fh), 1) = 1; 1845 | push @fhs, $fh; 1846 | } 1847 | 1848 | while (<>) { 1849 | be_verbose_as_appropriate length; 1850 | select undef, $wout = $wbits, undef, undef; 1851 | ++$i until vec($wout, fileno $fhs[$i % @fhs], 1); 1852 | syswrite $fhs[$i++ % @fhs], $_; 1853 | } 1854 | close for @fhs; 1855 | }, 1856 | 1857 | keep => sub { 1858 | my $f = $_[0] =~ /^\d+$/ 1859 | ? eval "sub {" . join("&&", map "\$_[$_]", split //, $_[0]) . "}" 1860 | : compile_eval_into_function $_[0], 'keep function'; 1861 | while (<>) { 1862 | my $line = $_; 1863 | be_verbose_as_appropriate length; 1864 | chomp; 1865 | my @xs = split /\t/; 1866 | print $line if $f->(@xs); 1867 | } 1868 | }, 1869 | 1870 | remove => sub { 1871 | my $f = $_[0] =~ /^\d+$/ 1872 | ? eval "sub {" . join("&&", map "\$_[$_]", split //, $_[0]) . "}" 1873 | : compile_eval_into_function $_[0], 'remove function'; 1874 | while (<>) { 1875 | my $line = $_; 1876 | be_verbose_as_appropriate length; 1877 | chomp; 1878 | my @xs = split /\t/; 1879 | print $line unless $f->(@xs); 1880 | } 1881 | }, 1882 | 1883 | each => sub { 1884 | my ($template) = @_; 1885 | while (<>) { 1886 | be_verbose_as_appropriate length; 1887 | chomp; 1888 | my $c = $template =~ s/\{\}/$_/gr; 1889 | system $c and die "each: failed to run $c: $!"; 1890 | } 1891 | }, 1892 | 1893 | every => sub { 1894 | my ($n) = @_; 1895 | my $i = 0; 1896 | while (<>) { 1897 | be_verbose_as_appropriate length; 1898 | print unless $i++ % $n; 1899 | } 1900 | }, 1901 | 1902 | fields => sub { 1903 | my ($fields) = @_; 1904 | my $everything = $fields =~ s/\.$//; 1905 | my @fs = split //, $fields; 1906 | $everything &&= 1 + max @fs; 1907 | 1908 | while (<>) { 1909 | be_verbose_as_appropriate length; 1910 | chomp; 1911 | my @xs = split /\t/; 1912 | my @ys = @xs[@fs]; 1913 | push @ys, @xs[$everything .. $#xs] if $everything; 1914 | print join("\t", map $_ // '', @ys), "\n"; 1915 | } 1916 | }, 1917 | 1918 | fieldsplit => sub { 1919 | my $pattern = $implosions{$_[0]} // $_[0]; 1920 | $pattern = $fieldsplit_shorthands{$pattern} // $pattern; 1921 | my $delim = qr/$pattern/; 1922 | while (<>) { 1923 | be_verbose_as_appropriate length; 1924 | chomp; 1925 | print join("\t", split /$delim/), "\n"; 1926 | } 1927 | }, 1928 | 1929 | number => sub { 1930 | my $n = 0; 1931 | while (<>) { 1932 | be_verbose_as_appropriate length; 1933 | chomp; 1934 | print row(++$n, $_), "\n"; 1935 | } 1936 | }, 1937 | 1938 | ntiles => sub { 1939 | my ($n) = @_; 1940 | my $line_count = 0; 1941 | my $fifo = tmpnam; 1942 | mkfifo $fifo, 0700 or die "failed to create fifo: $!"; 1943 | open my $sorted, sort_cmd('-g', $fifo) . " |" 1944 | or die "failed to create sort process: $!"; 1945 | open my $fifo_fh, '>', $fifo or die "failed to write to fifo: $!"; 1946 | 1947 | # Push data into the sort process, keeping track of the number of lines. 1948 | # We'll use this count later to use constant rather than linear space. 1949 | while (<>) { 1950 | ++$line_count; 1951 | be_verbose_as_appropriate(length); 1952 | $fifo_fh->print($_); 1953 | } 1954 | close $fifo_fh; 1955 | unlink $fifo; 1956 | 1957 | # Ok, now grab the data for each of the N sampling points. Here's what 1958 | # we're doing: 1959 | # 1960 | # 0 1 2 <- things we grab for 2-tiles 1961 | # 1 2 3 4 5 6 7 8 9 a b c d <- data points 1962 | # 1963 | # 0 1 2 <- things we grab for 2-tiles 1964 | # 1 2 3 4 5 6 7 8 9 a b c <- data points 1965 | # 1966 | # 0 1 2 3 4 <- quartiles 1967 | # 1 2 3 4 5 6 7 <- data points 1968 | # 1969 | # To make this work, we just keep track of the previous data point as we're 1970 | # reading. 1971 | 1972 | # Normal case 1973 | my $i = 0; 1974 | my $previous = undef; 1975 | my $max_line = $line_count - 1; 1976 | while (<$sorted>) { 1977 | chomp; 1978 | $previous = $_ unless defined $previous; 1979 | 1980 | my $break = int($i * $n / $max_line) / $n * $max_line; 1981 | if ($i >= $break && $i - $break < 1) { 1982 | # Take this row, performing a weighted average with the previous one: 1983 | # 1984 | # break 1985 | # x V y 1986 | # | | 1987 | # ----|------------ 1988 | # $w 1-$w 1989 | # 1990 | # We want x*(1-$w) + y*$w. 1991 | 1992 | my $w = $break - int $break || 1; 1993 | my $v = $previous * (1 - $w) + $_ * $w; 1994 | print $v, "\n"; 1995 | } 1996 | 1997 | ++$i; 1998 | $previous = $_; 1999 | } 2000 | close $sorted; 2001 | }, 2002 | 2003 | prepend => sub { 2004 | open my $fh, expand_filename_shorthands $_[0], 1 2005 | or die "failed to open --prepend pseudofile $_[0]: $!"; 2006 | be_verbose_as_appropriate(length), print while <$fh>; 2007 | close $fh; 2008 | print while <>; 2009 | }, 2010 | 2011 | append => sub { 2012 | open my $fh, expand_filename_shorthands $_[0], 1 2013 | or die "failed to open --append pseudofile $_[0]: $!"; 2014 | print while <>; 2015 | be_verbose_as_appropriate(length), print while <$fh>; 2016 | close $fh; 2017 | }, 2018 | 2019 | pipe => sub { 2020 | open my $fh, "| $_[0]" or die "failed to launch $_[0]: $!"; 2021 | be_verbose_as_appropriate(length), print $fh $_ while <>; 2022 | close $fh; 2023 | }, 2024 | 2025 | tee => sub { 2026 | open my $fh, "| $_[0]" or die "failed to launch $_[0]: $!"; 2027 | $SIG{PIPE} = 'IGNORE'; 2028 | while (<>) { 2029 | be_verbose_as_appropriate length; 2030 | $fh->print($_); 2031 | print; 2032 | } 2033 | close $fh; 2034 | }, 2035 | 2036 | duplicate => sub { 2037 | open my $fh1, "| $_[0]" or die "failed to launch $_[0]: $!"; 2038 | open my $fh2, "| $_[1]" or die "failed to launch $_[1]: $!"; 2039 | 2040 | # Important: keep going even if a subprocess rejects data. Otherwise things 2041 | # like "nfu --duplicate ^T1 ^T+1" will produce truncated output. 2042 | $SIG{PIPE} = 'IGNORE'; 2043 | 2044 | while (<>) { 2045 | be_verbose_as_appropriate length; 2046 | $fh1->print($_); 2047 | $fh2->print($_); 2048 | } 2049 | close $fh1; 2050 | close $fh2; 2051 | }, 2052 | 2053 | partition => sub { 2054 | my ($splitter, $cmd) = @_; 2055 | my %fhs; 2056 | my $f = compile_eval_into_function $splitter, 'partition function'; 2057 | 2058 | # Important: keep going even if a subprocess rejects data. Otherwise things 2059 | # like "nfu --partition ... ^T10" will produce truncated output. 2060 | $SIG{PIPE} = 'IGNORE'; 2061 | 2062 | my @open_partitions; 2063 | while (<>) { 2064 | be_verbose_as_appropriate length; 2065 | my $line = $_; 2066 | chomp(my $cline = $line); 2067 | my $p = $f->(split /\t/, $cline); 2068 | unless (exists $fhs{$p}) { 2069 | my $cmdsub = $cmd =~ s/\{\}/$p/gr =~ s/\{\.(\.*)\}/\{$1\}/gr; 2070 | open $fhs{$p}, "| $cmdsub" or die "failed to launch $cmdsub: $!"; 2071 | push @open_partitions, $p; 2072 | } 2073 | $fhs{$p}->print($line); 2074 | close($fhs{$p = shift @open_partitions}), delete $fhs{$p} 2075 | while @open_partitions > ($ENV{NFU_MAX_FILEHANDLES} // 64); 2076 | } 2077 | close for values %fhs; 2078 | }, 2079 | 2080 | hadoopc => sub { 2081 | my ($outfile, $mapper, $combiner, $reducer) = @_; 2082 | if ($outfile eq '.') { 2083 | # Print output data to stdout, then delete outfile 2084 | my $filename = hadoop_tempfile; 2085 | my @partfiles = map s/^hdfs:(?:\/\/)?//r, 2086 | hadoop_ls hadoop_into $filename, $mapper, $combiner, $reducer; 2087 | open my $fh, "| xargs hadoop fs -text" 2088 | or die "failed to execute xargs: $!"; 2089 | $fh->print("$_\n") for @partfiles; 2090 | close $fh; 2091 | system hadoop('fs', '-rm', '-r', $filename) . ' 1>&2'; 2092 | } else { 2093 | $outfile = hadoop_tempfile if $outfile eq '@'; 2094 | print hadoop_into($outfile, $mapper, $combiner, $reducer), "\n"; 2095 | } 2096 | }, 2097 | 2098 | hadoop => sub { 2099 | my ($outfile, $mapper, $reducer) = @_; 2100 | if ($outfile eq '.') { 2101 | # Print output data to stdout, then delete outfile 2102 | my $filename = hadoop_tempfile; 2103 | my @partfiles = map s/^hdfs:(?:\/\/)?//r, 2104 | hadoop_ls hadoop_into $filename, $mapper, 'NONE', $reducer; 2105 | open my $fh, "| xargs hadoop fs -text" 2106 | or die "failed to execute xargs: $!"; 2107 | $fh->print("$_\n") for @partfiles; 2108 | close $fh; 2109 | system hadoop('fs', '-rm', '-r', $filename) . ' 1>&2'; 2110 | } else { 2111 | $outfile = hadoop_tempfile if $outfile eq '@'; 2112 | print hadoop_into($outfile, $mapper, 'NONE', $reducer), "\n"; 2113 | } 2114 | }, 2115 | 2116 | sql => sub { 2117 | my ($db, $table, $schema, $query) = sql_parse_args @_; 2118 | my @read = sql_schema_and_buffer $schema; 2119 | my $index = sql_first_column_index $table, $schema; 2120 | $query = expand_sql_shorthands $query; 2121 | 2122 | if ($db =~ s/^P//) { 2123 | # postgres 2124 | my $edb = expand_postgres_db $db; 2125 | my $q = $query eq '_' ? '' 2126 | : "COPY ($query) TO STDOUT WITH NULL AS '';\n"; 2127 | 2128 | open my $fh, "| psql -c " 2129 | . shell_quote("DROP TABLE IF EXISTS $table;\n" 2130 | . "CREATE TABLE $table ($schema);\n" 2131 | . "COPY $table FROM STDIN;\n" 2132 | . $index 2133 | . $q) 2134 | . " $edb 1>&2" 2135 | or die "psql: failed to open psql for table $table $!"; 2136 | 2137 | write_buffer_and_stdin $fh, @read; 2138 | close $fh; 2139 | print "sql:P$db:select * from $table\n" unless length $q; 2140 | } elsif ($db =~ s/^S//) { 2141 | # sqlite3 2142 | my $fifo_name = tmpnam; 2143 | mkfifo $fifo_name, 0700 or die "failed to create fifo: $!"; 2144 | 2145 | my $child = fork; 2146 | unless ($child) { 2147 | system "echo " 2148 | . shell_quote(".mode tabs\n" 2149 | . "DROP TABLE IF EXISTS $table;\n" 2150 | . "CREATE TABLE $table ($schema);\n" 2151 | . ".import $fifo_name $table\n" 2152 | . $index 2153 | . ($query eq '_' ? '' : "$query;\n")) 2154 | . "| sqlite3 " . shell_quote expand_sqlite_db $db; 2155 | } else { 2156 | open my $fh, '>', $fifo_name or die "failed to open fifo into sqlite: $!"; 2157 | write_buffer_and_stdin $fh, @read; 2158 | close $fh; 2159 | unlink $fifo_name; 2160 | waitpid $child, 0; 2161 | print "sql:S$db:select * from $table\n" if $query eq '_'; 2162 | } 2163 | } else { 2164 | die "unknown SQL prefix: " . substr($db, 0, 1) 2165 | . " (valid prefixes are P and S)"; 2166 | } 2167 | }, 2168 | 2169 | preview => sub { 2170 | $verbose = 0; # don't print over the pager 2171 | my $have_less = !system 'which less > /dev/null'; 2172 | my $have_more = !system 'which more > /dev/null'; 2173 | 2174 | my $less_program = $have_less ? 'less' 2175 | : $have_more ? 'more' : 'cat'; 2176 | 2177 | exec_with_stdin $less_program; 2178 | }, 2179 | ); 2180 | 2181 | my %bracket_handlers = ( 2182 | '' => sub {my $stuff = shell_quote @_; 2183 | "".qx|$0 --quote $stuff| =~ s/\s*$//r}, 2184 | '@' => sub {my $stuff = shell_quote @_; 2185 | "sh:".(qx|$0 --quote $stuff| =~ s/\s*$//r)}, 2186 | q => sub {shell_quote @_}, 2187 | ); 2188 | 2189 | my %bracket_docs = ( 2190 | '' => 'nfu as function: [ -gc ] == "$(nfu --quote -gc)"', 2191 | '@' => 'nfu as data: @[ -gc foo ] == sh:"$(nfu --quote -gc foo)"', 2192 | q => 'quote things: q[ foo bar ] == "foo bar"', 2193 | ); 2194 | 2195 | # Print usage if the user clearly doesn't know what they're doing. 2196 | if (@ARGV ? $ARGV[0] =~ /^-[h?]$/ || $ARGV[0] =~ /^--(usage|help)$/ 2197 | : -t STDIN) { 2198 | 2199 | # Some checks for me to make sure I'm keeping the code well-maintained 2200 | exists $functions{$_} or die "no function for $_" for keys %usages; 2201 | exists $usages{$_} or die "no usage for $_" for keys %functions; 2202 | exists $arity{$_} or die "no arity for $_" for keys %usages; 2203 | exists $usages{$_ =~ s/--//r} or die "no usage for $_" 2204 | for values %explosions, keys %usages; 2205 | 2206 | exists $bracket_docs{$_} or die "no bracket doc for $_" 2207 | for keys %bracket_handlers; 2208 | 2209 | print STDERR "usage: nfu [prefix-commands...] [input-files...] commands...\n"; 2210 | print STDERR "where each command is one of the following:\n\n"; 2211 | 2212 | my $len = 1 + max map length, keys %usages; 2213 | my %short_lookup; 2214 | $short_lookup{$explosions{$_} =~ s/^--//r} = $_ for keys %explosions; 2215 | 2216 | for my $cmd (sort keys %usages) { 2217 | my $short = $short_lookup{$cmd}; 2218 | $short = defined $short ? "-$short|" : ' '; 2219 | printf STDERR " %s--%-${len}s(%d) %s\n", 2220 | $short, 2221 | $cmd, 2222 | $arity{$cmd}, 2223 | $usages{$cmd} ? $arity{$cmd} ? "<$usages{$cmd}>" 2224 | : "-- $usages{$cmd}" : ''; 2225 | } 2226 | 2227 | print STDERR "\nand prefix commands are:\n\n"; 2228 | 2229 | print STDERR " documentation (not used with normal commands):\n"; 2230 | print STDERR " --explain \n"; 2231 | print STDERR " --expand-pseudofile \n"; 2232 | print STDERR " --expand-code \n"; 2233 | print STDERR " --expand-gnuplot \n"; 2234 | print STDERR " --expand-sql \n"; 2235 | 2236 | print STDERR "\n pipeline modifiers:\n"; 2237 | print STDERR " --quote -- quotes args: eval \$(nfu --quote ...)\n"; 2238 | print STDERR " --use \n"; 2239 | print STDERR " --run \n"; 2240 | 2241 | print STDERR "\nargument bracket preprocessing:\n\n"; 2242 | 2243 | print STDERR " ^stuff -> [ -stuff ]\n\n"; 2244 | 2245 | my $bracket_max = max map length, keys %bracket_docs; 2246 | printf STDERR " %${bracket_max}s[ ] %s\n", $_, $bracket_docs{$_} 2247 | for sort keys %bracket_docs; 2248 | 2249 | my $pseudofile_len = 1 + max map length, keys %pseudofile_docs; 2250 | print STDERR "\npseudofile patterns:\n\n"; 2251 | printf STDERR " %-${pseudofile_len}s %s\n", $_, $pseudofile_docs{$_} 2252 | for sort keys %pseudofile_docs; 2253 | 2254 | print STDERR "\ngnuplot expansions:\n\n"; 2255 | printf STDERR " %2s -> '%s'\n", $_, $gnuplot_aliases{$_} 2256 | for sort keys %gnuplot_aliases; 2257 | 2258 | print STDERR "\nSQL expansions:\n\n"; 2259 | printf STDERR " %2s -> '%s'\n", $_, $sql_aliases{$_} 2260 | for sort keys %sql_aliases; 2261 | 2262 | print STDERR "\ndatabase prefixes:\n\n"; 2263 | printf STDERR " %s = %s\n", @$_ 2264 | for ['P' => 'PostgreSQL'], 2265 | ['S' => 'SQLite 3']; 2266 | 2267 | my $env_len = 1 + max map length, keys %env_docs; 2268 | print STDERR "\nenvironment variables:\n\n"; 2269 | printf STDERR " %-${env_len}s %s\n", $_, $env_docs{$_} 2270 | for sort keys %env_docs; 2271 | 2272 | print STDERR "\n"; 2273 | print STDERR "see https://github.com/spencertipping/nfu for documentation\n"; 2274 | print STDERR "\n"; 2275 | 2276 | exit 1; 2277 | } 2278 | 2279 | if (@ARGV && $ARGV[0] =~ /^--expand/) { 2280 | my ($command, $x, @others) = @ARGV; 2281 | if ($command =~ /-pseudofile$/) { 2282 | print expand_filename_shorthands($x) // '', "\n"; 2283 | } elsif ($command =~ /-code$/) { 2284 | print expand_eval_shorthands($x), "\n"; 2285 | } elsif ($command =~ /-gnuplot$/) { 2286 | print expand_gnuplot_options($x), "\n"; 2287 | } elsif ($command =~ /-sql$/) { 2288 | print expand_sql_shorthands($x), "\n"; 2289 | } else { 2290 | print STDERR "unknown expansion command: $command\n"; 2291 | exit 1; 2292 | } 2293 | exit 0; 2294 | } 2295 | 2296 | my @args_to_parse; 2297 | 2298 | # Preprocess args to look for bracketed groups. These need to be collapsed into 2299 | # single args, which must happen before we start assigning arguments to 2300 | # commands. 2301 | while (@ARGV) { 2302 | my $x = shift @ARGV; 2303 | last if $x eq '--'; 2304 | if ($x =~ s/\[$//) { 2305 | die "unknown bracket prefix: $x" unless exists $bracket_handlers{$x}; 2306 | my @xs; 2307 | my $depth = 1; 2308 | while (@ARGV) { 2309 | my $next = shift @ARGV; 2310 | $depth-- if $next eq ']'; 2311 | last unless $depth; 2312 | push @xs, $next; 2313 | $depth++ if $next =~ /\[$/; 2314 | } 2315 | unshift @ARGV, $bracket_handlers{$x}->(@xs); 2316 | } elsif ($x =~ s/^\^//) { 2317 | # Lift the command (as a short option) into a quoted nfu instance. 2318 | $x = shell_quote $x; 2319 | unshift @ARGV, ''.qx|$0 --quote -$x|; 2320 | } elsif ($x =~ s/\{$//) { 2321 | # Parse a branching map, which has the form { 'pattern' stuff... , 2322 | # 'pattern' stuff... , ... }. We generate a quoted TSV of packed base-64 2323 | # values. 2324 | die "unknown brace prefix: $x" if length $x; 2325 | my @lines; 2326 | my $key; 2327 | my @value; 2328 | my $depth = 1; 2329 | while (@ARGV) { 2330 | my $next = shift @ARGV; 2331 | $depth-- if $next eq '}'; 2332 | 2333 | if (!$depth || defined $key && $next eq ',') { 2334 | push @lines, row pack('u', $key), 2335 | pack('u', shell_quote @value); 2336 | @value = (); 2337 | $key = undef; 2338 | } elsif (defined $key) { 2339 | push @value, $next; 2340 | } else { 2341 | $key = $next; 2342 | } 2343 | 2344 | last unless $depth; 2345 | $depth++ if $next =~ /\{$/; 2346 | } 2347 | 2348 | unshift @ARGV, join "\n", @lines; 2349 | } elsif ($x eq '%') { 2350 | # Everything else is a variable binding of the form 'x=y'. Then go back 2351 | # through and rewrite all %x to be y. 2352 | my %bindings = map split(/=/, $_, 2), @ARGV; 2353 | my $names = join '|', keys %bindings; 2354 | @ARGV = (); 2355 | s#%($names)#$bindings{$1} // "%$1"#ge for @args_to_parse; 2356 | } else { 2357 | push @args_to_parse, $x; 2358 | } 2359 | } 2360 | 2361 | sub explode { 2362 | return $_[0] unless $_[0] =~ s/^-([^-])/$1/; 2363 | map {$explosions{$_} // $_} grep length, split /([-+.\d]*),?/, $_[0]; 2364 | } 2365 | 2366 | my %custom_env; 2367 | my @parsed; 2368 | my $quote_self = 0; 2369 | my $explain = 0; 2370 | 2371 | while (@args_to_parse) { 2372 | unshift @args_to_parse, explode shift @args_to_parse; 2373 | (my $command = shift @args_to_parse) =~ s/^--//; 2374 | 2375 | if (defined(my $arity = $arity{$command})) { 2376 | my @args; 2377 | push @args, shift @args_to_parse 2378 | while @args_to_parse && (--$arity >= 0 2379 | || ! -e $args_to_parse[0] 2380 | && $args_to_parse[0] =~ /^[-+]?\d+/); 2381 | push @parsed, [$command, @args]; 2382 | } elsif ($command =~ /(\w+)=(.*)/) { 2383 | $ENV{$1} = $custom_env{$1} = $2; 2384 | } elsif ($command eq 'run') { 2385 | my $x = shift @args_to_parse; 2386 | push @evaled_code, '--run', $x; 2387 | eval $x; 2388 | die "failed to run '$x': $@" if $@; 2389 | } elsif ($command eq 'use') { 2390 | my $x = shift @args_to_parse; 2391 | my $s = read_file $x; # you can --use pseudofiles, woot! 2392 | push @evaled_code, '--use', $x; 2393 | eval $s; 2394 | die "failed to use '$x': $@" if $@; 2395 | } elsif ($command eq 'explain') { 2396 | $explain = 1; 2397 | } elsif ($command eq 'verbose' || $command eq 'v') { 2398 | print STDERR "\033[2J" unless $quote_self; 2399 | $verbose = 1; 2400 | } elsif ($command eq 'child') { 2401 | $is_child = 1; 2402 | } elsif ($command eq 'quote') { 2403 | $quote_self = 1; 2404 | } else { 2405 | if ($quote_self) { 2406 | # Defer pseudofile resolution. This matters for things like intermediate 2407 | # Hadoop outputs. 2408 | push @ARGV, $command; 2409 | } else { 2410 | my $f = expand_filename_shorthands $command; 2411 | die "nonexistent pseudofile: $command" unless defined $f; 2412 | push @ARGV, $f; 2413 | } 2414 | } 2415 | } 2416 | 2417 | if ($quote_self) { 2418 | # Quote all other arguments so a shell will parse them correctly. 2419 | print quote_self($verbose ? ('--verbose') : (), 2420 | map("$_=$custom_env{$_}", keys %custom_env), 2421 | @evaled_code, 2422 | map(("--$$_[0]", @$_[1..$#$_]), @parsed), 2423 | @ARGV), "\n"; 2424 | exit 0; 2425 | } 2426 | 2427 | # Open output in an interactive previewer if... 2428 | push @parsed, ['preview'] if !$ENV{NFU_NO_PAGER} # we can page 2429 | && (!-t STDIN || @ARGV) # not interacting for input 2430 | && -t STDOUT; # interacting for output 2431 | 2432 | if ($explain) { 2433 | # Explain what we would have done with the given command line. 2434 | printf "file\t%s\n", $_ =~ s/#.*\n//gr for @ARGV; 2435 | printf "--%s\t%s\n", ${$_}[0], join "\t", @{$_}[1 .. $#$_] for @parsed; 2436 | } elsif (@parsed) { 2437 | my $reader = undef; 2438 | 2439 | # Note: the loop below uses pipe/fork/dup2 instead of a more idiomatic Open2 2440 | # call. I don't have a good reason for this other than to figure out how the 2441 | # low-level stuff worked. 2442 | for (my $i = 0; $i < @parsed; ++$i) { 2443 | my ($command, @args) = @{$parsed[$i]}; 2444 | 2445 | # Here's where things get fun. The question right now is, "do we need to 2446 | # fork, or can we run in-process?" -- i.e. are we in the middle, or at the 2447 | # end? When we're in the middle, we want to redirect STDOUT to the pipe's 2448 | # writer and fork; otherwise we run in-process and write directly to the 2449 | # existing STDOUT. 2450 | ++$verbose_row; 2451 | if ($i < @parsed - 1) { 2452 | # We're in the middle, so allocate a pipe and fork. 2453 | pipe my($new_reader), my($writer); 2454 | $verbose_command = $command; 2455 | @verbose_args = @args; 2456 | unless (fork) { 2457 | # We're the child, so do STDOUT redirection. 2458 | close $new_reader or die "failed to close pipe reader: $!"; 2459 | POSIX::close(0) or die "failed to close stdin" if defined $reader; 2460 | dup2(fileno($reader), 0) or die "failed to dup input: $!" 2461 | if defined $reader; 2462 | POSIX::close(1); 2463 | dup2(fileno($writer), 1) or die "failed to dup stdout: $!"; 2464 | 2465 | close $reader or die "failed to close reader: $!" if defined $reader; 2466 | close $writer or die "failed to close writer: $!"; 2467 | 2468 | # The function here may never return. 2469 | $functions{$command}->(@args); 2470 | exit; 2471 | } else { 2472 | close $writer or die "failed to close pipe writer: $!"; 2473 | close $reader if defined $reader; 2474 | $reader = $new_reader; 2475 | } 2476 | } else { 2477 | # We've hit the end of the chain. Preserve stdout, redirect stdin from 2478 | # current reader. 2479 | POSIX::close(0) or die "failed to close stdin" if defined $reader; 2480 | dup2(fileno($reader), 0) or die "failed to dup input: $!" 2481 | if defined $reader; 2482 | close $reader or die "failed to close reader: $!" if defined $reader; 2483 | $verbose_command = $command; 2484 | @verbose_args = @args; 2485 | $functions{$command}->(@args); 2486 | } 2487 | 2488 | # Prevent <> from reading files after the first iteration (this is such a 2489 | # hack). 2490 | @ARGV = (); 2491 | } 2492 | } else { 2493 | # Behave like cat, which is useful for auto-decompressing things. 2494 | be_verbose_as_appropriate(length), print while <>; 2495 | } 2496 | -------------------------------------------------------------------------------- /sql.md: -------------------------------------------------------------------------------- 1 | # SQL databases 2 | nfu knows how to talk to PostgreSQL and SQLite 3 using their command-line 3 | interfaces. There are two ways to do this: 4 | 5 | - `--sql` (`-Q`) command: populate a table and optionally issue a query 6 | - `sql:` pseudofile: use a database query as TSV data 7 | 8 | **WARNING:** `--sql` always first drops and recreates the table before 9 | importing data. The `sql:` pseudofile does not delete or modify anything 10 | (unless you specifically write that into a query). 11 | 12 | You indicate postgres vs sqlite using a `P` or `S` prefix on the database name; 13 | otherwise the commands behave identically between databases. For example: 14 | 15 | ```sh 16 | # import data into an indexed sqlite table 17 | $ nfu /usr/share/dict/words \ 18 | -m 'row %0, length %0' \ 19 | -Q S@:+wordlengths _ _ 20 | 21 | # length of all words starting with 'a' 22 | $ nfu sql:S@:"%*wordlengths %w f0 LIKE 'a%'" 23 | 24 | # nfu explains what's going on here: 25 | $ nfu --expand-sql "%*wordlengths %w f0 LIKE 'a%'" 26 | select * from wordlengths where f0 LIKE 'a%' 27 | ``` 28 | 29 | ## How this works 30 | The first command generates pairs of `word, length`, which the sqlite3 command 31 | batch-imports into an indexed table called `wordlengths`. Here's how nfu makes 32 | this happen. 33 | 34 | `--sql` takes three arguments: `{P|S}dbname:tablename`, `schema`, and `query`, 35 | with the following special cases: 36 | 37 | - if `dbname` is `@`, a "default" database is used. For postgres this is your 38 | username, and for sqlite3 this is `/tmp/nfu-$USER-sqlite.db`. 39 | - if `tablename` begins with `+`, a first-column index is created after the 40 | data is inserted. If `tablename` is omitted altogether, nfu defaults to `+t` 41 | -- a table called `t` with a first-field index. 42 | - if `schema` is `_`, nfu looks at the first 20 lines of data and infers one, 43 | generating column names `f0`, `f1`, ..., `fN-1`. 44 | - if `query` is `_`, nfu prints a `sql:` pseudofile that will read the whole 45 | table. Otherwise the query is executed and the results printed as TSV. 46 | 47 | Queries are subject to SQL shorthand expansion; you can use `--expand-sql` to 48 | see the result. (All shorthands begin with `%`.) 49 | --------------------------------------------------------------------------------