├── LICENSE ├── README.textile ├── br └── brutils ├── Makefile ├── README ├── brm.c ├── brp.c └── brutils.h /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2015 Erik Frey 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in 13 | all copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 21 | THE SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.textile: -------------------------------------------------------------------------------- 1 | h2. bashreduce : mapreduce in a bash script 2 | 3 | bashreduce lets you apply your favorite unix tools in a mapreduce fashion across multiple machines/cores. There's no installation, administration, or distributed filesystem. You'll need: 4 | 5 | * "br":http://github.com/erikfrey/bashreduce/blob/master/br somewhere handy in your path 6 | * vanilla unix tools: sort, awk, ssh, netcat, pv 7 | * password-less ssh to each machine you plan to use 8 | 9 | h2. Configuration 10 | 11 | Edit @/etc/br.hosts@ and enter the machines you wish to use as workers. Or specify your machines at runtime: 12 | 13 |
br -m "host1 host2 host3"
14 | 15 | To take advantage of multiple cores, repeat the host name. 16 | 17 | h2. Examples 18 | 19 | h3. sorting 20 | 21 |
br < input > output
22 | 23 | h3. word count 24 | 25 |
br -r "uniq -c" < input > output
26 | 27 | h3. great big join 28 | 29 |
LC_ALL='C' br -r "join - /tmp/join_data" < input > output
30 | 31 | h2. Performance 32 | 33 | h3. big honkin' local machine 34 | 35 | Let's start with a simpler scenario: I have a machine with multiple cores and with normal unix tools I'm relegated to using just one core. How does br help us here? Here's br on an 8-core machine, essentially operating as a poor man's multi-core sort: 36 | 37 | |_. command |_. using |_. time |_. rate | 38 | | sort -k1,1 -S2G 4gb_file > 4gb_file_sorted | coreutils | 30m32.078s | 2.24 MBps | 39 | | br -i 4gb_file -o 4gb_file_sorted | coreutils | 11m3.111s | 6.18 MBps | 40 | | br -i 4gb_file -o 4gb_file_sorted | brp/brm | 7m13.695s | 9.44 MBps | 41 | 42 | The job completely i/o saturates, but still a reasonable gain! 43 | 44 | h3. many cheap machines 45 | 46 | Here lies the promise of mapreduce: rather than use my big honkin' machine, I have a bunch of cheaper machines lying around that I can distribute my work to. How does br behave when I add four cheaper 4-core machines into the mix? 47 | 48 | |_. command |_. using |_. time |_. rate | 49 | | sort -k1,1 -S2G 4gb_file > 4gb_file_sorted | coreutils | 30m32.078s | 2.24 MBps | 50 | | br -i 4gb_file -o 4gb_file_sorted | coreutils | 8m30.652s | 8.02 MBps | 51 | | br -i 4gb_file -o 4gb_file_sorted | brp/brm | 4m7.596s | 16.54 MBps | 52 | 53 | We have a new bottleneck: we're limited by how quickly we can partition/pump our dataset out to the nodes. awk and sort begin to show their limitations (our clever awk script is a bit cpu bound, and @sort -m@ can only merge so many files at once). So we use two little helper programs written in C (yes, I know! it's cheating! if you can think of a better partition/merge using core unix tools, contact me) to partition the data and merge it back. 54 | 55 | h3. Future work 56 | 57 | I've tested this on ubuntu/debian, but not on other distros. According to Daniel Einspanjer, netcat has different parameters on Redhat. 58 | 59 | br has a poor man's dfs like so: 60 | 61 |
br -r "cat > /tmp/myfile" < input
62 | 63 | But this breaks if you specify the same host multiple times. Maybe some kind of very basic virtualization is in order. Maybe. 64 | 65 | Other niceties would be to more closely mimic the options presented in sort (numeric, reverse, etc). 66 | -------------------------------------------------------------------------------- /br: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # bashreduce: mapreduce in bash 3 | # erik@fawx.com 4 | 5 | usage() { 6 | local prog="`basename $1`" 7 | echo "Usage: $prog [-m host1 [host2...]] [-c column] [-r reduce] [-i input] [-o output]" 8 | echo " $prog -h for help." 9 | exit 2 10 | } 11 | 12 | showhelp() { 13 | echo "Usage: `basename $1`: [-m host1 [host2...]] [-c column] [-r reduce] [-i input] [-o output]" 14 | echo "bashreduce. Map an input file to many hosts, sort/reduce, merge" 15 | echo " -m: hosts to use, can repeat hosts for multiple cores" 16 | echo " default hosts from /etc/br.hosts" 17 | echo " -c: column to partition, default = 1 (1-based)" 18 | echo " -r: reduce function, default = identity" 19 | echo " -i: input file, default = stdin" 20 | echo " -o: output file, default = stdout" 21 | echo " -t: tmp dir to use, default = /tmp" 22 | echo " -S: memory to use for sort, default = 256M" 23 | echo " -h: this help message" 24 | exit 2 25 | } 26 | 27 | hosts= 28 | mapcolumn=1 29 | reduce= 30 | input= 31 | output= 32 | tmp_dir=/tmp 33 | sort_mem=256M 34 | 35 | while getopts "m:c:r:i:o:t:S:h" name; do 36 | case $name in 37 | m) hosts=$OPTARG;; 38 | c) mapcolumn=$OPTARG;; 39 | r) reduce=$OPTARG;; 40 | i) input=$OPTARG;; 41 | o) output=$OPTARG;; 42 | t) tmp_dir=$OPTARG;; 43 | S) sort_mem=$OPTARG;; 44 | h) showhelp $0;; 45 | [?]) usage $0;; 46 | esac 47 | done 48 | 49 | if [ -z "$hosts" ]; then 50 | if [ -e /etc/br.hosts ]; then 51 | hosts=`cat /etc/br.hosts` 52 | else 53 | echo "`basename $0`: must specify hosts with -m or provide /etc/br.hosts" 54 | usage $0 55 | fi 56 | fi 57 | 58 | # if we have a reduce, add the pipe explicitly 59 | [ -n "$reduce" ] && reduce="| $reduce 2>/dev/null" 60 | 61 | # okay let's get started! first we need a name for our job 62 | jobid="`uuidgen`" 63 | jobpath="$tmp_dir/br_job_$jobid" 64 | nodepath="$tmp_dir/br_node_$jobid" 65 | mkdir -p $jobpath/{in,out} 66 | 67 | # now, for each host, set up in and out fifos (and a netcat for each), and ssh to each host to set up workers listening on netcat 68 | 69 | port_in=8192 70 | port_out=$(($port_in + 1)) 71 | host_idx=0 72 | out_files= 73 | 74 | for host in $hosts; do 75 | # our named pipes 76 | mkfifo $jobpath/{in,out}/$host_idx 77 | # lets get the pid of our listener 78 | ssh -n $host "mkdir -p $nodepath" 79 | pid=$(ssh -n $host "nc -l -p $port_out >$nodepath/in_$host_idx 2>/dev/null /dev/null /dev/null $reduce | nc -q0 -l -p $port_in >&/dev/null &" 81 | # our local forwarders 82 | nc $host $port_in >$jobpath/in/$host_idx & 83 | nc -q0 $host $port_out <$jobpath/out/$host_idx & 84 | # our vars 85 | out_files="$out_files $jobpath/out/$host_idx" 86 | port_in=$(($port_in + 2)) 87 | port_out=$(($port_in + 1)) 88 | host_idx=$(($host_idx + 1)) 89 | done 90 | 91 | # okay, time to map 92 | if which brp >/dev/null; then 93 | eval "${input:+pv $input |} brp - $(($mapcolumn - 1)) $out_files" 94 | else 95 | # use awk if we don't have brp 96 | # we're taking advantage of a special property that awk leaves its file handles open until its done 97 | # i think this is universal 98 | # we're also sending a zero length string to all the handles at the end, in case some pipe got no love 99 | eval "${input:+pv $input |} awk '{ 100 | srand(\$$mapcolumn); 101 | print \$0 >>\"$jobpath/out/\"int(rand() * $host_idx); 102 | } 103 | END { 104 | for (i = 0; i != $host_idx; ++i) 105 | printf \"\" >>\"$jobpath/out/\"i; 106 | }'" 107 | fi 108 | 109 | # save it somewhere 110 | if which brm >/dev/null; then 111 | eval "brm - $(($mapcolumn - 1)) `find $jobpath/in/ -type p | xargs` ${output:+| pv >$output}" 112 | else 113 | # use sort -m if we don't have brm 114 | # sort -m creates tmp files if too many input files are specified 115 | # brm doesn't do this 116 | eval "sort -k$mapcolumn,$mapcolumn -m $jobpath/in/* ${output:+| pv >$output}" 117 | fi 118 | 119 | # finally, clean up after ourselves 120 | rm -rf $jobpath 121 | for host in $hosts; do 122 | ssh $host "rm -rf $nodepath" 123 | done 124 | 125 | # TODO: is there a safe way to kill subprocesses upon fail? 126 | # this seems to work: /bin/kill -- -$$ 127 | -------------------------------------------------------------------------------- /brutils/Makefile: -------------------------------------------------------------------------------- 1 | CFLAGS = -O3 -Wall 2 | OBJS_BRP = brp.o 3 | OBJS_BRM = brm.o 4 | HEADERS = brutils.h 5 | LIBS = 6 | TARGET_BRP = brp 7 | TARGET_BRM = brm 8 | BINDIR=/usr/local/bin 9 | 10 | all: $(TARGET_BRP) $(TARGET_BRM) 11 | 12 | $(TARGET_BRP): $(OBJS_BRP) $(HEADERS) 13 | $(CC) -o $(TARGET_BRP) $(OBJS_BRP) $(LIBS) 14 | 15 | $(TARGET_BRM): $(OBJS_BRM) $(HEADERS) 16 | $(CC) -o $(TARGET_BRM) $(OBJS_BRM) $(LIBS) 17 | 18 | clean: 19 | rm -f $(OBJS_BRP) $(OBJS_BRM) $(TARGET_BRP) $(TARGET_BRM) 20 | 21 | install: all 22 | install -c brp $(BINDIR) 23 | install -c brm $(BINDIR) 24 | 25 | -------------------------------------------------------------------------------- /brutils/README: -------------------------------------------------------------------------------- 1 | Too bad that partitioning using awk is fairly cpu bound. Here's a little c cheat. If someone can think of a way to partition text that's much faster than the awk script in br, email me: erik@fawx.com . 2 | -------------------------------------------------------------------------------- /brutils/brm.c: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | #include 6 | 7 | #include "brutils.h" 8 | 9 | void showusage() { 10 | fprintf(stderr, "usage: brm output column-index input1 input2 input3...\n"); 11 | fprintf(stderr, " can specify '-' for output, to write to stdout\n"); 12 | exit(1); 13 | } 14 | 15 | int main(int argc, char * argv[]) 16 | { 17 | FILE * pout = stdout; 18 | int i, col_index; 19 | 20 | if (argc < 4) 21 | showusage(); 22 | if (strcmp(argv[1], "-") != 0) 23 | pout = try_open(argv[1], "wb"); 24 | col_index = atoi(argv[2]); 25 | 26 | int lines_len = argc - 3; 27 | line_t ** lines = (line_t **) malloc( lines_len * sizeof(line_t *) ); 28 | line_t ** lines_end = lines; 29 | for (i = 0; i != lines_len; ++i) { 30 | *lines_end = (line_t *) malloc( sizeof(line_t) ); 31 | (*lines_end)->pin = try_open(argv[i + 3], "rb"); 32 | if (read_parse(col_index, *lines_end)) { 33 | ++lines_end; 34 | lower_bound_move(lines, lines_end); 35 | } 36 | else { 37 | fclose((*lines_end)->pin); 38 | free(*lines_end); 39 | } 40 | } 41 | 42 | // okay, merge! 43 | line_t * back; 44 | while (lines != lines_end) { 45 | // write to out 46 | back = *(lines_end - 1); 47 | *back->col_end = back->col_end_val; 48 | fputs(back->buf, pout); 49 | if (read_parse(col_index, back)) 50 | lower_bound_move(lines, lines_end); 51 | else { 52 | fclose(back->pin); 53 | --lines_end; 54 | } 55 | } 56 | 57 | if (pout != stdout) 58 | fclose(pout); 59 | 60 | return 0; 61 | } 62 | -------------------------------------------------------------------------------- /brutils/brp.c: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | #include 6 | 7 | #include "brutils.h" 8 | 9 | void showusage() { 10 | fprintf(stderr, "usage: brp input column-index output1 output2 output3...\n"); 11 | fprintf(stderr, " can specify '-' for input, to read from stdin\n"); 12 | exit(1); 13 | } 14 | 15 | int main(int argc, char * argv[]) 16 | { 17 | line_t line; 18 | int i, col_index; 19 | 20 | if (argc < 4) 21 | showusage(); 22 | if (strcmp(argv[1], "-") != 0) 23 | line.pin = try_open(argv[1], "rb"); 24 | else 25 | line.pin = stdin; 26 | col_index = atoi(argv[2]); 27 | 28 | int pouts_len = argc - 3; 29 | FILE ** pouts = (FILE **) malloc( pouts_len * sizeof(FILE *) ); 30 | for (i = 0; i != pouts_len; ++i) 31 | pouts[i] = try_open(argv[i + 3], "wb"); 32 | 33 | while (fgets(line.buf, sizeof(line.buf), line.pin)) { 34 | if ( find_col(col_index, &line) ) // if this string has the requisite number of columns 35 | fputs(line.buf, pouts[fnv_hash(line.col_beg, line.col_end) % pouts_len]); // write it to the correct file 36 | } 37 | 38 | if (line.pin != stdin) 39 | fclose(line.pin); 40 | 41 | for (i = 0; i != pouts_len; ++i) 42 | fclose(pouts[i]); 43 | 44 | return 0; 45 | } 46 | 47 | -------------------------------------------------------------------------------- /brutils/brutils.h: -------------------------------------------------------------------------------- 1 | #ifndef __BR_UTILS_H__ 2 | #define __BR_UTILS_H__ 3 | 4 | #include 5 | #include 6 | 7 | FILE * try_open(const char * path, const char * flags) { 8 | FILE * p = fopen(path, flags); 9 | if (!p) { 10 | fprintf(stderr, "could not open %s: %s\n", path, strerror(errno)); 11 | exit(1); 12 | } 13 | return p; 14 | } 15 | 16 | unsigned int fnv_hash(const char *p, const char *end) { 17 | unsigned int h = 2166136261UL; 18 | for (; p != end; ++p) 19 | h = (h * 16777619) ^ *p; 20 | return h; 21 | } 22 | 23 | typedef struct 24 | { 25 | char buf[8192]; 26 | char * col_beg; 27 | char * col_end; 28 | char col_end_val; 29 | FILE * pin; 30 | } line_t; 31 | 32 | int find_col(int col, line_t * line) { 33 | for (line->col_beg = line->buf; col != 0 && *line->col_beg != 0; ++line->col_beg) { 34 | if ( isspace(*line->col_beg) ) 35 | --col; 36 | } 37 | if (*line->col_beg == 0) 38 | return 0; 39 | for (line->col_end = line->col_beg; !isspace(*line->col_end); ++line->col_end) {} 40 | return 1; 41 | } 42 | 43 | int read_parse(int col, line_t * line) { 44 | while (fgets(line->buf, sizeof(line->buf), line->pin)) { 45 | if (find_col(col, line)) { 46 | line->col_end_val = *line->col_end; 47 | *line->col_end = 0; 48 | return 1; 49 | } 50 | } 51 | return 0; 52 | } 53 | 54 | // move end - 1 to the proper position in beg..end 55 | void lower_bound_move(line_t ** beg, line_t ** end) 56 | { 57 | if (beg == end) 58 | return; 59 | 60 | int len = end - beg - 1; 61 | int half; 62 | line_t ** mid; 63 | 64 | // [ * * * * x ] 65 | // we need to move x to its correct position in the otherwise sorted array 66 | while (len > 0) { 67 | half = len >> 1; 68 | mid = beg + half; 69 | if ( strcoll( (*mid)->col_beg, (*(end - 1))->col_beg) > 0 ) { 70 | beg = mid + 1; 71 | len = len - half - 1; 72 | } 73 | else 74 | len = half; 75 | } 76 | 77 | // if beg < end - 1, we need to move beg up 78 | if (beg < end - 1) { 79 | line_t * tmp = *(end - 1); 80 | memmove(beg + 1, beg, (end - beg - 1) * sizeof(line_t **)); 81 | *beg = tmp; 82 | } 83 | } 84 | 85 | #endif // __BR_UTILS_H__ 86 | --------------------------------------------------------------------------------