├── LICENSE ├── README.textile ├── br └── brutils ├── Makefile ├── README ├── brm.c ├── brp.c └── brutils.h /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2015 Erik Frey 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in 13 | all copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 21 | THE SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.textile: -------------------------------------------------------------------------------- 1 | h2. bashreduce : mapreduce in a bash script 2 | 3 | bashreduce lets you apply your favorite unix tools in a mapreduce fashion across multiple machines/cores. There's no installation, administration, or distributed filesystem. You'll need: 4 | 5 | * "br":http://github.com/erikfrey/bashreduce/blob/master/br somewhere handy in your path 6 | * vanilla unix tools: sort, awk, ssh, netcat, pv 7 | * password-less ssh to each machine you plan to use 8 | 9 | h2. Configuration 10 | 11 | Edit @/etc/br.hosts@ and enter the machines you wish to use as workers. Or specify your machines at runtime: 12 | 13 |
br -m "host1 host2 host3"14 | 15 | To take advantage of multiple cores, repeat the host name. 16 | 17 | h2. Examples 18 | 19 | h3. sorting 20 | 21 |
br < input > output22 | 23 | h3. word count 24 | 25 |
br -r "uniq -c" < input > output26 | 27 | h3. great big join 28 | 29 |
LC_ALL='C' br -r "join - /tmp/join_data" < input > output30 | 31 | h2. Performance 32 | 33 | h3. big honkin' local machine 34 | 35 | Let's start with a simpler scenario: I have a machine with multiple cores and with normal unix tools I'm relegated to using just one core. How does br help us here? Here's br on an 8-core machine, essentially operating as a poor man's multi-core sort: 36 | 37 | |_. command |_. using |_. time |_. rate | 38 | | sort -k1,1 -S2G 4gb_file > 4gb_file_sorted | coreutils | 30m32.078s | 2.24 MBps | 39 | | br -i 4gb_file -o 4gb_file_sorted | coreutils | 11m3.111s | 6.18 MBps | 40 | | br -i 4gb_file -o 4gb_file_sorted | brp/brm | 7m13.695s | 9.44 MBps | 41 | 42 | The job completely i/o saturates, but still a reasonable gain! 43 | 44 | h3. many cheap machines 45 | 46 | Here lies the promise of mapreduce: rather than use my big honkin' machine, I have a bunch of cheaper machines lying around that I can distribute my work to. How does br behave when I add four cheaper 4-core machines into the mix? 47 | 48 | |_. command |_. using |_. time |_. rate | 49 | | sort -k1,1 -S2G 4gb_file > 4gb_file_sorted | coreutils | 30m32.078s | 2.24 MBps | 50 | | br -i 4gb_file -o 4gb_file_sorted | coreutils | 8m30.652s | 8.02 MBps | 51 | | br -i 4gb_file -o 4gb_file_sorted | brp/brm | 4m7.596s | 16.54 MBps | 52 | 53 | We have a new bottleneck: we're limited by how quickly we can partition/pump our dataset out to the nodes. awk and sort begin to show their limitations (our clever awk script is a bit cpu bound, and @sort -m@ can only merge so many files at once). So we use two little helper programs written in C (yes, I know! it's cheating! if you can think of a better partition/merge using core unix tools, contact me) to partition the data and merge it back. 54 | 55 | h3. Future work 56 | 57 | I've tested this on ubuntu/debian, but not on other distros. According to Daniel Einspanjer, netcat has different parameters on Redhat. 58 | 59 | br has a poor man's dfs like so: 60 | 61 |
br -r "cat > /tmp/myfile" < input62 | 63 | But this breaks if you specify the same host multiple times. Maybe some kind of very basic virtualization is in order. Maybe. 64 | 65 | Other niceties would be to more closely mimic the options presented in sort (numeric, reverse, etc). 66 | -------------------------------------------------------------------------------- /br: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # bashreduce: mapreduce in bash 3 | # erik@fawx.com 4 | 5 | usage() { 6 | local prog="`basename $1`" 7 | echo "Usage: $prog [-m host1 [host2...]] [-c column] [-r reduce] [-i input] [-o output]" 8 | echo " $prog -h for help." 9 | exit 2 10 | } 11 | 12 | showhelp() { 13 | echo "Usage: `basename $1`: [-m host1 [host2...]] [-c column] [-r reduce] [-i input] [-o output]" 14 | echo "bashreduce. Map an input file to many hosts, sort/reduce, merge" 15 | echo " -m: hosts to use, can repeat hosts for multiple cores" 16 | echo " default hosts from /etc/br.hosts" 17 | echo " -c: column to partition, default = 1 (1-based)" 18 | echo " -r: reduce function, default = identity" 19 | echo " -i: input file, default = stdin" 20 | echo " -o: output file, default = stdout" 21 | echo " -t: tmp dir to use, default = /tmp" 22 | echo " -S: memory to use for sort, default = 256M" 23 | echo " -h: this help message" 24 | exit 2 25 | } 26 | 27 | hosts= 28 | mapcolumn=1 29 | reduce= 30 | input= 31 | output= 32 | tmp_dir=/tmp 33 | sort_mem=256M 34 | 35 | while getopts "m:c:r:i:o:t:S:h" name; do 36 | case $name in 37 | m) hosts=$OPTARG;; 38 | c) mapcolumn=$OPTARG;; 39 | r) reduce=$OPTARG;; 40 | i) input=$OPTARG;; 41 | o) output=$OPTARG;; 42 | t) tmp_dir=$OPTARG;; 43 | S) sort_mem=$OPTARG;; 44 | h) showhelp $0;; 45 | [?]) usage $0;; 46 | esac 47 | done 48 | 49 | if [ -z "$hosts" ]; then 50 | if [ -e /etc/br.hosts ]; then 51 | hosts=`cat /etc/br.hosts` 52 | else 53 | echo "`basename $0`: must specify hosts with -m or provide /etc/br.hosts" 54 | usage $0 55 | fi 56 | fi 57 | 58 | # if we have a reduce, add the pipe explicitly 59 | [ -n "$reduce" ] && reduce="| $reduce 2>/dev/null" 60 | 61 | # okay let's get started! first we need a name for our job 62 | jobid="`uuidgen`" 63 | jobpath="$tmp_dir/br_job_$jobid" 64 | nodepath="$tmp_dir/br_node_$jobid" 65 | mkdir -p $jobpath/{in,out} 66 | 67 | # now, for each host, set up in and out fifos (and a netcat for each), and ssh to each host to set up workers listening on netcat 68 | 69 | port_in=8192 70 | port_out=$(($port_in + 1)) 71 | host_idx=0 72 | out_files= 73 | 74 | for host in $hosts; do 75 | # our named pipes 76 | mkfifo $jobpath/{in,out}/$host_idx 77 | # lets get the pid of our listener 78 | ssh -n $host "mkdir -p $nodepath" 79 | pid=$(ssh -n $host "nc -l -p $port_out >$nodepath/in_$host_idx 2>/dev/null /dev/null /dev/null $reduce | nc -q0 -l -p $port_in >&/dev/null &" 81 | # our local forwarders 82 | nc $host $port_in >$jobpath/in/$host_idx & 83 | nc -q0 $host $port_out <$jobpath/out/$host_idx & 84 | # our vars 85 | out_files="$out_files $jobpath/out/$host_idx" 86 | port_in=$(($port_in + 2)) 87 | port_out=$(($port_in + 1)) 88 | host_idx=$(($host_idx + 1)) 89 | done 90 | 91 | # okay, time to map 92 | if which brp >/dev/null; then 93 | eval "${input:+pv $input |} brp - $(($mapcolumn - 1)) $out_files" 94 | else 95 | # use awk if we don't have brp 96 | # we're taking advantage of a special property that awk leaves its file handles open until its done 97 | # i think this is universal 98 | # we're also sending a zero length string to all the handles at the end, in case some pipe got no love 99 | eval "${input:+pv $input |} awk '{ 100 | srand(\$$mapcolumn); 101 | print \$0 >>\"$jobpath/out/\"int(rand() * $host_idx); 102 | } 103 | END { 104 | for (i = 0; i != $host_idx; ++i) 105 | printf \"\" >>\"$jobpath/out/\"i; 106 | }'" 107 | fi 108 | 109 | # save it somewhere 110 | if which brm >/dev/null; then 111 | eval "brm - $(($mapcolumn - 1)) `find $jobpath/in/ -type p | xargs` ${output:+| pv >$output}" 112 | else 113 | # use sort -m if we don't have brm 114 | # sort -m creates tmp files if too many input files are specified 115 | # brm doesn't do this 116 | eval "sort -k$mapcolumn,$mapcolumn -m $jobpath/in/* ${output:+| pv >$output}" 117 | fi 118 | 119 | # finally, clean up after ourselves 120 | rm -rf $jobpath 121 | for host in $hosts; do 122 | ssh $host "rm -rf $nodepath" 123 | done 124 | 125 | # TODO: is there a safe way to kill subprocesses upon fail? 126 | # this seems to work: /bin/kill -- -$$ 127 | -------------------------------------------------------------------------------- /brutils/Makefile: -------------------------------------------------------------------------------- 1 | CFLAGS = -O3 -Wall 2 | OBJS_BRP = brp.o 3 | OBJS_BRM = brm.o 4 | HEADERS = brutils.h 5 | LIBS = 6 | TARGET_BRP = brp 7 | TARGET_BRM = brm 8 | BINDIR=/usr/local/bin 9 | 10 | all: $(TARGET_BRP) $(TARGET_BRM) 11 | 12 | $(TARGET_BRP): $(OBJS_BRP) $(HEADERS) 13 | $(CC) -o $(TARGET_BRP) $(OBJS_BRP) $(LIBS) 14 | 15 | $(TARGET_BRM): $(OBJS_BRM) $(HEADERS) 16 | $(CC) -o $(TARGET_BRM) $(OBJS_BRM) $(LIBS) 17 | 18 | clean: 19 | rm -f $(OBJS_BRP) $(OBJS_BRM) $(TARGET_BRP) $(TARGET_BRM) 20 | 21 | install: all 22 | install -c brp $(BINDIR) 23 | install -c brm $(BINDIR) 24 | 25 | -------------------------------------------------------------------------------- /brutils/README: -------------------------------------------------------------------------------- 1 | Too bad that partitioning using awk is fairly cpu bound. Here's a little c cheat. If someone can think of a way to partition text that's much faster than the awk script in br, email me: erik@fawx.com . 2 | -------------------------------------------------------------------------------- /brutils/brm.c: -------------------------------------------------------------------------------- 1 | #include