├── .gitignore ├── .travis.yml ├── LICENSE ├── README.md ├── example.py ├── mincemeat.py ├── setup.py └── smoke.sh /.gitignore: -------------------------------------------------------------------------------- 1 | *~ 2 | *.pyc 3 | mincemeatpy-*.tar.gz 4 | output -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3 | - "2.6" 4 | - "2.7" 5 | script: ./smoke.sh 6 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2010 Michael Fairley 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in 11 | all copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 19 | THE SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | mincemeat.py: MapReduce on Python 2 | ================================= 3 | 4 | Introduction 5 | ------------ 6 | mincemeat.py is a Python implementation of the [MapReduce](http://en.wikipedia.org/wiki/Mapreduce) distributed computing framework. 7 | 8 | mincemeat.py is: 9 | 10 | * Lightweight - All of the code is contained in a single Python file (currently weighing in at <13kB) that depends only on the Python Standard Library. Any computer with Python and mincemeat.py can be a part of your cluster. 11 | * Fault tolerant - Workers (clients) can join and leave the cluster at any time without affecting the entire process. 12 | * Secure - mincemeat.py authenticates both ends of every connection, ensuring that only authorized code is executed. 13 | * Open source - mincemeat.py is distributed under the [MIT License](http://en.wikipedia.org/wiki/Mit_license), and consequently is free for all use, including commercial, personal, and academic, and can be modified and redistributed without restriction. 14 | 15 | 16 | Download 17 | -------- 18 | 19 | * Just [mincemeat.py](https://raw.github.com/michaelfairley/mincemeatpy/v0.1.4/mincemeat.py) (v 0.1.4) 20 | * The [full 0.1.4 release](https://github.com/michaelfairley/mincemeatpy/zipball/v0.1.4) (includes documentation and examples) 21 | * Clone this git repository: `git clone https://github.com/michaelfairley/mincemeatpy.git` 22 | 23 | Example 24 | ------- 25 | 26 | Let's look at the canonical MapReduce example, word counting: 27 | 28 | example.py: 29 | 30 | ```python 31 | #!/usr/bin/env python 32 | import mincemeat 33 | 34 | data = ["Humpty Dumpty sat on a wall", 35 | "Humpty Dumpty had a great fall", 36 | "All the King's horses and all the King's men", 37 | "Couldn't put Humpty together again", 38 | ] 39 | # The data source can be any dictionary-like object 40 | datasource = dict(enumerate(data)) 41 | 42 | def mapfn(k, v): 43 | for w in v.split(): 44 | yield w, 1 45 | 46 | def reducefn(k, vs): 47 | result = sum(vs) 48 | return result 49 | 50 | s = mincemeat.Server() 51 | s.datasource = datasource 52 | s.mapfn = mapfn 53 | s.reducefn = reducefn 54 | 55 | results = s.run_server(password="changeme") 56 | print results 57 | ``` 58 | 59 | Execute this script on the server: 60 | 61 | ```bash 62 | python example.py 63 | ``` 64 | 65 | Run mincemeat.py as a worker on a client: 66 | 67 | ```bash 68 | python mincemeat.py -p changeme [server address] 69 | ``` 70 | And the server will print out: 71 | 72 | ```python 73 | {'a': 2, 'on': 1, 'great': 1, 'Humpty': 3, 'again': 1, 'wall': 1, 'Dumpty': 2, 'men': 1, 'had': 1, 'all': 1, 'together': 1, "King's": 2, 'horses': 1, 'All': 1, "Couldn't": 1, 'fall': 1, 'and': 1, 'the': 2, 'put': 1, 'sat': 1} 74 | ``` 75 | 76 | This example was overly simplistic, but changing the datasource to be a collection of large files and running the client on multiple machines will work just as well. In fact, mincemeat.py has been used to produce a word frequency lists for many gigabytes of text using a slightly modified version of this code. 77 | 78 | Clients 79 | ------- 80 | 81 | You can run the client manually from within other Python scripts (rather than running mincemeat.py directly): 82 | 83 | ```python 84 | import mincemeat 85 | 86 | client = mincemeat.Client() 87 | client.password = "changeme" 88 | client.conn("localhost", mincemeat.DEFAULT_PORT) 89 | ``` 90 | 91 | [Shepherd.py](https://github.com/jpmec/shepherdpy) provides more sophisticated ways to run clients, including having client that poll or are forked on the same machine. 92 | 93 | Imports 94 | ------- 95 | 96 | One potential gotcha when using mincemeat.py: Your `mapfn` and `reducefn` functions don't have access to their enclosing environment, including imported modules. If you need to use an imported module in one of these functions, be sure to include `import whatever` in the functions themselves. 97 | 98 | 99 | Python 3 support 100 | ------- 101 | [ziyuang](https://github.com/ziyuang/mincemeatpy) has a fork of mincemeat.py that's comptable with python 3: [ziyuang/mincemeatpy](https://github.com/ziyuang/mincemeatpy) 102 | -------------------------------------------------------------------------------- /example.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import mincemeat 3 | 4 | data = ["Humpty Dumpty sat on a wall", 5 | "Humpty Dumpty had a great fall", 6 | "All the King's horses and all the King's men", 7 | "Couldn't put Humpty together again", 8 | ] 9 | # The data source can be any dictionary-like object 10 | datasource = dict(enumerate(data)) 11 | 12 | def mapfn(k, v): 13 | for w in v.split(): 14 | yield w, 1 15 | 16 | def reducefn(k, vs): 17 | result = sum(vs) 18 | return result 19 | 20 | s = mincemeat.Server() 21 | s.datasource = datasource 22 | s.mapfn = mapfn 23 | s.reducefn = reducefn 24 | 25 | results = s.run_server(password="changeme") 26 | print results 27 | -------------------------------------------------------------------------------- /mincemeat.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | 4 | ################################################################################ 5 | # Copyright (c) 2010 Michael Fairley 6 | # 7 | # Permission is hereby granted, free of charge, to any person obtaining a copy 8 | # of this software and associated documentation files (the "Software"), to deal 9 | # in the Software without restriction, including without limitation the rights 10 | # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 11 | # copies of the Software, and to permit persons to whom the Software is 12 | # furnished to do so, subject to the following conditions: 13 | # 14 | # The above copyright notice and this permission notice shall be included in 15 | # all copies or substantial portions of the Software. 16 | # 17 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 18 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 19 | # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 20 | # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 21 | # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 22 | # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 23 | # THE SOFTWARE. 24 | ################################################################################ 25 | 26 | import os 27 | import sys 28 | import hmac 29 | import types 30 | import random 31 | import socket 32 | import hashlib 33 | import marshal 34 | import logging 35 | import argparse 36 | import asynchat 37 | import asyncore 38 | import cPickle as pickle 39 | 40 | 41 | VERSION = "0.1.4" 42 | 43 | 44 | DEFAULT_PORT = 11235 45 | 46 | 47 | class Protocol(asynchat.async_chat): 48 | def __init__(self, conn=None, map=None): 49 | if conn: 50 | asynchat.async_chat.__init__(self, conn, map=map) 51 | else: 52 | asynchat.async_chat.__init__(self, map=map) 53 | 54 | self.set_terminator("\n") 55 | self.buffer = [] 56 | self.auth = None 57 | self.mid_command = False 58 | 59 | def collect_incoming_data(self, data): 60 | self.buffer.append(data) 61 | 62 | def send_command(self, command, data=None): 63 | if not ":" in command: 64 | command += ":" 65 | if data: 66 | pdata = pickle.dumps(data) 67 | command += str(len(pdata)) 68 | logging.debug( "<- %s" % command) 69 | self.push(command + "\n" + pdata) 70 | else: 71 | logging.debug( "<- %s" % command) 72 | self.push(command + "\n") 73 | 74 | def found_terminator(self): 75 | if not self.auth == "Done": 76 | command, data = (''.join(self.buffer).split(":",1)) 77 | self.process_unauthed_command(command, data) 78 | elif not self.mid_command: 79 | logging.debug("-> %s" % ''.join(self.buffer)) 80 | command, length = (''.join(self.buffer)).split(":", 1) 81 | if command == "challenge": 82 | self.process_command(command, length) 83 | elif length: 84 | self.set_terminator(int(length)) 85 | self.mid_command = command 86 | else: 87 | self.process_command(command) 88 | else: # Read the data segment from the previous command 89 | if not self.auth == "Done": 90 | logging.fatal("Recieved pickled data from unauthed source") 91 | sys.exit(1) 92 | data = pickle.loads(''.join(self.buffer)) 93 | self.set_terminator("\n") 94 | command = self.mid_command 95 | self.mid_command = None 96 | self.process_command(command, data) 97 | self.buffer = [] 98 | 99 | def send_challenge(self): 100 | self.auth = os.urandom(20).encode("hex") 101 | self.send_command(":".join(["challenge", self.auth])) 102 | 103 | def respond_to_challenge(self, command, data): 104 | mac = hmac.new(self.password, data, hashlib.sha1) 105 | self.send_command(":".join(["auth", mac.digest().encode("hex")])) 106 | self.post_auth_init() 107 | 108 | def verify_auth(self, command, data): 109 | mac = hmac.new(self.password, self.auth, hashlib.sha1) 110 | if data == mac.digest().encode("hex"): 111 | self.auth = "Done" 112 | logging.info("Authenticated other end") 113 | else: 114 | self.handle_close() 115 | 116 | def process_command(self, command, data=None): 117 | commands = { 118 | 'challenge': self.respond_to_challenge, 119 | 'disconnect': lambda x, y: self.handle_close(), 120 | } 121 | 122 | if command in commands: 123 | commands[command](command, data) 124 | else: 125 | logging.critical("Unknown command received: %s" % (command,)) 126 | self.handle_close() 127 | 128 | def process_unauthed_command(self, command, data=None): 129 | commands = { 130 | 'challenge': self.respond_to_challenge, 131 | 'auth': self.verify_auth, 132 | 'disconnect': lambda x, y: self.handle_close(), 133 | } 134 | 135 | if command in commands: 136 | commands[command](command, data) 137 | else: 138 | logging.critical("Unknown unauthed command received: %s" % (command,)) 139 | self.handle_close() 140 | 141 | 142 | class Client(Protocol): 143 | def __init__(self): 144 | Protocol.__init__(self) 145 | self.mapfn = self.reducefn = self.collectfn = None 146 | 147 | def conn(self, server, port): 148 | self.create_socket(socket.AF_INET, socket.SOCK_STREAM) 149 | self.connect((server, port)) 150 | asyncore.loop() 151 | 152 | def handle_connect(self): 153 | pass 154 | 155 | def handle_close(self): 156 | self.close() 157 | 158 | def set_mapfn(self, command, mapfn): 159 | self.mapfn = types.FunctionType(marshal.loads(mapfn), globals(), 'mapfn') 160 | 161 | def set_collectfn(self, command, collectfn): 162 | self.collectfn = types.FunctionType(marshal.loads(collectfn), globals(), 'collectfn') 163 | 164 | def set_reducefn(self, command, reducefn): 165 | self.reducefn = types.FunctionType(marshal.loads(reducefn), globals(), 'reducefn') 166 | 167 | def call_mapfn(self, command, data): 168 | logging.info("Mapping %s" % str(data[0])) 169 | results = {} 170 | for k, v in self.mapfn(data[0], data[1]): 171 | if k not in results: 172 | results[k] = [] 173 | results[k].append(v) 174 | if self.collectfn: 175 | for k in results: 176 | results[k] = [self.collectfn(k, results[k])] 177 | self.send_command('mapdone', (data[0], results)) 178 | 179 | def call_reducefn(self, command, data): 180 | logging.info("Reducing %s" % str(data[0])) 181 | results = self.reducefn(data[0], data[1]) 182 | self.send_command('reducedone', (data[0], results)) 183 | 184 | def process_command(self, command, data=None): 185 | commands = { 186 | 'mapfn': self.set_mapfn, 187 | 'collectfn': self.set_collectfn, 188 | 'reducefn': self.set_reducefn, 189 | 'map': self.call_mapfn, 190 | 'reduce': self.call_reducefn, 191 | } 192 | 193 | if command in commands: 194 | commands[command](command, data) 195 | else: 196 | Protocol.process_command(self, command, data) 197 | 198 | def post_auth_init(self): 199 | if not self.auth: 200 | self.send_challenge() 201 | 202 | 203 | class Server(asyncore.dispatcher, object): 204 | def __init__(self): 205 | self.socket_map = {} 206 | asyncore.dispatcher.__init__(self, map=self.socket_map) 207 | self.mapfn = None 208 | self.reducefn = None 209 | self.collectfn = None 210 | self.datasource = None 211 | self.password = None 212 | 213 | def run_server(self, password="", port=DEFAULT_PORT): 214 | self.password = password 215 | self.create_socket(socket.AF_INET, socket.SOCK_STREAM) 216 | self.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) 217 | self.bind(("", port)) 218 | self.listen(1) 219 | try: 220 | asyncore.loop(map=self.socket_map) 221 | except: 222 | asyncore.close_all() 223 | raise 224 | 225 | return self.taskmanager.results 226 | 227 | def handle_accept(self): 228 | conn, addr = self.accept() 229 | sc = ServerChannel(conn, self.socket_map, self) 230 | sc.password = self.password 231 | 232 | def handle_close(self): 233 | self.close() 234 | 235 | def set_datasource(self, ds): 236 | self._datasource = ds 237 | self.taskmanager = TaskManager(self._datasource, self) 238 | 239 | def get_datasource(self): 240 | return self._datasource 241 | 242 | datasource = property(get_datasource, set_datasource) 243 | 244 | 245 | class ServerChannel(Protocol): 246 | def __init__(self, conn, map, server): 247 | Protocol.__init__(self, conn, map=map) 248 | self.server = server 249 | 250 | self.start_auth() 251 | 252 | def handle_close(self): 253 | logging.info("Client disconnected") 254 | self.close() 255 | 256 | def start_auth(self): 257 | self.send_challenge() 258 | 259 | def start_new_task(self): 260 | command, data = self.server.taskmanager.next_task(self) 261 | if command == None: 262 | return 263 | self.send_command(command, data) 264 | 265 | def map_done(self, command, data): 266 | self.server.taskmanager.map_done(data) 267 | self.start_new_task() 268 | 269 | def reduce_done(self, command, data): 270 | self.server.taskmanager.reduce_done(data) 271 | self.start_new_task() 272 | 273 | def process_command(self, command, data=None): 274 | commands = { 275 | 'mapdone': self.map_done, 276 | 'reducedone': self.reduce_done, 277 | } 278 | 279 | if command in commands: 280 | commands[command](command, data) 281 | else: 282 | Protocol.process_command(self, command, data) 283 | 284 | def post_auth_init(self): 285 | if self.server.mapfn: 286 | self.send_command('mapfn', marshal.dumps(self.server.mapfn.func_code)) 287 | if self.server.reducefn: 288 | self.send_command('reducefn', marshal.dumps(self.server.reducefn.func_code)) 289 | if self.server.collectfn: 290 | self.send_command('collectfn', marshal.dumps(self.server.collectfn.func_code)) 291 | self.start_new_task() 292 | 293 | class TaskManager: 294 | START = 0 295 | MAPPING = 1 296 | REDUCING = 2 297 | FINISHED = 3 298 | 299 | def __init__(self, datasource, server): 300 | self.datasource = datasource 301 | self.server = server 302 | self.state = TaskManager.START 303 | 304 | def next_task(self, channel): 305 | if self.state == TaskManager.START: 306 | self.map_iter = iter(self.datasource) 307 | self.working_maps = {} 308 | self.map_results = {} 309 | #self.waiting_for_maps = [] 310 | self.state = TaskManager.MAPPING 311 | if self.state == TaskManager.MAPPING: 312 | try: 313 | map_key = self.map_iter.next() 314 | map_item = map_key, self.datasource[map_key] 315 | self.working_maps[map_item[0]] = map_item[1] 316 | return ('map', map_item) 317 | except StopIteration: 318 | if len(self.working_maps) > 0: 319 | key = random.choice(self.working_maps.keys()) 320 | return ('map', (key, self.working_maps[key])) 321 | self.state = TaskManager.REDUCING 322 | self.reduce_iter = self.map_results.iteritems() 323 | self.working_reduces = {} 324 | self.results = {} 325 | if self.state == TaskManager.REDUCING: 326 | try: 327 | reduce_item = self.reduce_iter.next() 328 | self.working_reduces[reduce_item[0]] = reduce_item[1] 329 | return ('reduce', reduce_item) 330 | except StopIteration: 331 | if len(self.working_reduces) > 0: 332 | key = random.choice(self.working_reduces.keys()) 333 | return ('reduce', (key, self.working_reduces[key])) 334 | self.state = TaskManager.FINISHED 335 | if self.state == TaskManager.FINISHED: 336 | self.server.handle_close() 337 | return ('disconnect', None) 338 | 339 | def map_done(self, data): 340 | # Don't use the results if they've already been counted 341 | if not data[0] in self.working_maps: 342 | return 343 | 344 | for (key, values) in data[1].iteritems(): 345 | if key not in self.map_results: 346 | self.map_results[key] = [] 347 | self.map_results[key].extend(values) 348 | del self.working_maps[data[0]] 349 | 350 | def reduce_done(self, data): 351 | # Don't use the results if they've already been counted 352 | if not data[0] in self.working_reduces: 353 | return 354 | 355 | self.results[data[0]] = data[1] 356 | del self.working_reduces[data[0]] 357 | 358 | def run_client(): 359 | parser = argparse.ArgumentParser(usage="%(prog)s [options] server_name") 360 | parser.add_argument("-p", "--password", dest="password", default="", help="password") 361 | parser.add_argument("-P", "--port", dest="port", type=int, default=DEFAULT_PORT, help="port") 362 | parser.add_argument("-v", "--verbose", dest="verbose", action="store_true") 363 | parser.add_argument("-V", "--loud", dest="loud", action="store_true") 364 | parser.add_argument("--version", action="version", version="%(prog)s {0}".format(VERSION)) 365 | parser.add_argument("server_name", default="localhost", nargs="?", help="server name") 366 | 367 | options = parser.parse_args() 368 | 369 | if options.verbose: 370 | logging.basicConfig(level=logging.INFO) 371 | if options.loud: 372 | logging.basicConfig(level=logging.DEBUG) 373 | 374 | client = Client() 375 | client.password = options.password 376 | client.conn(options.server_name, options.port) 377 | 378 | 379 | if __name__ == '__main__': 380 | run_client() 381 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from distutils.core import setup 2 | setup( 3 | name='mincemeat', 4 | version='0.1.4', 5 | author='Michael Fairley', 6 | py_modules=['mincemeat'], 7 | scripts=['mincemeat.py'], 8 | install_requires=[], 9 | ) 10 | -------------------------------------------------------------------------------- /smoke.sh: -------------------------------------------------------------------------------- 1 | EXPECTED="{'a': 2, 'on': 1, 'great': 1, 'Humpty': 3, 'again': 1, 'wall': 1, 'Dumpty': 2, 'men': 1, 'had': 1, 'all': 1, 'together': 1, \"King's\": 2, 'horses': 1, 'All': 1, \"Couldn't\": 1, 'fall': 1, 'and': 1, 'the': 2, 'put': 1, 'sat': 1}" 2 | 3 | python example.py > output & 4 | SERVER_PID=$! 5 | sleep 1 6 | python mincemeat.py -p changeme localhost 7 | sleep 1 8 | kill $! 2>/dev/null 9 | 10 | diff <(cat output) <(echo $EXPECTED) 11 | --------------------------------------------------------------------------------