├── .gitignore ├── LICENSE ├── README.md ├── bench └── benchMaster.py ├── bin ├── simhash-master └── simhash-slave ├── example-config.yaml ├── setup.py ├── smhcluster ├── __init__.py ├── adapters │ ├── __init__.py │ ├── http.py │ └── zrpc.py ├── master.py ├── slave.py └── util.py └── test ├── testMaster.py └── testRangeMap.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.o 2 | **.dSYM 3 | driver 4 | build/* 5 | *.cpp 6 | *.so 7 | *.pyc -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2013-2014 SEOmoz, Inc. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining 4 | a copy of this software and associated documentation files (the 5 | "Software"), to deal in the Software without restriction, including 6 | without limitation the rights to use, copy, modify, merge, publish, 7 | distribute, sublicense, and/or sell copies of the Software, and to 8 | permit persons to whom the Software is furnished to do so, subject to 9 | the following conditions: 10 | 11 | The above copyright notice and this permission notice shall be 12 | included in all copies or substantial portions of the Software. 13 | 14 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 15 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 16 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 17 | NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 18 | LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 19 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION 20 | WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 21 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Simhash Cluster 2 | =============== 3 | ![Status: Deprecated](https://img.shields.io/badge/status-deprecated-red.svg?style=flat) 4 | ![Team: Big Data](https://img.shields.io/badge/team-big_data-green.svg?style=flat) 5 | ![Scope: External](https://img.shields.io/badge/scope-external-green.svg?style=flat) 6 | ![Open Source: MIT](https://img.shields.io/badge/open_source-MIT-green.svg?style=flat) 7 | ![Critical: No](https://img.shields.io/badge/critical-no-red.svg?style=flat) 8 | 9 | __This is obviously unfinished work and we also have no intention of finishing 10 | it. Instead, we've elected to use a real database backing a simhash corpus 11 | through [simhash-db-py](https://github.com/seomoz/simhash-db-py).__ 12 | 13 | Simhash takes an input vector of integers, and produces a single integer output 14 | that's representative of that vector in the sense that _similar_ vectors yield 15 | _similar_ hashes -- their resultant hashes are expected to differ by only a few 16 | bits. With this in mind, simhash is often used in conjunction with a rolling 17 | hash function on text to generate the input vector, and thus yield a hash that 18 | corresponds to that block of text. In this way, you can quickly identify all the 19 | documents that would be considered near-duplicates. 20 | 21 | You can even construct tables to perform these queries very quickly indeed. 22 | Sadly, it can consume a fair amount of RAM, especially when you insert several 23 | hundred million or several billion hashes into the corpus of known hashes. And 24 | so, a distributed form is necessary. This is that distributed form. 25 | 26 | Architecture 27 | ============ 28 | There's one master node which slave nodes register with, at which point they are 29 | assigned shards to serve and all queries to that shard will be served by that 30 | node. The master and slaves communicate with zerorpc. 31 | 32 | Adapters 33 | ======== 34 | Adapters are the mechanism by which the cluster is accessed; `simhash-cluster` 35 | comes with two by default (one HTTP, and one zerorpc). All queries are directed 36 | at the master node. 37 | 38 | Storage 39 | ======= 40 | There's an assumption that you'd like to persist your corpus of known hashes as 41 | it might have developed over time. Like adapters, storage backends are pluggable 42 | and simply must support a few methods like `save` and `load.` 43 | 44 | Starting 45 | ======== 46 | The master node requires a yaml configuration file (an example file is included) 47 | that describes the adapters and storage to use, as well as the simhash 48 | configuration. With the configuration in place: 49 | 50 | simhash-master --config example-config.yaml 51 | 52 | This starts the master daemon (and adapters) running, and the master listening 53 | on port 1234. Slaves should then be started (on any node) and pointing to the 54 | master: 55 | 56 | simhash-slave :1234 57 | 58 | Querying 59 | ======== 60 | Once the master node is running, you can begin querying. Assuming the master 61 | daemon is running on `localhost`: 62 | 63 | # Using the http interface 64 | import simplejson as json 65 | # Add a bunch of hashes 66 | requests.put('http://localhost:8080/hashes', json.dumps(range(10000))) 67 | # Find the first similar hash 68 | requests.get('http://localhost:8080/first/12345').content 69 | # Find all similar hashes 70 | requests.get('http://localhost:8080/all/12345').content 71 | # Remove a particular hash 72 | requests.delete('http://localhost:8080/hashes/12345') 73 | 74 | And now using the `zerorpc` interface: 75 | 76 | import zerorpc 77 | c = zerorpc.Client('tcp://localhost:5678') 78 | # Insert hashes 79 | c.insert(*range(10000)) 80 | # And find first and all 81 | c.find_first(*range(10000)) 82 | c.find_all(*range(10000)) 83 | # And remove all of them if you'd like 84 | c.remove(*range(10000)) -------------------------------------------------------------------------------- /bench/benchMaster.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | import os 4 | import sys 5 | base, name = os.path.split(os.path.abspath(__file__)) 6 | sys.path = [os.path.split(base)[0]] + sys.path 7 | 8 | from smhcluster import master 9 | 10 | m = master.Master() 11 | for i in range(4): 12 | slave = master.Slave('slave-%i' % i, None) 13 | m.accept(slave) 14 | 15 | class timer(object): 16 | def __init__(self, name): 17 | self.name = name 18 | 19 | def __enter__(self): 20 | self.start = -time.time() 21 | print 'Starting %s...' % self.name 22 | 23 | def __exit__(self, t, v, tb): 24 | self.start += time.time() 25 | print ' %s: %fs' % (self.name, self.start) 26 | 27 | import time 28 | import random 29 | 30 | with timer('hashes and queries'): 31 | hashes = [random.randint(0, 1 << 63) for i in range(100000)] 32 | queries = [random.randint(0, 1 << 63) for i in range(100000)] 33 | 34 | with timer('insert %i' % len(hashes)): 35 | for h in hashes: 36 | m.insert(h) 37 | 38 | with timer('find_first %i' % len(queries)): 39 | for q in queries: 40 | m.find_first(q) 41 | 42 | with timer('find_all %i' % len(queries)): 43 | for q in queries: 44 | m.find_all(q) 45 | 46 | with timer('removes %i' % len(hashes)): 47 | for h in hashes: 48 | m.remove(h) 49 | -------------------------------------------------------------------------------- /bin/simhash-master: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | import argparse 4 | 5 | parser = argparse.ArgumentParser(description='Run and a near-duplicates master') 6 | parser.add_argument('--config', dest='config', type=str, 7 | help='Path to configuration') 8 | 9 | args = parser.parse_args() 10 | 11 | from smhcluster import master 12 | from smhcluster.util import klass 13 | 14 | import yaml 15 | 16 | with open(args.config) as f: 17 | args.config = yaml.load(f.read()) 18 | 19 | # We'll create a cluster, start it, and then check our configuration for the 20 | # various adapters we're going to use. 21 | m = master.Master() 22 | 23 | import gevent 24 | # Now let's set up each of our adapters 25 | adapters = [] 26 | for k, conf in args.config['adapters'].items(): 27 | # Now add an adapter to our database 28 | adapter = klass(k)(m) 29 | adapter.config(conf) 30 | gevent.spawn(adapter.listen) 31 | 32 | m.config(args.config) 33 | # This just gets the cluster listening for slave servers 34 | m.listen() -------------------------------------------------------------------------------- /bin/simhash-slave: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | import argparse 4 | 5 | parser = argparse.ArgumentParser(description='Run and a near-duplicates slave') 6 | parser.add_argument('master', type=str, 7 | help='The master to connect to') 8 | parser.add_argument('--port', dest='port', type=int, default=4242, 9 | help='Explicitly specify a port to run on') 10 | 11 | args = parser.parse_args() 12 | 13 | import gevent 14 | import zerorpc 15 | from socket import gethostname 16 | from smhcluster.slave import Slave 17 | 18 | hostname = gethostname() + ':' + str(args.port) 19 | slave = Slave(hostname) 20 | 21 | # Tell the thing it should register... 22 | gevent.spawn(slave.register, args.master) 23 | 24 | # And start serving... 25 | s = zerorpc.Server(slave) 26 | s.bind('tcp://0.0.0.0:%i' % args.port) 27 | try: 28 | s.run() 29 | except KeyboardInterrupt: 30 | slave.deregister(args.master) 31 | -------------------------------------------------------------------------------- /example-config.yaml: -------------------------------------------------------------------------------- 1 | # Top-level configuration about the tables themselves. How many blocks should we 2 | # divide the 64-bit int into? And how many differing bits and still considered a 3 | # near-duplicate? 4 | blocks : 6 5 | diff_bits: 3 6 | 7 | # How should we serve the API? 8 | adapters: 9 | smhcluster.adapters.http.Server: 10 | port: 8080 11 | smhcluster.adapters.zrpc.Server: 12 | port: 5678 13 | 14 | # And how should we do permanent storage? 15 | storage: 16 | smhcluster.storage.disk.Disk: 17 | path: /some/path/to/somewhere/ -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | from distutils.core import setup 4 | 5 | setup(name = 'smhcluster', 6 | version = '0.1.0', 7 | description = 'Cluster for Near-Duplicate Detection with Simhash', 8 | url = 'http://github.com/seomoz/simhash-cluster', 9 | author = 'Dan Lecocq', 10 | author_email = 'dan@seomoz.org', 11 | packages = ['smhcluster', 'smhcluster.adapters'], 12 | package_dir = { 13 | 'smhcluster': 'smhcluster', 14 | 'smhcluster.adapters': 'smhcluster/adapters' 15 | }, 16 | scripts = [ 17 | 'bin/simhash-master', 18 | 'bin/simhash-slave' 19 | ], 20 | dependencies = [ 21 | 'simhash', # For obvious reasons 22 | 'boto', # For persistence to S3 23 | 'bottle', # For the HTTP adapter 24 | 'gevent', # For non-blocking goodness 25 | 'requests', # For making real http requests 26 | 'zerorpc' # For RPC with gevent, zeromq 27 | ], 28 | classifiers = [ 29 | 'Programming Language :: Python', 30 | 'Intended Audience :: Developers', 31 | 'Operating System :: OS Independent', 32 | 'Topic :: Internet :: WWW/HTTP' 33 | ], 34 | ) 35 | -------------------------------------------------------------------------------- /smhcluster/__init__.py: -------------------------------------------------------------------------------- 1 | # Need an __init__, so I thought I'd put the logger here 2 | 3 | import logging 4 | logger = logging.getLogger('cluster') 5 | handler = logging.StreamHandler() 6 | handler.setFormatter( 7 | logging.Formatter('%(asctime)s %(levelname)s %(message)s')) 8 | handler.setLevel(logging.DEBUG) 9 | logger.addHandler(handler) 10 | logger.setLevel(logging.INFO) 11 | -------------------------------------------------------------------------------- /smhcluster/adapters/__init__.py: -------------------------------------------------------------------------------- 1 | # All adapters must implement the Server interface. This determines how clients 2 | # will access the master server (via HTTP, etc.). Adapters should also provide 3 | # at least a reference client implementation as well, also provided. 4 | 5 | class Server(object): 6 | # Accepts a cluster, which contains all the python objects needed to make 7 | # queries 8 | def __init__(self, cluster): 9 | self.cluster = cluster 10 | 11 | # Idempotently accept new configurations, raising exceptions when 12 | # malconfigured 13 | def config(self, config): 14 | pass 15 | 16 | # Serve until we say stop 17 | def listen(self): 18 | pass 19 | 20 | # Stop serving 21 | def stop(self): 22 | pass 23 | 24 | class Client(object): 25 | # Accepts a host to which to speak 26 | def __init__(self, host): 27 | self.host = host 28 | 29 | # Check for /any/ near-duplicate documents 30 | def find_first(self, query): 31 | pass 32 | 33 | # Check for /all/ near-duplicates 34 | def find_all(self, query): 35 | pass 36 | 37 | # Bulk form of find_first 38 | def find_first_bulk(self, queries): 39 | pass 40 | 41 | # Bulk form of find_all 42 | def find_all_bulk(self, queries): 43 | pass 44 | 45 | # Insert a hash 46 | def insert(self, h): 47 | pass 48 | 49 | # Bulk form of insert 50 | def insert_bulk(self, hashes): 51 | pass 52 | 53 | # Remove a hash 54 | def remove(self, h): 55 | pass 56 | 57 | # Bulk form of remove 58 | def remove_bulk(self, hashes): 59 | pass -------------------------------------------------------------------------------- /smhcluster/adapters/http.py: -------------------------------------------------------------------------------- 1 | # Provides a JSON interface to access the simhash cluster 2 | 3 | # We need bottle for the server, and requests for the client 4 | import gevent.monkey 5 | gevent.monkey; gevent.monkey.patch_all() 6 | import bottle 7 | import requests 8 | from bottle import run, request, abort, Bottle 9 | 10 | try: 11 | import simplejson as json 12 | except ImportError: 13 | import json 14 | 15 | from . import Server as _Server 16 | from . import Client as _Client 17 | 18 | class Server(_Server): 19 | # Accepts a cluster, which contains all the python objects needed to make 20 | # queries 21 | def __init__(self, cluster): 22 | self.cluster = cluster 23 | self.root = Bottle() 24 | 25 | # Idempotently accept new configurations, raising exceptions when 26 | # malconfigured 27 | def config(self, config): 28 | for key in config.keys(): 29 | if key not in ('port'): 30 | raise KeyError('Unknown configuration option %s' % key) 31 | 32 | self.port = config.get('port', 8080) 33 | 34 | def first(self, query=None): 35 | if query: 36 | return json.dumps(self.cluster.find_first(int(query))) 37 | return json.dumps( 38 | self.cluster.find_first(*json.load(request.body))) 39 | 40 | def all(self, query=None): 41 | if query: 42 | return json.dumps(self.cluster.find_all(int(query))) 43 | return json.dumps( 44 | self.cluster.find_all(*json.load(request.body))) 45 | 46 | def insert(self, h=None): 47 | if h: 48 | return json.dumps(self.cluster.insert(int(h))) 49 | return json.dumps( 50 | self.cluster.insert(*json.load(request.body))) 51 | 52 | def remove(self, h=None): 53 | if h: 54 | return json.dumps(self.cluster.remove(int(h))) 55 | return json.dumps( 56 | self.cluster.remove(*json.load(request.body))) 57 | 58 | # Serve forever 59 | def listen(self): 60 | # We're doing this a little oddly, because the server is an instance, so 61 | # we have to wait until we have an object so that we can attach the 62 | # route to a method bound to this instance. 63 | self.root.get( '/first/' )(self.first) # Single 64 | self.root.post( '/first' )(self.first) # Bulk 65 | self.root.get( '/all/' )(self.all) # Single 66 | self.root.post( '/all' )(self.all) # Bulk 67 | self.root.put( '/hashes/' )(self.insert) # Single 68 | self.root.put( '/hashes' )(self.insert) # Bulk 69 | self.root.delete('/hashes/' )(self.remove) # Single 70 | self.root.delete('/hashes/' )(self.remove) # Bulk 71 | 72 | # And run it! 73 | run(self.root, host='localhost', port=8080, server='gevent') 74 | print 'HTTP listening...' 75 | 76 | # Stop 77 | def stop(self): 78 | pass 79 | 80 | class Client(_Client): 81 | # Accepts a host to which to speak 82 | def __init__(self, host): 83 | self.host = host 84 | 85 | # Check for /any/ near-duplicate documents 86 | def find_first(self, query): 87 | return json.loads(requests.get(self.host + '/first/' + query).content) 88 | 89 | # Check for /all/ near-duplicates 90 | def find_all(self, query): 91 | return json.loads(requests.get(self.host + '/all/' + query).content) 92 | 93 | # Bulk form of find_first 94 | def find_first_bulk(self, queries): 95 | r = requests.post(self.host + '/first', data=json.dumps(queries)) 96 | return json.loads(r.content) 97 | 98 | # Bulk form of find_all 99 | def find_all_bulk(self, queries): 100 | r = requests.post(self.host + '/all', data=json.dumps(queries)) 101 | return json.loads(r.content) 102 | 103 | # Insert a hash 104 | def insert(self, h): 105 | return json.loads(requests.put(self.host + '/hashes/' + h)) 106 | 107 | # Bulk form of insert 108 | def insert_bulk(self, hashes): 109 | r = requests.put(self.host + '/hashes', data=json.dumps(hashes)) 110 | return json.loads(r.content) 111 | 112 | # Remove a hash 113 | def remove(self, h): 114 | return json.loads(requests.delete(self.host + '/hashes/' + h)) 115 | 116 | # Bulk form of remove 117 | def remove_bulk(self, hashes): 118 | r = requests.delete(self.host + '/hashes', data=json.dumps(hashes)) 119 | return json.loads(r.content) 120 | -------------------------------------------------------------------------------- /smhcluster/adapters/zrpc.py: -------------------------------------------------------------------------------- 1 | # Provides a zerorpc interface to the cluster 2 | 3 | import zerorpc 4 | 5 | from . import Server as _Server 6 | 7 | class Server(_Server): 8 | # Accepts a cluster, which contains all the python objects needed to make 9 | # queries 10 | def __init__(self, cluster): 11 | self.cluster = cluster 12 | 13 | # Idempotently accept new configurations, raising exceptions when 14 | # malconfigured 15 | def config(self, config): 16 | for key in config.keys(): 17 | if key not in ('port'): 18 | raise KeyError('Unknown configuration option %s' % key) 19 | 20 | self.port = config.get('port', 1234) 21 | if hasattr(self, 'server'): 22 | self.server.stop() 23 | 24 | self.server = zerorpc.Server(self.cluster) 25 | self.server.bind('tcp://0.0.0.0:%i' % self.port) 26 | 27 | # Serve forever 28 | def listen(self): 29 | self.server.run() 30 | 31 | # Stop 32 | def stop(self): 33 | self.server.stop() 34 | del self.server 35 | -------------------------------------------------------------------------------- /smhcluster/master.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | from . import logger 4 | from .util import RangeMap 5 | 6 | from collections import defaultdict 7 | 8 | # This is the master node object. It talks to slave nodes to determine both 9 | # their availability and health and to answer queries. 10 | class Master(object): 11 | # The number of shards 12 | shards = 1024 13 | max_node_shards = 256 14 | differing_bits = 3 15 | blocks = 6 16 | 17 | class RangeUnassigned(Exception): 18 | def __init__(self, value): 19 | Exception.__init__(self, value) 20 | 21 | def __init__(self): 22 | self.rangemap = RangeMap() 23 | for start, end in self.ranges(): 24 | self.rangemap.insert(start, end, None) 25 | 26 | # A mapping of hostnames to slave objects 27 | self.slaves = {} 28 | # Our current configuration 29 | self._config = {} 30 | 31 | def ranges(self): 32 | # Return a list of tuples (start, end) that we need 33 | results = [] 34 | for i in range(self.shards): 35 | start = i * (1 << 64) / self.shards 36 | end = (i + 1) * (1 << 64) / self.shards 37 | results.append((start, end-1)) 38 | return results 39 | 40 | def unassigned(self): 41 | # Get a list of tuples (start, end) of ranges that are unassigned 42 | return [(s, e) for s, e, i in self.rangemap if i == None] 43 | 44 | def register(self, hostname): 45 | # Accept a new slave. First, determine how many shards we're going to 46 | # give to each node once we add this new one 47 | count = min(self.max_node_shards, self.shards / (len(self.slaves) + 1)) 48 | assign = self.unassigned()[0:count] 49 | if (len(assign) < count): 50 | # We need to actually steal shards from some of the existing slaves 51 | # so we should figure out where to take them from 52 | slaves = defaultdict(list) 53 | for s, e, i in self.rangemap: 54 | if i != None: 55 | slaves[i].append((s, e)) 56 | 57 | # Now make a list of tuples based on this 58 | slaves = [(len(v), s, v) for s, v in slaves.items()] 59 | slaves.sort() 60 | slaves.reverse() 61 | for l, slave, shards in slaves: 62 | # Take up to l-count from this guy 63 | ct = min(l - count, count - len(assign)) 64 | nw = shards[0:ct] 65 | assign.extend(nw) 66 | for start, end in nw: 67 | slave.unload(start, end) 68 | logger.info('Reassigning [%i, %i) from %s to %s' % (start, end, repr(slave), hostname)) 69 | 70 | import zerorpc 71 | logger.info('Assigning %i to %s' % (count, hostname)) 72 | slave = zerorpc.Client('tcp://%s' % hostname) 73 | self.slaves[hostname] = slave 74 | for start, end in assign: 75 | slave.load(start, end) 76 | self.rangemap.insert(start, end, slave) 77 | 78 | # Send it its updated configuration 79 | slave.config(self._config) 80 | 81 | def deregister(self, hostname): 82 | # When deregistering a node, we should redistribute all the keys 83 | # associated with this particular host 84 | if isinstance(hostname, basestring): 85 | slave = self.slaves.pop(hostname) 86 | else: 87 | slave = hostname 88 | for k, v in self.slaves.items(): 89 | if slave == v: 90 | o = self.slaves.pop(k) 91 | 92 | assign = [(s, e) for s, e, i in self.rangemap if i == slave] 93 | count = min(self.max_node_shards, self.shards / (len(self.slaves)) + 1) 94 | # Alright, assign these ranges to the remaining slaves. Keep filling up 95 | # slaves until they're full 96 | for slave in self.slaves.values(): 97 | ct = len([(s, e) for s, e, i in self.rangemap if i == slave]) 98 | for i in range(count - ct): 99 | if not assign: 100 | break 101 | s, e = assign.pop(0) 102 | slave.load(s, e) 103 | self.rangemap.insert(s, e, slave) 104 | logger.info('Reassigning [%i, %i) to %s' % (s, e, repr(slave))) 105 | 106 | # Unassign all the remaining shards 107 | for s, e in assign: 108 | self.rangemap.insert(s, e, None) 109 | 110 | def stats(self): 111 | # Return the distribution of the shards 112 | slaves = defaultdict(list) 113 | for s, e, i in self.rangemap: 114 | if i != None: 115 | slaves[i].append((s, e)) 116 | 117 | return dict(((repr(s), len(shards)) for s, shards in slaves.items())) 118 | 119 | def config(self, config): 120 | self._config = config 121 | # Propagate the configuration to all the slaves 122 | for slave in self.slaves.values(): 123 | slave.config(config) 124 | 125 | def listen(self): 126 | # Listen for nodes trying to connect 127 | import zerorpc 128 | self.server = zerorpc.Server(self) 129 | self.server.bind('tcp://0.0.0.0:1234') 130 | self.server.run() 131 | 132 | def find(self, h): 133 | slave = self.rangemap.find(h) 134 | if not slave: 135 | raise Master.RangeUnassigned('%i unavailable' % h) 136 | return slave 137 | 138 | def find_first(self, *hashes): 139 | destinations = defaultdict(list) 140 | for h in hashes: 141 | destinations[self.find(h)].append(h) 142 | 143 | results = [] 144 | for k, queries in destinations.items(): 145 | results.extend(zip(queries, k.find_first(*queries))) 146 | logger.info('Finished querying %s' % k) 147 | return results 148 | 149 | def find_all(self, *hashes): 150 | destinations = defaultdict(list) 151 | for h in hashes: 152 | destinations[self.find(h)].append(h) 153 | 154 | results = [] 155 | for k, queries in destinations.items(): 156 | results.extend(zip(queries, k.find_all(*queries))) 157 | logger.info('Finished querying %s' % k) 158 | return results 159 | 160 | def insert(self, *hashes): 161 | # Because we map each query onto a range, we have to make sure that each 162 | # conceivable match for any query in that range is available. So for a 163 | # query for 0110100101101101, we'd map it onto a range, and return the 164 | # results from that range. However, it's more than likely that a number 165 | # 1110100101101101 would not be mapped to the same range. So, it doesn't 166 | # suffice to simply /insert/ based on ranges alone. We actually have to 167 | # insert it into a number of ranges. 168 | # 169 | # In particular, if we are configured to use 3 differing bits, then we 170 | # need to insert it into each of the ranges indicated by flipping the 171 | # 3 MSBs of the hash to insert by 0, 1, 2, 3, 4, 5, 6 and 7. 172 | # 173 | # This is a map of destinations to the queries 174 | destinations = defaultdict(list) 175 | for h in hashes: 176 | for i in range(1 << self.differing_bits): 177 | q = h ^ (i << (64 - self.differing_bits)) 178 | logger.debug('Looking up %s (%i)' % (bin(q), q)) 179 | destinations[self.find(q)].append((q, h)) 180 | 181 | for k, insertions in destinations.items(): 182 | k.insert(*insertions) 183 | logger.info('Finished inserting %s' % k) 184 | return True 185 | 186 | def remove(self, *hashes): 187 | # See the note in `insert` 188 | destinations = defaultdict(list) 189 | for h in hashes: 190 | for i in range(1 << self.differing_bits): 191 | q = h ^ (i << (64 - self.differing_bits)) 192 | logger.debug('Looking up %s (%i)' % (bin(q), q)) 193 | destinations[self.find(q)].append((q, h)) 194 | 195 | for k, removals in destinations.items(): 196 | k.remove(*removals) 197 | logger.info('Finished inserting %s' % k) 198 | return True -------------------------------------------------------------------------------- /smhcluster/slave.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | from . import logger 4 | from .util import RangeMap, klass 5 | 6 | class Slave(object): 7 | def __init__(self, hostname): 8 | self.hostname = hostname 9 | self.rangemap = RangeMap() 10 | self._config = {} 11 | 12 | # Send configuration to this node 13 | def config(self, config): 14 | logger.info('Recieved configuration %s' % (repr(config))) 15 | self._config = config 16 | 17 | def load(self, start, end): 18 | '''Load and start serving an interval''' 19 | from simhash import Corpus 20 | logger.info('%s loading range [%i, %i)' % (self.hostname, start, end)) 21 | self.rangemap.insert(start, end, Corpus(6, 3)) 22 | 23 | def unload(self, start, end): 24 | '''Stop serving the provided interval''' 25 | logger.info('%s unloading range [%i, %i)' % (self.hostname, start, end)) 26 | self.rangemap.remove(start, end) 27 | 28 | def save(self, start, end): 29 | '''Save the provided interval to permanent storage''' 30 | for name, conf in self._config.get('emitters', {}).items(): 31 | emitter = klass(name)(conf) 32 | logger.info('Loaded emitter %s' % name) 33 | 34 | def find(self, h): 35 | '''Find the shard associated with the provided hash''' 36 | corpus = self.rangemap.find(h) 37 | if not corpus: 38 | return None 39 | return corpus 40 | 41 | def find_first(self, *hashes): 42 | '''Find the first near-duplicate of the provided hashes''' 43 | return [self.find(h).find_first(h) for h in hashes] 44 | 45 | def find_all(self, *hashes): 46 | '''Find all near-duplicates of the provided hashes''' 47 | return [self.find(h).find_all(h) for h in hashes] 48 | 49 | def insert(self, *insertions): 50 | '''Insert h in to the shard for q''' 51 | for q, h in insertions: 52 | self.find(q).insert(h) 53 | 54 | def remove(self, *removals): 55 | '''Remove h from the shard for q''' 56 | for q, h in removals: 57 | return self.find(q).remove(h) 58 | 59 | def register(self, host): 60 | import zerorpc 61 | c = zerorpc.Client('tcp://%s' % host) 62 | logger.info('Registering...') 63 | logger.info('Registered: %s' % repr(c.register(self.hostname))) 64 | c.close() 65 | 66 | def deregister(self, host): 67 | import zerorpc 68 | c = zerorpc.Client('tcp://%s' % host) 69 | c.deregister(self.hostname) 70 | c.close() 71 | -------------------------------------------------------------------------------- /smhcluster/util.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | # A few utility functions 4 | def klass(s): 5 | '''Given a string, return the class associated with it''' 6 | mod = __import__(s.rpartition('.')[0]) 7 | for m in s.split('.')[1:-1]: 8 | mod = getattr(mod, m) 9 | return getattr(mod, s.rpartition('.')[2]) 10 | 11 | import bisect 12 | 13 | # A map of a set of ranges to items. It assumes that no two ranges overlap, but 14 | # it does not enforce that constraint. This class isn't optimized for fast 15 | # insertions or deletions -- for the time being, it's standing in for an 16 | # interface, which may be improved upon in the future 17 | class RangeMap(object): 18 | class RangeMatchException(Exception): 19 | def __init__(self, val): 20 | Exception.__init__(self, val) 21 | 22 | def __init__(self): 23 | # These are the starts of each of the ranges this holds 24 | self.starts = [] 25 | # A map from the start range to a tuple (end, item), which corresponds 26 | # to the start of the range 27 | self.items = {} 28 | 29 | def __len__(self): 30 | return len(self.starts) 31 | 32 | def __iter__(self): 33 | results = [] 34 | for start in self.starts: 35 | end, item = self.items[start] 36 | results.append((start, end, item)) 37 | return iter(results) 38 | 39 | # Find the item responsible for this range, remove and return it 40 | def remove(self, start, end): 41 | if start in self.items: 42 | oend, item = self.items.pop(start) 43 | if oend == end: 44 | self.starts = [i for i in self.starts if i != start] 45 | return item 46 | self.items[start] = (oend, item) 47 | raise RangeMap.RangeMatchException('%i != %i' % (oend, end)) 48 | return None 49 | 50 | # Insert a new item that is responsible for the provided range 51 | def insert(self, start, end, item): 52 | if start not in self.items: 53 | self.starts.append(start) 54 | self.starts.sort() 55 | self.items[start] = (end, item) 56 | 57 | # Get the item responsible for the provided range. If no suitable item can 58 | # be found, then it returns None 59 | def find(self, index): 60 | # If we don't have any items, then we're SOL 61 | i = bisect.bisect(self.starts, index) 62 | if (i == 0) or (i > len(self.starts)): 63 | return None 64 | 65 | start = self.starts[i-1] 66 | # If the start we found is still greater than index, we don't have one 67 | if start > index: 68 | return None 69 | 70 | end, item = self.items[start] 71 | # If it's outside of its range, then we've got a problem 72 | if end < index: 73 | return None 74 | 75 | return item 76 | 77 | # Index is based off of the start range 78 | def __getitem__(self, index): 79 | return self.items[index] 80 | -------------------------------------------------------------------------------- /test/testMaster.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | import unittest 4 | 5 | import os 6 | import sys 7 | base, name = os.path.split(os.path.abspath(__file__)) 8 | sys.path = [os.path.split(base)[0]] + sys.path 9 | 10 | from smhcluster import master 11 | 12 | class TestMaster(unittest.TestCase): 13 | def setUp(self): 14 | self.master = master.Master() 15 | for i in range(4): 16 | slave = master.Slave('slave-%i' % i, None) 17 | self.master.accept(slave) 18 | 19 | def test_insert_remove(self): 20 | # We should be able to appropriately insert and remove hashes 21 | h = int('101010101010', 2) 22 | q = int('101010101011', 2) 23 | self.master.insert(h) 24 | self.assertEqual(self.master.find_first(q), h) 25 | self.master.remove(h) 26 | self.assertEqual(self.master.find_first(q), 0) 27 | 28 | def test_find_first(self): 29 | # We should be able to find first 30 | h = int('101010101010', 2) 31 | q = int('101010101011', 2) 32 | self.master.insert(h) 33 | self.assertEqual(self.master.find_first(q), h) 34 | 35 | def test_find_all(self): 36 | # We should be able to find all 37 | hashes = [ 38 | int('101010101010', 2), 39 | int('101010101011', 2), 40 | int('101010101111', 2) 41 | ] 42 | q = int('101010111010', 2) 43 | for h in hashes: 44 | self.master.insert(h) 45 | 46 | self.assertEqual(set(self.master.find_all(q)), set(hashes)) 47 | 48 | def test_find_multiple(self): 49 | # We should be able to find queries even if the similar item wouldn't 50 | # normally map to the same shard as the original 51 | hashes = [ 52 | int('101010101010', 2), 53 | int('101010101011', 2), 54 | int('101010101111', 2) 55 | ] 56 | q = int('101010101010', 2) 57 | 58 | for i in range(len(hashes)): 59 | hashes[i] = hashes[i] | (1 << 63) 60 | self.master.insert(hashes[i]) 61 | 62 | self.assertEqual(set(self.master.find_all(q)), set(hashes)) 63 | 64 | if __name__ == '__main__': 65 | unittest.main() -------------------------------------------------------------------------------- /test/testRangeMap.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | import unittest 4 | 5 | import os 6 | import sys 7 | base, name = os.path.split(os.path.abspath(__file__)) 8 | sys.path = [os.path.split(base)[0]] + sys.path 9 | 10 | from smhcluster.util import RangeMap 11 | 12 | class TestRangeMap(unittest.TestCase): 13 | def setUp(self): 14 | self.rm = RangeMap() 15 | 16 | def test_insert_remove(self): 17 | # We should be able to insert new ranges and find them, and remove 18 | self.assertEqual(self.rm.find(100), None) 19 | self.rm.insert(0, 200, 'testing') 20 | self.assertEqual(self.rm.find(100), 'testing') 21 | self.rm.remove(0, 200) 22 | self.assertEqual(self.rm.find(100), None) 23 | 24 | def test_too_low(self): 25 | # If an index is too low for our ranges, then we should find None 26 | self.rm.insert(100, 200, 'testing') 27 | self.assertEqual(self.rm.find(99), None) 28 | 29 | def test_too_high(self): 30 | # If it's not inside one of our ranges, we should find None 31 | self.rm.insert(0, 100, 'testing') 32 | self.assertEqual(self.rm.find(101), None) 33 | 34 | def test_boundary(self): 35 | # When items are on the boundary, we should still find something 36 | self.rm.insert(100, 200, 'testing') 37 | self.assertEqual(self.rm.find(200), 'testing') 38 | self.assertEqual(self.rm.find(100), 'testing') 39 | 40 | def test_multiple(self): 41 | # When we have multiple ranges, we should be able to find the right one 42 | self.rm.insert(100, 200, 'cheese') 43 | self.rm.insert(300, 400, 'shop') 44 | self.rm.insert(500, 600, 'sketch') 45 | self.assertEqual(self.rm.find(150), 'cheese') 46 | self.assertEqual(self.rm.find(350), 'shop') 47 | self.assertEqual(self.rm.find(550), 'sketch') 48 | 49 | def test_remove_nonexistent(self): 50 | # When we ask to find rid of a range where we have the start range, but 51 | # the ends don't match, we should find an exception 52 | self.rm.insert(100, 200, 'testing') 53 | self.assertRaises(Exception, self.rm.remove, 100, 300) 54 | 55 | if __name__ == '__main__': 56 | unittest.main() --------------------------------------------------------------------------------