├── bruteforce.py ├── utils.py ├── LICENSE ├── main.py ├── dolphinn.py └── README.md /bruteforce.py: -------------------------------------------------------------------------------- 1 | import sys 2 | sys.path.append('/usr/local/lib/python3.4/dist-packages/') 3 | import numpy as np 4 | 5 | def bruteforce(P, Q): 6 | solQ=[] 7 | for q in Q: 8 | md=np.linalg.norm(P[0]-q) 9 | mi=0 10 | for i in range(len(P)): 11 | if np.linalg.norm(np.subtract(P[i],q))1000: 21 | s=int(len(P)/r) 22 | else: 23 | s=len(P) 24 | #randomly sample pointset 25 | J=np.random.choice(range(len(P)),s) 26 | m=np.zeros(D) 27 | for i in J: 28 | m=np.add(m,P[i]) 29 | m=np.divide(m,s) 30 | return m 31 | def isotropize(P, D, m): 32 | #find mean in order to isotropize 33 | for i in range(len(P)): 34 | P[i]=np.subtract(P[i],m) 35 | return P 36 | 37 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 2-Clause License 2 | 3 | Copyright (c) 2017, Ioannis Psarros 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | * Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | * Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 17 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 18 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 19 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 20 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 21 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 22 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 23 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 24 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 25 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 26 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | #Ioannis Psarros 2 | # 3 | import time 4 | import utils as fr 5 | import numpy as np 6 | import bruteforce as bf 7 | from dolphinn import * 8 | num_of_probes=20 ########################### 9 | M=1 ########################## 10 | 11 | #READ FILES 12 | #D1: data dimension, P: dataset 13 | #D2: query dimension, Q: queryset 14 | (D1,P)=fr.fvecs_read("siftsmall/siftsmall_base.fvecs") 15 | (D2,Q)=fr.fvecs_read("siftsmall/siftsmall_query.fvecs") 16 | if D1!=D2: 17 | raise IOError("Data points and query points are of different dimension") 18 | D=D1 19 | 20 | #CHANGE OF ORIGIN 21 | #find the mean of randomly sampled points 22 | m=fr.findmean(P,D,10) 23 | #then consider this mean as the origin 24 | P=fr.isotropize(P,D,m) 25 | Q=fr.isotropize(Q,D,m) 26 | K=int(np.log2(len(P)))-2 ########################## 27 | print "New dimension K=",K 28 | 29 | #PREPROCESSING 30 | tic = time.clock() 31 | dol=Dolphinn(P, D, K) 32 | toc=time.clock() 33 | print "Preprocessing time: ",toc-tic 34 | 35 | #QUERIES 36 | tic= time.clock() 37 | #assign keys to queries 38 | solQ=dol.queries(Q, M, num_of_probes) 39 | toc=time.clock() 40 | print "Average query time (Dolphinn): ",(toc-tic)/len(Q) 41 | 42 | #BRUTEFORCE 43 | tic= time.clock() 44 | solQQ=bf.bruteforce(P, Q) 45 | toc=time.clock() 46 | print "Average query time (Bruteforce): ",(toc-tic)/len(Q) 47 | 48 | #COMPUTE ACCURACY: max ratio (found distance)/(NN distance), number of exact NNs found 49 | n=0 50 | mmax=0 51 | for i in range(len(solQ)): 52 | if mmax=num_of_probes: 29 | flag=True 30 | break 31 | if flag: 32 | break 33 | #Queries 34 | #assign keys to queries 35 | A=np.sign(Q.dot(self.h)) 36 | b=np.array([2**j for j in range(self.K)]) 37 | solQ=[] 38 | for j in range(len(A)): 39 | cands=[] 40 | N=np.multiply(combs,A[j]) 41 | N=N.dot(b) 42 | for k in N: 43 | if self.cube.get(int(k))!= None: 44 | cands.extend(self.cube[int(k)]) 45 | if len(cands)>M: 46 | args=np.argpartition([np.linalg.norm(np.subtract(self.P[i],Q[j])) for i in cands],M) 47 | sols=[] 48 | for i in range(M): 49 | sols.append(cands[args[i]]) 50 | solQ.append(sols) 51 | else: 52 | solQ.append([-1]) 53 | return solQ 54 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | DolphinnPy 2 | 3 | Python 2.7 4 | 5 | Numpy is required: numpy.org 6 | for instance: pip install numpy 7 | 8 | DolphinnPy provides with a simple, yet efficient method for the problem of computing an (approximate) nearest neighbor in high dimensions. The algorithm is based on https://arxiv.org/abs/1612.07405, where we show linear space and sublinear query for a specific setting of parameters. 9 | 10 | First, N points are randomly mapped to keys in {0,1}^K, for K<=logN, by making use of the Hypeplane LSH family. Then, for a given query, candidate nearest neighbors are the ones within a small hamming radius with respect to their keys. Our approach resembles the multi-probe LSH approach but it differs on how the list of candidates is computed. 11 | 12 | Files: 13 | 14 | main.py: reads files, builds data structure, executes queries. dolphinn.py: data structure constructor, queries method. utils.py: various useful functions. bruteforce.py: linear scan for validation purposes. 15 | 16 | Hardcoded parameters (in main.py): 17 | 18 | K: new dimension - key bit length. 19 | num_of_probes: how many buckets are allowed to be visited. 20 | M: how many candidate points are allowed to be examined. 21 | 22 | Dataset, queryset files paths are in the script: in fvecs format. 23 | Requires input from http://corpus-texmex.irisa.fr/ 24 | 25 | How to run: python main.py 26 | 27 | Preprocesses dataset, then runs Dolphinn and brute-force search on all queries. 28 | Prints K, preprocessing and average-query times. 29 | Prints multiplicative approximation, number of exact answers. 30 | 31 | 32 | Some tasks: 33 | 34 | 1) Fix K, change num_of_probes and M: try to increase number of exact answers/decrease multiplicative approximation. 35 | 36 | 2) Fix num_of_probes and M, change K: try to increase number of exact answers/decrease multiplicative approximation. 37 | 38 | 3) After reading the files, the script calls an isotropize function for both sets. Run the script after commenting out these two lines. 39 | --------------------------------------------------------------------------------