├── LICENSE ├── README.md ├── TODO ├── example.rb ├── example2.rb ├── redimension.rb └── test.rb /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2015, Salvatore Sanfilippo 2 | 3 | All rights reserved. 4 | 5 | Redistribution and use in source and binary forms, with or without 6 | modification, are permitted provided that the following conditions are met: 7 | 8 | * Redistributions of source code must retain the above copyright notice, 9 | this list of conditions and the following disclaimer. 10 | 11 | * Redistributions in binary form must reproduce the above copyright notice, 12 | this list of conditions and the following disclaimer in the documentation 13 | and/or other materials provided with the distribution. 14 | 15 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 16 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 17 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 18 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR 19 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 20 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 21 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON 22 | ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 23 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 24 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 25 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Redimension 2 | === 3 | 4 | Redimension is a Redis multi-dimensional indexing and querying library 5 | implemented in order to index items in N-dimensions, and then asking for elements 6 | where each dimension is within the specified ranges. 7 | 8 | This library was written for multiple reasons: 9 | 10 | 1. In order to show the technique described in the [Redis indexing documentation](http://redis.io/topics/indexes) in actual working code. 11 | 2. Because it's useful for actual workloads. 12 | 3. Since I wanted to experiment with an actual API and implementation of this problem to understand if it's a good idea to add new commands to Redis implementing exactly this use case, but inside the server. 13 | 14 | The technique used in order to implement this library, is to use an ordered 15 | set of keys in order to represent a multi dimensional index by interleaving 16 | the bits values of each dimension in a single large number. This way 17 | it is possible the request squares (in 2D), cubes (in 3D) and in general 18 | N-dimensional ranges with the same side length in an efficient way as 19 | lexicographical ranges. In Redis, this is implemented using the sorted set 20 | data type, together with the [ZRANGEBYLEX command](http://redis.io/commands/zrangebylex). 21 | 22 | Usage 23 | === 24 | 25 | Currently the library can index only unsigned integers of the specified 26 | precision. There are no precision limits, you can index integers composed 27 | of as much bits as you like: you specify the number of bits for each dimension 28 | in the constructor when creating a Redimension object. 29 | 30 | An example usage in 2D is the following. Imagine you want to index persons 31 | by salary and age: 32 | 33 | redis = Redis.new() 34 | myindex = Redimension.new(redis,"people-by-salary",2,64) 35 | 36 | We created a Redimension object specifying a Redis object that must respond 37 | to the Redis commands. We specified we want 2D indexing, and 64 bits of 38 | precision for each dimension. The first argument is the key name that will 39 | represent the index as a sorted set. 40 | 41 | Now we can add elements to our index. 42 | 43 | myindex.index([45,120000],"Josh") 44 | myindex.index([50,110000],"Pamela") 45 | myindex.index([30,125000],"Angela") 46 | 47 | The `index` method takes an array of integers representing the value of each 48 | dimension for the item, and an item name that will be returned when asking 49 | for ranges during the query stage. 50 | 51 | Querying is simple. In the following query we ask for all the people with 52 | age between 40 and 50, and salary between 100000 and 115000. 53 | 54 | results = myindex.query([[40,50],[100000,115000]]) 55 | Output: [50, 110000, "Pamela"] 56 | 57 | Ranges are **always** inclusive. Not a big problem since currently we can 58 | only index integers so just increment/decrement to exclude a given value. 59 | 60 | If you want to play with the library, the above example is shipped with 61 | the source code, the file is called `example.rb`. 62 | 63 | Unindexing 64 | === 65 | 66 | There are two ways in order to remove indexed data from the index. One 67 | is to specify again the coordinates and the ID, using the `unindex` method: 68 | 69 | myindex.unindex([45,120000],"Josh") 70 | 71 | However sometimes it is no longer possible to have the old data, we want 72 | just unindex or update our coordinates for a given element. In this 73 | case we may enable a feature of the library called *Hash mapping*. We 74 | enable it by setting a key which will represent, using an Hash type, a 75 | map between the item ID and the current indexed representation: 76 | 77 | myindex.hashkey = "people-by-salary-map" 78 | 79 | Once this is enabled, each time we use the `index` method, an hash entry 80 | will be created at the same time. We can now use two additional methods. 81 | One will simply remove an item from the index just by ID: 82 | 83 | myindex.unindex_by_id("Josh") 84 | 85 | The other is a variant of `index` that removes and re-adds the element with 86 | the updated coordinates: 87 | 88 | myindex.update([46,120000],"Josh") 89 | 90 | It is imporatnt to enable this feature after the object is created, and 91 | consistently for all the queries, so that the Hash and the sorted set 92 | are in sync. When this feature is enabled, to use `index` is not a good 93 | idea and `update` should be used instead regardless the added element 94 | areadly exists or not inside the index. Please refer to `example2.rb` for 95 | an example. 96 | 97 | Tests 98 | === 99 | 100 | There is a fuzzy tester called `test.rb` that tests the library in 2D, 3D 101 | and 4D against Ruby-side filtering of elements within the ranges. 102 | In order to run the test just execute: 103 | 104 | ruby ./test.rb 105 | 106 | License 107 | === 108 | 109 | The code is released under the BSD 2 clause license. 110 | -------------------------------------------------------------------------------- /TODO: -------------------------------------------------------------------------------- 1 | TODO 2 | ==== 3 | 4 | * Some way to support pipelined add of multiple points. 5 | Perhaps just r.pipelined { idex.add(...) } works? 6 | 7 | * Select a better exponent for the query by using ZLEXCOUNT to estimate 8 | if filtering work would be too big or not to perform compared to the 9 | number of additional queries. 10 | 11 | * Binary encoding to save space instead of using hex to represent each byte as two. 12 | 13 | * Convert update & unindex_by_id method to Lua scripts. 14 | -------------------------------------------------------------------------------- /example.rb: -------------------------------------------------------------------------------- 1 | require 'rubygems' 2 | require 'redis' 3 | require "./redimension.rb" 4 | 5 | redis = Redis.new 6 | redis.del("people-by-salary") 7 | myindex = Redimension.new(redis,"people-by-salary",2,64) 8 | myindex.index([45,120000],"Josh") 9 | myindex.index([50,110000],"Pamela") 10 | myindex.index([30,125000],"Angela") 11 | results = myindex.query([[40,50],[100000,115000]]) 12 | results.each{|r| 13 | puts r.inspect 14 | } 15 | -------------------------------------------------------------------------------- /example2.rb: -------------------------------------------------------------------------------- 1 | require 'rubygems' 2 | require 'redis' 3 | require "./redimension.rb" 4 | 5 | redis = Redis.new 6 | redis.del("people-by-salary") 7 | redis.del("people-by-salary-map") 8 | myindex = Redimension.new(redis,"people-by-salary",2,64) 9 | myindex.hashkey = "people-by-salary-map" 10 | myindex.update([45,120000],"Josh") 11 | myindex.update([50,110000],"Pamela") 12 | myindex.update([41,100000],"George") 13 | myindex.update([30,125000],"Angela") 14 | 15 | results = myindex.query([[40,50],[100000,115000]]) 16 | results.each{|r| 17 | puts r.inspect 18 | } 19 | 20 | myindex.unindex_by_id("Pamela") 21 | puts "After unindexing:" 22 | results = myindex.query([[40,50],[100000,115000]]) 23 | results.each{|r| 24 | puts r.inspect 25 | } 26 | 27 | myindex.update([42,100000],"George") 28 | puts "After updating:" 29 | results = myindex.query([[40,50],[100000,115000]]) 30 | results.each{|r| 31 | puts r.inspect 32 | } 33 | -------------------------------------------------------------------------------- /redimension.rb: -------------------------------------------------------------------------------- 1 | class Redimension 2 | attr_accessor :debug, :hashkey 3 | attr_reader :redis, :key, :dim, :prec 4 | 5 | def initialize(redis,key,dim,prec=64) 6 | @debug = false 7 | @redis = redis 8 | @dim = dim 9 | @key = key 10 | @prec = prec 11 | @hashkey = false 12 | @binary = false # Default is hex encoding 13 | end 14 | 15 | def check_dim(vars) 16 | if vars.length != @dim 17 | raise "Please always use #{@dim} vars with this index." 18 | end 19 | end 20 | 21 | # Encode N variables into the bits-interleaved representation. 22 | def encode(vars) 23 | comb = false 24 | vars.each{|v| 25 | vbin = v.to_s(2).rjust(@prec,'0') 26 | comb = comb ? comb.zip(vbin.split("")) : vbin.split("") 27 | } 28 | comb = comb.flatten.compact.join("") 29 | comb.to_i.to_s(16).rjust(@prec*@dim/4,'0') 30 | end 31 | 32 | # Encode an element coordinates and ID as the whole string to add 33 | # into the sorted set. 34 | def elestring(vars,id) 35 | check_dim(vars) 36 | ele = encode(vars) 37 | vars.each{|v| ele << ":#{v}"} 38 | ele << ":#{id}" 39 | end 40 | 41 | # Add a variable with associated data 'id' 42 | def index(vars,id) 43 | ele = elestring(vars,id) 44 | @redis.multi { 45 | @redis.zadd(@key,0,ele) 46 | @redis.hset(@hashkey,id,ele) 47 | } 48 | end 49 | 50 | # ZREM according to current position in the space and ID. 51 | def unindex(vars,id) 52 | @redis.zrem(@key,elestring(vars,id)) 53 | end 54 | 55 | # Unidex by just ID in case @hashkey is set to true in order to take 56 | # an associated Redis hash with ID -> current indexed representation, 57 | # so that the user can unindex easily. 58 | def unindex_by_id(id) 59 | raise "Please specifiy an hash key with #hashkey to enable mapping" if !@hashkey 60 | ele = @redis.hget(@hashkey,id) 61 | @redis.multi { 62 | @redis.zrem(@key,ele) 63 | @redis.hdel(@hashkey,id) 64 | } 65 | end 66 | 67 | # Like #index but makes sure to remove the old index for the specified 68 | # id. Requires hash mapping enabled. 69 | def update(vars,id) 70 | raise "Please specifiy an hash key with #hashkey to enable mapping" if !@hashkey 71 | ele = elestring(vars,id) 72 | oldele = @redis.hget(@hashkey,id) 73 | @redis.multi { 74 | @redis.zrem(@key,oldele) 75 | @redis.hdel(@hashkey,id) 76 | @redis.zadd(@key,0,ele) 77 | @redis.hset(@hashkey,id,ele) 78 | } 79 | end 80 | 81 | # exp is the exponent of two that gives the size of the squares 82 | # we use in the range query. N times the exponent is the number 83 | # of bits we unset and set to get the start and end points of the range. 84 | def query_raw(vrange,exp) 85 | vstart = [] 86 | vend = [] 87 | # We start scaling our indexes in order to iterate all areas, so 88 | # that to move between N-dimensional areas we can just increment 89 | # vars. 90 | vrange.each{|r| 91 | vstart << r[0]/(2**exp) 92 | vend << r[1]/(2**exp) 93 | } 94 | 95 | # Visit all the sub-areas to cover our N-dim search region. 96 | ranges = [] 97 | vcurrent = vstart.dup 98 | notdone = true 99 | while notdone 100 | # For each sub-region, encode all the start-end ranges 101 | # for each dimension. 102 | vrange_start = [] 103 | vrange_end = [] 104 | (0...@dim).each{|i| 105 | vrange_start << vcurrent[i]*(2**exp) 106 | vrange_end << (vrange_start[i] | ((2**exp)-1)) 107 | } 108 | 109 | puts "Logical square #{vcurrent.inspect} from #{vrange_start.inspect} to #{vrange_end.inspect}" if @debug 110 | 111 | # Now we need to combine the ranges for each dimension 112 | # into a single lexicographcial query, so we turn 113 | # the ranges it into interleaved form. 114 | s = encode(vrange_start) 115 | # Now that we have the start of the range, calculate the end 116 | # by replacing the specified number of bits from 0 to 1. 117 | e = encode(vrange_end) 118 | ranges << ["[#{s}:","[#{e}:\xff"] 119 | puts "Lex query: #{ranges[-1]}" if @debug 120 | 121 | # Increment to loop in N dimensions in order to visit 122 | # all the sub-areas representing the N dimensional area to 123 | # query. 124 | (0...@dim).each{|i| 125 | if vcurrent[i] != vend[i] 126 | vcurrent[i] += 1 127 | break 128 | elsif i == dim-1 129 | notdone = false; # Visited everything! 130 | else 131 | vcurrent[i] = vstart[i] 132 | end 133 | } 134 | end 135 | 136 | # Perform the ZRANGEBYLEX queries to collect the results from the 137 | # defined ranges. Use pipelining to speedup. 138 | allres = @redis.pipelined { 139 | ranges.each{|range| 140 | @redis.zrangebylex(@key,range[0],range[1]) 141 | } 142 | } 143 | 144 | # Filter items according to the requested limits. This is needed 145 | # since our sub-areas used to cover the whole search area are not 146 | # perfectly aligned with boundaries, so we also retrieve elements 147 | # outside the searched ranges. 148 | items = [] 149 | allres.each{|res| 150 | res.each{|item| 151 | fields = item.split(":") 152 | skip = false 153 | (0...@dim).each{|i| 154 | if fields[i+1].to_i < vrange[i][0] || 155 | fields[i+1].to_i > vrange[i][1] 156 | then 157 | skip = true 158 | break 159 | end 160 | } 161 | items << fields[1..-2].map{|f| f.to_i} + [fields[-1]] if !skip 162 | } 163 | } 164 | items 165 | end 166 | 167 | # Like query_raw, but before performing the query makes sure to order 168 | # parameters so that x0 < x1 and y0 < y1 and so forth. 169 | # Also calculates the exponent for the query_raw masking. 170 | def query(vrange) 171 | check_dim(vrange) 172 | vrange = vrange.map{|vr| 173 | vr[0] < vr[1] ? vr : [vr[1],vr[0]] 174 | } 175 | deltas = vrange.map{|vr| (vr[1]-vr[0])+1} 176 | delta = deltas.min 177 | exp = 1 178 | while delta > 2 179 | delta /= 2 180 | exp += 1 181 | end 182 | # If ranges for different dimensions are extremely different in span, 183 | # we may end with a too small exponent which will result in a very 184 | # big number of queries in order to be very selective. This is most 185 | # of the times not a good idea, so at the cost of querying larger 186 | # areas and filtering more, we scale 'exp' until we can serve this 187 | # request with less than 20 ZRANGEBYLEX commands. 188 | # 189 | # Note: the magic "20" depends on the number of items inside the 190 | # requested range, since it's a tradeoff with filtering items outside 191 | # the searched area. It is possible to improve the algorithm by using 192 | # ZLEXCOUNT to get the number of items. 193 | while true 194 | deltas = vrange.map{|vr| 195 | (vr[1]/(2**exp))-(vr[0]/(2**exp))+1 196 | } 197 | ranges = deltas.reduce{|a,b| a*b} 198 | break if ranges < 20 199 | exp += 1 200 | end 201 | query_raw(vrange,exp) 202 | end 203 | 204 | # Similar to #query but takes just the center of the query area and a 205 | # radius, and automatically filters away all the elements outside the 206 | # specified circular area. 207 | def query_radius(x,y,exp,radius) 208 | # TODO 209 | end 210 | end 211 | 212 | -------------------------------------------------------------------------------- /test.rb: -------------------------------------------------------------------------------- 1 | require 'rubygems' 2 | require 'redis' 3 | require "./redimension.rb" 4 | 5 | def fuzzy_test(dim,items,queries) 6 | redis = Redis.new() 7 | redis.del("redim-fuzzy") 8 | rn = Redimension.new(redis,"redim-fuzzy",dim,64) 9 | id = 0 10 | dataset = [] 11 | 1000.times { 12 | vars = [] 13 | dim.times {vars << rand(1000)} 14 | dataset << vars+[id.to_s] 15 | rn.index(vars,id) 16 | puts "Adding #{dataset[-1].inspect}" 17 | id += 1 18 | } 19 | 20 | 1000.times { 21 | random = [] 22 | dim.times { 23 | s = rand(1000) 24 | e = rand(1000) 25 | # Sort the range for the test itself, the library can take 26 | # arguments in the wrong order without issues. 27 | s,e=e,s if s > e 28 | random << [s,e] 29 | } 30 | print "TESTING #{random.inspect}:" 31 | STDOUT.flush 32 | 33 | start_t = Time.now 34 | res1 = rn.query(random) 35 | end_t = Time.now 36 | print "#{res1.length} result in #{(end_t-start_t).to_f} seconds\n" 37 | res2 = dataset.select{|i| 38 | included = true 39 | (0...dim).each{|j| 40 | included = false if i[j] < random[j][0] || 41 | i[j] > random[j][1] 42 | } 43 | included 44 | } 45 | if res1.sort != res2.sort 46 | puts "ERROR #{res1.length} VS #{res2.length}:" 47 | puts res1.sort.inspect 48 | puts res2.sort.inspect 49 | exit 50 | end 51 | } 52 | puts "#{dim}D test passed" 53 | redis.del("redim-fuzzy") 54 | end 55 | 56 | fuzzy_test(4,100,1000) 57 | fuzzy_test(3,100,1000) 58 | fuzzy_test(2,1000,1000) 59 | --------------------------------------------------------------------------------