├── .gitignore ├── .rspec ├── Gemfile ├── README.md ├── Rakefile ├── bin └── mneme ├── config.rb ├── lib ├── mneme.rb └── mneme │ ├── helper.rb │ └── sweeper.rb ├── mneme.gemspec └── spec └── mneme_spec.rb /.gitignore: -------------------------------------------------------------------------------- 1 | *.gem 2 | .bundle 3 | Gemfile.lock 4 | pkg/* 5 | -------------------------------------------------------------------------------- /.rspec: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/igrigorik/mneme/e912a1eabb9c6f226df085626f8147e4a96cb85c/.rspec -------------------------------------------------------------------------------- /Gemfile: -------------------------------------------------------------------------------- 1 | source "http://rubygems.org" 2 | 3 | gemspec -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Mneme 2 | 3 | mneme (n.) mne·me 4 | 1. Psychology: the retentive basis or basic principle in a mind or organism accounting for memory, persisting effect of memory of past events. 5 | 2. Mythology: the Muse of memory, one of the original three Muses. Cf."Aoede, Melete." 6 | 7 | Mneme is an HTTP web-service for recording and identifying previously seen records - aka, duplicate detection. To achieve this goal in a scalable, and zero-maintenance manner, it is implemented via a collection of automatically rotated bloomfilters. By using a collection of bloomfilters, you can customize your false-positive error rate, as well as the amount of time you want your memory to perist (ex: remember all keys for the last 6 hours). 8 | 9 | To minimize the require memory footprint, mneme does not store the actual key names, instead each specified key is hashed and mapped onto the bloomfilter. For data storage, we use Redis getbit/setbit to efficiently store and retrieve bit-level data for each key. Couple this with Goliath app-server, and you have an out-of-the-box, high-performance, customizable duplicate filter. 10 | 11 | For more details: [Mneme: Scalable Duplicate Filtering Service](http://www.igvita.com/2011/03/24/mneme-scalable-duplicate-filtering-service ) 12 | 13 | ## Sample configuration 14 | 15 | # example_config.rb 16 | 17 | config['namespace'] = 'default' # namespace for your app (if you're sharing a redis instance) 18 | config['periods'] = 3 # number of periods to store data for 19 | config['length'] = 60 # length of a period in seconds (length = 60, periods = 3.. 180s worth of data) 20 | 21 | config['size'] = 1000 # desired size of the bloomfilter 22 | config['bits'] = 10 # number of bits allocated per key 23 | config['hashes'] = 7 # number of times each key will be hashed 24 | config['seed'] = 30 # seed value for the hash function 25 | 26 | config['pool'] = 2 # number of concurrent Redis connections 27 | 28 | To learn more about Bloom filter configuration: [Scalable Datasets: Bloom Filters in Ruby](http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/) 29 | 30 | ## Launching mneme 31 | 32 | $> redis-server 33 | $> gem install mneme 34 | $> mneme -p 9000 -sv -c config.rb # run with -h to see all options 35 | 36 | That's it! You now have a mneme web service running on port 9000. Let's try querying and inserting some data: 37 | 38 | $> curl "http://127.0.0.1:9000?key=abcd" 39 | {"found":[],"missing":["abcd"]} 40 | 41 | # -d creates a POST request with key=abcd, aka insert into filter 42 | $> curl "http://127.0.0.1:9000?key=abcd" -d' ' 43 | 44 | $> curl "http://127.0.0.1:9000?key=abcd" 45 | {"found":["abcd"],"missing":[]} 46 | 47 | ## Performance & Memory requirements 48 | 49 | - [Redis](http://redis.io/) is used as an in-memory datastore of the bloomfilter 50 | - [Goliath](https://github.com/postrank-labs/goliath) provides the high-performance HTTP frontend 51 | - The speed of storing a new key is: *O(number of BF hashes) - aka, O(1)* 52 | - The speed of retrieving a key is: *O(number of filters * number of BF hashes) - aka, O(1)* 53 | 54 | - Sample ab benchmarks for single key lookup: [https://gist.github.com/895326](https://gist.github.com/895326) 55 | 56 | Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Because we are using Redis as a backend, in-memory store for the filters, there is some extra overhead. Sample memory requirements: 57 | 58 | - 1.0% error rate for 1M items, 10 bits/item: 2.5 mb 59 | - 1.0% error rate for 150M items, 10 bits per item: 358.52 mb 60 | - 0.1% error rate for 150M items, 15 bits per item: 537.33 mb 61 | 62 | Ex: If you wanted to store up to 24 hours (with 1 hour = 1 bloom filter) of keys, where each hour can have up to 1M keys, and you are willing to accept a 1.0% error rate, then your memory footprint is: 24 * 2.5mb = 60mb of memory. The footprint will not change after 24 hours, because Mneme will automatically rotate and delete old filters for you! 63 | 64 | ### License 65 | 66 | (MIT License) - Copyright (c) 2011 Ilya Grigorik -------------------------------------------------------------------------------- /Rakefile: -------------------------------------------------------------------------------- 1 | require 'bundler' 2 | Bundler::GemHelper.install_tasks 3 | 4 | require 'rspec/core/rake_task' 5 | 6 | desc "Run all RSpec tests" 7 | RSpec::Core::RakeTask.new(:spec) 8 | 9 | task :default => :spec -------------------------------------------------------------------------------- /bin/mneme: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env ruby 2 | 3 | begin 4 | config_index = (ARGV.index('-c') || ARGV.index('--config')) + 1 5 | ARGV[config_index] = File.absolute_path(ARGV[config_index]) 6 | rescue 7 | puts "Please specify a valid mneme configuration file (ex: -c config.rb)" 8 | exit if !(ARGV.index('-h') || ARGV.index('--config')) 9 | end 10 | 11 | system("/usr/bin/env ruby " + File.dirname(__FILE__) + '/../lib/mneme.rb' + ' ' + ARGV.join(" ")) -------------------------------------------------------------------------------- /config.rb: -------------------------------------------------------------------------------- 1 | config['namespace'] = 'default' 2 | config['periods'] = 3 3 | config['length'] = 60 4 | 5 | config['size'] = 1000 6 | config['bits'] = 10 7 | config['hashes'] = 7 8 | config['seed'] = 30 9 | 10 | config['pool'] = 2 -------------------------------------------------------------------------------- /lib/mneme.rb: -------------------------------------------------------------------------------- 1 | require 'goliath' 2 | require 'yajl' 3 | 4 | require 'redis' 5 | require 'redis/connection/synchrony' 6 | require 'bloomfilter-rb' 7 | 8 | require 'mneme/helper' 9 | require 'mneme/sweeper' 10 | 11 | class Mneme < Goliath::API 12 | include Mnemosyne::Helper 13 | plugin Mnemosyne::Sweeper 14 | 15 | use Goliath::Rack::Params 16 | use Goliath::Rack::DefaultMimeType 17 | use Goliath::Rack::Formatters::JSON 18 | use Goliath::Rack::Render 19 | use Goliath::Rack::Heartbeat 20 | use Goliath::Rack::ValidationError 21 | use Goliath::Rack::Validation::RequestMethod, %w(GET POST) 22 | 23 | def response(env) 24 | keys = [params.delete('key') || params.delete('key[]')].flatten.compact 25 | return [400, {}, {error: 'no key specified'}] if keys.empty? 26 | 27 | logger.debug "Processing: #{keys}" 28 | case env[Goliath::Request::REQUEST_METHOD] 29 | when 'GET' then query_filters(keys) 30 | when 'POST' then update_filters(keys) 31 | end 32 | end 33 | 34 | def query_filters(keys) 35 | found, missing = [], [] 36 | keys.each do |key| 37 | 38 | present = false 39 | config['periods'].to_i.times do |n| 40 | if filter(n).key?(key) 41 | present = true 42 | break 43 | end 44 | end 45 | 46 | if present 47 | found << key 48 | else 49 | missing << key 50 | end 51 | end 52 | 53 | code = case keys.size 54 | when found.size then 200 55 | when missing.size then 404 56 | else 206 57 | end 58 | 59 | [code, {}, {found: found, missing: missing}] 60 | end 61 | 62 | def update_filters(keys) 63 | keys.each do |key| 64 | filter(0).insert key 65 | logger.debug "Inserted new key: #{key}" 66 | end 67 | 68 | [201, {}, ''] 69 | end 70 | 71 | private 72 | 73 | def filter(n) 74 | period = epoch_name(config['namespace'], n, config['length']) 75 | 76 | filter = if env[Goliath::Constants::CONFIG].key? period 77 | env[Goliath::Constants::CONFIG][period] 78 | else 79 | opts = { 80 | namespace: config['namespace'], 81 | size: config['size'] * config['bits'], 82 | seed: config['seed'], 83 | hashes: config['hashes'] 84 | } 85 | 86 | pool = config['pool'] || 1 87 | env[Goliath::Constants::CONFIG][period] = EventMachine::Synchrony::ConnectionPool.new(size: pool) do 88 | BloomFilter::Redis.new(opts) 89 | end 90 | end 91 | 92 | filter 93 | end 94 | end 95 | -------------------------------------------------------------------------------- /lib/mneme/helper.rb: -------------------------------------------------------------------------------- 1 | module Mnemosyne 2 | module Helper 3 | def epoch(n, length) 4 | (Time.now.to_i / length) - n 5 | end 6 | 7 | def epoch_name(namespace, n, length) 8 | "mneme-#{namespace}-#{epoch(n, length)}" 9 | end 10 | end 11 | end 12 | -------------------------------------------------------------------------------- /lib/mneme/sweeper.rb: -------------------------------------------------------------------------------- 1 | module Mnemosyne 2 | class Sweeper 3 | include Helper 4 | 5 | def initialize(port, config, status, logger) 6 | @status = status 7 | @config = config 8 | @logger = logger 9 | end 10 | 11 | def run 12 | if @config.empty? 13 | puts "Please specify a valid mneme configuration file (ex: -c config.rb)" 14 | EM.stop 15 | exit 16 | end 17 | 18 | sweeper = Proc.new do 19 | Fiber.new do 20 | current = epoch_name(@config['namespace'], 0, @config['length']) 21 | @logger.info "Sweeping old filters, current epoch: #{current}" 22 | 23 | conn = Redis.new 24 | @config['periods'].times do |n| 25 | name = epoch_name(@config['namespace'], n + @config['periods'], @config['length']) 26 | 27 | conn.del(name) 28 | @logger.info "Removed: #{name}" 29 | end 30 | conn.client.disconnect 31 | end.resume 32 | end 33 | 34 | sweeper.call 35 | EM.add_periodic_timer(@config['length']) { sweeper.call } 36 | 37 | @logger.info "Started Mnemosyne::Sweeper with #{@config['length']}s interval" 38 | end 39 | end 40 | end 41 | -------------------------------------------------------------------------------- /mneme.gemspec: -------------------------------------------------------------------------------- 1 | # -*- encoding: utf-8 -*- 2 | $:.push File.expand_path("../lib", __FILE__) 3 | 4 | Gem::Specification.new do |s| 5 | s.name = "mneme" 6 | s.version = "0.6.0" 7 | s.platform = Gem::Platform::RUBY 8 | s.authors = ["Ilya Grigorik"] 9 | s.email = ["ilya@igvita.com"] 10 | s.homepage = "" 11 | s.summary = %q{abc} 12 | s.description = %q{Write a gem description} 13 | 14 | s.rubyforge_project = "mneme" 15 | 16 | s.add_dependency "goliath", ["= 0.9.1"] 17 | s.add_dependency 'http_parser.rb', ["= 0.5.1"] 18 | s.add_dependency "hiredis" 19 | 20 | s.add_dependency "redis" 21 | s.add_dependency "yajl-ruby" 22 | s.add_dependency "bloomfilter-rb" 23 | 24 | s.add_development_dependency "rspec" 25 | s.add_development_dependency "em-http-request", "= 1.0.0.beta.3" 26 | 27 | s.files = `git ls-files`.split("\n") 28 | s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n") 29 | s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) } 30 | s.require_paths = ["lib"] 31 | end 32 | -------------------------------------------------------------------------------- /spec/mneme_spec.rb: -------------------------------------------------------------------------------- 1 | require 'lib/mneme' 2 | require 'goliath/test_helper' 3 | require 'em-http/middleware/json_response' 4 | 5 | describe Mneme do 6 | include Goliath::TestHelper 7 | 8 | let(:err) { Proc.new { fail "API request failed" } } 9 | let(:api_options) { { :config => File.expand_path(File.join(File.dirname(__FILE__), '..', 'config.rb')) } } 10 | 11 | EventMachine::HttpRequest.use EventMachine::Middleware::JSONResponse 12 | 13 | it 'responds to hearbeat' do 14 | with_api(Mneme, api_options) do 15 | get_request({path: '/status'}, err) do |c| 16 | c.response.should match('OK') 17 | end 18 | end 19 | end 20 | 21 | it 'should require an error if no key is provided' do 22 | with_api(Mneme, api_options) do 23 | get_request({}, err) do |c| 24 | c.response.should include 'error' 25 | end 26 | end 27 | end 28 | 29 | context 'single key' do 30 | it 'should return 404 on missing key' do 31 | with_api(Mneme, api_options) do 32 | get_request({:query => {:key => 'missing'}}, err) do |c| 33 | c.response_header.status.should == 404 34 | c.response['missing'].should include 'missing' 35 | end 36 | end 37 | end 38 | 39 | it 'should insert key into filter' do 40 | with_api(Mneme, api_options) do 41 | post_request({:query => {key: 'abc'}}) do |c| 42 | c.response_header.status.should == 201 43 | 44 | get_request({:query => {:key => 'abc'}}, err) do |c| 45 | c.response_header.status.should == 200 46 | c.response['found'].should include 'abc' 47 | end 48 | end 49 | end 50 | end 51 | end 52 | 53 | context 'multiple keys' do 54 | 55 | it 'should return 404 on missing keys' do 56 | with_api(Mneme, api_options) do 57 | get_request({:query => {:key => ['a', 'b']}}, err) do |c| 58 | c.response_header.status.should == 404 59 | 60 | c.response['found'].should be_empty 61 | c.response['missing'].should include 'a' 62 | c.response['missing'].should include 'b' 63 | end 64 | end 65 | end 66 | 67 | it 'should return 200 on found keys' do 68 | with_api(Mneme, api_options) do 69 | post_request({:query => {key: ['abc1', 'abc2']}}) do |c| 70 | c.response_header.status.should == 201 71 | 72 | get_request({:query => {:key => ['abc1', 'abc2']}}, err) do |c| 73 | c.response_header.status.should == 200 74 | end 75 | end 76 | end 77 | end 78 | 79 | it 'should return 206 on mixed keys' do 80 | with_api(Mneme, api_options) do 81 | post_request({:query => {key: ['abc3']}}) do |c| 82 | c.response_header.status.should == 201 83 | 84 | get_request({:query => {:key => ['abc3', 'abc4']}}, err) do |c| 85 | c.response_header.status.should == 206 86 | end 87 | end 88 | end 89 | end 90 | 91 | end 92 | 93 | end 94 | --------------------------------------------------------------------------------