├── .gitignore ├── .rspec ├── Gemfile ├── LICENSE.txt ├── README.md ├── config.json.example ├── config.ru ├── events_rmq.rb ├── events_sqs.rb ├── optica.rb ├── spec ├── optica_spec.rb └── spec_helper.rb ├── store.rb └── tools ├── copy.rb ├── fabfile.py ├── optica.sh └── reporter.rb /.gitignore: -------------------------------------------------------------------------------- 1 | *.gem 2 | *.rbc 3 | .bundle 4 | .config 5 | coverage 6 | InstalledFiles 7 | lib/bundler/man 8 | pkg 9 | rdoc 10 | spec/reports 11 | test/tmp 12 | test/version_tmp 13 | tmp 14 | 15 | # YARD artifacts 16 | .yardoc 17 | _yardoc 18 | doc/ 19 | 20 | # VIM 21 | .*sw? 22 | 23 | Gemfile.lock 24 | config.json 25 | -------------------------------------------------------------------------------- /.rspec: -------------------------------------------------------------------------------- 1 | --require spec_helper 2 | -------------------------------------------------------------------------------- /Gemfile: -------------------------------------------------------------------------------- 1 | source 'https://rubygems.org' 2 | 3 | gem 'sinatra', '~> 1.4.4' 4 | gem 'zk', '~> 1.9.3' 5 | gem 'unicorn', '~> 4.8.3' 6 | gem 'unicorn-worker-killer', '~> 0.4.4' 7 | gem 'hash-deep-merge', '~> 0.1.1' 8 | gem 'oj', '= 3.13.18' 9 | gem 'stomp', '~> 1.3.2' 10 | gem 'dogstatsd-ruby', '= 3.3.0' 11 | gem 'aws-sdk-sqs', '= 1.51.1' 12 | gem 'get_process_mem', '= 0.2.1' 13 | 14 | group :ddtrace do 15 | gem 'ddtrace', '~> 0.45.0' 16 | end 17 | 18 | group :test do 19 | gem 'rspec', '~> 3.11.0' 20 | gem 'rack-test', '~> 2.0.2' 21 | end 22 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Copyright (c) 2013 Airbnb, Inc. 2 | 3 | MIT License 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining 6 | a copy of this software and associated documentation files (the 7 | "Software"), to deal in the Software without restriction, including 8 | without limitation the rights to use, copy, modify, merge, publish, 9 | distribute, sublicense, and/or sell copies of the Software, and to 10 | permit persons to whom the Software is furnished to do so, subject to 11 | the following conditions: 12 | 13 | The above copyright notice and this permission notice shall be 14 | included in all copies or substantial portions of the Software. 15 | 16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 19 | NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 20 | LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 21 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION 22 | WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Optica # 2 | 3 | Optica is a service for registering and locating nodes. 4 | It provides a simple REST API. 5 | 6 | Nodes can POST to / to register themselves with some parameters. 7 | Humans can GET / to get a list of all registered nodes. 8 | GET also accepts some parameters to limit which of the registered nodes you see. 9 | 10 | ## Why Optica? ## 11 | 12 | We love the node registration features of [chef server](https://docs.chef.io/server_components.html). 13 | However, we run chef-solo here at [Airbnb](www.airbnb.com). 14 | We use optica as alternate node registration system. 15 | 16 | ## Installation ## 17 | 18 | Use `bundler`! 19 | To install all the dependencies: 20 | 21 | ```bash 22 | $ bundle install 23 | ``` 24 | 25 | ## Dependencies ## 26 | 27 | ### Zookeeper ### 28 | 29 | Optica is a front-end to a data store. 30 | At Airbnb, this data store is [Apache Zookeeper](https://zookeeper.apache.org/). 31 | 32 | Why Zookeeper? 33 | * we consider optica information critical data, with high uptime requirements 34 | * we already rely critically on Zookeeper to connect our infrastructure; we strive to ensure maximum uptime for this system 35 | * the load patterns of optica (many reads, infrequenty writes) match what zookeeper provides 36 | 37 | ### Rabbitmq ### 38 | 39 | Some parts of our infrastructure are asynchronous; we rely on notification of converges to know, for example, when some kinds of deploys have completed (or failed). 40 | For this reason, Optica generates events in [rabbitmq](http://www.rabbitmq.com/) for every converge. 41 | 42 | ## Usage with Chef ## 43 | 44 | We've included a sample notifier which reports back to optica on every chef converge. 45 | It's in this repo in `reporter.rb`, just make sure to substitute the 46 | correct value for the `optica_server` option. To use it, we added the [chef-handler cookbook](https://github.com/opscode-cookbooks/chef_handler). 47 | Then, we do the following (in our common cookbook, which is applied to every role): 48 | 49 | ```ruby 50 | directory node.common.notifier_dir 51 | 52 | cookbook_file `reporter.rb` do 53 | path File.join(node.common.notifier_dir, 'reporter.rb') 54 | end 55 | 56 | chef_handler 'notifier' do 57 | action :enable 58 | source File.join(node.common.notifier_dir, 'reporter.rb') 59 | end 60 | ``` 61 | 62 | If you wish to register additional key-value pairs with your node, simply add them to `node.optica.report`: 63 | 64 | ```ruby 65 | default.optica.report['jvm_version'] = node.java.version 66 | ``` 67 | 68 | ## Usage on the command line ## 69 | 70 | Optica has a very minimal query syntax, and errs on the side of returning more information than you need. 71 | Really, the only reason for the query parameters is to limit the amount of data transferred over the network. 72 | We can get away with it because all of the complex functionality you might wish for on the command line is provided by [JQ](http://stedolan.github.io/jq/). 73 | 74 | ### JQ examples ### 75 | 76 | Let's define a basic optica script: 77 | ```bash 78 | #!/bin/bash 79 | 80 | my_optica_host='https://optica.example.com' 81 | curl --silent ${my_optica_host}/?"$1" | jq --compact-output ".nodes[] | $2" 82 | ``` 83 | 84 | With this in your `$PATH` and the right subsitution for your optica endpoint, here are some examples: 85 | 86 | ##### Getting all hostnames by role: ##### 87 | 88 | I run this, then pick a random one to ssh into when, e.g., investigating issues. 89 | 90 | `$ optica role=myrole .hostname` 91 | 92 | ##### How many of each role in us-east-1a or 1b? #### 93 | 94 | See what the impact will be of an outage in those two zones: 95 | 96 | `$ optica az=us-east 'select(.az == "us-east-1a" or .az == "us-east-1b") | .role' | sort | uniq -c | sort -n ` 97 | 98 | ##### Monitor the progress of a chef run on a role #### 99 | 100 | Useful if you've just initiated a chef run across a large number of machines, or are waiting for scheduled runs to complete to deploy your change: 101 | 102 | `$ optica role=myrole '[.last_start, .failed, .hostname]' | sort` 103 | 104 | ## Usage with Fabric ## 105 | 106 | We've included a sample `fabfile.py` to get you started. 107 | Simply replace `optica.example` with the address to your optica install. 108 | 109 | ## Cleanup ## 110 | 111 | Optica relies on you manually cleaning up expired nodes. 112 | At Airbnb, all of our nodes run in Amazon's EC2. 113 | We have a regularly scheduled task which grabs all recently terminated instances and performs cleanup, including optica cleanup, on those instances. 114 | 115 | Cleanup is accomplished by calling `DELETE` on optica. 116 | For instance: 117 | 118 | ```bash 119 | $ curl -X DELETE http://optica.example.com/i-36428351 120 | ``` 121 | 122 | ## Development ## 123 | 124 | You'll need a copy of zookeeper running locally, and it should have the correct path for optica: 125 | 126 | ```bash 127 | $ zkServer start 128 | $ zkCli 129 | [zk: localhost:2181(CONNECTED) 0] create /optica '' 130 | Created /optica 131 | [zk: localhost:2181(CONNECTED) 1] quit 132 | Quitting... 133 | ``` 134 | 135 | The example config is set up to talk to your local zookeeper: 136 | 137 | ```bash 138 | $ cd optica 139 | $ cp config.json.example config.json 140 | ``` 141 | 142 | Edit the default config and add your EC2 credentials. 143 | 144 | We run `optica` via unicorn. 145 | To spin up a test process on port 4567: 146 | 147 | ```bash 148 | $ unicorn -p 4567 149 | ``` 150 | -------------------------------------------------------------------------------- /config.json.example: -------------------------------------------------------------------------------- 1 | { 2 | "zk_path":"localhost:2181/optica", 3 | "debug":true, 4 | "ip_check":true, 5 | "rabbit_host":"127.0.0.1", 6 | "rabbit_port":5672, 7 | "events":["rabbitmq", "sqs"], 8 | "sqs_region":"us-west-2", 9 | "sqs_queue":"optica-events" 10 | } 11 | -------------------------------------------------------------------------------- /config.ru: -------------------------------------------------------------------------------- 1 | require 'oj' 2 | Oj.default_options = { :mode => :strict } 3 | opts = Oj.load( File.read('config.json') ) 4 | 5 | # prepare the logger 6 | require 'logger' 7 | log = Logger.new(STDERR) 8 | log.progname = 'optica' 9 | log.level = Logger::INFO unless opts['debug'] 10 | 11 | opts['log'] = log 12 | 13 | # Override store mode from ENV 14 | if ENV['OPTICA_SPLIT_MODE'] 15 | opts['split_mode'] = ENV['OPTICA_SPLIT_MODE'].downcase 16 | end 17 | 18 | # Enable GC stats 19 | if opts['gc_stats'] 20 | if defined? GC::Profiler && GC::Profiler.respond_to?(:enable) 21 | GC::Profiler.enable 22 | elsif GC.respond_to?(:enable_stats) 23 | GC.enable_stats 24 | end 25 | end 26 | 27 | # Rack options 28 | if opts['rack'] 29 | key_space_limit = opts['rack']['key_space_limit'] 30 | Rack::Utils.key_space_limit = key_space_limit if key_space_limit 31 | end 32 | 33 | # prepare statsd 34 | require 'datadog/statsd' 35 | STATSD = Datadog::Statsd.new(opts['statsd_host'], opts['statsd_port']) 36 | 37 | # prepare to exit cleanly 38 | $EXIT = false 39 | 40 | # configure unicorn-worker-killer 41 | if opts['worker_killer'] 42 | require 'unicorn/worker_killer' 43 | 44 | wk_opts = opts['worker_killer'] 45 | 46 | if wk_opts['max_requests'] 47 | max_requests = wk_opts['max_requests'] 48 | # Max requests per worker 49 | use Unicorn::WorkerKiller::MaxRequests, max_requests['min'], max_requests['max'] 50 | end 51 | 52 | if wk_opts['mem_limit'] 53 | mem_limit = wk_opts['mem_limit'] 54 | # Max memory size (RSS) per worker 55 | use Unicorn::WorkerKiller::Oom, mem_limit['min'], mem_limit['max'] 56 | end 57 | end 58 | 59 | # configure the store 60 | require './store.rb' 61 | store = Store.new(opts) 62 | store.start 63 | 64 | EVENTS_CLASSES = { 65 | 'rabbitmq' => { 66 | 'class_name' => 'EventsRabbitMQ', 67 | 'file_name' => './events_rmq.rb', 68 | }, 69 | 'sqs' => { 70 | 'class_name' => 'EventsSQS', 71 | 'file_name' => './events_sqs.rb' 72 | }, 73 | } 74 | 75 | events_classes = opts['events'] || ['rabbitmq'] 76 | 77 | # configure the event creator 78 | events = events_classes.map do |name| 79 | class_opts = EVENTS_CLASSES[name] 80 | raise "unknown value '#{name}' for events option" unless class_opts 81 | class_name = class_opts['class_name'] 82 | file_name = class_opts['file_name'] 83 | log.info "loading #{class_name} from #{file_name}" 84 | require file_name 85 | class_const = Object.const_get(class_name) 86 | class_const.new(opts).tap do |obj| 87 | obj.start 88 | end 89 | end 90 | 91 | # set a signal handler 92 | ['INT', 'TERM', 'QUIT'].each do |signal| 93 | trap(signal) do 94 | $stderr.puts "Got signal #{signal} -- exit currently #{$EXIT}" 95 | 96 | exit! if $EXIT 97 | $EXIT = true 98 | 99 | # stop the server 100 | server = Rack::Handler.get(server) || Rack::Handler.default 101 | server.shutdown if server.respond_to?(:shutdown) 102 | 103 | # stop the components 104 | store.stop() 105 | events.stop() 106 | exit! 107 | end 108 | end 109 | 110 | # do we check the client IP? 111 | ip_check = case opts['client_check'] 112 | when true, 'direct' then :direct 113 | when 'forwarded_for' then :forwarded_for 114 | when false, nil then false 115 | else raise 'unknown value for ip_check option' 116 | end 117 | 118 | # load the app 119 | require './optica.rb' 120 | 121 | # configure tracing client 122 | def datadog_config(log) 123 | Datadog.configure do |c| 124 | service = ENV.fetch('DD_SERVICE', 'optica') 125 | c.use :sinatra, service_name: service 126 | # Statsd instance used for sending runtime metrics 127 | c.runtime_metrics.statsd = STATSD 128 | end 129 | 130 | # register tracer extension 131 | Optica.register Datadog::Contrib::Sinatra::Tracer 132 | 133 | # add correlation IDs to logger 134 | log.formatter = proc do |severity, datetime, progname, msg| 135 | correlation = Datadog.tracer.active_correlation rescue 'FAILED' 136 | "[#{datetime}][#{progname}][#{severity}][#{correlation}] #{msg}\n" 137 | end 138 | end 139 | 140 | begin 141 | require 'ddtrace/auto_instrument' 142 | datadog_config(log) 143 | rescue LoadError 144 | log.info "Datadog's tracing client not found, skipping..." 145 | end 146 | 147 | Optica.set :logger, log 148 | Optica.set :store, store 149 | Optica.set :events, events 150 | Optica.set :ip_check, ip_check 151 | Optica.set :split_mode, opts['split_mode'] 152 | 153 | # start the app 154 | log.info "Starting sinatra server..." 155 | run Optica 156 | -------------------------------------------------------------------------------- /events_rmq.rb: -------------------------------------------------------------------------------- 1 | require 'stomp' 2 | require 'oj' 3 | 4 | class EventsRabbitMQ 5 | def initialize(opts) 6 | @log = opts['log'] 7 | 8 | %w{rabbit_host rabbit_port}.each do |req| 9 | raise ArgumentError, "missing required argument '#{req}'" unless opts[req] 10 | end 11 | 12 | @connect_hash = { 13 | :hosts => [{ 14 | :host => opts['rabbit_host'], 15 | :port => opts['rabbit_port'], 16 | :login => opts['rabbit_user'] || 'guest', 17 | :passcode => opts['rabbit_pass'] || 'guest', 18 | }], 19 | :reliable => true, 20 | :autoflush => true, 21 | :connect_timeout => 10, 22 | :logger => @log, 23 | } 24 | 25 | @exchange_name = opts['exchange_name'] || 'ops' 26 | @routing = opts['routing'] || 'events.node.converged' 27 | @health_routing = opts['health_routing'] || 'checks.optica' 28 | end 29 | 30 | def name 31 | 'rabbitmq' 32 | end 33 | 34 | def start 35 | @client = Stomp::Client.new(@connect_hash) 36 | end 37 | 38 | def send(data) 39 | @client.publish("/exchange/#{@exchange_name}/#{@routing}", Oj.dump(data), {:persistent => true}) 40 | rescue Exception => e 41 | @log.error "unexpected error publishing to rabbitmq: #{e.inspect}" 42 | raise e 43 | else 44 | @log.debug "published an event to #{@routing}" 45 | end 46 | 47 | def healthy? 48 | @client.publish("/exchange/#{@exchange_name}/#{@health_routing}", '') 49 | rescue StandardError => e 50 | @log.error "events interface failed health check: #{e.inspect}" 51 | false 52 | else 53 | @log.debug "events interface for RabbitMQ healthy" 54 | true 55 | end 56 | 57 | def stop 58 | @log.warn "stopping the events interface" 59 | Process.kill("TERM", Process.pid) unless $EXIT 60 | @client.close if @client 61 | end 62 | end 63 | -------------------------------------------------------------------------------- /events_sqs.rb: -------------------------------------------------------------------------------- 1 | require 'aws-sdk-sqs' 2 | require 'oj' 3 | 4 | class EventsSQS 5 | def initialize(opts) 6 | @log = opts['log'] 7 | 8 | %w{sqs_region sqs_queue}.each do |req| 9 | raise ArgumentError, "missing required argument '#{req}'" unless opts[req] 10 | end 11 | 12 | @opts = { 13 | :region => opts['sqs_region'], 14 | :queue => opts['sqs_queue'], 15 | :logger => @log, 16 | } 17 | 18 | @message_group_id = opts['routing'] || 'events.node.converged' 19 | @health_message_group_id = opts['health_routing'] || 'checks.optica' 20 | end 21 | 22 | def name 23 | 'sqs' 24 | end 25 | 26 | def start 27 | @sqs = Aws::SQS::Client.new(region: @opts[:region], logger: @opts[:logger]) 28 | resp = @sqs.get_queue_url(queue_name: @opts[:queue]) 29 | @opts[:queue_url] = resp.queue_url 30 | end 31 | 32 | def send(data) 33 | @sqs.send_message(queue_url: @opts[:queue_url], 34 | message_group_id: @message_group_id, message_body: Oj.dump(data)) 35 | @log.debug "published an event to #{@opts[:queue]}" 36 | rescue StandardError => e 37 | @log.error "unexpected error publishing to SQS: #{e.inspect}" 38 | raise e 39 | end 40 | 41 | def healthy? 42 | @sqs.send_message(queue_url: @opts[:queue_url], 43 | message_group_id: @health_message_group_id, message_body: '{}') 44 | @log.debug "events interface for SQS is healthy" 45 | true 46 | rescue StandardError => e 47 | @log.error "events interface for SQS failed health check: #{e.inspect}" 48 | false 49 | end 50 | 51 | def stop 52 | end 53 | end 54 | -------------------------------------------------------------------------------- /optica.rb: -------------------------------------------------------------------------------- 1 | require 'sinatra/base' 2 | require 'cgi' 3 | require 'oj' 4 | 5 | class Optica < Sinatra::Base 6 | before do 7 | env['rack.logger'] = settings.logger 8 | end 9 | 10 | configure :production, :development do 11 | enable :logging 12 | end 13 | 14 | get '/' do 15 | return get_nodes(request) 16 | end 17 | 18 | # endpoint for fab usage 19 | get '/roles' do 20 | fields_to_include = ['role', 'id', 'hostname'] 21 | params = CGI::parse(request.query_string) 22 | if params['_extra_fields'] 23 | values = params['_extra_fields'] 24 | # accept both _additional_fields[] and _additional_fields=1,2 syntax 25 | values.each do |value| 26 | fields_to_include += value.split(',') 27 | end 28 | end 29 | 30 | return get_nodes(request, fields_to_include) 31 | end 32 | 33 | get '/store' do 34 | content_type 'application/octet-stream' 35 | return settings.store.nodes_serialized 36 | end 37 | 38 | def get_nodes(request, fields_to_include=nil) 39 | params = CGI::parse(request.query_string).reject { |p| p[0] == '_' } 40 | 41 | # Optimization for some of the most expensive requests 42 | if fields_to_include.nil? && params.empty? && settings.split_mode == 'server' 43 | content_type 'application/json', :charset => 'utf-8' 44 | return settings.store.nodes_serialized 45 | end 46 | 47 | # include only those nodes that match passed-in parameters 48 | examined = 0 49 | to_return = {} 50 | begin 51 | nodes = settings.store.lookup(params) 52 | rescue 53 | halt(503) 54 | end 55 | 56 | nodes.each do |node, properties| 57 | examined += 1 58 | included = true 59 | 60 | params.each do |param, values| 61 | values.each do |value| 62 | 63 | if not properties.include? param 64 | included = false 65 | elsif properties[param].nil? 66 | included = false 67 | elsif properties[param].is_a? String 68 | included = false unless properties[param].match value 69 | elsif properties[param].is_a? Array 70 | included = false unless properties[param].include? value 71 | elsif properties[param].class == TrueClass 72 | included = false unless ['true', 'True', '1'].include? value 73 | elsif properties[param].class == FalseClass 74 | included = false unless ['false', 'False', '0'].include? value 75 | end 76 | end 77 | end 78 | 79 | if included 80 | # return full list if fields_to_include is nil. Otherwise return only keys 81 | # listed in fields_to_include. Not using slice because not a rails app. 82 | to_return[node] = fields_to_include.nil? ? properties : properties 83 | .select { |key, _value| fields_to_include.include? key } 84 | end 85 | end 86 | 87 | content_type 'application/json', :charset => 'utf-8' 88 | result = {'examined'=>examined, 'returned'=>to_return.count, 'nodes'=>to_return} 89 | return Oj.dump(result) 90 | end 91 | 92 | post '/' do 93 | begin 94 | data = Oj.safe_load(request.body.read) 95 | rescue Oj::ParseError 96 | halt(400, 'invalid request payload') 97 | end 98 | 99 | ip = data['ip'] 100 | 101 | # check the node ip? disabled by false and nil 102 | case settings.ip_check 103 | when :direct 104 | halt(403) unless ip == request.ip 105 | when :forwarded_for 106 | header = env['HTTP_X_FORWARDED_FOR'] 107 | halt(500) unless header 108 | halt(403) unless ip == header.split(',').first 109 | end 110 | 111 | # update main store 112 | begin 113 | merged_data = settings.store.add(ip, data) 114 | rescue 115 | halt(500) 116 | end 117 | 118 | # publish update event 119 | message = 'stored' 120 | begin 121 | tags = [] 122 | event = merged_data.merge('event' => data) 123 | settings.events.each do |events| 124 | tags = ["events_queue:#{events.name}"] 125 | events.send(event) 126 | STATSD.increment('optica.events', :tags => tags + ['status:success']) 127 | end 128 | rescue => e 129 | STATSD.increment('optica.events', :tags => tags + ['status:failed']) unless tags.empty? 130 | # If event publishing failed, we treat it as a warning rather than an error. 131 | message += " -- [warning] failed to publish event: #{e.to_s}" 132 | end 133 | 134 | content_type 'text/plain', :charset => 'utf-8' 135 | 136 | return message 137 | end 138 | 139 | delete '/:id' do |id| 140 | matching = settings.store.nodes.select{ |k,v| v['id'] == id } 141 | if matching.length == 0 142 | return 204 143 | elsif matching.length == 1 144 | begin 145 | settings.store.delete(matching.flatten[1]['ip']) 146 | rescue 147 | halt(500) 148 | else 149 | return "deleted" 150 | end 151 | else 152 | return [409, "found multiple entries matching id #{id}"] 153 | end 154 | end 155 | 156 | get '/health' do 157 | if settings.store.healthy? and settings.events.all? { |events| events.healthy? } 158 | content_type 'text/plain', :charset => 'utf-8' 159 | return "OK" 160 | else 161 | return [503, 'not healthy!'] 162 | end 163 | end 164 | 165 | get '/ping' do 166 | content_type 'text/plain', :charset => 'utf-8' 167 | return "PONG" 168 | end 169 | end 170 | -------------------------------------------------------------------------------- /spec/optica_spec.rb: -------------------------------------------------------------------------------- 1 | require 'spec_helper' 2 | require './optica.rb' 3 | 4 | RSpec.describe Optica do 5 | def app 6 | Optica 7 | end 8 | 9 | before(:all) do 10 | Optica.set :logger, double('TestLogger') 11 | end 12 | 13 | describe '/ping' do 14 | it 'returns PONG' do 15 | get '/ping' 16 | expect(last_response.status).to eq(200) 17 | expect(last_response.body).to eq('PONG') 18 | end 19 | end 20 | 21 | describe '/' do 22 | let(:data) { 23 | { 24 | 'ip' => '127.0.0.1', 25 | 'id' => 'test', 26 | 'environment' => 'development', 27 | } 28 | } 29 | let(:object_data) { 30 | { 31 | 'ip' => '127.0.0.1', 32 | 'test' => Object.new, 33 | } 34 | } 35 | 36 | before(:all) do 37 | Optica.set :ip_check, :test 38 | end 39 | 40 | before(:each) do 41 | Optica.set :store, double('TestStore') 42 | event = double('TestEvent') 43 | allow(event).to receive(:name).and_return('TestEvent') 44 | Optica.set :events, [event] 45 | statsd = double('TestStatsd') 46 | allow(statsd).to receive(:increment) 47 | stub_const('STATSD', statsd) 48 | end 49 | 50 | it 'can post data' do 51 | expect(Optica.store).to receive(:add).with(data['ip'], data).and_return(data) 52 | expect(Optica.events.first).to receive(:send) 53 | post('/', Oj.dump(data), 'CONTENT_TYPE' => 'application/json') 54 | expect(last_response.status).to eq(200) 55 | expect(last_response.body).to eq('stored') 56 | end 57 | 58 | it 'does not load objects' do 59 | loaded_data = nil 60 | allow(Optica.store).to receive(:add) do |ip, data| 61 | loaded_data = data 62 | end 63 | post('/', Oj.dump(object_data), 'CONTENT_TYPE' => 'application/json') 64 | expect(loaded_data).not_to be_nil 65 | expect(loaded_data['test']).to be_a(Hash) 66 | end 67 | 68 | it 'rejects invalid JSON data' do 69 | post('/', '{', 'CONTENT_TYPE' => 'application/json') 70 | expect(last_response.status).to eq(400) 71 | end 72 | end 73 | end 74 | -------------------------------------------------------------------------------- /spec/spec_helper.rb: -------------------------------------------------------------------------------- 1 | # This file was generated by the `rspec --init` command. Conventionally, all 2 | # specs live under a `spec` directory, which RSpec adds to the `$LOAD_PATH`. 3 | # The generated `.rspec` file contains `--require spec_helper` which will cause 4 | # this file to always be loaded, without a need to explicitly require it in any 5 | # files. 6 | # 7 | # Given that it is always loaded, you are encouraged to keep this file as 8 | # light-weight as possible. Requiring heavyweight dependencies from this file 9 | # will add to the boot time of your test suite on EVERY test run, even for an 10 | # individual file that may not need all of that loaded. Instead, consider making 11 | # a separate helper file that requires the additional dependencies and performs 12 | # the additional setup, and require it from the spec files that actually need 13 | # it. 14 | # 15 | # See https://rubydoc.info/gems/rspec-core/RSpec/Core/Configuration 16 | 17 | require 'rspec' 18 | require 'rack/test' 19 | 20 | RSpec.configure do |config| 21 | # rspec-expectations config goes here. You can use an alternate 22 | # assertion/expectation library such as wrong or the stdlib/minitest 23 | # assertions if you prefer. 24 | config.expect_with :rspec do |expectations| 25 | # This option will default to `true` in RSpec 4. It makes the `description` 26 | # and `failure_message` of custom matchers include text for helper methods 27 | # defined using `chain`, e.g.: 28 | # be_bigger_than(2).and_smaller_than(4).description 29 | # # => "be bigger than 2 and smaller than 4" 30 | # ...rather than: 31 | # # => "be bigger than 2" 32 | expectations.include_chain_clauses_in_custom_matcher_descriptions = true 33 | end 34 | 35 | # rspec-mocks config goes here. You can use an alternate test double 36 | # library (such as bogus or mocha) by changing the `mock_with` option here. 37 | config.mock_with :rspec do |mocks| 38 | # Prevents you from mocking or stubbing a method that does not exist on 39 | # a real object. This is generally recommended, and will default to 40 | # `true` in RSpec 4. 41 | mocks.verify_partial_doubles = true 42 | end 43 | 44 | # This option will default to `:apply_to_host_groups` in RSpec 4 (and will 45 | # have no way to turn it off -- the option exists only for backwards 46 | # compatibility in RSpec 3). It causes shared context metadata to be 47 | # inherited by the metadata hash of host groups and examples, rather than 48 | # triggering implicit auto-inclusion in groups with matching metadata. 49 | config.shared_context_metadata_behavior = :apply_to_host_groups 50 | 51 | # This allows you to limit a spec run to individual examples or groups 52 | # you care about by tagging them with `:focus` metadata. When nothing 53 | # is tagged with `:focus`, all examples get run. RSpec also provides 54 | # aliases for `it`, `describe`, and `context` that include `:focus` 55 | # metadata: `fit`, `fdescribe` and `fcontext`, respectively. 56 | config.filter_run_when_matching :focus 57 | 58 | # Allows RSpec to persist some state between runs in order to support 59 | # the `--only-failures` and `--next-failure` CLI options. We recommend 60 | # you configure your source control system to ignore this file. 61 | config.example_status_persistence_file_path = "spec/examples.txt" 62 | 63 | # Limits the available syntax to the non-monkey patched syntax that is 64 | # recommended. For more details, see: 65 | # https://relishapp.com/rspec/rspec-core/docs/configuration/zero-monkey-patching-mode 66 | config.disable_monkey_patching! 67 | 68 | # This setting enables warnings. It's recommended, but in some cases may 69 | # be too noisy due to issues in dependencies. 70 | config.warnings = true 71 | 72 | # Many RSpec users commonly either run the entire suite or an individual 73 | # file, and it's useful to allow more verbose output when running an 74 | # individual spec file. 75 | if config.files_to_run.one? 76 | # Use the documentation formatter for detailed output, 77 | # unless a formatter has already been configured 78 | # (e.g. via a command-line flag). 79 | config.default_formatter = "doc" 80 | end 81 | 82 | # Print the 10 slowest examples and example groups at the 83 | # end of the spec run, to help surface which specs are running 84 | # particularly slow. 85 | config.profile_examples = 10 86 | 87 | # Run specs in random order to surface order dependencies. If you find an 88 | # order dependency and want to debug it, you can fix the order by providing 89 | # the seed, which is printed after each run. 90 | # --seed 1234 91 | config.order = :random 92 | 93 | # Seed global randomization in this process using the `--seed` CLI option. 94 | # Setting this allows you to use `--seed` to deterministically reproduce 95 | # test failures related to randomization by passing the same `--seed` value 96 | # as the one that triggered the failure. 97 | Kernel.srand config.seed 98 | 99 | # Include testing API for Rack apps. 100 | config.include Rack::Test::Methods 101 | end 102 | -------------------------------------------------------------------------------- /store.rb: -------------------------------------------------------------------------------- 1 | require 'zk' 2 | require 'oj' 3 | require 'hash_deep_merge' 4 | require 'open-uri' 5 | 6 | class Store 7 | 8 | attr_reader :ips 9 | 10 | DEFAULT_CACHE_STALE_AGE = 0 11 | DEFAULT_SPLIT_MODE = "disabled" 12 | DEFAULT_STORE_PORT = 8001 13 | DEFAULT_HTTP_TIMEOUT = 30 14 | DEFAULT_HTTP_RETRY_DELAY = 5 15 | 16 | def initialize(opts) 17 | @log = opts['log'] 18 | @index_fields = opts['index_fields'].to_s.split(/,\s*/) 19 | 20 | @opts = { 21 | 'split_mode' => DEFAULT_SPLIT_MODE, 22 | 'split_mode_store_port' => DEFAULT_STORE_PORT, 23 | 'split_mode_retry_delay' => DEFAULT_HTTP_RETRY_DELAY, 24 | 'split_mode_http_timeout' => DEFAULT_HTTP_TIMEOUT, 25 | }.merge(opts) 26 | 27 | unless opts['zk_path'] 28 | raise ArgumentError, "missing required argument 'zk_path'" 29 | else 30 | @path = opts['zk_path'] 31 | end 32 | 33 | @zk = nil 34 | setup_cache(opts) 35 | end 36 | 37 | def setup_cache(opts) 38 | # We use a daemon that refreshes cache every N (tunable) 39 | # seconds. In addition, we subscript to all children joining/leaving 40 | # events. This is less frequent because normally no one would constantly 41 | # add/remove machines. So whenever a join/leave event happens, we immediately 42 | # refresh cache. This way we guarantee that whenever we add/remove 43 | # machines, cache will always have the right set of machines. 44 | 45 | @cache_enabled = !!opts['cache_enabled'] 46 | @cache_stale_age = opts['cache_stale_age'] || DEFAULT_CACHE_STALE_AGE 47 | 48 | # zk watcher for node joins/leaves 49 | @cache_root_watcher = nil 50 | 51 | # mutex for atomically updating cached results 52 | @cache_results_serialized = nil 53 | @cache_results = {} 54 | @cache_indices = {} 55 | @cache_mutex = Mutex.new 56 | 57 | # daemon that'll fetch from zk periodically 58 | @cache_fetch_thread = nil 59 | # flag that controls if fetch daemon should run 60 | @cache_fetch_thread_should_run = false 61 | # how long we serve cached data 62 | @cache_fetch_base_interval = (opts['cache_fetch_interval'] || 20).to_i 63 | @cache_fetch_interval = @cache_fetch_base_interval 64 | 65 | # timestamp that prevents setting cache result with stale data 66 | @cache_results_last_fetched_time = 0 67 | end 68 | 69 | def start() 70 | @log.info "waiting to connect to zookeeper at #{@path}" 71 | @zk = ZK.new(@path) 72 | 73 | @zk.on_state_change do |event| 74 | @log.info "zk state changed, state=#{@zk.state}, session_id=#{session_id}" 75 | end 76 | 77 | @zk.ping? 78 | @log.info "ZK connection established successfully. session_id=#{session_id}" 79 | 80 | # We have to readd all watchers and refresh cache if we reconnect to a new server. 81 | @zk.on_connected do |event| 82 | @log.info "ZK connection re-established. session_id=#{session_id}" 83 | 84 | if @cache_enabled 85 | @log.info "Resetting watchers and re-syncing cache. session_id=#{session_id}" 86 | setup_watchers 87 | reload_instances 88 | end 89 | end 90 | 91 | if @cache_enabled 92 | setup_watchers 93 | reload_instances 94 | start_fetch_thread 95 | end 96 | end 97 | 98 | def session_id 99 | '0x%x' % @zk.session_id rescue nil 100 | end 101 | 102 | def stop_cache_related() 103 | @cache_root_watcher.unsubscribe if @cache_root_watcher 104 | @cache_root_watcher = nil 105 | @cache_fetch_thread_should_run = false 106 | @cache_fetch_thread.join if @cache_fetch_thread 107 | @cache_fetch_thread = nil 108 | end 109 | 110 | def stop() 111 | @log.warn "stopping the store" 112 | stop_cache_related 113 | @zk.close() if @zk 114 | @zk = nil 115 | end 116 | 117 | # get instances for a given service 118 | def nodes() 119 | STATSD.time('optica.store.get_nodes') do 120 | unless @cache_enabled 121 | inst, idx = load_instances_from_zk 122 | return inst 123 | end 124 | 125 | check_cache_age 126 | @cache_results 127 | end 128 | end 129 | 130 | def nodes_serialized 131 | @cache_results_serialized 132 | end 133 | 134 | def lookup(params) 135 | if @opts['split_mode'] != 'server' || !@cache_enabled 136 | return nodes 137 | end 138 | 139 | STATSD.time('optica.store.lookup') do 140 | 141 | # Find all suitable indices and their cardinalities 142 | cardinalities = params.reduce({}) do |res, (key, _)| 143 | res[key] = @cache_indices[key].length if @cache_indices.key? key 144 | res 145 | end 146 | 147 | unless cardinalities.empty? 148 | # Find best suitable index 149 | best_key = cardinalities.sort_by {|k,v| v}.first.first 150 | best_idx = @cache_indices.fetch(best_key, {}) 151 | 152 | # Check if index saves enough cycles, otherwise fall back to full cache 153 | if @cache_results.length > 0 && best_idx.length.to_f / @cache_results.length.to_f > 0.5 154 | return nodes 155 | end 156 | 157 | return nodes_from_index(best_idx, params[best_key]) 158 | end 159 | 160 | return nodes 161 | end 162 | end 163 | 164 | def load_instances 165 | STATSD.time('optica.store.load_instances') do 166 | @opts['split_mode'] == 'server' ? 167 | load_instances_from_leader : 168 | load_instances_from_zk 169 | end 170 | end 171 | 172 | def load_instances_from_leader 173 | begin 174 | uri = "http://localhost:%d/store" % @opts['split_mode_store_port'] 175 | res = open(uri, :read_timeout => @opts['split_mode_http_timeout']) 176 | 177 | remote_store = Oj.safe_load(res.read) 178 | [ remote_store['inst'], remote_store['idx'] ] 179 | rescue OpenURI::HTTPError, Errno::ECONNREFUSED, Net::ReadTimeout => e 180 | @log.error "Error loading store from #{uri}: #{e.inspect}; will retry after #{@opts['split_mode_retry_delay']}" 181 | 182 | sleep @opts['split_mode_retry_delay'] 183 | retry 184 | end 185 | end 186 | 187 | def load_instances_from_zk() 188 | @log.info "Reading instances from zk:" 189 | 190 | inst = {} 191 | idx = {} 192 | 193 | begin 194 | @zk.children('/', :watch => true).each do |child| 195 | node = get_node("/#{child}") 196 | update_nodes child, node, inst, idx 197 | end 198 | rescue Exception => e 199 | # ZK client library caches DNS names of ZK nodes and it resets the 200 | # cache only when the client object is initialized, or set_servers 201 | # method is called. Set_servers is not exposed in ruby library, so 202 | # we force re-init the underlying client object here to make sure 203 | # we always connect to the current IP addresses. 204 | @zk.reopen 205 | 206 | @log.error "unexpected error reading from zk! #{e.inspect}" 207 | raise e 208 | end 209 | 210 | [inst, idx] 211 | end 212 | 213 | def add(node, data) 214 | child = "/#{node}" 215 | 216 | # deep-merge the old and new data 217 | prev_data = get_node(child) 218 | new_data = prev_data.deep_merge(data) 219 | json_data = Oj.dump(new_data) 220 | 221 | @log.debug "writing to zk at #{child} with #{json_data}" 222 | 223 | begin 224 | STATSD.time('optica.zookeeper.set') do 225 | @zk.set(child, json_data) 226 | end 227 | new_data 228 | rescue ZK::Exceptions::NoNode => e 229 | STATSD.time('optica.zookeeper.create') do 230 | @zk.create(child, :data => json_data) 231 | end 232 | new_data 233 | rescue Exception => e 234 | @zk.reopen 235 | 236 | @log.error "unexpected error writing to zk! #{e.inspect}" 237 | raise e 238 | end 239 | end 240 | 241 | def delete(node) 242 | @log.info "deleting node #{node}" 243 | 244 | begin 245 | STATSD.time('optica.zookeeper.delete') do 246 | @zk.delete("/" + node, :ignore => :no_node) 247 | end 248 | rescue Exception => e 249 | @zk.reopen 250 | 251 | @log.error "unexpected error deleting nodes in zk! #{e.inspect}" 252 | raise e 253 | end 254 | end 255 | 256 | def healthy?() 257 | healthy = true 258 | if $EXIT 259 | @log.warn 'not healthy because stopping...' 260 | healthy = false 261 | elsif not @zk 262 | @log.warn 'not healthy because no zookeeper...' 263 | healthy = false 264 | elsif not @zk.connected? 265 | @log.warn 'not healthy because zookeeper not connected...' 266 | healthy = false 267 | end 268 | return healthy 269 | end 270 | 271 | private 272 | 273 | def nodes_from_index(idx, values) 274 | matching_keys = [] 275 | 276 | # To preserve original optica behavior we have to validate all keys 277 | # against standard rules 278 | values.each do |val| 279 | keys = idx.keys.select do |key| 280 | matched = true 281 | if key.is_a? String 282 | matched = false unless key.match val 283 | elsif key.is_a? Array 284 | matched = false unless key.include? val 285 | elsif key.class == TrueClass 286 | matched = false unless ['true', 'True', '1'].include? val 287 | elsif key.class == FalseClass 288 | matched = false unless ['false', 'False', '0'].include? val 289 | end 290 | matched 291 | end 292 | matching_keys << keys 293 | end 294 | 295 | if matching_keys.length == 1 296 | matching_keys = matching_keys.first 297 | elsif matching_keys.length > 1 298 | matching_keys = matching_keys.inject(:&) 299 | end 300 | 301 | matching_keys.reduce({}) do |res, key| 302 | res.merge idx.fetch(key, {}) 303 | end 304 | end 305 | 306 | def update_nodes(node_name, node, inst, idx) 307 | inst[node_name] = node 308 | 309 | @index_fields.each do |key| 310 | if node.key?(key) && !node[key].nil? 311 | val = node[key] 312 | idx[key] ||= {} 313 | idx[key][val] ||= {} 314 | idx[key][val][node_name] = node 315 | end 316 | end 317 | end 318 | 319 | def get_node(node) 320 | begin 321 | data, stat = STATSD.time('optica.zookeeper.get') do 322 | @zk.get(node) 323 | end 324 | STATSD.time('optica.json.parse') do 325 | Oj.safe_load(data) 326 | end 327 | rescue ZK::Exceptions::NoNode 328 | @log.info "node #{node} disappeared" 329 | {} 330 | rescue Oj::ParseError 331 | @log.warn "removing invalid node #{node}: data failed to parse (#{data.inspect})" 332 | delete(node) 333 | {} 334 | rescue Exception => e 335 | @zk.reopen 336 | 337 | @log.error "unexpected error reading from zk! #{e.inspect}" 338 | raise e 339 | end 340 | end 341 | 342 | # immediately update cache if node joins/leaves 343 | def setup_watchers 344 | return if @zk.nil? || @opts['split_mode'] == 'server' 345 | 346 | @cache_root_watcher = @zk.register("/", :only => :child) do |event| 347 | @log.info "Children added/deleted" 348 | reload_instances 349 | end 350 | end 351 | 352 | def check_cache_age 353 | return unless @cache_enabled 354 | 355 | cache_age = Time.new.to_i - @cache_results_last_fetched_time.to_i 356 | STATSD.gauge 'optica.store.cache.age', cache_age 357 | 358 | if @cache_stale_age > 0 && cache_age > @cache_stale_age 359 | msg = "cache age exceeds threshold: #{cache_age} > #{@cache_stale_age}" 360 | 361 | @log.error msg 362 | raise msg 363 | end 364 | end 365 | 366 | def reload_instances() 367 | 368 | return unless @cache_mutex.try_lock 369 | 370 | begin 371 | now = Time.now.to_i 372 | 373 | if now > @cache_results_last_fetched_time + @cache_fetch_interval 374 | inst, idx = load_instances 375 | 376 | @cache_results = inst.freeze 377 | @cache_indices = idx.freeze 378 | 379 | case @opts['split_mode'] 380 | when 'store' 381 | new_store = { 382 | 'inst' => @cache_results, 383 | 'idx' => @cache_indices, 384 | } 385 | @cache_results_serialized = Oj.dump new_store 386 | when 'server' 387 | new_store = { 388 | 'examined' => 0, 389 | 'returned' => @cache_results.length, 390 | 'nodes' => @cache_results, 391 | } 392 | @cache_results_serialized = Oj.dump new_store 393 | end 394 | 395 | 396 | @cache_results_last_fetched_time = now 397 | update_cache_fetch_interval 398 | 399 | @log.info "reloaded cache. new reload interval = #{@cache_fetch_interval}" 400 | end 401 | ensure 402 | @cache_mutex.unlock 403 | end 404 | end 405 | 406 | def update_cache_fetch_interval 407 | @cache_fetch_interval = @cache_fetch_base_interval + rand(0..20) 408 | end 409 | 410 | def start_fetch_thread() 411 | @cache_fetch_thread_should_run = true 412 | @cache_fetch_thread = Thread.new do 413 | while @cache_fetch_thread_should_run do 414 | begin 415 | sleep(@cache_fetch_interval) rescue nil 416 | source = @opts['split_mode'] == 'server' ? 'remote store' : 'zookeeper' 417 | @log.info "Cache fetch thread now fetches from #{source}..." 418 | reload_instances rescue nil 419 | check_cache_age 420 | rescue => ex 421 | @log.warn "Caught exception in cache fetch thread: #{ex} #{ex.backtrace}" 422 | end 423 | end 424 | end 425 | end 426 | end 427 | -------------------------------------------------------------------------------- /tools/copy.rb: -------------------------------------------------------------------------------- 1 | #!/usr/bin/ruby 2 | 3 | # this script might come in handy for moving optica data 4 | # from one ZK cluster to another 5 | 6 | require 'zk' 7 | 8 | dest_zk = 'new-zk:2181/optica' 9 | source_zk = 'old-zk:2181/optica' 10 | 11 | puts "connecting to dest" 12 | dest = ZK.new(dest_zk) 13 | dest.ping? 14 | 15 | puts "connecting to source" 16 | source = ZK.new(source_zk) 17 | source.ping? 18 | 19 | source.children('/').each do |child| 20 | child = "/#{child}" 21 | 22 | puts "reading #{child}" 23 | data, stat = source.get(child) 24 | 25 | begin 26 | dest.set(child, data) 27 | rescue ZK::Exceptions::NoNode => e 28 | dest.create(child, :data => data) 29 | end 30 | puts "wrote #{child}" 31 | end 32 | -------------------------------------------------------------------------------- /tools/fabfile.py: -------------------------------------------------------------------------------- 1 | from fabric.api import env, run, sudo, put 2 | 3 | import collections 4 | import json 5 | import requests 6 | 7 | response = requests.get('http://%s:8080/roles' % optica_ip) 8 | hosts = json.loads(response.text) 9 | 10 | # fill the role list 11 | env.roledefs = collections.defaultdict(lambda: []) 12 | for hostinfo in hosts['nodes'].values(): 13 | env.roledefs[hostinfo['role']].append(hostinfo['hostname']) 14 | 15 | # show the roll list if no role selected 16 | if not env.roles: 17 | print "Available roles:\n" 18 | for role in sorted(env.roledefs.keys()): 19 | count = len(env.roledefs[role]) 20 | print " %-30s %3d machine%s" % (role, count, "s" if count > 1 else "") 21 | print "" 22 | 23 | def uptime(): 24 | """Check the uptime on a node""" 25 | run('uptime') 26 | 27 | def restart_service(service_name): 28 | """Restart a specified service (e.g. `fab restart_service:nginx`)""" 29 | sudo('service %s restart' % service_name) 30 | -------------------------------------------------------------------------------- /tools/optica.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -o xtrace 3 | 4 | my_optica_host='https://optica.example.com' 5 | curl --silent ${my_optica_host}/?"$1" | jq --compact-output ".nodes[] | $2" 6 | -------------------------------------------------------------------------------- /tools/reporter.rb: -------------------------------------------------------------------------------- 1 | 2 | require 'chef/handler' 3 | require 'net/http' 4 | require 'fileutils' 5 | require 'json' 6 | require 'tempfile' 7 | 8 | module Optica 9 | class Reporter < Chef::Handler 10 | 11 | MAX_TRIES = 4 12 | SAVED_REPORTS_DIR = '/tmp/failed_optica_reports' 13 | SAVED_REPORTS_PREFIX = 'optica_report' 14 | 15 | def report 16 | data = {} 17 | 18 | # include self-reported attributes if present 19 | report = safe_node_attr(['optica', 'report']) 20 | report.each do |k, v| 21 | data[k] = v 22 | end if report 23 | 24 | # include run info 25 | data['failed'] = run_status.failed? 26 | data['last_start'] = run_status.start_time 27 | data['last_reported'] = Time.now 28 | 29 | # include some node data (but fail gracefully) 30 | data['branch'] = safe_node_attr(['branch']) 31 | data['roles'] = safe_node_attr(['roles']) 32 | data['recipes'] = safe_node_attr(['recipes']) 33 | data['synapse_services'] = safe_node_attr(['synapse', 'enabled_services']) 34 | data['nerve_services'] = safe_node_attr(['nerve', 'enabled_services']) 35 | data['ownership'] = safe_node_attr(['ownership']) 36 | 37 | # ip is needed by optica for all reports 38 | data['ip'] = safe_node_attr(['ipaddress']) 39 | 40 | data['environment'] = safe_node_attr(['env']) 41 | data['role'] = safe_node_attr(['role']) 42 | data['id'] = safe_node_attr(['ec2', 'instance_id']) 43 | data['hostname'] = safe_node_attr(['hostname']) 44 | data['uptime'] = safe_node_attr(['uptime_seconds']) 45 | data['public_hostname'] = safe_node_attr(['ec2', 'public_hostname']) 46 | data['public_ip'] = safe_node_attr(['ec2', 'public_ipv4']) 47 | data['az'] = safe_node_attr(['ec2', 'placement_availability_zone']) 48 | data['security_groups'] = safe_node_attr(['ec2', 'security_groups']) 49 | data['instance_type'] = safe_node_attr(['ec2', 'instance_type']) 50 | data['ami_id'] = safe_node_attr(['ec2', 'ami_id']) 51 | data['intended_branch'] = File.read('/etc/chef/branch').strip 52 | 53 | converge_reason = ENV['CONVERGE_REASON'] 54 | data['converge_reason'] = converge_reason unless converge_reason.nil? 55 | 56 | converger = ENV['IDENTITY'] 57 | data['converger'] = converger unless converger.nil? 58 | 59 | # report the data 60 | Chef::Log.info "Sending run data to optica" 61 | tries = 0 62 | begin 63 | tries += 1 64 | 65 | connection = Net::HTTP.new('optica.d.musta.ch', 443) 66 | connection.use_ssl = true 67 | result = connection.post('/', data.to_json) 68 | 69 | if result.code.to_i >= 200 and result.code.to_i < 300 70 | Chef::Log.info "SUCCESS: optica replied: '#{result.body}'" 71 | else 72 | raise StandardError.new("optica replied with: #{result.code}:#{result.body}") 73 | end 74 | rescue => e 75 | if tries < MAX_TRIES 76 | Chef::Log.info "FAILED: error reporting to optica from #{data['hostname']} (#{e.message}); trying again..." 77 | sleep 2 ** tries 78 | retry 79 | end 80 | 81 | Chef::Log.error "FAILED: error reporting to optica from #{data['hostname']}: #{e.message} #{e.backtrace}" 82 | save_report(data) 83 | else 84 | delete_old_reports 85 | end 86 | end 87 | 88 | def safe_node_attr(attr_list) 89 | return nil unless attr_list.is_a? Array 90 | 91 | obj = run_status.node 92 | processed = ['node'] 93 | 94 | attr_list.each do |a| 95 | begin 96 | obj = obj[a] 97 | rescue 98 | Chef::Log.info "Failed to get attribute #{a} from #{processed.join('.')}" 99 | return nil 100 | else 101 | processed << a 102 | end 103 | end 104 | 105 | return obj 106 | end 107 | 108 | def save_report(data) 109 | FileUtils.mkdir_p(SAVED_REPORTS_DIR) 110 | 111 | filename = File.join(SAVED_REPORTS_DIR, SAVED_REPORTS_PREFIX + Time.now.to_i.to_s) 112 | File.open(filename, 'w') do |sr| 113 | sr.write(data.to_json) 114 | end 115 | 116 | Chef::Log.info "Optica backup report written to #{filename}" 117 | end 118 | 119 | def delete_old_reports 120 | to_delete = Dir.glob(File.join(SAVED_REPORTS_DIR, SAVED_REPORTS_PREFIX + "*")) 121 | unless to_delete.length == 0 122 | Chef::Log.info "Deleting #{to_delete.length} unsent optica reports" 123 | to_delete.each { |old_report| File.delete(old_report) } 124 | end 125 | end 126 | end 127 | end 128 | --------------------------------------------------------------------------------