├── .gitignore ├── Gemfile ├── LICENSE.txt ├── README.md ├── Rakefile ├── bin └── vector ├── lib ├── vector.rb └── vector │ ├── cli.rb │ ├── functions │ ├── flexible_down_scaling.rb │ └── predictive_scaling.rb │ └── version.rb └── vector.gemspec /.gitignore: -------------------------------------------------------------------------------- 1 | *.gem 2 | *.rbc 3 | .bundle 4 | .config 5 | .yardoc 6 | Gemfile.lock 7 | InstalledFiles 8 | _yardoc 9 | coverage 10 | doc/ 11 | lib/bundler/man 12 | pkg 13 | rdoc 14 | spec/reports 15 | test/tmp 16 | test/version_tmp 17 | tmp 18 | -------------------------------------------------------------------------------- /Gemfile: -------------------------------------------------------------------------------- 1 | source 'https://rubygems.org' 2 | 3 | # Specify your gem's dependencies in vector.gemspec 4 | gemspec 5 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Copyright (c) 2013 Instructure, Inc 2 | 3 | MIT License 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining 6 | a copy of this software and associated documentation files (the 7 | "Software"), to deal in the Software without restriction, including 8 | without limitation the rights to use, copy, modify, merge, publish, 9 | distribute, sublicense, and/or sell copies of the Software, and to 10 | permit persons to whom the Software is furnished to do so, subject to 11 | the following conditions: 12 | 13 | The above copyright notice and this permission notice shall be 14 | included in all copies or substantial portions of the Software. 15 | 16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 19 | NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 20 | LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 21 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION 22 | WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Vector 2 | 3 | Vector is a tool that augments your auto-scaling groups. The two 4 | features currently offered are Predictive Scaling and Flexible Down 5 | Scaling. 6 | 7 | ## Predictive scaling 8 | 9 | Auto Scaling groups do a good job of responding to current 10 | load conditions, but if you have a predictable load pattern, 11 | it can be nice to scale up your servers a little bit *early*. 12 | Some reasons you might want to do that are: 13 | 14 | * If it takes several minutes for an instance to fully boot 15 | and ready itself for requests. 16 | * If you have very serious (but predictable) spikes, 17 | it's nice to have the capacity in place before the spike 18 | starts. 19 | * To give yourself a buffer of time if AWS APIs start 20 | throwing errors. If scaling up is going to fail, you'd 21 | rather it start failing with a little bit of time before 22 | you actually need the capacity so you can begin evasive maneuvers. 23 | 24 | Vector examines your existing CloudWatch alarms tied to your Auto 25 | Scaling groups, and predicts if they will be triggered in the future 26 | based on what happened in the past. 27 | 28 | **Note:** This only works with metrics that are averaged across your group - 29 | like CPUUtilization or Load. If you auto-scale based on something 30 | like QueueLength, Predictive Scaling will not work right for you. 31 | 32 | For each lookback window you specify, Vector will first check the 33 | current value of the metric * the number of nodes, and the past value of 34 | the metric * the past number of nodes. If those numbers are close enough 35 | (within the threshold specified by `--ps-valid-threshold`), then it will 36 | continue. 37 | 38 | Vector will then go back to the lookback window specified, and then 39 | forward in time based on the lookahead window (`--ps-lookahead-window`). 40 | It will compute the metric * number of nodes then to get a predicted 41 | aggregate metric value for the current future. It then divides that by 42 | the current number of nodes to get a predicted average value for the 43 | metric. That is then compared against the alarm's threshold. 44 | 45 | For example: 46 | 47 | > You have an alarm that checks CPUUtilization of your group, and will 48 | > trigger the alarm if that goes above 70%. Vector is configured to use a 49 | > 1 week lookback window, a 1 hour lookahead window, and a valid-threshold 50 | > of 0.8. 51 | > 52 | > The current value of CPUUtilization is 49%, and there are 2 nodes in the 53 | > group. CPUUtilization 1 week ago was 53%, and there were 2 nodes in the 54 | > group. Therefore, total current CPUUtilization is 98%, and 1 week ago was 55 | > 106%. Those are within 80% of each other (valid-threshold), so we can 56 | > continue with the prediction. 57 | > 58 | > The value of CPUUtilization 1 week ago, *plus* 1 hour was 45%, and 59 | > there were 4 nodes in the group. We calculate total CPUUtilization for 60 | > that time to be 180%. Assuming no new nodes are launched, the predicted 61 | > average CPUUtilization for the group 1 hour from now is 180% / 2 = 90%. 62 | > 90% is above the alarm's 75% threshold, so we trigger the scaleup 63 | > policy. 64 | 65 | 66 | If you use Predictive Scaling, you probably also want to use Flexible 67 | Down Scaling (below) so that after scaling up in prediction of load, 68 | your scaledown policy doesn't quickly undo Vector's hard work. You 69 | probably want to set `up-to-down-cooldown` to be close to the size of 70 | your `lookahead-window`. 71 | 72 | ### Timezones 73 | 74 | If you specify a timezone (either explicitly or via the system 75 | timezone), Vector will use DST-aware time calculations when evaluating 76 | lookback windows. If you don't specify a timezone and your system time 77 | is UTC, then 8AM on Monday morning after DST ends will look back 168 78 | hours - which is 7AM on the previous Monday. Predictive scaling would 79 | be off by one hour for a whole week in that case. 80 | 81 | ## Flexible Down Scaling 82 | 83 | ### Different Cooldown Periods 84 | 85 | Auto Scaling Groups support the concept of "cooldown periods" - a window 86 | of time after a scaling activity where no other activities should take 87 | place. This is to give the group a chance to settle into the new 88 | configuration before deciding whether another action is required. 89 | 90 | However, Auto Scaling Groups only support specifying the cooldown period 91 | *after* a certain activity - you can say "After a scale up, wait 5 92 | minutes before doing anything else, and after a scale down, wait 15 93 | minutes." What you can't do is say "After a scale up, wait 5 minutes for 94 | another scale up, and 40 minutes for a scale down." 95 | 96 | Vector lets you add custom up-to-down and down-to-down cooldown periods. 97 | You create your policies and alarms in your Auto Scaling Groups like 98 | normal, and then *disable* the alarms tied to the scale down policy. 99 | Then you tell Vector what cooldown periods to use, and he does the rest. 100 | 101 | ### Multiple Alarms 102 | 103 | Another benefit to Flexible Down Scaling is the ability to specify 104 | multiple alarms for a scaling down policy and require *all* alarms to 105 | trigger before scaling down. With Vector, you can add multiple 106 | (disabled) alarms to a policy, and Vector will trigger the policy only 107 | when *both* alarms are in ALARM state. This lets you do something like 108 | "only scale down when CPU utilization is < 30% and there is not a 109 | backlog of requests on any instances". 110 | 111 | ### Max Sunk Cost 112 | 113 | Vector also lets you specify a "max sunk cost" when scaling down a node. 114 | Amazon bills on hourly increments, and you pay a full hour for every 115 | partial hour used, so you want your instances to terminate as close to 116 | their hourly billing renewal (without going past it). 117 | 118 | For example, if you specify `--fds-max-sunk-cost 15m` and have two nodes 119 | in your group - 47 minutes and 32 minutes away from their hourly billing 120 | renewals - the group will not be scaled down. 121 | 122 | (You should make sure to run Vector on an interval smaller than this 123 | one, or else it's possible Vector may never find eligible nodes for 124 | scaledown and never scaledown.) 125 | 126 | ### Variable Thresholds 127 | 128 | When deciding to scale down, a static CPU utilization threshold may be 129 | inefficient. For example, if there are 2 nodes running, and you have a 130 | minimum of 2, and the average CPU is 75%, removing 1 node would 131 | theoretically result in the remaining 2 nodes running at > 100%. However, 132 | with 20 nodes running, at an average CPU of 75%, removing 1 node will 133 | only result in an average CPU of 79% across the remaining 19 nodes. 134 | 135 | When there are more nodes running, you can be more aggressive about 136 | removing nodes without overloading the remaining nodes. Variable 137 | thresholds allow you to express this. 138 | 139 | You can enable variable thresholds with `--fds-variable-thresholds`. 140 | 141 | ### Integration with Predictive Scaling 142 | 143 | Before scaling down, and if Predictive Scaling is in effect, Vector will 144 | check to see if the size **after** scaling down would trigger Predictive 145 | Scaling. If it would, the scaling policy will not be executed. 146 | 147 | ## Requirements 148 | 149 | * Auto Scaling groups must have the GroupInServiceInstances metric 150 | enabled. 151 | * Auto Scaling groups must have at least one scaling policy with a 152 | positive adjustment, and that policy must have at least one 153 | CloudWatch alarm with a CPUUtilization metric. 154 | 155 | ## Installation 156 | 157 | ```bash 158 | $ gem install vector 159 | ``` 160 | 161 | ## Usage 162 | 163 | Typically vector will be invoked via cron periodically (every 10 minutes 164 | is a good choice.) 165 | 166 | ``` 167 | Usage: vector [options] 168 | DURATION can look like 60s, 1m, 5h, 7d, 1w 169 | --timezone TIMEZONE Timezone to use for date calculations (like America/Denver) (default: system timezone) 170 | --region REGION AWS region to operate in (default: us-east-1) 171 | --groups group1,group2 A list of Auto Scaling Groups to evaluate 172 | --fleet fleet An AWS ASG Fleet (instead of specifying --groups) 173 | -v, --[no-]verbose Run verbosely 174 | 175 | Predictive Scaling Options 176 | --[no-]ps Enable Predictive Scaling 177 | --ps-lookback-windows DURATION,DURATION 178 | List of lookback windows 179 | --ps-lookahead-window DURATION 180 | Lookahead window 181 | --ps-valid-threshold FLOAT A number from 0.0 - 1.0 specifying how closely previous load must match current load for Predictive Scaling to take effect 182 | --ps-valid-period DURATION The period to use when doing the threshold check 183 | 184 | Flexible Down Scaling Options 185 | --[no-]fds Enable Flexible Down Scaling 186 | --fds-up-to-down DURATION The cooldown period between up and down scale events 187 | --fds-down-to-down DURATION The cooldown period between down and down scale events 188 | --fds-max-sunk-cost DURATION Only let a scaledown occur if there is an instance this close to its hourly billing point 189 | ``` 190 | 191 | # Questions 192 | 193 | ### Why not just predictively scale based on the past DesiredInstances? 194 | 195 | If we don't look at the actual utilization and just look 196 | at how many instances we were running in the past, we will end up 197 | scaling earlier and earlier, and will never re-adjust and not scale up 198 | if load patterns change, and we don't need so much capacity. 199 | 200 | ### What about high availability? What if the box Vector is running on dies? 201 | 202 | Luckily Vector is just providing optimizations - the critical component 203 | of scaling up based on demand is still provided by the normal Auto 204 | Scaling service. If Vector does not run, you just don't get the 205 | predictive scaling and down scaling. 206 | 207 | -------------------------------------------------------------------------------- /Rakefile: -------------------------------------------------------------------------------- 1 | require "bundler/gem_tasks" 2 | -------------------------------------------------------------------------------- /bin/vector: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env ruby 2 | 3 | if $0 == "bin/vector" 4 | $LOAD_PATH.unshift(File.expand_path(File.dirname(__FILE__)) + "/../lib") 5 | end 6 | 7 | require 'vector' 8 | Vector::CLI.new(ARGV).run 9 | -------------------------------------------------------------------------------- /lib/vector.rb: -------------------------------------------------------------------------------- 1 | require 'aws-sdk' 2 | require 'active_support/time' 3 | 4 | require 'vector/version' 5 | 6 | module Vector 7 | def self.time_string_to_seconds(string) 8 | if string =~ /^(\d+)([smhdw])?$/ 9 | n = $1.to_i 10 | unit = $2 || 's' 11 | 12 | case unit 13 | when 's' 14 | n.seconds 15 | when 'm' 16 | n.minutes 17 | when 'h' 18 | n.hours 19 | when 'd' 20 | n.days 21 | when 'w' 22 | n.weeks 23 | end 24 | else 25 | nil 26 | end 27 | end 28 | 29 | def self.within_threshold(threshold, v1, v2) 30 | threshold * v1 < v2 && threshold * v2 < v1 31 | end 32 | 33 | module HLogger 34 | @@enabled = false 35 | def self.enable(bool) 36 | @@enabled = bool 37 | end 38 | 39 | def hlog_ctx(ctx, &block) 40 | @components ||= [] 41 | @components << ctx 42 | yield 43 | ensure 44 | @components.pop 45 | end 46 | 47 | def hlog(string) 48 | return unless @@enabled 49 | puts "[#{hlog_ctx_string}] #{string}" 50 | end 51 | 52 | def hlog_ctx_string 53 | @components.join ',' 54 | end 55 | end 56 | end 57 | 58 | require 'vector/cli' 59 | require 'vector/functions/predictive_scaling' 60 | require 'vector/functions/flexible_down_scaling' 61 | -------------------------------------------------------------------------------- /lib/vector/cli.rb: -------------------------------------------------------------------------------- 1 | require 'optparse' 2 | require 'aws-sdk' 3 | require 'aws/auto_scaling/fleets' 4 | require 'vector/functions/flexible_down_scaling' 5 | require 'vector/functions/predictive_scaling' 6 | 7 | module Vector 8 | class CLI 9 | def initialize(argv) 10 | @argv = argv 11 | end 12 | 13 | def run 14 | load_config 15 | 16 | auto_scaling = AWS::AutoScaling.new(:region => @config[:region]) 17 | cloudwatch = AWS::CloudWatch.new(:region => @config[:region]) 18 | 19 | # everything we do should be fine looking at a snapshot in time, 20 | # so memoizing should be fine when acting as a CLI. 21 | AWS.start_memoizing 22 | 23 | groups = if @config[:fleet] 24 | auto_scaling.fleets[@config[:fleet]].groups 25 | else 26 | @config[:groups].map do |group_name| 27 | auto_scaling.groups[group_name] 28 | end 29 | end 30 | 31 | ps = nil 32 | if @config[:predictive_scaling][:enabled] 33 | psconf = @config[:predictive_scaling] 34 | ps = Vector::Function::PredictiveScaling.new( 35 | { :cloudwatch => cloudwatch, :dry_run => @config[:dry_run] }.merge(psconf)) 36 | end 37 | 38 | fds = nil 39 | if @config[:flexible_down_scaling][:enabled] 40 | fdsconf = @config[:flexible_down_scaling] 41 | fds = Vector::Function::FlexibleDownScaling.new( 42 | { :cloudwatch => cloudwatch, :dry_run => @config[:dry_run] }.merge(fdsconf)) 43 | end 44 | 45 | groups.each do |group| 46 | begin 47 | ps_check_procs = nil 48 | 49 | if ps 50 | ps_result = ps.run_for(group) 51 | ps_check_procs = ps_result[:check_procs] 52 | 53 | if ps_result[:triggered] 54 | # Don't need to evaluate for scaledown if we triggered a scaleup 55 | next 56 | end 57 | end 58 | 59 | if fds 60 | fds.run_for(group, ps_check_procs) 61 | end 62 | 63 | rescue => e 64 | puts "error for #{group.name}: #{e.inspect}\n#{e.backtrace.join "\n"}" 65 | end 66 | end 67 | end 68 | 69 | protected 70 | 71 | def load_config 72 | opts = { 73 | :quiet => false, 74 | :dry_run => false, 75 | :region => 'us-east-1', 76 | :groups => [], 77 | :fleet => nil, 78 | :predictive_scaling => { 79 | :enabled => false, 80 | :lookback_windows => [], 81 | :lookahead_window => nil, 82 | :valid_threshold => nil, 83 | :valid_period => 60 * 10 84 | }, 85 | :flexible_down_scaling => { 86 | :enabled => false, 87 | :up_down_cooldown => nil, 88 | :down_down_cooldown => nil, 89 | :max_sunk_cost => nil, 90 | :variable_thresholds => false, 91 | :n_low => nil, 92 | :n_high => nil, 93 | :m => nil, 94 | :g_high => 1.0, 95 | :g_low => 1.0 96 | } 97 | } 98 | 99 | optparser = OptionParser.new do |o| 100 | o.banner = "Usage: vector [options]" 101 | o.separator "DURATION can look like 60s, 1m, 5h, 7d, 1w" 102 | o.set_summary_width 5 103 | o.set_summary_indent ' ' 104 | 105 | def wrap(str) 106 | str.scan(/\S.{0,#{60}}\S(?=\s|$)|\S+/).join "\n " 107 | end 108 | 109 | o.on("--timezone TIMEZONE", wrap("Timezone to use for date calculations (like America/Denver) (default: system timezone)")) do |v| 110 | Time.zone = v 111 | end 112 | 113 | o.on("--region REGION", wrap("AWS region to operate in (default: us-east-1)")) do |v| 114 | opts[:region] = v 115 | end 116 | 117 | o.on("--groups group1,group2", Array, wrap("A list of Auto Scaling Groups to evaluate")) do |v| 118 | opts[:groups] = v 119 | end 120 | 121 | o.on("--fleet fleet", wrap("An AWS ASG Fleet (instead of specifying --groups)")) do |v| 122 | opts[:fleet] = v 123 | end 124 | 125 | o.on("--[no-]dry-run", wrap("Don't actually trigger any policies")) do |v| 126 | opts[:dry_run] = v 127 | end 128 | 129 | o.on("-q", "--[no-]quiet", wrap("Run quietly")) do |v| 130 | opts[:quiet] = v 131 | end 132 | 133 | o.separator "" 134 | o.separator "Predictive Scaling Options" 135 | 136 | o.on("--[no-]ps", wrap("Enable Predictive Scaling")) do |v| 137 | opts[:predictive_scaling][:enabled] = v 138 | end 139 | 140 | o.on("--ps-lookback-windows DURATION,DURATION", Array, wrap("List of lookback windows")) do |v| 141 | opts[:predictive_scaling][:lookback_windows] = 142 | v.map {|w| Vector.time_string_to_seconds(w) } 143 | end 144 | 145 | o.on("--ps-lookahead-window DURATION", String, wrap("Lookahead window")) do |v| 146 | opts[:predictive_scaling][:lookahead_window] = 147 | Vector.time_string_to_seconds(v) 148 | end 149 | 150 | o.on("--ps-valid-threshold FLOAT", Float, wrap("A number from 0.0 - 1.0 specifying how closely previous load must match current load for Predictive Scaling to take effect")) do |v| 151 | opts[:predictive_scaling][:valid_threshold] = v 152 | end 153 | 154 | o.on("--ps-valid-period DURATION", String, wrap("The period to use when doing the threshold check")) do |v| 155 | opts[:predictive_scaling][:valid_period] = 156 | Vector.time_string_to_seconds v 157 | end 158 | 159 | o.separator "" 160 | o.separator "Flexible Down Scaling Options" 161 | 162 | o.on("--[no-]fds", wrap("Enable Flexible Down Scaling")) do |v| 163 | opts[:flexible_down_scaling][:enabled] = v 164 | end 165 | 166 | o.on("--fds-up-to-down DURATION", String, wrap("The cooldown period between up and down scale events")) do |v| 167 | opts[:flexible_down_scaling][:up_down_cooldown] = 168 | Vector.time_string_to_seconds v 169 | end 170 | 171 | o.on("--fds-down-to-down DURATION", String, wrap("The cooldown period between down and down scale events")) do |v| 172 | opts[:flexible_down_scaling][:down_down_cooldown] = 173 | Vector.time_string_to_seconds v 174 | end 175 | 176 | o.on("--fds-max-sunk-cost DURATION", String, wrap("Only let a scaledown occur if there is an instance this close to its hourly billing point")) do |v| 177 | time = Vector.time_string_to_seconds v 178 | if time > 1.hour 179 | puts "--fds-max-sunk-cost duration must be < 1 hour" 180 | exit 1 181 | end 182 | 183 | opts[:flexible_down_scaling][:max_sunk_cost] = time 184 | end 185 | 186 | o.separator "" 187 | o.on("--[no-]fds-variable-thresholds", wrap("Enable Variable Thresholds")) do |v| 188 | opts[:flexible_down_scaling][:variable_thresholds] = v 189 | end 190 | 191 | o.on("--fds-n-low NUM", Integer, wrap("Number of nodes corresponding to --fds-g-low. (default: 1 more than the group's minimum size)")) do |v| 192 | opts[:flexible_down_scaling][:n_low] = v 193 | end 194 | 195 | o.on("--fds-n-high NUM", Integer, wrap("Number of nodes corresponding to --fds-g-high. (default: the group's maximum size)")) do |v| 196 | opts[:flexible_down_scaling][:n_high] = v 197 | end 198 | 199 | o.on("--fds-m PERCENTAGE", Float, wrap("Maximum target utilization. Will default to the CPUUtilization alarm threshold.")) do |v| 200 | opts[:flexible_down_scaling][:m] = v / 100 201 | end 202 | 203 | o.on("--fds-g-high PERCENTAGE", Float, wrap("Capacity headroom to apply when scaling down from --fds-n-high nodes, as a percentage. e.g. if this is 90%, then will not scale down from --fds-n-high nodes until expected utilization on the remaining nodes is at or below 90% of --fds-m. (default: 100)")) do |v| 204 | opts[:flexible_down_scaling][:g_high] = v / 100 205 | end 206 | 207 | o.on("--fds-g-low PERCENTAGE", Float, wrap("Capacity headroom to apply when scaling down from --fds-n-low nodes, as a percentage. e.g. if this is 75%, then will not scale down from --fds-n-low nodes until expected utilization on the remaining nodes is at or below 75% of --fds-m. When scaling down from a number of nodes other than --fds-n-high or --fds-n-low, will use a capacity headroom linearly interpolated from --fds-g-high and --fds-g-low. (default: 100)")) do |v| 208 | opts[:flexible_down_scaling][:g_low] = v / 100 209 | end 210 | 211 | o.on("--fds-print-variable-thresholds", wrap("Calculates and displays the thresholds that will be used for each asg, and does not execute any downscaling policies. (For debugging).")) do |v| 212 | opts[:flexible_down_scaling][:print_variable_thresholds] = true 213 | end 214 | 215 | end.parse!(@argv) 216 | 217 | if opts[:groups].empty? && opts[:fleet].nil? 218 | puts "No groups were specified." 219 | exit 1 220 | end 221 | 222 | if !opts[:groups].empty? && !opts[:fleet].nil? 223 | puts "You can't specify --groups and --fleet." 224 | exit 1 225 | end 226 | 227 | if opts[:predictive_scaling][:enabled] 228 | ps = opts[:predictive_scaling] 229 | if ps[:lookback_windows].empty? || ps[:lookahead_window].nil? 230 | puts "You must specify lookback windows and a lookahead window for Predictive Scaling." 231 | exit 1 232 | end 233 | end 234 | 235 | if opts[:flexible_down_scaling][:enabled] 236 | fds = opts[:flexible_down_scaling] 237 | if fds[:up_down_cooldown].nil? || 238 | fds[:down_down_cooldown].nil? 239 | puts "You must specify both up-to-down and down-to-down cooldown periods for Flexible Down Scaling." 240 | exit 1 241 | end 242 | end 243 | 244 | Vector::HLogger.enable(!opts[:quiet]) 245 | 246 | @config = opts 247 | end 248 | end 249 | end 250 | -------------------------------------------------------------------------------- /lib/vector/functions/flexible_down_scaling.rb: -------------------------------------------------------------------------------- 1 | require 'vector' 2 | 3 | module Vector 4 | module Function 5 | class FlexibleDownScaling 6 | include Vector::HLogger 7 | 8 | def initialize(options) 9 | @cloudwatch = options[:cloudwatch] 10 | @dry_run = options[:dry_run] 11 | @up_down_cooldown = options[:up_down_cooldown] 12 | @down_down_cooldown = options[:down_down_cooldown] 13 | @max_sunk_cost = options[:max_sunk_cost] 14 | @variable_thresholds = options[:variable_thresholds] 15 | @n_low = options[:n_low] 16 | @n_high = options[:n_high] 17 | @m = options[:m] 18 | @g_low = options[:g_low] 19 | @g_high = options[:g_high] 20 | @debug_variable_thresholds = options[:print_variable_thresholds] 21 | end 22 | 23 | def run_for(group, ps_check_procs) 24 | result = { :triggered => false } 25 | 26 | hlog_ctx("fds") do 27 | hlog_ctx("group:#{group.name}") do 28 | # don't check if no config was specified 29 | if @up_down_cooldown.nil? && @down_down_cooldown.nil? 30 | hlog("No cooldown periods specified, exiting") 31 | return result 32 | end 33 | 34 | # don't bother checking for a scaledown if desired capacity is 35 | # already at the minimum size... 36 | if group.desired_capacity == group.min_size 37 | hlog("Group is already at minimum size, exiting") 38 | return result 39 | end 40 | 41 | scaledown_policies = group.scaling_policies.select do |policy| 42 | policy.scaling_adjustment < 0 43 | end 44 | 45 | scaledown_policies.each do |policy| 46 | hlog_ctx("policy:#{policy.name}") do 47 | # TODO: support adjustment types other than ChangeInCapacity here 48 | if policy.adjustment_type == "ChangeInCapacity" && 49 | ps_check_procs && 50 | ps_check_procs.any? {|ps_check_proc| 51 | ps_check_proc.call(group.desired_capacity + policy.scaling_adjustment, self) } 52 | hlog("Predictive scaleup would trigger a scaleup if group were shrunk") 53 | next 54 | end 55 | 56 | alarms = policy.alarms.keys.map do |alarm_name| 57 | @cloudwatch.alarms[alarm_name] 58 | end 59 | 60 | # only consider disabled alarms (enabled alarms will trigger 61 | # the policy automatically) 62 | disabled_alarms = alarms.select do |alarm| 63 | !alarm.enabled? 64 | end 65 | 66 | # Do this logic first in case the user is just trying to print out 67 | # the thresholds. 68 | if @variable_thresholds 69 | # variable_thresholds currently requires a CPUUtilization alarm to function 70 | vt_cpu_alarm = disabled_alarms.find {|alarm| alarm.metric_name == "CPUUtilization" } 71 | 72 | # remove this alarm from the check, since we're not checking its alarm status 73 | # below. 74 | disabled_alarms.delete(vt_cpu_alarm) 75 | 76 | unless vt_cpu_alarm 77 | hlog("Variable thresholds requires an alarm on CPUUtilization, skipping") 78 | next 79 | end 80 | 81 | @n_low ||= group.min_size + 1 82 | @n_high ||= group.max_size 83 | @m ||= vt_cpu_alarm.threshold / 100 84 | 85 | if @g_low == @g_high 86 | hlog("g_low == g_high (#{@g_low}), not attempting to use flexible thresholds.") 87 | next 88 | end 89 | 90 | if @n_low == @n_high 91 | hlog("n_low == n_high (#{@n_low}), not attempting to use flexible thresholds.") 92 | next 93 | end 94 | 95 | if @debug_variable_thresholds 96 | puts " n_low: #{@n_low}" 97 | puts " n_high: #{@n_high}" 98 | puts " m: #{@m}" 99 | puts " g_low: #{@g_low}" 100 | puts " g_high: #{@g_high}" 101 | puts 102 | puts " N Threshold" 103 | ([@n_low, group.min_size].min + 1).upto([@n_high, group.max_size].max) do |i| 104 | puts " %2d %.1f%%" % [i, (variable_threshold(i, @n_low, @n_high, @m, @g_low, @g_high) * 100)] 105 | end 106 | next 107 | end 108 | end 109 | 110 | unless disabled_alarms.all? {|alarm| alarm.state_value == "ALARM" } 111 | hlog("Not all alarms are in ALARM state") 112 | next 113 | end 114 | 115 | if @variable_thresholds 116 | threshold = variable_threshold(group.desired_capacity, @n_low, @n_high, @m, @g_low, @g_high) 117 | 118 | stats = vt_cpu_alarm.metric.statistics( 119 | :start_time => Time.now - (vt_cpu_alarm.period * vt_cpu_alarm.evaluation_periods), 120 | :end_time => Time.now, 121 | :statistics => [ vt_cpu_alarm.statistic ], 122 | :period => vt_cpu_alarm.period) 123 | 124 | if stats.datapoints.length < vt_cpu_alarm.evaluation_periods 125 | hlog("Could not get enough datapoints for checking variable threshold"); 126 | next 127 | end 128 | 129 | if stats.datapoints.any? {|dp| dp[vt_cpu_alarm.statistic.downcase.to_sym] > (threshold * 100) } 130 | hlog("Not all datapoints are beneath the variable threshold #{(threshold * 100).to_i}: #{stats.datapoints}") 131 | next 132 | end 133 | 134 | hlog("Variable threshold: #{(threshold * 100).to_i}, #{group.desired_capacity} nodes") 135 | end 136 | 137 | unless outside_cooldown_period(group) 138 | hlog("Group is not outside the specified cooldown periods") 139 | next 140 | end 141 | 142 | unless has_eligible_scaledown_instance(group) 143 | hlog("Group does not have an instance eligible for scaledown due to max_sunk_cost") 144 | next 145 | end 146 | 147 | if @dry_run 148 | hlog("Executing policy (DRY RUN)") 149 | else 150 | hlog("Executing policy") 151 | policy.execute(:honor_cooldown => true) 152 | end 153 | 154 | result[:triggered] = true 155 | 156 | # no need to evaluate other scaledown policies 157 | return result 158 | end 159 | end 160 | end 161 | end 162 | 163 | result 164 | end 165 | 166 | protected 167 | 168 | def variable_threshold(n, n_low, n_high, m, g_low, g_high) 169 | m_high = g_high * m 170 | m_low = g_low * m 171 | a = (m_high - m_low).to_f / (n_high - n_low).to_f 172 | b = m_low - (n_low * a) 173 | res = (a * n + b) * (1.0 - (1.0 / n)) 174 | res 175 | end 176 | 177 | def has_eligible_scaledown_instance(group) 178 | return true if @max_sunk_cost.nil? 179 | 180 | group.ec2_instances.select {|i| i.status == :running }.each do |instance| 181 | # get amount of time until hitting the instance renewal time 182 | time_left = ((instance.launch_time.min - Time.now.min) % 60).minutes 183 | 184 | # if we're within 1 minute, assume we won't be able to terminate it 185 | # in time anyway and ignore it. 186 | if time_left > 1.minute and time_left < @max_sunk_cost 187 | # we only care if there is at least one instance within the window 188 | # where we can scale down 189 | return true 190 | end 191 | end 192 | 193 | false 194 | end 195 | 196 | def outside_cooldown_period(group) 197 | @cached_outside_cooldown ||= {} 198 | if @cached_outside_cooldown.has_key? group 199 | return @cached_outside_cooldown[group] 200 | end 201 | 202 | activities = previous_scaling_activities(group) 203 | return nil if activities.nil? 204 | 205 | if activities[:up] 206 | hlog "Last scale up #{(Time.now - activities[:up]).minutes.inspect} ago" 207 | end 208 | if activities[:down] 209 | hlog "Last scale down #{(Time.now - activities[:down]).minutes.inspect} ago" 210 | end 211 | result = true 212 | 213 | # check up-down 214 | if @up_down_cooldown && activities[:up] && 215 | Time.now - activities[:up] < @up_down_cooldown 216 | result = false 217 | end 218 | 219 | # check down-down 220 | if @down_down_cooldown && activities[:down] && 221 | Time.now - activities[:down] < @down_down_cooldown 222 | result = false 223 | end 224 | 225 | result 226 | end 227 | 228 | # Looks at the GroupDesiredCapacity metric for the specified 229 | # group, and finds the most recent change in value. 230 | # 231 | # @returns 232 | # * nil if there was a problem getting data. There may have been 233 | # scaling events or not, we don't know. 234 | # * a hash with two keys, :up and :down, with values indicating 235 | # when the last corresponding activity happened. If the 236 | # activity was not seen in the examined time period, the value 237 | # is nil. 238 | def previous_scaling_activities(group) 239 | metric = @cloudwatch.metrics. 240 | with_namespace("AWS/AutoScaling"). 241 | with_metric_name("GroupDesiredCapacity"). 242 | filter('dimensions', [{ 243 | :name => "AutoScalingGroupName", 244 | :value => group.name 245 | }]).first 246 | 247 | return nil unless metric 248 | 249 | start_time = Time.now - [ @up_down_cooldown, @down_down_cooldown ].max 250 | end_time = Time.now 251 | 252 | stats = metric.statistics( 253 | :start_time => start_time, 254 | :end_time => end_time, 255 | :statistics => [ "Average" ], 256 | :period => 60) 257 | 258 | # check if we got enough datapoints... if we didn't, we need to 259 | # assume bad data and inform the caller. this code is basically 260 | # checking if the # of received datapoints is within 50% of the 261 | # expected datapoints. 262 | got_datapoints = stats.datapoints.length 263 | requested_datapoints = (end_time - start_time) / 60 264 | if !Vector.within_threshold(0.5, got_datapoints, requested_datapoints) 265 | return nil 266 | end 267 | 268 | # iterate over the datapoints in reverse, looking for the first 269 | # change in value, which should be the most recent scaling 270 | # activity 271 | activities = { :down => nil, :up => nil } 272 | last_value = nil 273 | stats.datapoints.sort {|a,b| b[:timestamp] <=> a[:timestamp] }.each do |dp| 274 | next if dp[:average].nil? 275 | 276 | unless last_value.nil? 277 | if dp[:average] != last_value 278 | direction = (last_value < dp[:average]) ? :down : :up 279 | activities[direction] ||= dp[:timestamp] 280 | end 281 | end 282 | 283 | last_value = dp[:average] 284 | break unless activities.values.any? {|v| v.nil? } 285 | end 286 | 287 | activities 288 | end 289 | end 290 | end 291 | end 292 | -------------------------------------------------------------------------------- /lib/vector/functions/predictive_scaling.rb: -------------------------------------------------------------------------------- 1 | module Vector 2 | module Function 3 | class PredictiveScaling 4 | include Vector::HLogger 5 | 6 | def initialize(options) 7 | @cloudwatch = options[:cloudwatch] 8 | @dry_run = options[:dry_run] 9 | @lookback_windows = options[:lookback_windows] 10 | @lookahead_window = options[:lookahead_window] 11 | @valid_threshold = options[:valid_threshold] 12 | @valid_period = options[:valid_period] 13 | end 14 | 15 | def run_for(group) 16 | result = { :check_procs => [], :triggered => false } 17 | 18 | hlog_ctx "ps" do 19 | hlog_ctx "group:#{group.name}" do 20 | return result if @lookback_windows.length == 0 21 | 22 | scaleup_policies = group.scaling_policies.select do |policy| 23 | policy.scaling_adjustment > 0 24 | end 25 | 26 | scaleup_policies.each do |policy| 27 | hlog_ctx "policy:#{policy.name}" do 28 | 29 | policy.alarms.keys.each do |alarm_name| 30 | alarm = @cloudwatch.alarms[alarm_name] 31 | hlog_ctx "alarm:#{alarm.name}" do 32 | hlog "Metric #{alarm.metric.name}" 33 | 34 | unless alarm.enabled? 35 | hlog "Skipping disabled alarm" 36 | next 37 | end 38 | 39 | # Note that everywhere we say "load" what we mean is 40 | # "metric value * number of nodes" 41 | now_load, now_num = load_for(group, alarm.metric, 42 | Time.now, @valid_period) 43 | 44 | if now_load.nil? 45 | hlog "Could not get current total for metric" 46 | next 47 | end 48 | 49 | @lookback_windows.each do |window| 50 | hlog_ctx "window:#{window.inspect.gsub ' ', ''}" do 51 | then_load, = load_for(group, alarm.metric, 52 | Time.now - window, @valid_period) 53 | 54 | if then_load.nil? 55 | hlog "Could not get past total value for metric" 56 | next 57 | end 58 | 59 | # check that the past total utilization is within 60 | # threshold% of the current total utilization 61 | if @valid_threshold && 62 | !Vector.within_threshold(@valid_threshold, now_load, then_load) 63 | hlog "Past metric total value not within threshold (current #{now_load}, then #{then_load})" 64 | next 65 | end 66 | 67 | past_load, = load_for(group, alarm.metric, 68 | Time.now - window + @lookahead_window, 69 | alarm.period) 70 | 71 | if past_load.nil? 72 | hlog "Could not get past + #{@lookahead_window.inspect} total value for metric" 73 | next 74 | end 75 | 76 | # now take the past total load and divide it by the 77 | # current number of instances to get the predicted value 78 | 79 | # (we capture our original log context here in order to display 80 | # the source of these checks later when this proc is called by 81 | # scaledown stuff). 82 | orig_ctx = hlog_ctx_string 83 | check_proc = Proc.new do |num_nodes, logger| 84 | predicted_value = past_load.to_f / num_nodes 85 | 86 | log_str = "Predicted #{alarm.metric.name}: #{predicted_value} (#{num_nodes} nodes)" 87 | 88 | # Tack on the original context if we're in a different logger 89 | # (for the case where this is called during scaledown checks). 90 | if orig_ctx != logger.hlog_ctx_string 91 | log_str += " (from #{orig_ctx})" 92 | end 93 | 94 | logger.hlog log_str 95 | 96 | check_alarm_threshold(alarm, predicted_value) 97 | end 98 | result[:check_procs] << check_proc 99 | 100 | if check_proc.call(now_num, self) 101 | if @dry_run 102 | hlog "Executing policy (DRY RUN)" 103 | else 104 | hlog "Executing policy" 105 | policy.execute(honor_cooldown: true) 106 | end 107 | 108 | result[:triggered] = true 109 | 110 | # don't need to evaluate further windows or policies on this group 111 | return result 112 | end 113 | end 114 | end 115 | end 116 | end 117 | end 118 | end 119 | end 120 | end 121 | 122 | result 123 | end 124 | 125 | protected 126 | 127 | def check_alarm_threshold(alarm, value) 128 | case alarm.comparison_operator 129 | when "GreaterThanOrEqualToThreshold" 130 | value >= alarm.threshold 131 | when "GreaterThanThreshold" 132 | value > alarm.threshold 133 | when "LessThanThreshold" 134 | value < alarm.threshold 135 | when "LessThanOrEqualToThreshold" 136 | value <= alarm.threshold 137 | end 138 | end 139 | 140 | def load_for(group, metric, time, window) 141 | num_instances_metric = @cloudwatch.metrics. 142 | with_namespace("AWS/AutoScaling"). 143 | with_metric_name("GroupInServiceInstances"). 144 | filter('dimensions', [{ 145 | :name => 'AutoScalingGroupName', 146 | :value => group.name 147 | }]).first 148 | 149 | unless num_instances_metric 150 | raise "Could not find GroupInServicesInstances metric for #{group.name}" 151 | end 152 | 153 | start_time = time - (window / 2) 154 | end_time = time + (window / 2) 155 | 156 | avg = average_for_metric(metric, start_time, end_time) 157 | num = average_for_metric(num_instances_metric, start_time, end_time) 158 | 159 | if avg.nil? || num.nil? 160 | return [ nil, nil ] 161 | end 162 | 163 | [ avg * num, num ] 164 | end 165 | 166 | def average_for_metric(metric, start_time, end_time) 167 | stats = metric.statistics( 168 | :start_time => start_time, 169 | :end_time => end_time, 170 | :statistics => [ "Average" ], 171 | :period => 60) 172 | 173 | return nil if stats.datapoints.length == 0 174 | 175 | sum = stats.datapoints.inject(0) do |r, dp| 176 | r + dp[:average] 177 | end 178 | 179 | sum.to_f / stats.datapoints.length 180 | end 181 | end 182 | end 183 | end 184 | -------------------------------------------------------------------------------- /lib/vector/version.rb: -------------------------------------------------------------------------------- 1 | module Vector 2 | VERSION = "0.0.6" 3 | end 4 | -------------------------------------------------------------------------------- /vector.gemspec: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | lib = File.expand_path('../lib', __FILE__) 3 | $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib) 4 | require 'vector/version' 5 | 6 | Gem::Specification.new do |spec| 7 | spec.name = "vector" 8 | spec.version = Vector::VERSION 9 | spec.authors = ["Zach Wily"] 10 | spec.email = ["zach@zwily.com"] 11 | spec.summary = %q{AWS Auto-Scaling Assistant} 12 | spec.homepage = "http://github.com/instructure/vector" 13 | spec.license = "MIT" 14 | 15 | spec.files = `git ls-files`.split($/) 16 | spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) } 17 | spec.test_files = spec.files.grep(%r{^(test|spec|features)/}) 18 | spec.require_paths = ["lib"] 19 | 20 | spec.add_dependency "aws-sdk" 21 | spec.add_dependency "aws-asg-fleet" 22 | spec.add_dependency "activesupport" 23 | 24 | spec.add_development_dependency "bundler", "~> 1.3" 25 | spec.add_development_dependency "rake" 26 | end 27 | --------------------------------------------------------------------------------