├── .gitignore
├── Gemfile
├── LICENSE.txt
├── README.md
├── Rakefile
├── bin
    └── vector
├── lib
    ├── vector.rb
    └── vector
    │   ├── cli.rb
    │   ├── functions
    │       ├── flexible_down_scaling.rb
    │       └── predictive_scaling.rb
    │   └── version.rb
└── vector.gemspec


/.gitignore:
--------------------------------------------------------------------------------
 1 | *.gem
 2 | *.rbc
 3 | .bundle
 4 | .config
 5 | .yardoc
 6 | Gemfile.lock
 7 | InstalledFiles
 8 | _yardoc
 9 | coverage
10 | doc/
11 | lib/bundler/man
12 | pkg
13 | rdoc
14 | spec/reports
15 | test/tmp
16 | test/version_tmp
17 | tmp
18 | 


--------------------------------------------------------------------------------
/Gemfile:
--------------------------------------------------------------------------------
1 | source 'https://rubygems.org'
2 | 
3 | # Specify your gem's dependencies in vector.gemspec
4 | gemspec
5 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2013 Instructure, Inc
 2 | 
 3 | MIT License
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining
 6 | a copy of this software and associated documentation files (the
 7 | "Software"), to deal in the Software without restriction, including
 8 | without limitation the rights to use, copy, modify, merge, publish,
 9 | distribute, sublicense, and/or sell copies of the Software, and to
10 | permit persons to whom the Software is furnished to do so, subject to
11 | the following conditions:
12 | 
13 | The above copyright notice and this permission notice shall be
14 | included in all copies or substantial portions of the Software.
15 | 
16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19 | NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20 | LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22 | WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
23 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Vector
  2 | 
  3 | Vector is a tool that augments your auto-scaling groups. The two
  4 | features currently offered are Predictive Scaling and Flexible Down
  5 | Scaling.
  6 | 
  7 | ## Predictive scaling
  8 | 
  9 | Auto Scaling groups do a good job of responding to current
 10 | load conditions, but if you have a predictable load pattern,
 11 | it can be nice to scale up your servers a little bit *early*.
 12 | Some reasons you might want to do that are:
 13 | 
 14 |  * If it takes several minutes for an instance to fully boot
 15 |    and ready itself for requests.
 16 |  * If you have very serious (but predictable) spikes,
 17 |    it's nice to have the capacity in place before the spike
 18 |    starts.
 19 |  * To give yourself a buffer of time if AWS APIs start
 20 |    throwing errors. If scaling up is going to fail, you'd
 21 |    rather it start failing with a little bit of time before
 22 |    you actually need the capacity so you can begin evasive maneuvers.
 23 | 
 24 | Vector examines your existing CloudWatch alarms tied to your Auto
 25 | Scaling groups, and predicts if they will be triggered in the future
 26 | based on what happened in the past.
 27 | 
 28 | **Note:** This only works with metrics that are averaged across your group -
 29 | like CPUUtilization or Load. If you auto-scale based on something
 30 | like QueueLength, Predictive Scaling will not work right for you.
 31 | 
 32 | For each lookback window you specify, Vector will first check the
 33 | current value of the metric * the number of nodes, and the past value of
 34 | the metric * the past number of nodes. If those numbers are close enough
 35 | (within the threshold specified by `--ps-valid-threshold`), then it will
 36 | continue.
 37 | 
 38 | Vector will then go back to the lookback window specified, and then
 39 | forward in time based on the lookahead window (`--ps-lookahead-window`).
 40 | It will compute the metric * number of nodes then to get a predicted
 41 | aggregate metric value for the current future. It then divides that by
 42 | the current number of nodes to get a predicted average value for the
 43 | metric. That is then compared against the alarm's threshold.
 44 | 
 45 | For example:
 46 | 
 47 | > You have an alarm that checks CPUUtilization of your group, and will
 48 | > trigger the alarm if that goes above 70%. Vector is configured to use a
 49 | > 1 week lookback window, a 1 hour lookahead window, and a valid-threshold
 50 | > of 0.8.
 51 | > 
 52 | > The current value of CPUUtilization is 49%, and there are 2 nodes in the
 53 | > group. CPUUtilization 1 week ago was 53%, and there were 2 nodes in the
 54 | > group. Therefore, total current CPUUtilization is 98%, and 1 week ago was
 55 | > 106%. Those are within 80% of each other (valid-threshold), so we can
 56 | > continue with the prediction.
 57 | > 
 58 | > The value of CPUUtilization 1 week ago, *plus* 1 hour was 45%, and
 59 | > there were 4 nodes in the group. We calculate total CPUUtilization for
 60 | > that time to be 180%. Assuming no new nodes are launched, the predicted
 61 | > average CPUUtilization for the group 1 hour from now is 180% / 2 = 90%.
 62 | > 90% is above the alarm's 75% threshold, so we trigger the scaleup
 63 | > policy.
 64 | 
 65 | 
 66 | If you use Predictive Scaling, you probably also want to use Flexible
 67 | Down Scaling (below) so that after scaling up in prediction of load,
 68 | your scaledown policy doesn't quickly undo Vector's hard work. You
 69 | probably want to set `up-to-down-cooldown` to be close to the size of
 70 | your `lookahead-window`.
 71 | 
 72 | ### Timezones
 73 | 
 74 | If you specify a timezone (either explicitly or via the system
 75 | timezone), Vector will use DST-aware time calculations when evaluating
 76 | lookback windows. If you don't specify a timezone and your system time
 77 | is UTC, then 8AM on Monday morning after DST ends will look back 168
 78 | hours - which is 7AM on the previous Monday.  Predictive scaling would
 79 | be off by one hour for a whole week in that case.
 80 | 
 81 | ## Flexible Down Scaling
 82 | 
 83 | ### Different Cooldown Periods
 84 | 
 85 | Auto Scaling Groups support the concept of "cooldown periods" - a window
 86 | of time after a scaling activity where no other activities should take
 87 | place. This is to give the group a chance to settle into the new
 88 | configuration before deciding whether another action is required.
 89 | 
 90 | However, Auto Scaling Groups only support specifying the cooldown period
 91 | *after* a certain activity - you can say "After a scale up, wait 5
 92 | minutes before doing anything else, and after a scale down, wait 15
 93 | minutes." What you can't do is say "After a scale up, wait 5 minutes for
 94 | another scale up, and 40 minutes for a scale down."
 95 | 
 96 | Vector lets you add custom up-to-down and down-to-down cooldown periods.
 97 | You create your policies and alarms in your Auto Scaling Groups like
 98 | normal, and then *disable* the alarms tied to the scale down policy.
 99 | Then you tell Vector what cooldown periods to use, and he does the rest.
100 | 
101 | ### Multiple Alarms
102 | 
103 | Another benefit to Flexible Down Scaling is the ability to specify
104 | multiple alarms for a scaling down policy and require *all* alarms to
105 | trigger before scaling down. With Vector, you can add multiple
106 | (disabled) alarms to a policy, and Vector will trigger the policy only
107 | when *both* alarms are in ALARM state. This lets you do something like
108 | "only scale down when CPU utilization is < 30% and there is not a
109 | backlog of requests on any instances".
110 | 
111 | ### Max Sunk Cost
112 | 
113 | Vector also lets you specify a "max sunk cost" when scaling down a node.
114 | Amazon bills on hourly increments, and you pay a full hour for every
115 | partial hour used, so you want your instances to terminate as close to
116 | their hourly billing renewal (without going past it).
117 | 
118 | For example, if you specify `--fds-max-sunk-cost 15m` and have two nodes
119 | in your group - 47 minutes and 32 minutes away from their hourly billing
120 | renewals - the group will not be scaled down.
121 | 
122 | (You should make sure to run Vector on an interval smaller than this
123 | one, or else it's possible Vector may never find eligible nodes for
124 | scaledown and never scaledown.)
125 | 
126 | ### Variable Thresholds
127 | 
128 | When deciding to scale down, a static CPU utilization threshold may be
129 | inefficient. For example, if there are 2 nodes running, and you have a
130 | minimum of 2, and the average CPU is 75%, removing 1 node would
131 | theoretically result in the remaining 2 nodes running at > 100%. However,
132 | with 20 nodes running, at an average CPU of 75%, removing 1 node will
133 | only result in an average CPU of 79% across the remaining 19 nodes.
134 | 
135 | When there are more nodes running, you can be more aggressive about
136 | removing nodes without overloading the remaining nodes. Variable
137 | thresholds allow you to express this.
138 | 
139 | You can enable variable thresholds with `--fds-variable-thresholds`.
140 | 
141 | ### Integration with Predictive Scaling
142 | 
143 | Before scaling down, and if Predictive Scaling is in effect, Vector will
144 | check to see if the size **after** scaling down would trigger Predictive
145 | Scaling. If it would, the scaling policy will not be executed.
146 | 
147 | ## Requirements
148 | 
149 |  * Auto Scaling groups must have the GroupInServiceInstances metric
150 |    enabled.
151 |  * Auto Scaling groups must have at least one scaling policy with a
152 |    positive adjustment, and that policy must have at least one
153 |    CloudWatch alarm with a CPUUtilization metric.
154 | 
155 | ## Installation
156 | 
157 | ```bash
158 | $ gem install vector
159 | ```
160 | 
161 | ## Usage
162 | 
163 | Typically vector will be invoked via cron periodically (every 10 minutes
164 | is a good choice.)
165 | 
166 | ```
167 | Usage: vector [options]
168 | DURATION can look like 60s, 1m, 5h, 7d, 1w
169 |         --timezone TIMEZONE          Timezone to use for date calculations (like America/Denver) (default: system timezone)
170 |         --region REGION              AWS region to operate in (default: us-east-1)
171 |         --groups group1,group2       A list of Auto Scaling Groups to evaluate
172 |         --fleet fleet                An AWS ASG Fleet (instead of specifying --groups)
173 |     -v, --[no-]verbose               Run verbosely
174 | 
175 | Predictive Scaling Options
176 |         --[no-]ps                    Enable Predictive Scaling
177 |         --ps-lookback-windows DURATION,DURATION
178 |                                      List of lookback windows
179 |         --ps-lookahead-window DURATION
180 |                                      Lookahead window
181 |         --ps-valid-threshold FLOAT   A number from 0.0 - 1.0 specifying how closely previous load must match current load for Predictive Scaling to take effect
182 |         --ps-valid-period DURATION   The period to use when doing the threshold check
183 | 
184 | Flexible Down Scaling Options
185 |         --[no-]fds                   Enable Flexible Down Scaling
186 |         --fds-up-to-down DURATION    The cooldown period between up and down scale events
187 |         --fds-down-to-down DURATION  The cooldown period between down and down scale events
188 |         --fds-max-sunk-cost DURATION Only let a scaledown occur if there is an instance this close to its hourly billing point
189 | ```
190 | 
191 | # Questions
192 | 
193 | ### Why not just predictively scale based on the past DesiredInstances?
194 | 
195 | If we don't look at the actual utilization and just look
196 | at how many instances we were running in the past, we will end up
197 | scaling earlier and earlier, and will never re-adjust and not scale up
198 | if load patterns change, and we don't need so much capacity.
199 | 
200 | ### What about high availability? What if the box Vector is running on dies?
201 | 
202 | Luckily Vector is just providing optimizations - the critical component
203 | of scaling up based on demand is still provided by the normal Auto
204 | Scaling service. If Vector does not run, you just don't get the
205 | predictive scaling and down scaling.
206 | 
207 | 


--------------------------------------------------------------------------------
/Rakefile:
--------------------------------------------------------------------------------
1 | require "bundler/gem_tasks"
2 | 


--------------------------------------------------------------------------------
/bin/vector:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env ruby
2 | 
3 | if $0 == "bin/vector"
4 |   $LOAD_PATH.unshift(File.expand_path(File.dirname(__FILE__)) + "/../lib")
5 | end
6 | 
7 | require 'vector'
8 | Vector::CLI.new(ARGV).run
9 | 


--------------------------------------------------------------------------------
/lib/vector.rb:
--------------------------------------------------------------------------------
 1 | require 'aws-sdk'
 2 | require 'active_support/time'
 3 | 
 4 | require 'vector/version'
 5 | 
 6 | module Vector
 7 |   def self.time_string_to_seconds(string)
 8 |     if string =~ /^(\d+)([smhdw])?$/
 9 |       n = $1.to_i
10 |       unit = $2 || 's'
11 | 
12 |       case unit
13 |       when 's'
14 |         n.seconds
15 |       when 'm'
16 |         n.minutes
17 |       when 'h'
18 |         n.hours
19 |       when 'd'
20 |         n.days
21 |       when 'w'
22 |         n.weeks
23 |       end
24 |     else
25 |       nil
26 |     end
27 |   end
28 | 
29 |   def self.within_threshold(threshold, v1, v2)
30 |     threshold * v1 < v2 && threshold * v2 < v1
31 |   end
32 | 
33 |   module HLogger
34 |     @@enabled = false
35 |     def self.enable(bool)
36 |       @@enabled = bool
37 |     end
38 | 
39 |     def hlog_ctx(ctx, &block)
40 |       @components ||= []
41 |       @components << ctx
42 |       yield
43 |     ensure
44 |       @components.pop
45 |     end
46 | 
47 |     def hlog(string)
48 |       return unless @@enabled
49 |       puts "[#{hlog_ctx_string}] #{string}"
50 |     end
51 | 
52 |     def hlog_ctx_string
53 |       @components.join ','
54 |     end
55 |   end
56 | end
57 | 
58 | require 'vector/cli'
59 | require 'vector/functions/predictive_scaling'
60 | require 'vector/functions/flexible_down_scaling'
61 | 


--------------------------------------------------------------------------------
/lib/vector/cli.rb:
--------------------------------------------------------------------------------
  1 | require 'optparse'
  2 | require 'aws-sdk'
  3 | require 'aws/auto_scaling/fleets'
  4 | require 'vector/functions/flexible_down_scaling'
  5 | require 'vector/functions/predictive_scaling'
  6 | 
  7 | module Vector
  8 |   class CLI
  9 |     def initialize(argv)
 10 |       @argv = argv
 11 |     end
 12 | 
 13 |     def run
 14 |       load_config
 15 | 
 16 |       auto_scaling = AWS::AutoScaling.new(:region => @config[:region])
 17 |       cloudwatch = AWS::CloudWatch.new(:region => @config[:region])
 18 | 
 19 |       # everything we do should be fine looking at a snapshot in time,
 20 |       # so memoizing should be fine when acting as a CLI.
 21 |       AWS.start_memoizing
 22 | 
 23 |       groups = if @config[:fleet]
 24 |                  auto_scaling.fleets[@config[:fleet]].groups
 25 |                else
 26 |                  @config[:groups].map do |group_name|
 27 |                    auto_scaling.groups[group_name]
 28 |                  end
 29 |                end
 30 | 
 31 |       ps = nil
 32 |       if @config[:predictive_scaling][:enabled]
 33 |         psconf = @config[:predictive_scaling]
 34 |         ps = Vector::Function::PredictiveScaling.new(
 35 |           { :cloudwatch => cloudwatch, :dry_run => @config[:dry_run] }.merge(psconf))
 36 |       end
 37 | 
 38 |       fds = nil
 39 |       if @config[:flexible_down_scaling][:enabled]
 40 |         fdsconf = @config[:flexible_down_scaling]
 41 |         fds = Vector::Function::FlexibleDownScaling.new(
 42 |           { :cloudwatch => cloudwatch, :dry_run => @config[:dry_run] }.merge(fdsconf))
 43 |       end
 44 | 
 45 |       groups.each do |group|
 46 |         begin
 47 |           ps_check_procs = nil
 48 | 
 49 |           if ps
 50 |             ps_result = ps.run_for(group)
 51 |             ps_check_procs = ps_result[:check_procs]
 52 | 
 53 |             if ps_result[:triggered]
 54 |               # Don't need to evaluate for scaledown if we triggered a scaleup
 55 |               next
 56 |             end
 57 |           end
 58 | 
 59 |           if fds
 60 |             fds.run_for(group, ps_check_procs)
 61 |           end
 62 | 
 63 |         rescue => e
 64 |           puts "error for #{group.name}: #{e.inspect}\n#{e.backtrace.join "\n"}"
 65 |         end
 66 |       end
 67 |     end
 68 | 
 69 |     protected
 70 | 
 71 |     def load_config
 72 |       opts = {
 73 |         :quiet => false,
 74 |         :dry_run => false,
 75 |         :region => 'us-east-1',
 76 |         :groups => [],
 77 |         :fleet => nil,
 78 |         :predictive_scaling => {
 79 |           :enabled => false,
 80 |           :lookback_windows => [],
 81 |           :lookahead_window => nil,
 82 |           :valid_threshold => nil,
 83 |           :valid_period => 60 * 10
 84 |         },
 85 |         :flexible_down_scaling => {
 86 |           :enabled => false,
 87 |           :up_down_cooldown => nil,
 88 |           :down_down_cooldown => nil,
 89 |           :max_sunk_cost => nil,
 90 |           :variable_thresholds => false,
 91 |           :n_low => nil,
 92 |           :n_high => nil,
 93 |           :m => nil,
 94 |           :g_high => 1.0,
 95 |           :g_low => 1.0
 96 |         }
 97 |       }
 98 | 
 99 |       optparser = OptionParser.new do |o|
100 |         o.banner = "Usage: vector [options]"
101 |         o.separator "DURATION can look like 60s, 1m, 5h, 7d, 1w"
102 |         o.set_summary_width 5
103 |         o.set_summary_indent '  '
104 | 
105 |         def wrap(str)
106 |           str.scan(/\S.{0,#{60}}\S(?=\s|$)|\S+/).join "\n        "
107 |         end
108 | 
109 |         o.on("--timezone TIMEZONE", wrap("Timezone to use for date calculations (like America/Denver) (default: system timezone)")) do |v|
110 |           Time.zone = v
111 |         end
112 | 
113 |         o.on("--region REGION", wrap("AWS region to operate in (default: us-east-1)")) do |v|
114 |           opts[:region] = v
115 |         end
116 | 
117 |         o.on("--groups group1,group2", Array, wrap("A list of Auto Scaling Groups to evaluate")) do |v|
118 |           opts[:groups] = v
119 |         end
120 | 
121 |         o.on("--fleet fleet", wrap("An AWS ASG Fleet (instead of specifying --groups)")) do |v|
122 |           opts[:fleet] = v
123 |         end
124 | 
125 |         o.on("--[no-]dry-run", wrap("Don't actually trigger any policies")) do |v|
126 |           opts[:dry_run] = v
127 |         end
128 | 
129 |         o.on("-q", "--[no-]quiet", wrap("Run quietly")) do |v|
130 |           opts[:quiet] = v
131 |         end
132 | 
133 |         o.separator ""
134 |         o.separator "Predictive Scaling Options"
135 | 
136 |         o.on("--[no-]ps", wrap("Enable Predictive Scaling")) do |v|
137 |           opts[:predictive_scaling][:enabled] = v
138 |         end
139 | 
140 |         o.on("--ps-lookback-windows DURATION,DURATION", Array, wrap("List of lookback windows")) do |v|
141 |           opts[:predictive_scaling][:lookback_windows] =
142 |             v.map {|w| Vector.time_string_to_seconds(w) }
143 |         end
144 | 
145 |         o.on("--ps-lookahead-window DURATION", String, wrap("Lookahead window")) do |v|
146 |           opts[:predictive_scaling][:lookahead_window] =
147 |             Vector.time_string_to_seconds(v)
148 |         end
149 | 
150 |         o.on("--ps-valid-threshold FLOAT", Float, wrap("A number from 0.0 - 1.0 specifying how closely previous load must match current load for Predictive Scaling to take effect")) do |v|
151 |           opts[:predictive_scaling][:valid_threshold] = v
152 |         end
153 | 
154 |         o.on("--ps-valid-period DURATION", String, wrap("The period to use when doing the threshold check")) do |v|
155 |           opts[:predictive_scaling][:valid_period] = 
156 |             Vector.time_string_to_seconds v
157 |         end
158 | 
159 |         o.separator ""
160 |         o.separator "Flexible Down Scaling Options"
161 | 
162 |         o.on("--[no-]fds", wrap("Enable Flexible Down Scaling")) do |v|
163 |           opts[:flexible_down_scaling][:enabled] = v
164 |         end
165 | 
166 |         o.on("--fds-up-to-down DURATION", String, wrap("The cooldown period between up and down scale events")) do |v|
167 |           opts[:flexible_down_scaling][:up_down_cooldown] =
168 |             Vector.time_string_to_seconds v
169 |         end
170 | 
171 |         o.on("--fds-down-to-down DURATION", String, wrap("The cooldown period between down and down scale events")) do |v|
172 |           opts[:flexible_down_scaling][:down_down_cooldown] =
173 |             Vector.time_string_to_seconds v
174 |         end
175 | 
176 |         o.on("--fds-max-sunk-cost DURATION", String, wrap("Only let a scaledown occur if there is an instance this close to its hourly billing point")) do |v|
177 |           time = Vector.time_string_to_seconds v
178 |           if time > 1.hour
179 |             puts "--fds-max-sunk-cost duration must be < 1 hour"
180 |             exit 1
181 |           end
182 | 
183 |           opts[:flexible_down_scaling][:max_sunk_cost] = time
184 |         end
185 | 
186 |         o.separator ""
187 |         o.on("--[no-]fds-variable-thresholds", wrap("Enable Variable Thresholds")) do |v|
188 |           opts[:flexible_down_scaling][:variable_thresholds] = v
189 |         end
190 | 
191 |         o.on("--fds-n-low NUM", Integer, wrap("Number of nodes corresponding to --fds-g-low. (default: 1 more than the group's minimum size)")) do |v|
192 |           opts[:flexible_down_scaling][:n_low] = v
193 |         end
194 | 
195 |         o.on("--fds-n-high NUM", Integer, wrap("Number of nodes corresponding to --fds-g-high. (default: the group's maximum size)")) do |v|
196 |           opts[:flexible_down_scaling][:n_high] = v
197 |         end
198 | 
199 |         o.on("--fds-m PERCENTAGE", Float, wrap("Maximum target utilization. Will default to the CPUUtilization alarm threshold.")) do |v|
200 |           opts[:flexible_down_scaling][:m] = v / 100
201 |         end
202 | 
203 |         o.on("--fds-g-high PERCENTAGE", Float, wrap("Capacity headroom to apply when scaling down from --fds-n-high nodes, as a percentage. e.g. if this is 90%, then will not scale down from --fds-n-high nodes until expected utilization on the remaining nodes is at or below 90% of --fds-m. (default: 100)")) do |v|
204 |           opts[:flexible_down_scaling][:g_high] = v / 100
205 |         end
206 | 
207 |         o.on("--fds-g-low PERCENTAGE", Float, wrap("Capacity headroom to apply when scaling down from --fds-n-low nodes, as a percentage. e.g. if this is 75%, then will not scale down from --fds-n-low nodes until expected utilization on the remaining nodes is at or below 75% of --fds-m. When scaling down from a number of nodes other than --fds-n-high or --fds-n-low, will use a capacity headroom linearly interpolated from --fds-g-high and --fds-g-low. (default: 100)")) do |v|
208 |           opts[:flexible_down_scaling][:g_low] = v / 100
209 |         end
210 | 
211 |         o.on("--fds-print-variable-thresholds", wrap("Calculates and displays the thresholds that will be used for each asg, and does not execute any downscaling policies. (For debugging).")) do |v|
212 |           opts[:flexible_down_scaling][:print_variable_thresholds] = true
213 |         end
214 | 
215 |       end.parse!(@argv)
216 | 
217 |       if opts[:groups].empty? && opts[:fleet].nil?
218 |         puts "No groups were specified."
219 |         exit 1
220 |       end
221 | 
222 |       if !opts[:groups].empty? && !opts[:fleet].nil?
223 |         puts "You can't specify --groups and --fleet."
224 |         exit 1
225 |       end
226 | 
227 |       if opts[:predictive_scaling][:enabled]
228 |         ps = opts[:predictive_scaling]
229 |         if ps[:lookback_windows].empty? || ps[:lookahead_window].nil?
230 |           puts "You must specify lookback windows and a lookahead window for Predictive Scaling."
231 |           exit 1
232 |         end
233 |       end
234 | 
235 |       if opts[:flexible_down_scaling][:enabled]
236 |         fds = opts[:flexible_down_scaling]
237 |         if fds[:up_down_cooldown].nil? ||
238 |            fds[:down_down_cooldown].nil?
239 |           puts "You must specify both up-to-down and down-to-down cooldown periods for Flexible Down Scaling."
240 |           exit 1
241 |         end
242 |       end
243 | 
244 |       Vector::HLogger.enable(!opts[:quiet])
245 | 
246 |       @config = opts
247 |     end
248 |   end
249 | end
250 | 


--------------------------------------------------------------------------------
/lib/vector/functions/flexible_down_scaling.rb:
--------------------------------------------------------------------------------
  1 | require 'vector'
  2 | 
  3 | module Vector
  4 |   module Function
  5 |     class FlexibleDownScaling
  6 |       include Vector::HLogger
  7 | 
  8 |       def initialize(options)
  9 |         @cloudwatch = options[:cloudwatch]
 10 |         @dry_run = options[:dry_run]
 11 |         @up_down_cooldown = options[:up_down_cooldown]
 12 |         @down_down_cooldown = options[:down_down_cooldown]
 13 |         @max_sunk_cost = options[:max_sunk_cost]
 14 |         @variable_thresholds = options[:variable_thresholds]
 15 |         @n_low = options[:n_low]
 16 |         @n_high = options[:n_high]
 17 |         @m = options[:m]
 18 |         @g_low = options[:g_low]
 19 |         @g_high = options[:g_high]
 20 |         @debug_variable_thresholds = options[:print_variable_thresholds]
 21 |       end
 22 | 
 23 |       def run_for(group, ps_check_procs)
 24 |         result = { :triggered => false }
 25 | 
 26 |         hlog_ctx("fds") do
 27 |           hlog_ctx("group:#{group.name}") do
 28 |             # don't check if no config was specified
 29 |             if @up_down_cooldown.nil? && @down_down_cooldown.nil?
 30 |               hlog("No cooldown periods specified, exiting")
 31 |               return result
 32 |             end
 33 | 
 34 |             # don't bother checking for a scaledown if desired capacity is
 35 |             # already at the minimum size...
 36 |             if group.desired_capacity == group.min_size
 37 |               hlog("Group is already at minimum size, exiting")
 38 |               return result
 39 |             end
 40 | 
 41 |             scaledown_policies = group.scaling_policies.select do |policy|
 42 |               policy.scaling_adjustment < 0
 43 |             end
 44 | 
 45 |             scaledown_policies.each do |policy|
 46 |               hlog_ctx("policy:#{policy.name}") do
 47 |                 # TODO: support adjustment types other than ChangeInCapacity here
 48 |                 if policy.adjustment_type == "ChangeInCapacity" &&
 49 |                    ps_check_procs &&
 50 |                    ps_check_procs.any? {|ps_check_proc|
 51 |                      ps_check_proc.call(group.desired_capacity + policy.scaling_adjustment, self) }
 52 |                   hlog("Predictive scaleup would trigger a scaleup if group were shrunk")
 53 |                   next
 54 |                 end
 55 | 
 56 |                 alarms = policy.alarms.keys.map do |alarm_name|
 57 |                   @cloudwatch.alarms[alarm_name]
 58 |                 end
 59 | 
 60 |                 # only consider disabled alarms (enabled alarms will trigger
 61 |                 # the policy automatically)
 62 |                 disabled_alarms = alarms.select do |alarm|
 63 |                   !alarm.enabled?
 64 |                 end
 65 | 
 66 |                 # Do this logic first in case the user is just trying to print out
 67 |                 # the thresholds.
 68 |                 if @variable_thresholds
 69 |                   # variable_thresholds currently requires a CPUUtilization alarm to function
 70 |                   vt_cpu_alarm = disabled_alarms.find {|alarm| alarm.metric_name == "CPUUtilization" }
 71 | 
 72 |                   # remove this alarm from the check, since we're not checking its alarm status
 73 |                   # below.
 74 |                   disabled_alarms.delete(vt_cpu_alarm)
 75 | 
 76 |                   unless vt_cpu_alarm
 77 |                     hlog("Variable thresholds requires an alarm on CPUUtilization, skipping")
 78 |                     next
 79 |                   end
 80 | 
 81 |                   @n_low ||= group.min_size + 1
 82 |                   @n_high ||= group.max_size
 83 |                   @m ||= vt_cpu_alarm.threshold / 100
 84 | 
 85 |                   if @g_low == @g_high
 86 |                     hlog("g_low == g_high (#{@g_low}), not attempting to use flexible thresholds.")
 87 |                     next
 88 |                   end
 89 | 
 90 |                   if @n_low == @n_high
 91 |                     hlog("n_low == n_high (#{@n_low}), not attempting to use flexible thresholds.")
 92 |                     next
 93 |                   end
 94 | 
 95 |                   if @debug_variable_thresholds
 96 |                     puts "  n_low: #{@n_low}"
 97 |                     puts " n_high: #{@n_high}"
 98 |                     puts "      m: #{@m}"
 99 |                     puts "  g_low: #{@g_low}"
100 |                     puts " g_high: #{@g_high}"
101 |                     puts
102 |                     puts "  N  Threshold"
103 |                     ([@n_low, group.min_size].min + 1).upto([@n_high, group.max_size].max) do |i|
104 |                       puts " %2d  %.1f%%" % [i, (variable_threshold(i, @n_low, @n_high, @m, @g_low, @g_high) * 100)]
105 |                     end
106 |                     next
107 |                   end
108 |                 end
109 | 
110 |                 unless disabled_alarms.all? {|alarm| alarm.state_value == "ALARM" }
111 |                   hlog("Not all alarms are in ALARM state")
112 |                   next
113 |                 end
114 | 
115 |                 if @variable_thresholds
116 |                   threshold = variable_threshold(group.desired_capacity, @n_low, @n_high, @m, @g_low, @g_high)
117 | 
118 |                   stats = vt_cpu_alarm.metric.statistics(
119 |                     :start_time => Time.now - (vt_cpu_alarm.period * vt_cpu_alarm.evaluation_periods),
120 |                     :end_time => Time.now,
121 |                     :statistics => [ vt_cpu_alarm.statistic ],
122 |                     :period => vt_cpu_alarm.period)
123 | 
124 |                   if stats.datapoints.length < vt_cpu_alarm.evaluation_periods
125 |                     hlog("Could not get enough datapoints for checking variable threshold");
126 |                     next
127 |                   end
128 | 
129 |                   if stats.datapoints.any? {|dp| dp[vt_cpu_alarm.statistic.downcase.to_sym] > (threshold * 100) }
130 |                     hlog("Not all datapoints are beneath the variable threshold #{(threshold * 100).to_i}: #{stats.datapoints}")
131 |                     next
132 |                   end
133 | 
134 |                   hlog("Variable threshold: #{(threshold * 100).to_i}, #{group.desired_capacity} nodes")
135 |                 end
136 | 
137 |                 unless outside_cooldown_period(group)
138 |                   hlog("Group is not outside the specified cooldown periods")
139 |                   next
140 |                 end
141 | 
142 |                 unless has_eligible_scaledown_instance(group)
143 |                   hlog("Group does not have an instance eligible for scaledown due to max_sunk_cost")
144 |                   next
145 |                 end
146 | 
147 |                 if @dry_run
148 |                   hlog("Executing policy (DRY RUN)")
149 |                 else
150 |                   hlog("Executing policy")
151 |                   policy.execute(:honor_cooldown => true)
152 |                 end
153 | 
154 |                 result[:triggered] = true
155 | 
156 |                 # no need to evaluate other scaledown policies
157 |                 return result
158 |               end
159 |             end
160 |           end
161 |         end
162 | 
163 |         result
164 |       end
165 | 
166 |       protected
167 | 
168 |       def variable_threshold(n, n_low, n_high, m, g_low, g_high)
169 |         m_high = g_high * m
170 |         m_low = g_low * m
171 |         a = (m_high - m_low).to_f / (n_high - n_low).to_f
172 |         b = m_low - (n_low * a)
173 |         res = (a * n + b) * (1.0 - (1.0 / n))
174 |         res
175 |       end
176 | 
177 |       def has_eligible_scaledown_instance(group)
178 |         return true if @max_sunk_cost.nil?
179 | 
180 |         group.ec2_instances.select {|i| i.status == :running }.each do |instance|
181 |           # get amount of time until hitting the instance renewal time
182 |           time_left = ((instance.launch_time.min - Time.now.min) % 60).minutes
183 | 
184 |           # if we're within 1 minute, assume we won't be able to terminate it
185 |           # in time anyway and ignore it.
186 |           if time_left > 1.minute and time_left < @max_sunk_cost
187 |             # we only care if there is at least one instance within the window
188 |             # where we can scale down
189 |             return true
190 |           end
191 |         end
192 | 
193 |         false
194 |       end
195 | 
196 |       def outside_cooldown_period(group)
197 |         @cached_outside_cooldown ||= {}
198 |         if @cached_outside_cooldown.has_key? group
199 |           return @cached_outside_cooldown[group]
200 |         end
201 | 
202 |         activities = previous_scaling_activities(group)
203 |         return nil if activities.nil?
204 | 
205 |         if activities[:up]
206 |           hlog "Last scale up #{(Time.now - activities[:up]).minutes.inspect} ago"
207 |         end
208 |         if activities[:down]
209 |           hlog "Last scale down #{(Time.now - activities[:down]).minutes.inspect} ago"
210 |         end
211 |         result = true
212 | 
213 |         # check up-down
214 |         if @up_down_cooldown && activities[:up] &&
215 |           Time.now - activities[:up] < @up_down_cooldown
216 |           result = false
217 |         end
218 | 
219 |         # check down-down
220 |         if @down_down_cooldown && activities[:down] &&
221 |           Time.now - activities[:down] < @down_down_cooldown
222 |           result = false
223 |         end
224 | 
225 |         result
226 |       end
227 | 
228 |       # Looks at the GroupDesiredCapacity metric for the specified
229 |       # group, and finds the most recent change in value.
230 |       #
231 |       # @returns
232 |       #   * nil if there was a problem getting data. There may have been
233 |       #     scaling events or not, we don't know.
234 |       #   * a hash with two keys, :up and :down, with values indicating
235 |       #     when the last corresponding activity happened. If the
236 |       #     activity was not seen in the examined time period, the value
237 |       #     is nil.
238 |       def previous_scaling_activities(group)
239 |         metric = @cloudwatch.metrics.
240 |           with_namespace("AWS/AutoScaling").
241 |           with_metric_name("GroupDesiredCapacity").
242 |           filter('dimensions', [{
243 |             :name => "AutoScalingGroupName",
244 |             :value => group.name
245 |           }]).first
246 | 
247 |         return nil unless metric
248 | 
249 |         start_time = Time.now - [ @up_down_cooldown, @down_down_cooldown ].max
250 |         end_time = Time.now
251 | 
252 |         stats = metric.statistics(
253 |           :start_time => start_time,
254 |           :end_time => end_time,
255 |           :statistics => [ "Average" ],
256 |           :period => 60)
257 | 
258 |         # check if we got enough datapoints... if we didn't, we need to
259 |         # assume bad data and inform the caller. this code is basically
260 |         # checking if the # of received datapoints is within 50% of the
261 |         # expected datapoints.
262 |         got_datapoints = stats.datapoints.length
263 |         requested_datapoints = (end_time - start_time) / 60
264 |         if !Vector.within_threshold(0.5, got_datapoints, requested_datapoints)
265 |           return nil
266 |         end
267 | 
268 |         # iterate over the datapoints in reverse, looking for the first
269 |         # change in value, which should be the most recent scaling
270 |         # activity
271 |         activities = { :down => nil, :up => nil }
272 |         last_value = nil
273 |         stats.datapoints.sort {|a,b| b[:timestamp] <=> a[:timestamp] }.each do |dp|
274 |           next if dp[:average].nil?
275 | 
276 |           unless last_value.nil?
277 |             if dp[:average] != last_value
278 |               direction = (last_value < dp[:average]) ? :down : :up
279 |               activities[direction] ||= dp[:timestamp]
280 |             end
281 |           end
282 | 
283 |           last_value = dp[:average]
284 |           break unless activities.values.any? {|v| v.nil? }
285 |         end
286 | 
287 |         activities
288 |       end
289 |     end
290 |   end
291 | end
292 | 


--------------------------------------------------------------------------------
/lib/vector/functions/predictive_scaling.rb:
--------------------------------------------------------------------------------
  1 | module Vector
  2 |   module Function
  3 |     class PredictiveScaling
  4 |       include Vector::HLogger
  5 | 
  6 |       def initialize(options)
  7 |         @cloudwatch = options[:cloudwatch]
  8 |         @dry_run = options[:dry_run]
  9 |         @lookback_windows = options[:lookback_windows]
 10 |         @lookahead_window = options[:lookahead_window]
 11 |         @valid_threshold = options[:valid_threshold]
 12 |         @valid_period = options[:valid_period]
 13 |       end
 14 | 
 15 |       def run_for(group)
 16 |         result = { :check_procs => [], :triggered => false }
 17 | 
 18 |         hlog_ctx "ps" do
 19 |           hlog_ctx "group:#{group.name}" do
 20 |             return result if @lookback_windows.length == 0
 21 | 
 22 |             scaleup_policies = group.scaling_policies.select do |policy|
 23 |               policy.scaling_adjustment > 0
 24 |             end
 25 | 
 26 |             scaleup_policies.each do |policy|
 27 |               hlog_ctx "policy:#{policy.name}" do
 28 | 
 29 |                 policy.alarms.keys.each do |alarm_name|
 30 |                   alarm = @cloudwatch.alarms[alarm_name]
 31 |                   hlog_ctx "alarm:#{alarm.name}" do
 32 |                     hlog "Metric #{alarm.metric.name}"
 33 | 
 34 |                     unless alarm.enabled?
 35 |                       hlog "Skipping disabled alarm"
 36 |                       next
 37 |                     end
 38 | 
 39 |                     # Note that everywhere we say "load" what we mean is
 40 |                     # "metric value * number of nodes"
 41 |                     now_load, now_num = load_for(group, alarm.metric,
 42 |                       Time.now, @valid_period)
 43 | 
 44 |                     if now_load.nil?
 45 |                       hlog "Could not get current total for metric"
 46 |                       next
 47 |                     end
 48 | 
 49 |                     @lookback_windows.each do |window|
 50 |                       hlog_ctx "window:#{window.inspect.gsub ' ', ''}" do
 51 |                         then_load, = load_for(group, alarm.metric,
 52 |                           Time.now - window, @valid_period)
 53 | 
 54 |                         if then_load.nil?
 55 |                           hlog "Could not get past total value for metric"
 56 |                           next
 57 |                         end
 58 | 
 59 |                         # check that the past total utilization is within
 60 |                         # threshold% of the current total utilization
 61 |                         if @valid_threshold &&
 62 |                           !Vector.within_threshold(@valid_threshold, now_load, then_load)
 63 |                           hlog "Past metric total value not within threshold (current #{now_load}, then #{then_load})"
 64 |                           next
 65 |                         end
 66 | 
 67 |                         past_load, = load_for(group, alarm.metric,
 68 |                           Time.now - window + @lookahead_window,
 69 |                           alarm.period)
 70 | 
 71 |                         if past_load.nil?
 72 |                           hlog "Could not get past + #{@lookahead_window.inspect} total value for metric"
 73 |                           next
 74 |                         end
 75 | 
 76 |                         # now take the past total load and divide it by the
 77 |                         # current number of instances to get the predicted value
 78 | 
 79 |                         # (we capture our original log context here in order to display
 80 |                         # the source of these checks later when this proc is called by
 81 |                         # scaledown stuff).
 82 |                         orig_ctx = hlog_ctx_string
 83 |                         check_proc = Proc.new do |num_nodes, logger|
 84 |                           predicted_value = past_load.to_f / num_nodes
 85 | 
 86 |                           log_str = "Predicted #{alarm.metric.name}: #{predicted_value} (#{num_nodes} nodes)"
 87 | 
 88 |                           # Tack on the original context if we're in a different logger
 89 |                           # (for the case where this is called during scaledown checks).
 90 |                           if orig_ctx != logger.hlog_ctx_string
 91 |                             log_str += " (from #{orig_ctx})"
 92 |                           end
 93 | 
 94 |                           logger.hlog log_str
 95 | 
 96 |                           check_alarm_threshold(alarm, predicted_value)
 97 |                         end
 98 |                         result[:check_procs] << check_proc
 99 | 
100 |                         if check_proc.call(now_num, self)
101 |                           if @dry_run
102 |                             hlog "Executing policy (DRY RUN)"
103 |                           else
104 |                             hlog "Executing policy"
105 |                             policy.execute(honor_cooldown: true)
106 |                           end
107 | 
108 |                           result[:triggered] = true
109 | 
110 |                           # don't need to evaluate further windows or policies on this group
111 |                           return result
112 |                         end
113 |                       end
114 |                     end
115 |                   end
116 |                 end
117 |               end
118 |             end
119 |           end
120 |         end
121 | 
122 |         result
123 |       end
124 | 
125 |       protected
126 | 
127 |       def check_alarm_threshold(alarm, value)
128 |         case alarm.comparison_operator
129 |         when "GreaterThanOrEqualToThreshold"
130 |           value >= alarm.threshold
131 |         when "GreaterThanThreshold"
132 |           value > alarm.threshold
133 |         when "LessThanThreshold"
134 |           value < alarm.threshold
135 |         when "LessThanOrEqualToThreshold"
136 |           value <= alarm.threshold
137 |         end
138 |       end
139 | 
140 |       def load_for(group, metric, time, window)
141 |         num_instances_metric = @cloudwatch.metrics.
142 |           with_namespace("AWS/AutoScaling").
143 |           with_metric_name("GroupInServiceInstances").
144 |           filter('dimensions', [{
145 |             :name => 'AutoScalingGroupName',
146 |             :value => group.name
147 |           }]).first
148 | 
149 |         unless num_instances_metric
150 |           raise "Could not find GroupInServicesInstances metric for #{group.name}"
151 |         end
152 | 
153 |         start_time = time - (window / 2)
154 |         end_time = time + (window / 2)
155 | 
156 |         avg = average_for_metric(metric, start_time, end_time)
157 |         num = average_for_metric(num_instances_metric, start_time, end_time)
158 | 
159 |         if avg.nil? || num.nil?
160 |           return [ nil, nil ]
161 |         end
162 | 
163 |         [ avg * num, num ]
164 |       end
165 | 
166 |       def average_for_metric(metric, start_time, end_time)
167 |         stats = metric.statistics(
168 |           :start_time => start_time,
169 |           :end_time => end_time,
170 |           :statistics => [ "Average" ],
171 |           :period => 60)
172 | 
173 |         return nil if stats.datapoints.length == 0
174 | 
175 |         sum = stats.datapoints.inject(0) do |r, dp|
176 |           r + dp[:average]
177 |         end
178 | 
179 |         sum.to_f / stats.datapoints.length
180 |       end
181 |     end
182 |   end
183 | end
184 | 


--------------------------------------------------------------------------------
/lib/vector/version.rb:
--------------------------------------------------------------------------------
1 | module Vector
2 |   VERSION = "0.0.6"
3 | end
4 | 


--------------------------------------------------------------------------------
/vector.gemspec:
--------------------------------------------------------------------------------
 1 | # coding: utf-8
 2 | lib = File.expand_path('../lib', __FILE__)
 3 | $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
 4 | require 'vector/version'
 5 | 
 6 | Gem::Specification.new do |spec|
 7 |   spec.name          = "vector"
 8 |   spec.version       = Vector::VERSION
 9 |   spec.authors       = ["Zach Wily"]
10 |   spec.email         = ["zach@zwily.com"]
11 |   spec.summary       = %q{AWS Auto-Scaling Assistant}
12 |   spec.homepage      = "http://github.com/instructure/vector"
13 |   spec.license       = "MIT"
14 | 
15 |   spec.files         = `git ls-files`.split($/)
16 |   spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
17 |   spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
18 |   spec.require_paths = ["lib"]
19 | 
20 |   spec.add_dependency "aws-sdk"
21 |   spec.add_dependency "aws-asg-fleet"
22 |   spec.add_dependency "activesupport"
23 | 
24 |   spec.add_development_dependency "bundler", "~> 1.3"
25 |   spec.add_development_dependency "rake"
26 | end
27 | 


--------------------------------------------------------------------------------