├── requirements.txt ├── NOTICE ├── .github └── PULL_REQUEST_TEMPLATE.md ├── es-create-cwalarms-iamrole.json ├── README.md ├── LICENSE.txt ├── es-create-cwalarms.py └── es-check-cwalarms.py /requirements.txt: -------------------------------------------------------------------------------- 1 | boto3>=1.4.7 2 | argparse>=1.4.0 3 | 4 | 5 | -------------------------------------------------------------------------------- /NOTICE: -------------------------------------------------------------------------------- 1 | Check CloudWatch Alarms for Elasticsearch 2 | Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. 3 | -------------------------------------------------------------------------------- /.github/PULL_REQUEST_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | *Issue #, if available:* 2 | 3 | *Description of changes:* 4 | 5 | 6 | By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. 7 | -------------------------------------------------------------------------------- /es-create-cwalarms-iamrole.json: -------------------------------------------------------------------------------- 1 | { 2 | "Version": "2012-10-17", 3 | "Statement": [ 4 | { 5 | "Effect": "Allow", 6 | "Action": "logs:CreateLogGroup", 7 | "Resource": "*" 8 | }, 9 | { 10 | "Effect": "Allow", 11 | "Action": [ 12 | "logs:CreateLogStream", 13 | "logs:PutLogEvents" 14 | ], 15 | "Resource": [ 16 | "*" 17 | ] 18 | }, 19 | { 20 | "Effect": "Allow", 21 | "Action": [ 22 | "es:ESHttp*" 23 | ], 24 | "Resource": "*" 25 | }, 26 | { 27 | "Sid": "EsDomains", 28 | "Effect": "Allow", 29 | "Action": [ 30 | "es:ListDomainNames" 31 | ], 32 | "Resource": "*" 33 | }, 34 | { 35 | "Sid": "CloudwatchAlarms", 36 | "Effect": "Allow", 37 | "Action": [ 38 | "cloudwatch:DescribeAlarms", 39 | "cloudwatch:DescribeAlarmsForMetric", 40 | "cloudwatch:DisableAlarmActions", 41 | "cloudwatch:EnableAlarmActions", 42 | "cloudwatch:ListMetrics", 43 | "cloudwatch:PutMetricAlarm" 44 | ], 45 | "Resource": [ 46 | "*" 47 | ] 48 | } 49 | ] 50 | } 51 | 52 | 53 | 54 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | Amazon Elasticsearch Service Stats and Alarms 3 | ==================== 4 | This is a set of scripts that addresses some common needs found in a complex Amazon Elasticsearch Service environment. Currently it consists of 2 subcomponents: 5 | 6 | 1. es-create-cwalarms.py: script that creates a set of recommended CloudWatch alarms for specific Amazon ES metrics. 7 | 2. es-checkcwalarms-lambda.py: script that can be run stand-alone or as a scheduled Lambda to check whether the correct alarms are set. 8 | 9 | The scripts are provided as starting points; while they provide some generally useful functionality they likely will need customization to your specific environment and requirements. 10 | 11 | The scripts are written in Python 2.7 and use Boto3. 12 | 13 | Note that the term "domain" (the term used by Amazon Elasticsearch Service) and "cluster" (colloquially) are used interchangeably here. 14 | 15 | es-check-cwalarms-lambda 16 | ------ 17 | This Python 2.7 script checks whether the recommended/desired alarms are set. It can be run from the command line; or, set up as a scheduled Lambda. For example: if clusters were originally set up without the desired alarms; or, configuration changes - cluster size increases being a common one - might require updates to alarms. 18 | 19 | The script assumes you have the permissions listed in the provided sample IAM role (es-create-cwalarms-iamrole.json). 20 | 21 | The desired alarms are listed in an array embedded in the code (esAlarms). This ugly hack allows parameters to be updated based on current domain stats. 22 | ``` 23 | >python es-checkcwalarms.py -h 24 | usage: es-checkcwalarms.py [-h] [-e ESPREFIX] [-n NOTIFY] 25 | [-f FREE] [-p PROFILE] [-r REGION] 26 | 27 | Checks a set of recommended CloudWatch alarms for Amazon Elasticsearch Service domains (optionally, those beginning with a given prefix). 28 | 29 | optional arguments: 30 | -h, --help show this help message and exit 31 | -e ESPREFIX, --esprefix ESPREFIX 32 | Only check Amazon Elasticsearch Service domains that begin with this prefix. 33 | -n NOTIFY, --notify NOTIFY 34 | List of CloudWatch alarm actions; e.g. ['arn:aws:sns:xxxx'] 35 | -f FREE, --free FREE Minimum free storage (MB) on which to alarm 36 | -p PROFILE, --profile PROFILE 37 | IAM profile name to use 38 | -r REGION, --region REGION 39 | AWS region to check. Defaults to us-east-1 40 | ``` 41 | Sample of output: 42 | ``` 43 | List of Amazon Elasticsearch Service domains starting with given prefix (awses): [u'awsestest53'] 44 | ======================================================================================================= 45 | Starting checks for Amazon Elasticsearch Service domain awsestest53 , version is 5.3 46 | Automated snapshot hour is: 0 47 | 2 instances t2.small.elasticsearch 48 | awsestest53 Does not have Dedicated Masters!! 49 | awsestest53 Does not have Zone Awareness enabled!! 50 | EBS enabled: True type: gp2 size (GB): 10 0 Iops; 20480 total storage (MB) 51 | Desired free storage set to (in MB): 4096.0 52 | awsestest53 Test-Elasticsearch-awsestest53-ClusterStatus.yellow-Alarm ClusterStatus.yellow Alarm ok; definition matches. 53 | awsestest53 Test-Elasticsearch-awsestest53-CPUUtilization-Alarm Threshold does not match. Should be: 80.0 ; is 70.0 54 | awsestest53 Test-Elasticsearch-awsestest53-CPUUtilization-Alarm AlarmActions does not match. Should be: ['arn:aws:sns:us-west-2:123456789012:sendnotification'] ; is ['arn:aws:sns:us-west-2:123456789000:sendnotification'] 55 | 56 | ``` 57 | 58 | To run as a Lambda: 59 | Follow the instructions [here](https://docs.aws.amazon.com/lambda/latest/dg/get-started-create-function.html), with the following changes: 60 | * For Runtime, use Python 2.7 61 | * For IAM Role, use a Lambda execution role with the added permissions in the provided sample IAM role (es-create-cwalarms-iamrole.json) 62 | * For Timeout: increase the value to 3 minutes (depending on the number of domains you have) 63 | * If desired, add environment variables: 64 | * esprefix: only check domains beginning with this prefix 65 | * esfree: minimum free space 66 | * Edit the Lambda code inline, and paste in the contents of es-check-cwalarms.py. 67 | 68 | Save and test the Lambda. 69 | 70 | Es-create-cwalarms 71 | ------- 72 | This Python 2.7 script can be run from the command line to create the recommended set of alarms for a specific Amazon ES cluster. 73 | 74 | It assumes you have the permissions listed in the provided sample IAM role. 75 | 76 | The alarms to be created are listed in an array embedded in the code (esAlarms). This ugly hack allows parameters to be updated based on current domain stats. 77 | ``` 78 | > python es-create-cwalarms.py -h 79 | Starting ... at 2017-09-13 22:40:08.493000 80 | usage: es-create-cwalarms.py [-h] -c CLUSTER -a ACCOUNT [-e ENV] [-n NOTIFY] [-f FREE] [-p PROFILE] [-r REGION] 81 | 82 | Create a set of recommended CloudWatch alarms for a given Amazon Elasticsearch Service domain. 83 | 84 | optional arguments: 85 | -h, --help show this help message and exit 86 | -c CLUSTER, --cluster CLUSTER 87 | Amazon Elasticsearch domain name (e.g., testcluster1) 88 | -a ACCOUNT, --account ACCOUNT 89 | AWS account id of the owning account (needed for metric dimension). 90 | -e ENV, --env ENV Environment (e.g., Test, or Prod). Prepended to the alarm name. 91 | -n NOTIFY, --notify NOTIFY 92 | List of CloudWatch alarm actions; e.g. ['arn:aws:sns:xxxx'] 93 | -f FREE, --free FREE Minimum free storage (MB) on which to alarm 94 | -p PROFILE, --profile PROFILE 95 | IAM profile name to use 96 | -r REGION, --region REGION 97 | AWS region for the cluster. 98 | > python es-create-cwalarms.py -c testcluster -a 123456789012 -e Test -p iamuser1 -r us-west-2 99 | ``` 100 | Check the results by going to the CloudWatch Alarms console, or by running es-checkcwalarms (see below). The naming convention for the alarms created is: 101 | ```{Environment}-{domain}-{MetricName}-alarm``` 102 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | END OF TERMS AND CONDITIONS 178 | 179 | APPENDIX: How to apply the Apache License to your work. 180 | 181 | To apply the Apache License to your work, attach the following 182 | boilerplate notice, with the fields enclosed by brackets "[]" 183 | replaced with your own identifying information. (Don't include 184 | the brackets!) The text should be enclosed in the appropriate 185 | comment syntax for the file format. We also recommend that a 186 | file or class name and description of purpose be included on the 187 | same "printed page" as the copyright notice for easier 188 | identification within third-party archives. 189 | 190 | Copyright 2018 Amazon.com 191 | 192 | Licensed under the Apache License, Version 2.0 (the "License"); 193 | you may not use this file except in compliance with the License. 194 | You may obtain a copy of the License at 195 | 196 | http://www.apache.org/licenses/LICENSE-2.0 197 | 198 | Unless required by applicable law or agreed to in writing, software 199 | distributed under the License is distributed on an "AS IS" BASIS, 200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 201 | See the License for the specific language governing permissions and 202 | limitations under the License. 203 | -------------------------------------------------------------------------------- /es-create-cwalarms.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """ 3 | Given the name of an Amazon Elasticsearch Service cluster, create the set of recommended CloudWatch alarms. 4 | 5 | Naming convention for the alarms are: {Environment}-{domain}-{MetricName}-alarm 6 | 7 | Requires the following permissions: 8 | * create CloudWatch alarms 9 | (per http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/permissions-reference-cw.html ) 10 | cloudwatch:DescribeAlarms 11 | cloudwatch:DescribeAlarmsForMetric 12 | cloudwatch:EnableAlarmActions | DisableAlarmActions (depending on options chosen) 13 | cloudwatch:PutMetricAlarm 14 | ... The managed policy CloudWatchFullAccess provides the needed permissions. 15 | 16 | * To check that free space is appropriately defined, also need to be able to check the E/S cluster definitions. 17 | 18 | 19 | Expects the following parameters: 20 | env environment; is used in the Alarm name only. Default: Test 21 | clusterName Amazon Elasticsearch Service domain name on which the alarms are to be created 22 | clientId the account Id of the owning AWS account (needed for CloudWatch alarm dimension) 23 | alarmActions list of SNS arns to be notified when the alarm is fired 24 | free minimum amount of free storage to assign, if no other information is available 25 | 26 | @author Veronika Megler 27 | @date August 2017 28 | 29 | """ 30 | 31 | from __future__ import print_function # Python 2/3 compatibility 32 | 33 | # Need to: pip install elasticsearch; boto3; requests_aws4auth; requests 34 | # First: import standard modules 35 | import string 36 | import json 37 | import time 38 | from datetime import datetime 39 | import sys 40 | import os 41 | import logging 42 | import traceback 43 | import boto3 44 | import argparse 45 | import ast 46 | import collections 47 | 48 | DEFAULT_REGION = 'us-east-1' 49 | DEFAULT_SNSTOPIC = "sendnotification" 50 | 51 | Alarm = collections.namedtuple('Alarm', ['metric', 'stat', 'period', 'eval_period', 'operator', 'threshold', 'alarmAction']) 52 | 53 | # Amazon Elasticsearch Service settings 54 | ES_NAME_SPACE = 'AWS/ES' # set for these Amazon Elasticsearch Service alarms 55 | # The following table lists instance types with instance storage, for free storage calculations. 56 | # It must be updated when instance definitions change 57 | # See: https://aws.amazon.com/elasticsearch-service/pricing/ , select your region 58 | # Definitions are in GB 59 | DISK_SPACE = {"r3.large.elasticsearch": 32, 60 | "r3.xlarge.elasticsearch": 80, 61 | "r3.2xlarge.elasticsearch": 160, 62 | "r3.4xlarge.elasticsearch": 320, 63 | "r3.8xlarge.elasticsearch": 640, 64 | "m3.medium.elasticsearch": 4, 65 | "m3.large.elasticsearch": 32, 66 | "m3.xlarge.elasticsearch": 80, 67 | "m3.2xlarge.elasticsearch": 160, 68 | "i2.xlarge.elasticsearch": 800, 69 | "i2.2xlarge.elasticsearch": 1600, 70 | "i3.large.elasticsearch": 475, 71 | "i3.xlarge.elasticsearch": 950, 72 | "i3.2xlarge.elasticsearch": 1900, 73 | "i3.4xlarge.elasticsearch": 3800, 74 | "i3.8xlarge.elasticsearch": 7600, 75 | "i3.16xlarge.elasticsearch": 15200 76 | } 77 | 78 | MIN_ES_FREESPACE = 2048.0 # default amount of free space (in MB). ALSO minimum set by AWS ES 79 | MIN_ES_FREESPACE_PERCENT = .25 # Required minimum 20% free space 80 | DEFAULT_ES_FREESPACE = MIN_ES_FREESPACE 81 | 82 | LOG_LEVELS = {'CRITICAL': 50, 'ERROR': 40, 'WARNING': 30, 'INFO': 20, 'DEBUG': 10} 83 | 84 | def init_logging(): 85 | # Setup logging because debugging with print can get ugly. 86 | logger = logging.getLogger() 87 | logger.setLevel(logging.DEBUG) 88 | logging.getLogger("boto3").setLevel(logging.WARNING) 89 | logging.getLogger('botocore').setLevel(logging.WARNING) 90 | logging.getLogger('nose').setLevel(logging.WARNING) 91 | 92 | return logger 93 | 94 | def setup_local_logging(logger, log_level = 'INFO'): 95 | # Set the Logger so if running locally, it will print out to the main screen. 96 | handler = logging.StreamHandler() 97 | formatter = logging.Formatter( 98 | '%(asctime)s %(name)-12s %(levelname)-8s %(message)s',datefmt='%Y-%m-%dT%H:%M:%SZ' 99 | ) 100 | handler.setFormatter(formatter) 101 | logger.addHandler(handler) 102 | if log_level in LOG_LEVELS: 103 | logger.setLevel(LOG_LEVELS[log_level]) 104 | else: 105 | logger.setLevel(LOG_LEVELS['INFO']) 106 | 107 | return logger 108 | 109 | def set_log_level(logger, log_level = 'INFO'): 110 | # There is some stuff that needs to go here. 111 | if log_level in LOG_LEVELS: 112 | logger.setLevel(LOG_LEVELS[log_level]) 113 | else: 114 | logger.setLevel(LOG_LEVELS['INFO']) 115 | 116 | return logger 117 | 118 | def convert_unicode(data): 119 | ''' 120 | Takes a unicode input, and returns the same as utf-8 121 | ''' 122 | if isinstance(data, basestring): 123 | return str(data) 124 | elif isinstance(data, collections.Mapping): 125 | return dict(map(convert_unicode, data.iteritems())) 126 | elif isinstance(data, collections.Iterable): 127 | return type(data)(map(convert_unicode, data)) 128 | #else: 129 | return data 130 | 131 | def str_convert_unicode(data): 132 | return str(convert_unicode(data)) 133 | 134 | def get_default_alarm_actions(region, account, snstopic): 135 | # A default alarmActions can be hardcoded, to allow for easier standardization. 136 | alarmActions = ["arn:aws:sns:" + str(region) + ":" + str(account) + ":" + str(snstopic)] 137 | return alarmActions 138 | 139 | def get_args(): 140 | """ 141 | Parse command line arguments and populate args object. 142 | The args object is passed to functions as argument 143 | 144 | Returns: 145 | object (ArgumentParser): arguments and configuration settings 146 | """ 147 | try: 148 | currentAccount = boto3.client('sts').get_caller_identity().get('Account') 149 | except Exception as e: # e.g.: botocore.exceptions.EndpointConnectionError 150 | logger.critical(str(e)) 151 | logger.critical("Exiting.") 152 | sys.exit() 153 | parser = argparse.ArgumentParser(description = 'Create a set of recommended CloudWatch alarms for a given Amazon Elasticsearch Service domain.', 154 | formatter_class=argparse.ArgumentDefaultsHelpFormatter) 155 | parser.add_argument('-n', '--notify', nargs='+', required = False, default=[], 156 | help = "List of CloudWatch alarm actions; e.g. ['arn:aws:sns:xxxx']") 157 | parser.add_argument("-c", "--cluster", required = True, type = str, help = "Amazon Elasticsearch Service domain name (e.g., testcluster1)") 158 | parser.add_argument("-a", "--account", required = False, type = int, default=currentAccount, 159 | help = "AWS account id of the owning account (needed for metric dimension).") 160 | parser.add_argument("-f", "--free", required = False, type = float, default=DEFAULT_ES_FREESPACE, 161 | help = "Minimum free storage (MB) on which to alarm") 162 | parser.add_argument("-p", "--profile", required = False, type = str, default='default', 163 | help = "IAM profile name to use") 164 | parser.add_argument("-r", "--region", required = False, type = str, default=DEFAULT_REGION, help = "AWS region for the cluster.") 165 | 166 | if len(sys.argv) == 1: 167 | parser.error('Insufficient arguments provided. Exiting for safety.') 168 | logging.critical("Insufficient arguments provided. Exiting for safety.") 169 | sys.exit() 170 | args = parser.parse_args() 171 | # Reset minimum allowable, if less than AWS ES min 172 | if args.free < MIN_ES_FREESPACE: 173 | logger.info("Freespace of " + str(args.free) + " is less than the minimum for AES of " + str(MIN_ES_FREESPACE) + ". Setting to " + str(MIN_ES_FREESPACE)) 174 | args.free = MIN_ES_FREESPACE 175 | args.prog = parser.prog 176 | 177 | if args.notify == []: 178 | args.notify = get_default_alarm_actions(args.region,args.account,DEFAULT_SNSTOPIC) 179 | logger.info("Starting at " + str(datetime.utcnow()) + ". Using parameters: " + str(args)) 180 | return args 181 | 182 | def calc_storage(b3session, domainStatus, wantesfree): 183 | # Calculate the amount of storage the free space alarm requires 184 | esfree = float(wantesfree) # Start with given desired free storage amount 185 | ebs = False 186 | if not 'ElasticsearchClusterConfig' in domainStatus: 187 | logger.error("No ElasticsearchClusterConfig available. Setting desired storage to default: " + str(esfree)) 188 | return esfree 189 | clusterConfig = domainStatus['ElasticsearchClusterConfig'] 190 | is_ebs_enabled = 'EBSOptions' in domainStatus and domainStatus['EBSOptions']['EBSEnabled'] 191 | if is_ebs_enabled: 192 | ebsOptions = domainStatus['EBSOptions'] 193 | iops = "No Iops" 194 | if "Iops" in ebsOptions: 195 | iops = str(ebsOptions['Iops']) + " Iops" 196 | totalStorage = int(clusterConfig['InstanceCount']) * int(ebsOptions['VolumeSize']) * 1024 # Convert to MB 197 | logger.info(' '.join(["EBS enabled:", str(ebsOptions['EBSEnabled']), "type:", str(ebsOptions['VolumeType']), "size (GB):", str(ebsOptions['VolumeSize']), str(iops), str(totalStorage), " total storage (MB)"])) 198 | ebs = True 199 | esfree = float(int(float(totalStorage) * MIN_ES_FREESPACE_PERCENT)) 200 | else: 201 | logger.warning("EBS not in use. Using instance storage only.") 202 | if clusterConfig['InstanceType'] in DISK_SPACE: 203 | esfree = DISK_SPACE[clusterConfig['InstanceType']] * 1024 * MIN_ES_FREESPACE_PERCENT * clusterConfig['InstanceCount'] 204 | logger.info("Instance storage definition found for:", DISK_SPACE[clusterConfig['InstanceType']], "; free storage calced to:", esfree) 205 | else: 206 | # InstanceType not found in DISK_SPACE. What's going on? (some instance types change to/from EBS, over time, it seems) 207 | logger.warning(clusterConfig['InstanceType'] + " is using instance storage, but definition of its disk space is not available.") 208 | logger.info("Desired free storage set to (in MB): " + str(esfree)) 209 | return esfree 210 | 211 | def get_nodes_expected(domainStatus): 212 | return domainStatus['ElasticsearchClusterConfig']['InstanceCount'] 213 | 214 | ############################################################################### 215 | # 216 | # MAIN 217 | # 218 | ############################################################################### 219 | def main(): 220 | startTime = datetime.utcnow() 221 | print("Starting ... at {}".format(startTime)) 222 | 223 | global logger 224 | logger = init_logging() 225 | os.environ['log_level'] = os.environ.get('log_level', "INFO") 226 | 227 | logger = setup_local_logging(logger, os.environ['log_level']) 228 | 229 | event = {'log_level': 'INFO'} 230 | os.environ['AWS_REGION'] = os.environ.get('AWS_REGION', DEFAULT_REGION) 231 | 232 | args = get_args() 233 | theAlarmAction = get_default_alarm_actions(args.region, args.account, DEFAULT_SNSTOPIC) if args.notify is None else args.notify 234 | esDomain = args.cluster 235 | b3session = boto3.Session(profile_name=args.profile, region_name=args.region) 236 | 237 | # Get current ES config details 238 | esclient = b3session.client("es") 239 | response = esclient.describe_elasticsearch_domain(DomainName=esDomain) 240 | if 'DomainStatus' not in response: 241 | # For whatever reason, didn't get a response from this domain. 242 | logger.error("No domainStatus response received from domain " + domain + "; no alarms created") 243 | return 244 | 245 | domainStatus = response['DomainStatus'] 246 | 247 | # The following array specifies the statistics we wish to create for each Amazon ES cluster. 248 | # The stats are selected per the following documentation: 249 | # http://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-managedomains.html#es-managedomains-cloudwatchmetrics 250 | # Array format: 251 | # (MetricName, Statistic, Period, EvaluationPeriods [int], ComparisonOperator, Threshold [float] ) 252 | # ComparisonOperator: 'GreaterThanOrEqualToThreshold'|'GreaterThanThreshold'|'LessThanThreshold'|'LessThanOrEqualToThreshold' 253 | esAlarms = [ 254 | Alarm(metric='ClusterStatus.red', stat='Maximum', period=60, eval_period=1, operator='GreaterThanOrEqualToThreshold', threshold=1.0, alarmAction=theAlarmAction), 255 | Alarm(metric='ClusterStatus.yellow', stat='Maximum', period=60, eval_period=1, operator='GreaterThanOrEqualToThreshold', threshold=1.0, alarmAction=theAlarmAction), 256 | Alarm(metric="FreeStorageSpace", stat="Minimum", period=60, eval_period=1, 257 | operator="LessThanOrEqualToThreshold", threshold=float(calc_storage(b3session, domainStatus, args.free)), alarmAction=theAlarmAction ), 258 | Alarm(metric='ClusterIndexWritesBlocked', stat='Maximum', period=60, eval_period=5, operator='GreaterThanOrEqualToThreshold', threshold=1.0, alarmAction=theAlarmAction), 259 | Alarm(metric='Nodes', stat='Maximum', period=86400, eval_period=1, operator='LessThanThreshold', threshold=float(get_nodes_expected(domainStatus)), alarmAction=theAlarmAction), 260 | Alarm(metric='AutomatedSnapshotFailure', stat='Maximum', period=60, eval_period=1, operator='GreaterThanOrEqualToThreshold', threshold=1.0, alarmAction=theAlarmAction), 261 | Alarm(metric='CPUUtilization', stat='Average', period=900, eval_period=3, operator='GreaterThanOrEqualToThreshold', threshold=80.0, alarmAction=theAlarmAction), 262 | Alarm(metric='JVMMemoryPressure', stat='Average', period=900, eval_period=1, operator='GreaterThanOrEqualToThreshold', threshold=80.0, alarmAction=theAlarmAction), 263 | ] 264 | 265 | if domainStatus['ElasticsearchClusterConfig']['DedicatedMasterEnabled']: 266 | # The following alarms apply for domains with dedicated master nodes. 267 | logger.info(esDomain + " has Dedicated Masters. Adding Master alarms.") 268 | esAlarms.append(Alarm(metric='MasterCPUUtilization', stat='Maximum', period=900, eval_period=3, operator='GreaterThanOrEqualToThreshold', threshold=50.0, alarmAction=theAlarmAction)) 269 | esAlarms.append(Alarm(metric='MasterJVMMemoryPressure', stat='Maximum', period=900, eval_period=1, operator='GreaterThanOrEqualToThreshold', threshold=80.0, alarmAction=theAlarmAction)) 270 | esAlarms.append(Alarm(metric='MasterReachableFromNode', stat='Maximum', period=60, eval_period=5, operator='LessThanOrEqualToThreshold', threshold=0.0, alarmAction=theAlarmAction)) 271 | 272 | if "EncryptionAtRestOptions" in domainStatus and domainStatus["EncryptionAtRestOptions"]["Enabled"]: 273 | # The following alarms are available for domains with encryption at rest 274 | logger.info(' '.join([esDomain, "is using encryption - adding KMS key alarms. Key:", str_convert_unicode(domainStatus["EncryptionAtRestOptions"]["KmsKeyId"])])) 275 | esAlarms.append(Alarm(metric='KMSKeyError', stat='Maximum', period=60, eval_period=1, operator='GreaterThanOrEqualToThreshold', threshold=1.0, alarmAction=theAlarmAction)) 276 | esAlarms.append(Alarm(metric='KMSKeyInaccessible', stat='Maximum', period=60, eval_period=5, operator='GreaterThanOrEqualToThreshold', threshold=1.0, alarmAction=theAlarmAction)) 277 | 278 | # Unless you add the correct dimensions, the alarm will not correctly "connect" to the metric 279 | # How do you know what's correct? - at the bottom of http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/es-metricscollected.html 280 | dimensions = [ {"Name": "DomainName", "Value": esDomain }, 281 | {"Name": "ClientId", "Value": str(args.account) } 282 | ] 283 | cwclient = b3session.client("cloudwatch", region_name=args.region) 284 | 285 | # For each alarm in the array, create the CloudWatch alarm for this cluster 286 | # NOTE: If you specify an Action with an SNS topic in the wrong region, you'll get a message that you've chosen an invalid region 287 | # on the put_metric_alarm. 288 | theAlarmAction = args.notify 289 | for esAlarm in esAlarms: 290 | alarmName = '-'.join(['AES', esDomain, esAlarm.metric, 'Alarm']) 291 | response = cwclient.put_metric_alarm( 292 | AlarmName=alarmName, 293 | AlarmDescription=alarmName, 294 | #ActionsEnabled=True|False, 295 | #OKActions=['string'], 296 | AlarmActions=esAlarm.alarmAction, 297 | #InsufficientDataActions=['string'], 298 | MetricName=esAlarm.metric, 299 | Namespace=ES_NAME_SPACE, 300 | Statistic=esAlarm.stat, 301 | #ExtendedStatistic='string', 302 | Dimensions=dimensions, 303 | Period=esAlarm.period, #Unit='Seconds'|'Microseconds'|'Milliseconds'|'Bytes'|'Kilobytes'|'Megabytes'|'Gigabytes'|'Terabytes'|'Bits'|'Kilobits'|'Megabits'|'Gigabits'|'Terabits'|'Percent'|'Count'|'Bytes/Second'|'Kilobytes/Second'|'Megabytes/Second'|'Gigabytes/Second'|'Terabytes/Second'|'Bits/Second'|'Kilobits/Second'|'Megabits/Second'|'Gigabits/Second'|'Terabits/Second'|'Count/Second'|'None', 304 | EvaluationPeriods=esAlarm.eval_period, 305 | Threshold=esAlarm.threshold, 306 | ComparisonOperator=esAlarm.operator 307 | #TreatMissingData='string', 308 | #EvaluateLowSampleCountPercentile='string' 309 | ) 310 | logger.info("Created " + alarmName) 311 | 312 | logger.info("Finished creating " + str(len(esAlarms)) + " alarms for domain " + esDomain + "!") 313 | 314 | 315 | 316 | 317 | if __name__ == "__main__": 318 | main() 319 | -------------------------------------------------------------------------------- /es-check-cwalarms.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """ 3 | Checks the alarms set up for each Amazon Elasticsearch Service domain in this region. 4 | Can be run as a Lambda or as a standalone Python program. 5 | 6 | Requires the following permissions: 7 | * ability to output Lambda operations messages to CloudWatch logs (logs:*) 8 | Probably only need: logs:CreateLogGroup; logs:CreateLogStream, logs:PutLogEvents 9 | * create CloudWatch alarms 10 | (per http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/permissions-reference-cw.html ) 11 | cloudwatch:DescribeAlarms 12 | cloudwatch:DescribeAlarmsForMetric 13 | cloudwatch:EnableAlarmActions | DisableAlarmActions (depending on options chosen) 14 | cloudwatch:PutMetricAlarm 15 | 16 | * ability to list Elasticsearch domains (es:ESHttpGet) 17 | (per http://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-configuration-api.html ) 18 | es:ListDomainNames 19 | 20 | This code can be run from the command line, or as a Lambda function. 21 | If run as a Lambda: expects the following environment variables: 22 | DEFAULT_DOMAIN_PREFIX string prefix for Amazon Elasticsearch Service domain names: only check that set of domains; e.g. 'test-' 23 | region string region in which the Lambda is to be run 24 | # WARNING!! The alarmActions can be hardcoded, to allow for easier standardization. BUT make sure they're what you want! 25 | alarmActions string array of actions; e.g. '["arn:aws:sns:us-west-2:123456789012:sendnotification"]' 26 | 27 | @author Veronika Megler 28 | @date August 2017 29 | 30 | """ 31 | from __future__ import print_function # Python 2/3 compatibility 32 | 33 | # https://www.npmjs.com/package/serverless-python-requirements 34 | try: 35 | import unzip_requirements 36 | except ImportError: 37 | pass 38 | 39 | # First: import standard modules 40 | import string 41 | import json 42 | import time 43 | from datetime import datetime 44 | import sys 45 | import os 46 | import logging 47 | import traceback 48 | import boto3 49 | import argparse 50 | import collections 51 | 52 | MIN_ES_FREESPACE = 2048.0 # default amount of free space (in MB). ALSO minimum set by AWS ES 53 | ES_FREESPACE_PERCENT = .20 # Recommended 20% free space 54 | DEFAULT_ES_FREESPACE = MIN_ES_FREESPACE 55 | 56 | DEFAULT_DOMAIN_PREFIX = "" 57 | DEFAULT_REGION = 'us-east-1' 58 | DEFAULT_SNSTOPIC = 'sendnotification' 59 | DEFAULT_ENVIRONMENT = 'Test' 60 | AWS_REGION = DEFAULT_REGION 61 | 62 | Alarm = collections.namedtuple('Alarm', ['metric', 'stat', 'period', 'evalPeriod', 'operator', 'threshold', 'alarmAction']) 63 | 64 | # Amazon Elasticsearch Service settings 65 | ES_NAME_SPACE = 'AWS/ES' # set for these Amazon ES alarms 66 | # Amazon Elasticsearch Service settings 67 | ES_NAME_SPACE = 'AWS/ES' # set for these Amazon Elasticsearch Service alarms 68 | # The following table lists instance types with instance storage, for free storage calculations. 69 | # It must be updated when instance definitions change 70 | # See: https://aws.amazon.com/elasticsearch-service/pricing/ , select your region 71 | # Definitions are in GB 72 | DISK_SPACE = {"r3.large.elasticsearch": 32, 73 | "r3.xlarge.elasticsearch": 80, 74 | "r3.2xlarge.elasticsearch": 160, 75 | "r3.4xlarge.elasticsearch": 320, 76 | "r3.8xlarge.elasticsearch": 640, 77 | "m3.medium.elasticsearch": 4, 78 | "m3.large.elasticsearch": 32, 79 | "m3.xlarge.elasticsearch": 80, 80 | "m3.2xlarge.elasticsearch": 160, 81 | "i2.xlarge.elasticsearch": 800, 82 | "i2.2xlarge.elasticsearch": 1600, 83 | "i3.large.elasticsearch": 475, 84 | "i3.xlarge.elasticsearch": 950, 85 | "i3.2xlarge.elasticsearch": 1900, 86 | "i3.4xlarge.elasticsearch": 3800, 87 | "i3.8xlarge.elasticsearch": 7600, 88 | "i3.16xlarge.elasticsearch": 15200 89 | } 90 | 91 | LOG_LEVELS = {'CRITICAL': 50, 'ERROR': 40, 'WARNING': 30, 'INFO': 20, 'DEBUG': 10} 92 | 93 | def init_logging(): 94 | # Setup logging because debugging with print can get ugly. 95 | logger = logging.getLogger() 96 | logger.setLevel(logging.DEBUG) 97 | logging.getLogger("boto3").setLevel(logging.WARNING) 98 | logging.getLogger('botocore').setLevel(logging.WARNING) 99 | logging.getLogger('nose').setLevel(logging.WARNING) 100 | logging.Formatter.converter = time.gmtime 101 | 102 | return logger 103 | 104 | def setup_local_logging(logger, log_level = 'INFO'): 105 | # Set the Logger so if running locally, it will print out to the main screen. 106 | handler = logging.StreamHandler() 107 | formatter = logging.Formatter( 108 | '%(asctime)s %(name)-12s %(levelname)-8s %(message)s',datefmt='%Y-%m-%dT%H:%M:%SZ' 109 | ) 110 | handler.setFormatter(formatter) 111 | logger.addHandler(handler) 112 | if log_level in LOG_LEVELS: 113 | logger.setLevel(LOG_LEVELS[log_level]) 114 | else: 115 | logger.setLevel(LOG_LEVELS['INFO']) 116 | 117 | return logger 118 | 119 | def set_log_level(logger, log_level = 'INFO'): 120 | # There is some stuff that needs to go here. 121 | if log_level in LOG_LEVELS: 122 | logger.setLevel(LOG_LEVELS[log_level]) 123 | else: 124 | logger.setLevel(LOG_LEVELS['INFO']) 125 | 126 | return logger 127 | 128 | def get_default_alarm_actions(region, account, snstopic): 129 | # A default alarmActions can be hardcoded, to allow for easier standardization. BUT make sure it's what you want! 130 | alarmActions = ["arn:aws:sns:" + str(region) + ":" + str(account) + ":" + str(snstopic)] 131 | return alarmActions 132 | 133 | def get_args(): 134 | """ 135 | Parse command line arguments and populate args object. 136 | The args object is passed to functions as argument 137 | 138 | Returns: 139 | object (ArgumentParser): arguments and configuration settings 140 | """ 141 | parser = argparse.ArgumentParser(description = 'Create a set of recommended CloudWatch alarms for a given Amazon Elasticsearch Service domain.') 142 | parser.add_argument('-n', '--notify', nargs='+', required = False, default=[], 143 | help = "List of CloudWatch alarm actions; e.g. ['arn:aws:sns:xxxx']") 144 | parser.add_argument("-e", "--esprefix", required = False, type = str, default = "", 145 | help = "Only check AWS Elasticsearch domains that begin with this prefix.") 146 | parser.add_argument("-f", "--free", required = False, type = float, default=DEFAULT_ES_FREESPACE, help = "Minimum free storage (MB) on which to alarm") 147 | parser.add_argument("-p", "--profile", required = False, type = str, default='default', 148 | help = "IAM profile name to use") 149 | parser.add_argument("-r", "--region", required = False, type = str, default='us-east-1', help = "AWS region for the domain. Default: " + DEFAULT_REGION) 150 | 151 | args = parser.parse_args() 152 | args.prog = parser.prog 153 | try: 154 | args.account = boto3.client('sts').get_caller_identity().get('Account') 155 | except Exception as e: # e.g.: botocore.exceptions.EndpointConnectionError 156 | logger.critical(str(e)) 157 | logger.critical("Exiting.") 158 | sys.exit() 159 | # Reset minimum allowable, if less than AWS ES min 160 | if args.free < MIN_ES_FREESPACE: 161 | logger.info("Freespace of " + args.free + " is less than the minimum for AES of " + MIN_ES_FREESPACE + ". Setting to " + MIN_ES_FREESPACE) 162 | args.free = MIN_ES_FREESPACE 163 | if args.notify == []: 164 | args.notify = get_default_alarm_actions(args.region,args.account,DEFAULT_SNSTOPIC) 165 | logger.info("Starting at " + str(datetime.utcnow()) + ". Using parameters: " + str(args)) 166 | return args 167 | 168 | def convert_unicode(data): 169 | ''' 170 | Takes a unicode input, and returns the same as utf-8 171 | ''' 172 | if isinstance(data, basestring): 173 | return str(data) 174 | elif isinstance(data, collections.Mapping): 175 | return dict(map(convert_unicode, data.iteritems())) 176 | elif isinstance(data, collections.Iterable): 177 | return type(data)(map(convert_unicode, data)) 178 | #else: 179 | return data 180 | 181 | def str_convert_unicode(data): 182 | return str(convert_unicode(data)) 183 | 184 | def get_domains_list(esclient, domainprefix): 185 | # Returns the list of Elasticsearch domains that start with this prefix 186 | domainNamesList = esclient.list_domain_names() 187 | names = map(lambda domain: domain['DomainName'], domainNamesList['DomainNames']) 188 | return [name for name in names if name.startswith(domainprefix)] 189 | 190 | class AlarmChecker(object): 191 | # Check the details of alarm, against expected values for this esAlarm 192 | # Alarm(MetricName, Statistic, Period, EvaluationPeriods [int], ComparisonOperator, Threshold [float], AlarmActions ) 193 | def __init__(self, domain, alarm): 194 | self.alarm = alarm 195 | self.domain = domain 196 | 197 | def check_statistics(self, expected_value): 198 | return self._check_field('Statistic', expected_value) 199 | 200 | def check_period(self, expected_value): 201 | return self._check_field('Period', expected_value) 202 | 203 | def check_evalPeriod(self, expected_value): 204 | return self._check_field('EvaluationPeriods', expected_value) 205 | 206 | def check_operator(self, expected_value): 207 | return self._check_field('ComparisonOperator', expected_value) 208 | 209 | def check_threshold(self, expected_value): 210 | return self._check_field('Threshold', expected_value) 211 | 212 | def check_alarm_actions(self, expected_value): 213 | return self._check_field('AlarmActions', expected_value) 214 | 215 | def _check_field(self, field_name, expected_value): 216 | actual_value = self.alarm[field_name] 217 | is_alarm_okay = actual_value == expected_value 218 | if not is_alarm_okay: 219 | logger.warning(' '.join([self.domain, "Alarm:", field_name, "does not match for", self.alarm['AlarmName'], "Should be:", str(expected_value), "but is", str(actual_value)])) 220 | return is_alarm_okay 221 | 222 | class ESDomain(object): 223 | ''' 224 | This class represents the Amazon Elasticsearch Service domain 225 | ''' 226 | 227 | def __init__(self, botoes, domain, desiredEsFree, theAlarmAction): 228 | self.domain = domain 229 | self.dedicatedMasters = False 230 | self.esfree = desiredEsFree # Minimum free to allow, if no other info available 231 | self.ebs = False 232 | self.kmsenabled = False 233 | # The following array specifies the alarms we wish to create for each Amazon ES domain. 234 | # We may need to reset some parameters per domain stats, so we reset it for each domain. 235 | # The stats are selected per the following documentation: 236 | # http://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-managedomains.html#es-managedomains-cloudwatchmetrics 237 | # Array format: 238 | # (MetricName, Statistic, Period, EvaluationPeriods [int], ComparisonOperator, Threshold [float] ) 239 | # ComparisonOperator: 'GreaterThanOrEqualToThreshold'|'GreaterThanThreshold'|'LessThanThreshold'|'LessThanOrEqualToThreshold' 240 | self.esAlarms = [ 241 | Alarm(metric='ClusterStatus.yellow', stat='Maximum', period=60, evalPeriod=5, operator='GreaterThanOrEqualToThreshold', threshold=1.0, alarmAction=theAlarmAction), 242 | Alarm(metric='ClusterStatus.red', stat='Maximum', period=60, evalPeriod=5, operator='GreaterThanOrEqualToThreshold', threshold=1.0, alarmAction=theAlarmAction), 243 | Alarm(metric='CPUUtilization', stat='Average', period=60, evalPeriod=5, operator='GreaterThanOrEqualToThreshold', threshold=80.0, alarmAction=theAlarmAction), 244 | Alarm(metric='JVMMemoryPressure', stat='Average', period=60, evalPeriod=5, operator='GreaterThanOrEqualToThreshold', threshold=85.0, alarmAction=theAlarmAction), 245 | Alarm(metric='ClusterIndexWritesBlocked', stat='Maximum', period=60, evalPeriod=5, operator='GreaterThanOrEqualToThreshold', threshold=1.0, alarmAction=theAlarmAction) 246 | # OPTIONAL 247 | , Alarm(metric='AutomatedSnapshotFailure', stat='Maximum', period=60, evalPeriod=5, operator='GreaterThanOrEqualToThreshold', threshold=1.0, alarmAction=theAlarmAction) 248 | ] 249 | 250 | # For other checks: get basic domain definition, and check options against best practices 251 | try: 252 | self.get_domain_stats(botoes) 253 | except: 254 | self.domainStatus = None 255 | # For whatever reason, didn't get a response from this domain; return the default alarms. 256 | logger.error("No domainStatus response received from domain " + domain + "; no best practices checks performed; not all alarms created") 257 | raise 258 | 259 | # Figure out how much storage the domain has, and should have 260 | self.esAlarms.append(Alarm(metric="FreeStorageSpace", stat="Minimum", period=60, evalPeriod=5, 261 | operator="LessThanOrEqualToThreshold", threshold=float(self.calc_storage()), alarmAction=theAlarmAction ) ) 262 | 263 | if self.dedicatedMasters: 264 | # The following alarms apply for domains with dedicated master nodes. 265 | self.esAlarms.append(Alarm(metric='MasterCPUUtilization', stat='Maximum', period=60, evalPeriod=5, operator='GreaterThanOrEqualToThreshold', threshold=80.0, alarmAction=theAlarmAction)) 266 | self.esAlarms.append(Alarm(metric='MasterJVMMemoryPressure', stat='Maximum', period=60, evalPeriod=5, operator='GreaterThanOrEqualToThreshold', threshold=80.0, alarmAction=theAlarmAction)) 267 | self.esAlarms.append(Alarm(metric='MasterReachableFromNode', stat='Maximum', period=60, evalPeriod=5, operator='LessThanOrEqualToThreshold', threshold=0.0, alarmAction=theAlarmAction)) 268 | 269 | if self.kmsenabled: 270 | # The following alarms are available for domains with encryption at rest 271 | self.esAlarms.append(Alarm(metric='KMSKeyError', stat='Maximum', period=60, evalPeriod=5, operator='GreaterThanOrEqualToThreshold', threshold=1.0, alarmAction=theAlarmAction)) 272 | self.esAlarms.append(Alarm(metric='KMSKeyInaccessible', stat='Maximum', period=60, evalPeriod=5, operator='GreaterThanOrEqualToThreshold', threshold=1.0, alarmAction=theAlarmAction)) 273 | 274 | return 275 | 276 | def get_alarms(self): 277 | return self.esAlarms 278 | 279 | def get_esfree(self): 280 | return self.esfree 281 | 282 | def check_vpc_options(self): 283 | # VPC Endpoint 284 | if "VPCOptions" in self.domainStatus: 285 | vpcOptions = self.domainStatus["VPCOptions"] 286 | logger.info(' '.join([self.domain, "VPC:", str_convert_unicode(vpcOptions["VPCId"]), 287 | "AZs:", str_convert_unicode(vpcOptions["AvailabilityZones"]), 288 | "subnets:", str_convert_unicode(vpcOptions["SubnetIds"]), 289 | " security groups:", str_convert_unicode(vpcOptions["SecurityGroupIds"])])) 290 | else: 291 | logger.warning(self.domain + " Not using VPC Endpoint") 292 | return 293 | 294 | def check_encryption_at_rest(self): 295 | # Encryption at rest 296 | is_encryption_enabled = 'EncryptionAtRestOptions' in self.domainStatus and self.domainStatus['EncryptionAtRestOptions']['Enabled'] 297 | if is_encryption_enabled: 298 | encryptionAtRestOptions = self.domainStatus["EncryptionAtRestOptions"] 299 | self.kmsenabled = encryptionAtRestOptions["Enabled"] 300 | logger.info(' '.join([self.domain, "EncryptionAtRestOptions: ", str_convert_unicode(encryptionAtRestOptions["Enabled"]), 301 | "Key:", str_convert_unicode(encryptionAtRestOptions["KmsKeyId"])])) 302 | else: 303 | logger.warning(self.domain + " Not using Encryption at Rest") 304 | return 305 | 306 | def check_endpoint(self): 307 | endpoint = None 308 | if "Endpoint" in self.domainStatus: 309 | endpoint = self.domainStatus["Endpoint"] 310 | elif "Endpoints" in self.domainStatus: 311 | endpoint = convert_unicode(self.domainStatus["Endpoints"]["vpc"]) 312 | self.endpoint = endpoint 313 | logger.info(self.domain + " endpoint: " + str(endpoint)) 314 | return 315 | 316 | def get_domain_stats(self, botoes): 317 | # First: get the domain stats, and check the basic domain options against best practices 318 | # TO FIX: If get throttled on this call (beyond boto3 throttling recovery), wait and retry 319 | response = None 320 | domain = self.domain 321 | try: 322 | response = botoes.describe_elasticsearch_domain(DomainName=domain) 323 | except ClientError as e: 324 | logger.error("Error on getting domain stats from " + str(domain) + str(e)) 325 | raise e 326 | domainStatus = response['DomainStatus'] 327 | self.esversion = domainStatus["ElasticsearchVersion"] 328 | self.domainStatus = domainStatus 329 | logger.info("=======================================================================================================") 330 | logger.info("Starting checks for Amazon Elasticsearch Service domain {}, version is {}".format(domain, self.esversion)) 331 | self.check_endpoint() 332 | self.check_vpc_options() 333 | self.check_encryption_at_rest() 334 | 335 | # Zone Awareness 336 | if not domainStatus['ElasticsearchClusterConfig']['ZoneAwarenessEnabled']: 337 | logger.warning(domain + " Does not have Zone Awareness enabled") 338 | else: 339 | logger.info(domain + " Has Zone Awareness enabled") 340 | self.nodes_and_masters() 341 | self.log_publishing() 342 | 343 | logger.info(' '.join([domain, "Automated snapshot hour (UTC):", str(self.domainStatus["SnapshotOptions"]['AutomatedSnapshotStartHour'])])) 344 | 345 | return 346 | 347 | def nodes_and_masters(self): 348 | # Data nodes and Masters 349 | clusterConfig = self.domainStatus['ElasticsearchClusterConfig'] 350 | logger.info(self.domain + " Instance configuration: " + str(clusterConfig['InstanceCount']) + " instances; type: " + str(clusterConfig['InstanceType'])) 351 | if int(clusterConfig['InstanceCount']) % 2 == 1: 352 | logger.warning(self.domain + " Instance count is ODD. Best practice is for an even number of data nodes and zone awareness.") 353 | if clusterConfig['DedicatedMasterEnabled']: 354 | self.dedicatedMasters = True 355 | logger.info(' '.join([self.domain, str(clusterConfig['DedicatedMasterCount']), "masters; type:", clusterConfig['DedicatedMasterType']])) 356 | if int(clusterConfig['DedicatedMasterCount']) % 2 == 0: 357 | logger.warning(self.domain + " Dedicated master count is even - risk of split brain.!!") 358 | else: 359 | logger.warning(self.domain + " Does not have Dedicated Masters." + str(clusterConfig['DedicatedMasterEnabled'])) 360 | return 361 | 362 | def log_publishing(self): 363 | if "LogPublishingOptions" in self.domainStatus: 364 | msg = "" 365 | logpub = self.domainStatus["LogPublishingOptions"] 366 | if "INDEX_SLOW_LOGS" in logpub and "Enabled" in logpub["INDEX_SLOW_LOGS"]: 367 | msg += "Index slow logs enabled: " + str(logpub["INDEX_SLOW_LOGS"]["Enabled"]) + ". " 368 | if "SEARCH_SLOW_LOGS" in logpub and "Enabled" in logpub["SEARCH_SLOW_LOGS"]: 369 | msg += "Search slow logs enabled: " + str(logpub["SEARCH_SLOW_LOGS"]["Enabled"]) 370 | if msg == "": 371 | logger.info(self.domain + " Neither index nor search slow logs are enabled.") 372 | else: 373 | logger.info(self.domain + ' ' + msg) 374 | else: 375 | logger.info(self.domain + " Neither index nor search slow logs are enabled.") 376 | return 377 | 378 | def calc_storage(self): 379 | ebs = False 380 | es_freespace = float(MIN_ES_FREESPACE) # Set up a default min free space 381 | if self.domainStatus == None: 382 | logger.warning("No domain statistics available; using default for minimum storage.") 383 | return es_freespace 384 | domainStatus = self.domainStatus 385 | 386 | # Storage calculation. 387 | clusterConfig = domainStatus['ElasticsearchClusterConfig'] 388 | is_ebs_enabled = 'EBSOptions' in domainStatus and domainStatus['EBSOptions']['EBSEnabled'] 389 | if is_ebs_enabled: 390 | ebsOptions = domainStatus['EBSOptions'] 391 | iops = "No Iops" 392 | if "Iops" in ebsOptions: 393 | iops = str(ebsOptions['Iops']) + " Iops" 394 | totalStorage = int(clusterConfig['InstanceCount']) * int(ebsOptions['VolumeSize']) * 1024 # Convert to MB 395 | logger.info(' '.join([self.domain, "EBS enabled:", str(ebsOptions['EBSEnabled']), "type:", str(ebsOptions['VolumeType']), "size (GB):", str(ebsOptions['VolumeSize']), str(iops), "Total storage (MB):", str(totalStorage)])) 396 | ebs = True 397 | es_freespace = float(int(float(totalStorage) * ES_FREESPACE_PERCENT)) 398 | else: 399 | logger.warning(self.domain + " EBS not in use. Using instance storage only.") 400 | if clusterConfig['InstanceType'] in DISK_SPACE: 401 | es_freespace = DISK_SPACE[clusterConfig['InstanceType']] * 1024 * ES_FREESPACE_PERCENT * clusterConfig['InstanceCount'] 402 | logger.info(' '.join([self.domain, "Instance storage definition found for:", DISK_SPACE[clusterConfig['InstanceType']], "GB; free storage calced to:", es_freespace, "MB"])) 403 | else: 404 | # InstanceType not found in DISK_SPACE. What's going on? (some instance types change to/from EBS, over time, it seems) 405 | logger.warning(self.domain + " " + str(clusterConfig['InstanceType']) + " is using instance storage, but definition of its disk space is not available.") 406 | 407 | logger.info(' '.join([self.domain, "Desired free storage set to (in MB):", str(es_freespace)])) 408 | self.ebs = ebs 409 | self.es_freespace = es_freespace 410 | return es_freespace 411 | 412 | def process_alarms(esclient, cwclient, espref, desiredEsFree, account, alarmActions): 413 | ourDomains = get_domains_list(esclient, espref) 414 | # Now we've got the list 415 | logger.info("List of Amazon Elasticsearch Service domains starting with given prefix (" + espref + "): " + ', '.join(ourDomains)) 416 | 417 | # Now, get a list of the CloudWatch alarms, for each domain. 418 | # We could cut this list significantly if we knew the alarm-name-prefix would always be. But, we don't, right now. 419 | # TO DO: so ... given that ... add pagination, in case there's too many alarms. 420 | 421 | missingAlarms = 0 422 | okAlarms = 0 423 | notOkAlarms = 0 424 | esdomain = None 425 | 426 | for domain in ourDomains: 427 | # First: check the basic domain options against best practices 428 | try: 429 | esdomain = ESDomain(esclient, domain, desiredEsFree, alarmActions) 430 | except Exception as e: 431 | # There may still be alarms defined. 432 | # So, carry on, check the default alarms for this cluster. 433 | logger.warning(domain + " " + str(e)) 434 | 435 | esAlarms = esdomain.get_alarms() 436 | 437 | # Now, check the individual CloudWatch alarms 438 | # Unless you add the correct dimensions, the alarm will not correctly "connect" to the metric 439 | # How do you know what's correct? - at the bottom of http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/es-metricscollected.html 440 | dimensions = [ {"Name": "DomainName", "Value": domain }, 441 | {"Name": "ClientId", "Value": str(account) } 442 | ] 443 | 444 | # Get the list of alarms that have been set. 445 | for esAlarm in esAlarms: 446 | metricsList = cwclient.describe_alarms_for_metric( 447 | MetricName=esAlarm.metric, 448 | Namespace=ES_NAME_SPACE, 449 | Dimensions=dimensions 450 | ) 451 | alarms = metricsList['MetricAlarms'] 452 | # check the alarm(s) we got back, to make sure they're the ones we want. 453 | if len(alarms) < 1: 454 | logger.warning(domain + " Missing alarm: " + str(esAlarm)) 455 | missingAlarms += 1 456 | else: 457 | for alarm in alarms: 458 | okAlarm = True 459 | alarm_checker = AlarmChecker(domain, alarm) 460 | okAlarm = alarm_checker.check_statistics(esAlarm.stat) and \ 461 | alarm_checker.check_period(esAlarm.period) and \ 462 | alarm_checker.check_threshold(esAlarm.threshold) and \ 463 | alarm_checker.check_evalPeriod(esAlarm.evalPeriod) and \ 464 | alarm_checker.check_operator(esAlarm.operator) and \ 465 | alarm_checker.check_alarm_actions(esAlarm.alarmAction) 466 | if okAlarm: 467 | logger.info(' '.join([domain, "Alarm definition matches:", alarm['MetricName'], alarm['AlarmName']])) 468 | okAlarms += 1 469 | else: 470 | notOkAlarms += 1 471 | logger.info("=======================================================================================================") 472 | logger.info("Successfully finished processing!") 473 | logger.info("Alarm status summary: across " + str(len(ourDomains)) + " domains:") 474 | logger.info(" Ok alarms " + str(okAlarms)) 475 | logger.info(" Missing alarms " + str(missingAlarms)) 476 | logger.info(" Not matching alarms " + str(notOkAlarms)) 477 | return 478 | 479 | ############################################################################### 480 | # 481 | # MAIN 482 | # 483 | ############################################################################### 484 | 485 | def lambda_handler(event, context): 486 | """The Lambda function handler 487 | 488 | Args: 489 | event: The event passed by Lambda 490 | context: The context passed by Lambda 491 | 492 | """ 493 | 494 | print('Received event:' + json.dumps(event)) 495 | 496 | try: 497 | global logger 498 | logger = init_logging() 499 | logger = set_log_level(logger, os.environ.get('log_level', 'INFO')) 500 | 501 | logger.debug("Running function lambda_handler") 502 | process_global_vars() 503 | 504 | except SystemExit: 505 | logger.error("Exiting") 506 | sys.exit(1) 507 | except ValueError: 508 | exit(1) 509 | except: 510 | print ("Unexpected error!\n Stack Trace:", traceback.format_exc()) 511 | 512 | # Check environment variables 513 | espref = os.environ.get('esprefix', DEFAULT_DOMAIN_PREFIX) 514 | desiredEsFree = os.environ.get('esfree', DEFAULT_ES_FREESPACE) 515 | account = boto3.client('sts').get_caller_identity().get('Account') 516 | AWS_REGION = os.environ.get('AWS_REGION', DEFAULT_REGION) 517 | esAlarmActions = os.environ.get('alarmActions', get_default_alarm_actions(AWS_REGION,account,DEFAULT_SNSTOPIC)) 518 | 519 | # Establish credentials 520 | session_var = boto3.session.Session() 521 | credentials = session_var.get_credentials() 522 | esregion = session_var.region_name or DEFAULT_REGION 523 | esclient = boto3.client("es") 524 | cwclient = boto3.client("cloudwatch") 525 | 526 | process_alarms(esclient, cwclient, espref, desiredEsFree, account, esAlarmActions) 527 | return 'Success' 528 | 529 | def main(): 530 | global logger 531 | logger = init_logging() 532 | os.environ['log_level'] = os.environ.get('log_level', "INFO") 533 | 534 | logger = setup_local_logging(logger, os.environ['log_level']) 535 | 536 | event = {'log_level': 'INFO'} 537 | os.environ['AWS_REGION'] = os.environ.get('AWS_REGION', DEFAULT_REGION) 538 | 539 | args = get_args() 540 | b3session = boto3.Session(profile_name=args.profile, region_name=args.region) 541 | 542 | esclient = b3session.client("es") 543 | cwclient = b3session.client("cloudwatch") 544 | process_alarms(esclient, cwclient, args.esprefix, args.free, args.account, args.notify) 545 | 546 | # Standalone harness, to run the same checks from the command line 547 | if __name__ == "__main__": 548 | main() 549 | 550 | 551 | 552 | --------------------------------------------------------------------------------