├── requirements.txt ├── dashboard1.png ├── dashboard2.png ├── NOTICE.txt ├── dist ├── redshift-advanced-monitoring-1.1.zip ├── redshift-advanced-monitoring-1.2.zip ├── redshift-advanced-monitoring-1.3.zip ├── redshift-advanced-monitoring-1.4.zip ├── redshift-advanced-monitoring-1.5.zip ├── redshift-advanced-monitoring-1.6.zip ├── redshift-advanced-monitoring-1.7.zip └── redshift-advanced-monitoring-1.8.zip ├── user-queries.json ├── .gitignore ├── deploy.sh ├── lambda_function.py ├── redshift-monitoring-cli.py ├── deploy-non-vpc.yaml ├── deploy-vpc.yaml ├── monitoring-queries.json ├── LICENSE.txt ├── redshift_monitoring.py └── README.md /requirements.txt: -------------------------------------------------------------------------------- 1 | pg8000==1.29.4 2 | pgpasslib==1.1.0 3 | -------------------------------------------------------------------------------- /dashboard1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-redshift-monitoring/HEAD/dashboard1.png -------------------------------------------------------------------------------- /dashboard2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-redshift-monitoring/HEAD/dashboard2.png -------------------------------------------------------------------------------- /NOTICE.txt: -------------------------------------------------------------------------------- 1 | amazon-redshift-monitoring 2 | Copyright 2016-2016 Amazon.com, Inc. or its affiliates. All Rights Reserved. 3 | -------------------------------------------------------------------------------- /dist/redshift-advanced-monitoring-1.1.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-redshift-monitoring/HEAD/dist/redshift-advanced-monitoring-1.1.zip -------------------------------------------------------------------------------- /dist/redshift-advanced-monitoring-1.2.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-redshift-monitoring/HEAD/dist/redshift-advanced-monitoring-1.2.zip -------------------------------------------------------------------------------- /dist/redshift-advanced-monitoring-1.3.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-redshift-monitoring/HEAD/dist/redshift-advanced-monitoring-1.3.zip -------------------------------------------------------------------------------- /dist/redshift-advanced-monitoring-1.4.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-redshift-monitoring/HEAD/dist/redshift-advanced-monitoring-1.4.zip -------------------------------------------------------------------------------- /dist/redshift-advanced-monitoring-1.5.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-redshift-monitoring/HEAD/dist/redshift-advanced-monitoring-1.5.zip -------------------------------------------------------------------------------- /dist/redshift-advanced-monitoring-1.6.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-redshift-monitoring/HEAD/dist/redshift-advanced-monitoring-1.6.zip -------------------------------------------------------------------------------- /dist/redshift-advanced-monitoring-1.7.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-redshift-monitoring/HEAD/dist/redshift-advanced-monitoring-1.7.zip -------------------------------------------------------------------------------- /dist/redshift-advanced-monitoring-1.8.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-redshift-monitoring/HEAD/dist/redshift-advanced-monitoring-1.8.zip -------------------------------------------------------------------------------- /user-queries.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "query": "select count(9) from sensor_data", 4 | "name":"SensorDataCanary", 5 | "unit":"Count", 6 | "type":"interval" 7 | } 8 | ] -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | pg8000 2 | pg8000-1.10.5.dist-info 3 | six-1.10.0.dist-info 4 | six.py 5 | six.pyc 6 | /lib/ 7 | .idea 8 | .project 9 | .pydevproject 10 | 11 | *.pyc 12 | .DS_Store 13 | -------------------------------------------------------------------------------- /deploy.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | ver=`python3 -c 'import redshift_monitoring as rm; print(rm.__version__);'` 4 | 5 | if [ "$1" = "" -o "$1" = "bin" ] ; then 6 | for r in `aws ec2 describe-regions --query Regions[*].RegionName --output text`; do aws s3 cp dist/redshift-advanced-monitoring-$ver.zip s3://awslabs-code-$r/RedshiftAdvancedMonitoring/redshift-advanced-monitoring-$ver.zip --acl public-read --region $r; done 7 | fi 8 | 9 | if [ "$1" = "" -o "$1" = "yaml" ] ; then 10 | for r in `aws ec2 describe-regions --query Regions[*].RegionName --output text`; do aws s3 cp deploy-vpc.yaml s3://awslabs-code-$r/RedshiftAdvancedMonitoring/deploy-vpc.yaml --acl public-read --region $r; done 11 | 12 | for r in `aws ec2 describe-regions --query Regions[*].RegionName --output text`; do aws s3 cp deploy-non-vpc.yaml s3://awslabs-code-$r/RedshiftAdvancedMonitoring/deploy-non-vpc.yaml --acl public-read --region $r; done 13 | fi 14 | -------------------------------------------------------------------------------- /lambda_function.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Copyright 2016-2016 Amazon.com, Inc. or its affiliates. All Rights Reserved. 4 | # Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the License. A copy of the License is located at 5 | # http://aws.amazon.com/apache2.0/ 6 | # or in the "license" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. 7 | 8 | import os 9 | import sys 10 | import redshift_monitoring 11 | 12 | def lambda_handler(event, context): 13 | # resolve the configuration from the sources required 14 | config_sources = [event, os.environ] 15 | redshift_monitoring.monitor_cluster(config_sources) 16 | return 'Finished' 17 | 18 | if __name__ == "__main__": 19 | lambda_handler(sys.argv[0], None) 20 | -------------------------------------------------------------------------------- /redshift-monitoring-cli.py: -------------------------------------------------------------------------------- 1 | import redshift_monitoring as rm 2 | import os 3 | import json 4 | from argparse import ArgumentParser 5 | 6 | config = {"DbUser": None, 7 | "EncryptedPassword": None, 8 | "HostName": None, 9 | "HostPort": None, 10 | "DatabaseName": None, 11 | "ClusterName": None, 12 | "DEBUG": False, 13 | "AWS_REGION": None 14 | } 15 | 16 | parser = ArgumentParser() 17 | 18 | for c in config.keys(): 19 | parser.add_argument(f"--{c}", dest=c, required=False if c in ["AWS_REGION", "DEBUG", "HostPort"] else True) 20 | 21 | args = parser.parse_args() 22 | 23 | for arg in vars(args): 24 | config[arg] = getattr(args, arg) 25 | 26 | if config.get("AWS_REGION") is None: 27 | if "AWS_REGION" not in os.environ or os.environ.get("AWS_REGION") is None: 28 | raise Exception("AWS_REGION must be exported in the environment or part of arguments") 29 | else: 30 | config["AWS_REGION"] = os.environ["AWS_REGION"] 31 | 32 | if config.get("HostPort") is None: 33 | config["HostPort"] = 5439 34 | 35 | print("Using the following event definition - can be used for testing in AWS Lambda") 36 | print(json.dumps([config])) 37 | response = rm.monitor_cluster([config]) 38 | -------------------------------------------------------------------------------- /deploy-non-vpc.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: 2010-09-09 2 | Transform: AWS::Serverless-2016-10-31 3 | Parameters: 4 | ClusterName: 5 | Default: my-redshift-cluster 6 | Description: Cluster Name 7 | Type: String 8 | AllowedPattern: .* 9 | DbUser: 10 | Default: My DB User 11 | Description: Name of the database user to connect to 12 | Type: String 13 | AllowedPattern: .* 14 | EncryptedPassword: 15 | Default: Base64 Encoded Encrypted Password 16 | Description: Password encrypted with AWS KMS (leave blank to use IAM authentication token) 17 | Type: String 18 | AllowedPattern: .* 19 | KmsKeyARN: 20 | Default: arn:aws:kms:us-east-1:123456789012:key/MyKey 21 | Description: KMS Key ARN used to decrypt the password (leave blank to use IAM authentication token) 22 | Type: String 23 | AllowedPattern: ^$|arn:aws:kms:[a-zA-Z0-9-]+:\d{12}:key\/.* 24 | HostName: 25 | Default: my-redshift-cluster.XXXXXXXXXXXX..redshift.amazonaws.com 26 | Description: Cluster Endpoint Address 27 | Type: String 28 | AllowedPattern: .*\.redshift\.amazonaws\.com$ 29 | HostPort: 30 | Default: 5439 31 | Description: Database Port 32 | Type: Number 33 | MinValue: 1024 34 | MaxValue: 65535 35 | DatabaseName: 36 | Default: mydb 37 | Description: Database Name to connect to 38 | Type: String 39 | AllowedPattern: .* 40 | AggregationInterval: 41 | Default: 1 hour 42 | Description: Interval for aggregating statistics 43 | Type: String 44 | AllowedValues: 45 | - 1 hour 46 | - 10 minutes 47 | Conditions: 48 | UseKms: !Not 49 | - !Equals 50 | - !Ref KmsKeyARN 51 | - '' 52 | Resources: 53 | ScheduledFunction: 54 | Type: AWS::Serverless::Function 55 | Properties: 56 | Handler: lambda_function.lambda_handler 57 | Runtime: python3.9 58 | CodeUri: 59 | Bucket: !Sub awslabs-code-${AWS::Region} 60 | Key: RedshiftAdvancedMonitoring/redshift-advanced-monitoring-1.8.zip 61 | MemorySize: 192 62 | Timeout: 900 63 | Tags: 64 | Name: RedshiftAdvancedMonitoring 65 | Role: !GetAtt ScheduledServiceIAMRole.Arn 66 | Events: 67 | Timer: 68 | Type: Schedule 69 | Properties: 70 | Schedule: rate(1 hour) 71 | Input: 72 | !Sub | 73 | { 74 | "DbUser":"${DbUser}", 75 | "EncryptedPassword":"${EncryptedPassword}", 76 | "ClusterName":"${ClusterName}", 77 | "HostName":"${HostName}", 78 | "HostPort":"${HostPort}", 79 | "DatabaseName":"${DatabaseName}", 80 | "AggregationInterval":"${AggregationInterval}" 81 | } 82 | ScheduledServiceIAMRole: 83 | Type: "AWS::IAM::Role" 84 | Properties: 85 | RoleName: "LambdaRedshiftMonitoringRole" 86 | Path: "/" 87 | AssumeRolePolicyDocument: 88 | Version: "2012-10-17" 89 | Statement: 90 | - 91 | Sid: "AllowLambdaServiceToAssumeRole" 92 | Effect: "Allow" 93 | Action: 94 | - "sts:AssumeRole" 95 | Principal: 96 | Service: 97 | - "lambda.amazonaws.com" 98 | Policies: 99 | - 100 | PolicyName: "LambdaRedshiftMonitoringPolicy" 101 | PolicyDocument: 102 | Version: "2012-10-17" 103 | Statement: 104 | - 105 | Effect: "Allow" 106 | Action: 107 | - "cloudwatch:PutMetricData" 108 | Resource: "*" 109 | ManagedPolicyArns: 110 | - "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole" 111 | - !If [UseKms, !Ref KmsDecryptPolicy, !Ref GetClusterCredentialsPolicy] 112 | KmsDecryptPolicy: 113 | Condition: UseKms 114 | Type: "AWS::IAM::ManagedPolicy" 115 | Properties: 116 | PolicyDocument: 117 | Version: "2012-10-17" 118 | Statement: 119 | - 120 | Effect: "Allow" 121 | Action: 122 | - "kms:Decrypt" 123 | Resource: !Ref KmsKeyARN 124 | GetClusterCredentialsPolicy: 125 | Type: "AWS::IAM::ManagedPolicy" 126 | Properties: 127 | PolicyDocument: 128 | Version: "2012-10-17" 129 | Statement: 130 | - 131 | Effect: "Allow" 132 | Action: 133 | - "redshift:GetClusterCredentials" 134 | Resource: 135 | - !Sub "arn:aws:redshift:${AWS::Region}:${AWS::AccountId}:dbname:${ClusterName}/${DatabaseName}" 136 | - !Sub "arn:aws:redshift:${AWS::Region}:${AWS::AccountId}:dbuser:${ClusterName}/${DbUser}" 137 | -------------------------------------------------------------------------------- /deploy-vpc.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: 2010-09-09 2 | Transform: AWS::Serverless-2016-10-31 3 | Parameters: 4 | ClusterName: 5 | Default: my-redshift-cluster 6 | Description: Cluster Name 7 | Type: String 8 | AllowedPattern: .* 9 | DbUser: 10 | Default: My DB User 11 | Description: Name of the database user to connect to 12 | Type: String 13 | AllowedPattern: .* 14 | EncryptedPassword: 15 | Default: Base64 Encoded Encrypted Password 16 | Description: Password encrypted with AWS KMS (leave blank to use IAM authentication token) 17 | Type: String 18 | AllowedPattern: .* 19 | KmsKeyARN: 20 | Default: arn:aws:kms:us-east-1:123456789012:key/MyKey 21 | Description: KMS Key ARN used to decrypt the password (leave blank to use IAM authentication token) 22 | Type: String 23 | AllowedPattern: ^$|arn:aws:kms:[a-zA-Z0-9-]+:\d{12}:key\/.* 24 | HostName: 25 | Default: my-redshift-cluster.XXXXXXXXXXXX..redshift.amazonaws.com 26 | Description: Cluster Endpoint Address 27 | Type: String 28 | AllowedPattern: .*\.redshift\.amazonaws\.com$ 29 | HostPort: 30 | Default: 5439 31 | Description: Database Port 32 | Type: Number 33 | MinValue: 1024 34 | MaxValue: 65535 35 | DatabaseName: 36 | Default: mydb 37 | Description: Database Name to connect to 38 | Type: String 39 | AllowedPattern: .* 40 | SecurityGroups: 41 | Default: mygroup1, mygroup2 42 | Description: Security Groups as CSV list to use for the deployed function (may be required for Redshift security policy) 43 | Type: CommaDelimitedList 44 | SubnetIds: 45 | Default: subnet1, subnet2, subnet3 46 | Description: List of private Subnets in VPC in which the function will egress network connections 47 | Type: CommaDelimitedList 48 | AggregationInterval: 49 | Default: 1 hour 50 | Description: Interval for aggregating statistics 51 | Type: String 52 | AllowedValues: 53 | - 1 hour 54 | - 10 minutes 55 | Conditions: 56 | UseKms: !Not 57 | - !Equals 58 | - !Ref KmsKeyARN 59 | - '' 60 | Resources: 61 | ScheduledFunction: 62 | Type: AWS::Serverless::Function 63 | Properties: 64 | Handler: lambda_function.lambda_handler 65 | Runtime: python3.9 66 | CodeUri: 67 | Bucket: !Sub awslabs-code-${AWS::Region} 68 | Key: RedshiftAdvancedMonitoring/redshift-advanced-monitoring-1.8.zip 69 | MemorySize: 192 70 | Timeout: 900 71 | Tags: 72 | Name: RedshiftAdvancedMonitoring 73 | Role: !GetAtt ScheduledServiceIAMRole.Arn 74 | VpcConfig: 75 | SecurityGroupIds: 76 | !Ref SecurityGroups 77 | SubnetIds: 78 | !Ref SubnetIds 79 | Events: 80 | Timer: 81 | Type: Schedule 82 | Properties: 83 | Schedule: rate(1 hour) 84 | Input: 85 | !Sub | 86 | { 87 | "DbUser":"${DbUser}", 88 | "EncryptedPassword":"${EncryptedPassword}", 89 | "ClusterName":"${ClusterName}", 90 | "HostName":"${HostName}", 91 | "HostPort":"${HostPort}", 92 | "DatabaseName":"${DatabaseName}", 93 | "AggregationInterval":"${AggregationInterval}" 94 | } 95 | ScheduledServiceIAMRole: 96 | Type: "AWS::IAM::Role" 97 | Properties: 98 | RoleName: "LambdaRedshiftMonitoringRole" 99 | Path: "/" 100 | AssumeRolePolicyDocument: 101 | Version: "2012-10-17" 102 | Statement: 103 | - 104 | Sid: "AllowLambdaServiceToAssumeRole" 105 | Effect: "Allow" 106 | Action: 107 | - "sts:AssumeRole" 108 | Principal: 109 | Service: 110 | - "lambda.amazonaws.com" 111 | Policies: 112 | - 113 | PolicyName: "LambdaRedshiftMonitoringPolicy" 114 | PolicyDocument: 115 | Version: "2012-10-17" 116 | Statement: 117 | - 118 | Effect: "Allow" 119 | Action: 120 | - "cloudwatch:PutMetricData" 121 | Resource: "*" 122 | ManagedPolicyArns: 123 | - "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole" 124 | - !If [UseKms, !Ref KmsDecryptPolicy, !Ref GetClusterCredentialsPolicy] 125 | KmsDecryptPolicy: 126 | Condition: UseKms 127 | Type: "AWS::IAM::ManagedPolicy" 128 | Properties: 129 | PolicyDocument: 130 | Version: "2012-10-17" 131 | Statement: 132 | - 133 | Effect: "Allow" 134 | Action: 135 | - "kms:Decrypt" 136 | Resource: !Ref KmsKeyARN 137 | GetClusterCredentialsPolicy: 138 | Type: "AWS::IAM::ManagedPolicy" 139 | Properties: 140 | PolicyDocument: 141 | Version: "2012-10-17" 142 | Statement: 143 | - 144 | Effect: "Allow" 145 | Action: 146 | - "redshift:GetClusterCredentials" 147 | Resource: 148 | - !Sub "arn:aws:redshift:${AWS::Region}:${AWS::AccountId}:dbname:${ClusterName}/${DatabaseName}" 149 | - !Sub "arn:aws:redshift:${AWS::Region}:${AWS::AccountId}:dbuser:${ClusterName}/${DbUser}" 150 | -------------------------------------------------------------------------------- /monitoring-queries.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "query": "SELECT /* Lambda CloudWatch Exporter */ count(a.attname) FROM pg_namespace n, pg_class c, pg_attribute a WHERE n.oid = c.relnamespace AND c.oid = a.attrelid AND a.attnum > 0 AND NOT a.attisdropped and n.nspname NOT IN ('information_schema','pg_catalog','pg_toast') AND format_encoding(a.attencodingtype::integer) = 'none' AND c.relkind='r' AND a.attsortkeyord != 1", 4 | "name": "ColumnsNotCompressed", 5 | "unit": "Count", 6 | "type": "value" 7 | }, 8 | 9 | { 10 | "query": "SELECT /* Lambda CloudWatch Exporter */ sum(nvl(s.num_qs,0)) FROM svv_table_info t LEFT JOIN (SELECT tbl, COUNT(distinct query) num_qs FROM stl_scan s WHERE s.userid > 1 AND starttime >= GETDATE() - INTERVAL '1 hour' GROUP BY tbl) s ON s.tbl = t.table_id WHERE t.sortkey1 IS NULL", 11 | "name": "QueriesScanNoSort", 12 | "unit": "Count", 13 | "type": "value" 14 | }, 15 | 16 | { 17 | "query": "SELECT /* Lambda CloudWatch Exporter */ SUM(w.total_queue_time) / 1000000.0 FROM stl_wlm_query w WHERE w.queue_start_time >= GETDATE() - INTERVAL '1 hour' AND w.total_queue_time > 0", 18 | "name": "TotalWLMQueueTime", 19 | "unit": "Seconds", 20 | "type": "value" 21 | }, 22 | 23 | { 24 | "query": "SELECT /* Lambda CloudWatch Exporter */ count(distinct query) FROM svl_query_report WHERE is_diskbased='t' AND (LABEL LIKE 'hash%%' OR LABEL LIKE 'sort%%' OR LABEL LIKE 'aggr%%') AND userid > 1 AND start_time >= GETDATE() - INTERVAL '1 hour'", 25 | "name": "DiskBasedQueries", 26 | "unit": "Count", 27 | "type": "value" 28 | }, 29 | 30 | { 31 | "query": "select /* Lambda CloudWatch Exporter */ avg(datediff(ms,startqueue,startwork)) from stl_commit_stats where startqueue >= GETDATE() - INTERVAL '1 hour'", 32 | "name": "AvgCommitQueueTime", 33 | "unit": "Milliseconds", 34 | "type": "value" 35 | }, 36 | 37 | { 38 | "query": "select /* Lambda CloudWatch Exporter */ count(distinct l.query) from stl_alert_event_log as l where l.userid >1 and l.event_time >= GETDATE() - INTERVAL '1 hour'", 39 | "name": "TotalAlerts", 40 | "unit": "Seconds", 41 | "type": "value" 42 | }, 43 | 44 | { 45 | "query": "select /* Lambda CloudWatch Exporter */ avg(datediff(ms, starttime, endtime)) from stl_query where starttime >= GETDATE() - INTERVAL '1 hour'", 46 | "name": "AverageQueryTime", 47 | "unit": "Milliseconds", 48 | "type": "value" 49 | }, 50 | 51 | { 52 | "query": "select /* Lambda CloudWatch Exporter */ sum(packets) from stl_dist where starttime >= GETDATE() - INTERVAL '1 hour'", 53 | "name": "Packets", 54 | "unit": "Count", 55 | "type": "value" 56 | }, 57 | 58 | { 59 | "query": "select /* Lambda CloudWatch Exporter */ sum(total) from (select count(query) total from stl_dist where starttime >= GETDATE() - INTERVAL '1 hour' group by query having sum(packets) > 1000000)", 60 | "name": "QueriesWithHighTraffic", 61 | "unit": "Count", 62 | "type": "value" 63 | }, 64 | 65 | { 66 | "query": "select /* Lambda CloudWatch Exporter */ count(event) from stl_connection_log where event = 'initiating session' and username != 'rdsdb' and pid not in (select pid from stl_connection_log where event = 'disconnecting session')", 67 | "name": "DbConnections", 68 | "unit": "Count", 69 | "type": "value" 70 | }, 71 | 72 | { 73 | "query": "select count(*) from svv_transactions t WHERE t.lockable_object_type = 'transactionid' and pid != pg_backend_pid()", 74 | "name": "OpenTransactions", 75 | "unit": "Count", 76 | "type": "value" 77 | }, 78 | 79 | { 80 | "query": "select count(*) from svv_transactions t WHERE t.granted = 'f' and t.pid != pg_backend_pid()", 81 | "name": "UngrantedLocks", 82 | "unit": "Count", 83 | "type": "value" 84 | }, 85 | 86 | { 87 | "query": "select case when count(*) > 0 then 1 else 0 end from svv_transactions where xid = (select max(xid) from stl_vacuum)", 88 | "name": "VacuumRunning", 89 | "unit": "Count", 90 | "type": "value" 91 | }, 92 | 93 | { 94 | "query": "select case when total_query_slots > 15 then 1 else 0 end status from (select sum(num_query_tasks) total_query_slots from STV_WLM_SERVICE_CLASS_CONFIG where service_class > 5 and name != 'Short query queue')", 95 | "name": "WLMQuerySlotCountWarning", 96 | "unit": "None", 97 | "type": "value" 98 | }, 99 | 100 | { 101 | "query": "SELECT /* Lambda CloudWatch Exporter */ count(1) FROM stv_blocklist WHERE tombstone<>0", 102 | "name": "TombstoneCount", 103 | "unit": "Count", 104 | "type": "value", 105 | "comment": "High tombstone blocks could cause unexpected disk full issue." 106 | }, 107 | 108 | { 109 | "query": "SELECT /* Lambda CloudWatch Exporter */ sum(bytes/1000000) mb_last_hour FROM svl_query_summary WHERE query IN (SELECT query FROM stl_query WHERE user>=100 AND endtime > GETDATE() - INTERVAL '1 hour')", 110 | "name": "MBDataProcessedInLastHour", 111 | "unit": "Count", 112 | "type": "value", 113 | "comment": "An indicator that evaluates the throughput of processed data (in MB) for user queries completed within last hour. This query can be heavy that needs more than 1 minute." 114 | }, 115 | 116 | { 117 | "query": "SELECT /* Lambda CloudWatch Exporter */ SUM(CASE WHEN source_query IS NOT NULL THEN 1 ELSE 0 END)*100.0 / COUNT(*) as cache_hit_pct FROM svl_qlog WHERE userid>=100 AND starttime > GETDATE() - INTERVAL '1 days'", 118 | "name": "QueryCacheHitPercentage", 119 | "unit": "Count", 120 | "type": "value", 121 | "comment": "The percentage value shows how many queries hit in query cache." 122 | }, 123 | 124 | { 125 | "query": "SELECT /* Lambda CloudWatch Exporter */ MAX(AGE(datfrozenxid)) as max_tid FROM pg_database WHERE datname NOT IN ('padb_harvest','dev')", 126 | "name": "MaxTransactionId", 127 | "unit": "Count", 128 | "type": "value", 129 | "comment": "Once the transaction id increased to ~2 billion, the cluster needs resize to reset the transaction id." 130 | }, 131 | 132 | { 133 | "query": "SELECT /* Lambda CloudWatch Exporter */ COALESCE(SUM(usage_in_seconds),0) concurrency_scaling_usage FROM svcs_concurrency_scaling_usage WHERE end_time > GETDATE() - INTERVAL '1 hour'", 134 | "name": "ConcurrencyScalingUsage", 135 | "unit": "Count", 136 | "type": "value", 137 | "comment": "Concurrency scaling cluster usage in seconds for last hour." 138 | } 139 | 140 | ] -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "{}" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright {yyyy} {name of copyright owner} 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /redshift_monitoring.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | 3 | import os 4 | import sys 5 | 6 | # Copyright 2016-2016 Amazon.com, Inc. or its affiliates. All Rights Reserved. 7 | # Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the License. A copy of the License is located at 8 | # http://aws.amazon.com/apache2.0/ 9 | # or in the "license" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. 10 | 11 | # add the lib directory to the path 12 | sys.path.append(os.path.join(os.path.dirname(__file__), "lib")) 13 | sys.path.append(os.path.join(os.path.dirname(__file__), "sql")) 14 | 15 | import boto3 16 | import base64 17 | import pg8000.native 18 | import datetime 19 | import json 20 | import pgpasslib 21 | 22 | #### Static Configuration 23 | ssl = True 24 | interval = '1 hour' 25 | ################## 26 | 27 | __version__ = "1.8" 28 | debug = False 29 | pg8000.paramstyle = "qmark" 30 | NAME = "Lambda CloudWatch Exporter" 31 | 32 | 33 | def run_external_commands(command_set_type, file_name, conn, cluster): 34 | if not os.path.exists(file_name): 35 | return [] 36 | 37 | external_commands = None 38 | try: 39 | external_commands = json.load(open(file_name, 'r')) 40 | except ValueError as e: 41 | # handle a malformed user query set gracefully 42 | if e.message == "No JSON object could be decoded": 43 | return [] 44 | else: 45 | raise 46 | 47 | output_metrics = [] 48 | 49 | for command in external_commands: 50 | if command['type'] == 'value': 51 | cmd_type = "Query" 52 | else: 53 | cmd_type = "Canary" 54 | 55 | print("Executing %s %s: %s" % (command_set_type, cmd_type, command['name'])) 56 | 57 | try: 58 | t = datetime.datetime.now() 59 | interval, result = run_command(conn, command['query']) 60 | 61 | for row in result: 62 | value, *_ = row 63 | 64 | # append a cloudwatch metric for the value, or the elapsed interval, based upon the configured 'type' value 65 | if command['type'] == 'value': 66 | output_metrics.append({ 67 | 'MetricName': command['name'], 68 | 'Dimensions': [ 69 | {'Name': 'ClusterIdentifier', 'Value': cluster} 70 | ], 71 | 'Timestamp': t, 72 | 'Value': 0 if value is None else value, 73 | 'Unit': command['unit'] 74 | }) 75 | else: 76 | output_metrics.append({ 77 | 'MetricName': command['name'], 78 | 'Dimensions': [ 79 | {'Name': 'ClusterIdentifier', 'Value': cluster} 80 | ], 81 | 'Timestamp': t, 82 | 'Value': interval, 83 | 'Unit': 'Milliseconds' 84 | }) 85 | except Exception as e: 86 | print("Exception running external command %s" % command['name']) 87 | print(e) 88 | 89 | return output_metrics 90 | 91 | 92 | def run_command(conn, statement) -> tuple: 93 | if debug: 94 | print("Running Statement: %s" % statement) 95 | 96 | t = datetime.datetime.now() 97 | output = conn.run(statement) 98 | interval = (datetime.datetime.now() - t).microseconds / 1000 99 | 100 | return interval, output 101 | 102 | 103 | def gather_service_class_stats(conn, cluster): 104 | metrics = [] 105 | runtime, service_class_info = run_command(conn, ''' 106 | SELECT DATE_TRUNC('hour', a.service_class_start_time) AS metrics_ts, 107 | TRIM(d.name) as service_class, 108 | COUNT(a.query) AS query_count, 109 | SUM(a.total_exec_time) AS sum_exec_time, 110 | sum(case when a.total_queue_time > 0 then 1 else 0 end) count_queued_queries, 111 | SUM(a.total_queue_time) AS sum_queue_time, 112 | count(c.is_diskbased) as count_diskbased_segments 113 | FROM stl_wlm_query a 114 | JOIN stv_wlm_classification_config b ON a.service_class = b.action_service_class 115 | LEFT OUTER JOIN (select query, SUM(CASE when is_diskbased = 't' then 1 else 0 end) is_diskbased 116 | from svl_query_summary 117 | group by query) c on a.query = c.query 118 | JOIN stv_wlm_service_class_config d on a.service_class = d.service_class 119 | WHERE a.service_class > 5 120 | AND a.service_class_start_time > DATEADD(hour, -2, current_date) 121 | GROUP BY DATE_TRUNC('hour', a.service_class_start_time), 122 | d.name 123 | ''') 124 | 125 | def add_metric(metric_name, service_class_id, metric_value, ts): 126 | metrics.append({ 127 | 'MetricName': metric_name, 128 | 'Dimensions': [{'Name': 'ClusterIdentifier', 'Value': cluster}, 129 | {'Name': 'ServiceClassID', 'Value': str(service_class_id)}], 130 | 'Timestamp': ts, 131 | 'Value': metric_value 132 | }) 133 | 134 | for service_class in service_class_info: 135 | add_metric('ServiceClass-Queued', service_class[1], service_class[4], service_class[0]) 136 | add_metric('ServiceClass-QueueTime', service_class[1], service_class[5], service_class[0]) 137 | add_metric('ServiceClass-Executed', service_class[1], service_class[2], service_class[0]) 138 | add_metric('ServiceClass-ExecTime', service_class[1], service_class[3], service_class[0]) 139 | add_metric('ServiceClass-DiskbasedQuerySegments', service_class[1], service_class[6], service_class[0]) 140 | 141 | return metrics 142 | 143 | 144 | def gather_table_stats(conn, cluster): 145 | interval, result = run_command(conn, 146 | f"select /* {NAME} */ \"schema\" || '.' || \"table\" as table, encoded, max_varchar, unsorted, stats_off, tbl_rows, skew_sortkey1, skew_rows from svv_table_info") 147 | tables_not_compressed = 0 148 | max_skew_ratio = 0 149 | total_skew_ratio = 0 150 | number_tables_skew = 0 151 | number_tables = 0 152 | max_skew_sort_ratio = 0 153 | total_skew_sort_ratio = 0 154 | number_tables_skew_sort = 0 155 | number_tables_statsoff = 0 156 | max_varchar_size = 0 157 | max_unsorted_pct = 0 158 | total_rows = 0 159 | 160 | for table in result: 161 | table_name, encoded, max_varchar, unsorted, stats_off, tbl_rows, skew_sortkey1, skew_rows, *_ = table 162 | number_tables += 1 163 | if encoded == 'N': 164 | tables_not_compressed += 1 165 | if skew_rows is not None: 166 | if skew_rows > max_skew_ratio: 167 | max_skew_ratio = skew_rows 168 | total_skew_ratio += skew_rows 169 | number_tables_skew += 1 170 | if skew_sortkey1 is not None: 171 | if skew_sortkey1 > max_skew_sort_ratio: 172 | max_skew_sort_ratio = skew_sortkey1 173 | total_skew_sort_ratio += skew_sortkey1 174 | number_tables_skew_sort += 1 175 | if stats_off is not None and stats_off > 5: 176 | number_tables_statsoff += 1 177 | if max_varchar is not None and max_varchar > max_varchar_size: 178 | max_varchar_size = max_varchar 179 | if unsorted is not None and unsorted > max_unsorted_pct: 180 | max_unsorted_pct = unsorted 181 | if tbl_rows is not None: 182 | total_rows += tbl_rows 183 | 184 | if number_tables_skew > 0: 185 | avg_skew_ratio = total_skew_ratio / number_tables_skew 186 | else: 187 | avg_skew_ratio = 0 188 | 189 | if number_tables_skew_sort > 0: 190 | avg_skew_sort_ratio = total_skew_sort_ratio / number_tables_skew_sort 191 | else: 192 | avg_skew_sort_ratio = 0 193 | 194 | # build up the metrics to put in cloudwatch 195 | metrics = [] 196 | 197 | def add_metric(metric_name, value, unit): 198 | metrics.append({ 199 | 'MetricName': metric_name, 200 | 'Dimensions': [ 201 | {'Name': 'ClusterIdentifier', 'Value': cluster} 202 | ], 203 | 'Timestamp': datetime.datetime.utcnow(), 204 | 'Value': value, 205 | 'Unit': unit 206 | }) 207 | 208 | units_count = 'Count' 209 | units_none = 'None' 210 | units_pct = 'Percent' 211 | 212 | add_metric('TablesNotCompressed', tables_not_compressed, units_count) 213 | add_metric('MaxSkewRatio', max_skew_ratio, units_none) 214 | add_metric('MaxSkewSortRatio', max_skew_sort_ratio, units_none) 215 | add_metric('AvgSkewRatio', avg_skew_ratio, units_none) 216 | add_metric('AvgSkewSortRatio', avg_skew_sort_ratio, units_none) 217 | add_metric('Tables', number_tables, units_count) 218 | add_metric('Rows', total_rows, units_count) 219 | add_metric('TablesStatsOff', number_tables_statsoff, units_count) 220 | add_metric('MaxVarcharSize', max_varchar_size, units_none) 221 | add_metric('MaxUnsorted', max_unsorted_pct, units_pct) 222 | 223 | return metrics 224 | 225 | 226 | # nasty hack for backward compatibility, to extract label values from os.environ or event 227 | def get_config_value(labels, configs): 228 | for l in labels: 229 | for c in configs: 230 | if l in c: 231 | if debug: 232 | print("Resolved label value %s from config" % l) 233 | 234 | return c[l] 235 | 236 | return None 237 | 238 | 239 | def monitor_cluster(config_sources): 240 | aws_region = get_config_value(['AWS_REGION'], config_sources) 241 | 242 | set_debug = get_config_value(['DEBUG', 'debug', ], config_sources) 243 | if set_debug is not None and ((isinstance(set_debug, bool) and set_debug) or set_debug.upper() == 'TRUE'): 244 | global debug 245 | debug = True 246 | 247 | kms = boto3.client('kms', region_name=aws_region) 248 | cw = boto3.client('cloudwatch', region_name=aws_region) 249 | redshift = boto3.client('redshift', region_name=aws_region) 250 | 251 | if debug: 252 | print("Connected to AWS KMS & CloudWatch in %s" % aws_region) 253 | 254 | user = get_config_value(['DbUser', 'db_user', 'dbUser'], config_sources) 255 | host = get_config_value(['HostName', 'cluster_endpoint', 'dbHost', 'db_host'], config_sources) 256 | port = int(get_config_value(['HostPort', 'db_port', 'dbPort'], config_sources)) 257 | database = get_config_value(['DatabaseName', 'db_name', 'db'], config_sources) 258 | cluster = get_config_value(['ClusterName', 'cluster_name', 'clusterName'], config_sources) 259 | 260 | global interval 261 | interval = get_config_value(['AggregationInterval', 'agg_interval', 'aggregtionInterval'], config_sources) 262 | 263 | pwd = None 264 | try: 265 | pwd = pgpasslib.getpass(host, port, database, user) 266 | except pgpasslib.FileNotFound as e: 267 | pass 268 | 269 | # check if unencrypted password exists if no pgpasslib 270 | if pwd is None: 271 | pwd = get_config_value(['db_pwd'], config_sources) 272 | 273 | # check for encrypted password if the above two don't exist 274 | if pwd is None: 275 | enc_password = get_config_value(['EncryptedPassword', 'encrypted_password', 'encrypted_pwd', 'dbPassword'], 276 | config_sources) 277 | if enc_password: 278 | 279 | # resolve the authorisation context, if there is one, and decrypt the password 280 | auth_context = get_config_value('kms_auth_context', config_sources) 281 | 282 | if auth_context is not None: 283 | auth_context = json.loads(auth_context) 284 | 285 | try: 286 | if auth_context is None: 287 | pwd = kms.decrypt(CiphertextBlob=base64.b64decode(enc_password))[ 288 | 'Plaintext'] 289 | else: 290 | pwd = kms.decrypt(CiphertextBlob=base64.b64decode(enc_password), EncryptionContext=auth_context)[ 291 | 'Plaintext'] 292 | except: 293 | print('KMS access failed: exception %s' % sys.exc_info()[1]) 294 | print('Encrypted Password: %s' % enc_password) 295 | print('Encryption Context %s' % auth_context) 296 | 297 | # check for credentials using IAM database authentication 298 | if pwd is None: 299 | try: 300 | cluster_credentials = redshift.get_cluster_credentials(DbUser=user, 301 | DbName=database, 302 | ClusterIdentifier=cluster, 303 | AutoCreate=False) 304 | user = cluster_credentials['DbUser'] 305 | pwd = cluster_credentials['DbPassword'] 306 | 307 | except: 308 | print('GetClusterCredentials failed: exception %s' % sys.exc_info()[1]) 309 | 310 | # Connect to the cluster 311 | try: 312 | if debug: 313 | print('Connecting to Redshift: %s' % host) 314 | 315 | conn = pg8000.native.Connection(user, host=host, database=database, port=port, password=pwd, ssl_context=True, 316 | tcp_keepalive=True, application_name=NAME) 317 | conn.autocommit = True 318 | except: 319 | print('Redshift Connection Failed: exception %s' % sys.exc_info()[1]) 320 | raise 321 | 322 | if debug: 323 | print('Successfully Connected to Cluster') 324 | 325 | # set application name 326 | set_name = f"set application_name to '{NAME}-v{__version__}'" 327 | 328 | if debug: 329 | print(set_name) 330 | 331 | run_command(conn, set_name) 332 | 333 | # collect table statistics 334 | put_metrics = gather_table_stats(conn, cluster) 335 | 336 | # collect service class statistics 337 | put_metrics.extend(gather_service_class_stats(conn, cluster)) 338 | 339 | # run the externally configured commands and append their values onto the put metrics 340 | put_metrics.extend(run_external_commands('Redshift Diagnostic', 'monitoring-queries.json', conn, cluster)) 341 | 342 | # run the supplied user commands and append their values onto the put metrics 343 | put_metrics.extend(run_external_commands('User Configured', 'user-queries.json', conn, cluster)) 344 | 345 | # add a metric for how many metrics we're exporting (whoa inception) 346 | put_metrics.extend([{ 347 | 'MetricName': 'CloudwatchMetricsExported', 348 | 'Dimensions': [ 349 | {'Name': 'ClusterIdentifier', 'Value': cluster} 350 | ], 351 | 'Timestamp': datetime.datetime.utcnow(), 352 | 'Value': len(put_metrics), 353 | 'Unit': 'Count' 354 | }]) 355 | 356 | max_metrics = 20 357 | group = 0 358 | print("Publishing %s CloudWatch Metrics" % (len(put_metrics))) 359 | 360 | for x in range(0, len(put_metrics), max_metrics): 361 | group += 1 362 | 363 | # slice the metrics into blocks of 20 or just the remaining metrics 364 | put = put_metrics[x:(x + max_metrics)] 365 | 366 | if debug: 367 | print("Metrics group %s: %s Datapoints" % (group, len(put))) 368 | print(put) 369 | try: 370 | cw.put_metric_data( 371 | Namespace='Redshift', 372 | MetricData=put 373 | ) 374 | except: 375 | print('Pushing metrics to CloudWatch failed: exception %s' % sys.exc_info()[1]) 376 | raise 377 | 378 | conn.close() 379 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Redshift Advance Monitoring 2 | 3 | ## Goals 4 | Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse solution that uses columnar storage to minimise IO, provides high data compression rates, and offers fast performance. This GitHub project provides an advance monitoring system for Amazon Redshift that is completely serverless, based on AWS Lambda and Amazon CloudWatch. A serverless Lambda function runs on a schedule, connects to the configured Redshift cluster, and generates CloudWatch custom alarms for common possible issues. 5 | 6 | Most of the graphs are based on the information provided in AWS Big Data Blog articles and Redshift Documentation: 7 | 8 | * [Top 10 Performance Tuning Techniques for Amazon Redshift](https://blogs.aws.amazon.com/bigdata/post/Tx31034QG0G3ED1/Top-10-Performance-Tuning-Techniques-for-Amazon-Redshift) 9 | * [Advanced table design playbook](https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-preamble-prerequisites-and-prioritization) 10 | 11 | ## Installation 12 | 13 | This function can be automatically deployed using a Serverless Application Model (SAM) in CloudFormation. Use the links below based on the specified region to walk through the CloudFormation deployment model. 14 | 15 | You must supply parameters for your cluster name, endpoint address and port, master username and the encrypted password, and the aggregation interval to be used by the monitoring scripts (default 1 hour). 16 | 17 | The SAM stack will create: 18 | 19 | * An IAM Role called LambdaRedshiftMonitoringRole 20 | * This IAM Role will have a single linked IAM Policy called LambdaRedshiftMonitoringPolicy that can: 21 | * Decrypt the KMS Key used to encrypt the cluster password (kms::Decrypt) 22 | * Emit CloudWatch metrics (cloudwatch::PutMetricData) 23 | 24 | |Region | VPC Template | Non-VPC Template | 25 | |---- |---- | ----| 26 | |ap-northeast-1 | [](https://console.aws.amazon.com/cloudformation/home?region=ap-northeast-1#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-ap-northeast-1.amazonaws.com/awslabs-code-ap-northeast-1/RedshiftAdvancedMonitoring/deploy-vpc.yaml) | [](https://console.aws.amazon.com/cloudformation/home?region=ap-northeast-1#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-ap-northeast-1.amazonaws.com/awslabs-code-ap-northeast-1/RedshiftAdvancedMonitoring/deploy-non-vpc.yaml) | |ap-northeast-2 | [](https://console.aws.amazon.com/cloudformation/home?region=ap-northeast-2#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-ap-northeast-2.amazonaws.com/awslabs-code-ap-northeast-2/RedshiftAdvancedMonitoring/deploy-vpc.yaml) | [](https://console.aws.amazon.com/cloudformation/home?region=ap-northeast-2#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-ap-northeast-2.amazonaws.com/awslabs-code-ap-northeast-2/RedshiftAdvancedMonitoring/deploy-non-vpc.yaml) | |ap-south-1 | [](https://console.aws.amazon.com/cloudformation/home?region=ap-south-1#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-ap-south-1.amazonaws.com/awslabs-code-ap-south-1/RedshiftAdvancedMonitoring/deploy-vpc.yaml) | [](https://console.aws.amazon.com/cloudformation/home?region=ap-south-1#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-ap-south-1.amazonaws.com/awslabs-code-ap-south-1/RedshiftAdvancedMonitoring/deploy-non-vpc.yaml) | |ap-southeast-1 | [](https://console.aws.amazon.com/cloudformation/home?region=ap-southeast-1#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-ap-southeast-1.amazonaws.com/awslabs-code-ap-southeast-1/RedshiftAdvancedMonitoring/deploy-vpc.yaml) | [](https://console.aws.amazon.com/cloudformation/home?region=ap-southeast-1#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-ap-southeast-1.amazonaws.com/awslabs-code-ap-southeast-1/RedshiftAdvancedMonitoring/deploy-non-vpc.yaml) | |ap-southeast-2 | [](https://console.aws.amazon.com/cloudformation/home?region=ap-southeast-2#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-ap-southeast-2.amazonaws.com/awslabs-code-ap-southeast-2/RedshiftAdvancedMonitoring/deploy-vpc.yaml) | [](https://console.aws.amazon.com/cloudformation/home?region=ap-southeast-2#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-ap-southeast-2.amazonaws.com/awslabs-code-ap-southeast-2/RedshiftAdvancedMonitoring/deploy-non-vpc.yaml) | |ca-central-1 | [](https://console.aws.amazon.com/cloudformation/home?region=ca-central-1#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-ca-central-1.amazonaws.com/awslabs-code-ca-central-1/RedshiftAdvancedMonitoring/deploy-vpc.yaml) | [](https://console.aws.amazon.com/cloudformation/home?region=ca-central-1#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-ca-central-1.amazonaws.com/awslabs-code-ca-central-1/RedshiftAdvancedMonitoring/deploy-non-vpc.yaml) | |eu-central-1 | [](https://console.aws.amazon.com/cloudformation/home?region=eu-central-1#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-eu-central-1.amazonaws.com/awslabs-code-eu-central-1/RedshiftAdvancedMonitoring/deploy-vpc.yaml) | [](https://console.aws.amazon.com/cloudformation/home?region=eu-central-1#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-eu-central-1.amazonaws.com/awslabs-code-eu-central-1/RedshiftAdvancedMonitoring/deploy-non-vpc.yaml) | |eu-west-1 | [](https://console.aws.amazon.com/cloudformation/home?region=eu-west-1#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-eu-west-1.amazonaws.com/awslabs-code-eu-west-1/RedshiftAdvancedMonitoring/deploy-vpc.yaml) | [](https://console.aws.amazon.com/cloudformation/home?region=eu-west-1#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-eu-west-1.amazonaws.com/awslabs-code-eu-west-1/RedshiftAdvancedMonitoring/deploy-non-vpc.yaml) | |eu-west-2 | [](https://console.aws.amazon.com/cloudformation/home?region=eu-west-2#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-eu-west-2.amazonaws.com/awslabs-code-eu-west-2/RedshiftAdvancedMonitoring/deploy-vpc.yaml) | [](https://console.aws.amazon.com/cloudformation/home?region=eu-west-2#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-eu-west-2.amazonaws.com/awslabs-code-eu-west-2/RedshiftAdvancedMonitoring/deploy-non-vpc.yaml) | |eu-west-3 | [](https://console.aws.amazon.com/cloudformation/home?region=eu-west-3#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-eu-west-3.amazonaws.com/awslabs-code-eu-west-3/RedshiftAdvancedMonitoring/deploy-vpc.yaml) | [](https://console.aws.amazon.com/cloudformation/home?region=eu-west-3#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-eu-west-3.amazonaws.com/awslabs-code-eu-west-3/RedshiftAdvancedMonitoring/deploy-non-vpc.yaml) | |sa-east-1 | [](https://console.aws.amazon.com/cloudformation/home?region=sa-east-1#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-sa-east-1.amazonaws.com/awslabs-code-sa-east-1/RedshiftAdvancedMonitoring/deploy-vpc.yaml) | [](https://console.aws.amazon.com/cloudformation/home?region=sa-east-1#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-sa-east-1.amazonaws.com/awslabs-code-sa-east-1/RedshiftAdvancedMonitoring/deploy-non-vpc.yaml) | |us-east-1 | [](https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3.amazonaws.com/awslabs-code-us-east-1/RedshiftAdvancedMonitoring/deploy-vpc.yaml) | [](https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3.amazonaws.com/awslabs-code-us-east-1/RedshiftAdvancedMonitoring/deploy-non-vpc.yaml) | |us-east-2 | [](https://console.aws.amazon.com/cloudformation/home?region=us-east-2#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-us-east-2.amazonaws.com/awslabs-code-us-east-2/RedshiftAdvancedMonitoring/deploy-vpc.yaml) | [](https://console.aws.amazon.com/cloudformation/home?region=us-east-2#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-us-east-2.amazonaws.com/awslabs-code-us-east-2/RedshiftAdvancedMonitoring/deploy-non-vpc.yaml) | |us-west-1 | [](https://console.aws.amazon.com/cloudformation/home?region=us-west-1#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-us-west-1.amazonaws.com/awslabs-code-us-west-1/RedshiftAdvancedMonitoring/deploy-vpc.yaml) | [](https://console.aws.amazon.com/cloudformation/home?region=us-west-1#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-us-west-1.amazonaws.com/awslabs-code-us-west-1/RedshiftAdvancedMonitoring/deploy-non-vpc.yaml) | |us-west-2 | [](https://console.aws.amazon.com/cloudformation/home?region=us-west-2#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-us-west-2.amazonaws.com/awslabs-code-us-west-2/RedshiftAdvancedMonitoring/deploy-vpc.yaml) | [](https://console.aws.amazon.com/cloudformation/home?region=us-west-2#/stacks/new?stackName=RedshiftAdvancedMonitoring&templateURL=https://s3-us-west-2.amazonaws.com/awslabs-code-us-west-2/RedshiftAdvancedMonitoring/deploy-non-vpc.yaml) | 27 | 28 | If you wish to deploy manually, you can use the prebuilt zip in the [dist](dist) folder, or you can build it yourself. We've included a [build script](build.sh) for bash shell that will create a zip file which you can upload into AWS Lambda. 29 | 30 | The password for the Redshift user *must* be encrypted with KMS, and plaintext passwords are NOT supported. Furthermore, Lambda Environment Variables can also be [encrypted within the Lambda service using KMS](http://docs.aws.amazon.com/lambda/latest/dg/env_variables.html#env_encrypt). 31 | 32 | # Setting up KMS Keys for encryption 33 | 34 | If you use the above SAM deployment templates, then all permissions are configured for you. If not, then these are the steps you should follow to configure the function: 35 | 36 | * Create a KMS key in the same region as the Redshift Cluster. Take note of the key ARN [Documentation](http://docs.aws.amazon.com/kms/latest/developerguide/create-keys.html) 37 | * Create a Role for the lambda function, at least this role should have the policy "AWSLambdaVPCAccessExecutionRole" to be able to run in a VPC, and the custom policy (to access the KMS key): 38 | 39 | ```json 40 | { 41 | "Version": "2012-10-17", 42 | "Statement": [ 43 | { 44 | "Sid": "Stmt1458213823000", 45 | "Effect": "Allow", 46 | "Action": [ 47 | "kms:Decrypt" 48 | ], 49 | "Resource": [ 50 | "" 51 | ] 52 | }, 53 | { 54 | "Sid": "Stmt1458218675000", 55 | "Effect": "Allow", 56 | "Action": [ 57 | "cloudwatch:PutMetricData" 58 | ], 59 | "Resource": [ 60 | "*" 61 | ] 62 | } 63 | ] 64 | } 65 | ``` 66 | 67 | * Create a user in Redshift to use it with the script, this user should have at least access to the tables in the "pg_catalog" schema: 68 | >grant select on all tables in schema pg_catalog to tamreporting 69 | 70 | * Encrypt the password of the user with the KMS key, you can use this command line to do it: 71 | >aws kms encrypt --key-id `` --plaintext `` 72 | 73 | ## Configuration 74 | 75 | ### Static Configuration (Bad - deprecated after v1.2) 76 | 77 | You can edit the variables at the top of the script, and rebuild. Please note that anyone who has access to the Lambda function code will also have access to these configuration values. This includes: 78 | 79 | * user: The user in the database. 80 | * enc_password: The password encrypted with the KMS key. 81 | * host: The endpoing dns name of the Redshift cluster. 82 | * port: The port used by the Redshift cluster. 83 | * database: Database name of the Redshift cluster. 84 | * ssl: If you want to use SSL to connect to the cluster. 85 | * cluster: A cluster name, your graphs in CloudWatch are going to use it to reference the Redshift Cluster. 86 | * interval: The interval you're going to use to run your lambda function, 1 hour is a recommended interval. 87 | 88 | ### Environment Variables (Better) 89 | 90 | Alternatively, you can now use [Lambda Environment Variables](http://docs.aws.amazon.com/lambda/latest/dg/env_variables.html) for configuration, including: 91 | 92 | ``` 93 | "Environment": { 94 | "Variables": { 95 | "encrypted_password": "KMS encrypted password", 96 | "db_port": "database part number", 97 | "cluster_name": "display name for cloudwatch metrics", 98 | "db_name": "database name", 99 | "db_user": "database user name", 100 | "cluster_endpoint": "cluster DNS name" 101 | } 102 | } 103 | ``` 104 | 105 | ### Configuring with Events (Best) 106 | 107 | This option allows you to send the configuration as part of the Scheduled Event, which then means you can support multiple clusters from a single Lambda function. This option will override any Environment variables that you've configured. An example event looks like: 108 | 109 | ``` 110 | { 111 | "DbUser": "master", 112 | "EncryptedPassword": "AQECAHh+YtzV/K7+L/VDT7h2rYDCWFSUugXGqMxzWGXynPCHpQAAAGkwZwYJKoZIhvcNAQcGoFowWAIBADBTBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDM8DWMFELclZ2s7cmwIBEIAmyVGjoB7F4HbwU5Y1lq7GVQ3UU3MaE10LWieCKMHOtVhJioi+IHw=", 113 | "ClusterName": "energy-demo", 114 | "HostName": "energy-demo.c7bpmf3ajaft.eu-west-1.redshift.amazonaws.com", 115 | "HostPort": "5439", 116 | "DatabaseName": "master", 117 | "AggregationInterval": "1 hour" 118 | } 119 | ``` 120 | 121 | The old environment variable names are provided for backward compatibility, but you can use environment variables with the above names, and it will use those instead. 122 | 123 | ## Manual Deployment Instructions 124 | 125 | * If you are rebuilding the function, download and install dependencies 126 | >pip install -r requirements.txt -t . 127 | 128 | * Assemble and compress the Lambda function package: 129 | >./build.sh 130 | 131 | If you are including any user defined query extensions, then build with: 132 | 133 | >./build.sh --include-user-queries 134 | 135 | Please note the labelled version in Github does not include any user queries 136 | 137 | * Create a lambda function, some of the parameters of the function are: 138 | * Runtime: Python 2.7 139 | * Upload the zip file generated 140 | * Handler: `lambda_function.lambda_handler` 141 | * Role: Use the role created 142 | * Memory: 256MB 143 | * Timeout: 5 minutes 144 | * VPC: Use the same VPC as the Redshift cluster. You're going to need at least two private subnets with access to the Redshift cluster in its Security Group. You should have a NAT Gateway to give access to Internet to those subnets routing tables. You cannot use public subnets. You can read more information here [AWS blog](https://aws.amazon.com/blogs/aws/new-access-resources-in-a-vpc-from-your-lambda-functions/) 145 | 146 | * Add an Event Source to the Lambda function with a Scheduled Event, running with the same frequency you configured in the Lambda function. 147 | 148 | ## Confirming Successful Execution 149 | 150 | * After a period of time, you can check your CloudWatch metrics, and create alarms. You can also create a Dashboard with all the graphs and have a view of your database as this one: 151 | 152 | ![Dashboard1](dashboard1.png) 153 | ![Dashboard2](dashboard2.png) 154 | 155 | # Extensions 156 | 157 | The published CloudWatch metrics are all configured in a JSON file called `monitoring-queries.json`. These are queries that have been built by the AWS Redshift database engineering and support teams and which provide detailed metrics about the operation of your cluster. 158 | 159 | If you would like to create your own queries to be instrumented via AWS CloudWatch, such as user 'canary' queries which help you to see the performance of your cluster over time, these can be added into the [user-queries.json](user-queries.json) file. The file is a JSON array, with each query having the following structure: 160 | 161 | ``` 162 | { 163 | "query": "my select query that returns a numeric value", 164 | "name":"MyCanaryQuery", 165 | "unit":"Count | Seconds | Milliseconds | Whatever", 166 | "type":"(value | interval)" 167 | } 168 | ``` 169 | 170 | The last attribute `type` is probably the most important. If you use `value`, then the value from your query will be exported to CloudWatch with the indicated unit and Metric Name. However, if you use `interval`, then the runtime of your query will be instrumented as elapsed milliseconds, giving you the ability to create the desired 'canary' query. 171 | 172 | ---- 173 | 174 | Copyright 2016-2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. 175 | 176 | Licensed under the Apache License, Version 2.0 (the "License"); 177 | you may not use this file except in compliance with the License. 178 | You may obtain a copy of the License at 179 | 180 | http://www.apache.org/licenses/LICENSE-2.0 181 | 182 | Unless required by applicable law or agreed to in writing, software 183 | distributed under the License is distributed on an "AS IS" BASIS, 184 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 185 | See the License for the specific language governing permissions and 186 | limitations under the License. --------------------------------------------------------------------------------