├── .gitignore ├── README.md ├── check_haproxy_queue ├── check_joyent_zone_mem ├── check_postgres_replication ├── check_sidekiq_queue └── check_twemproxy /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/ 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | nagios-checks 2 | ============= 3 | 4 | Various nagios checks that we use at Wanelo. 5 | 6 | check_joyent_zone_mem 7 | --------------------- 8 | This script will use the Joyent tool "jinf" to validate that free RAM on the zone is within specified percentage thresholds. 9 | 10 | Usage: 11 | ``` 12 | ./check_joyent_zone_mem [-w ] [-c ] 13 | ``` 14 | 15 | Example: 16 | ``` 17 | ./check_joyent_zone_mem -w 75 -c 90 18 | RSS OK : my-host.prod 47% used (4334Mb free)|rss=47%;70;85 19 | ``` 20 | 21 | check_sidekiq_queue 22 | ------------------- 23 | Peeks into the Sidekiq queue using redis-cli and validates the queue depth is within a given warning/critical range. 24 | 25 | Usage: 26 | ``` 27 | ./check_sidekiq_queue [-h host] [-p ] [-a password] ([-q queue] || [ -s retry|schedule ]) [-n namespace] [-d db] [-w warn_perc] [-c critical_perc] ([-i ]) 28 | ``` 29 | 30 | Defaults: localhost, 6379, no password, default queue, no namespace, db=0, warning at 500, critical at 1000. 31 | 32 | ``` 33 | ./check_sidekiq_queue -h 10.100.1.12 -q activity -w 200 -c 1000 34 | SIDEKIQ OK : redis-host.prod 0 on activity|sidekiq_queue_activity=0;200;1000 35 | ``` 36 | 37 | By passing -q flag you will be getting a size of a regular sidekiq queue, while passing -s flag allows checking the size of 38 | retry and schedule sidekiq system queues. 39 | 40 | To check for all sidekiq queues, -q flag can be set to 'all'. Thresholds will be compared for the largest queue from all the queues. 41 | To check for all sidekiq queues except a list of queues, -i can be passed. This option can only be used with -q flag equal to 'all' 42 | 43 | The following example checks threshold for the largest queue among all sidekiq queues except queues `monitor_queue` and `execute_queue` 44 | ``` 45 | ./check_sidekiq_queue -h 10.100.1.12 -q all -i monitor_queue,execute_queue -w 200 -c 1000 46 | SIDEKIQ OK : redis-host.prod 86 on activity|sidekiq_queue_activity=0;200;1000 47 | ``` 48 | 49 | check_postgres_replication 50 | -------------------------- 51 | Checks transaction log position on a master PostgreSQL host and a replica and warns if the replica 52 | is behind by a certain amount of data. 53 | 54 | ``` 55 | Usage: ./check_postgres_replication [ options ] 56 | -h --host replica host (default 127.0.0.1) 57 | -m --master master fqdn or ip (required) 58 | -U --user database user (default postgres) 59 | -x --units units of measurement to display (KB or MB, default MB) 60 | -w --warning warning threshold in bytes (default 10MB) 61 | -c --critical critical threshold in bytes (default 15MB) 62 | ``` 63 | 64 | Note that `--units` is only used in the response. No math is done to translate `--warning` or `--critical`, 65 | which should be set as bytes. Thus, a 20MB warning would be set as 20971520. 66 | 67 | check_twemproxy 68 | --------------- 69 | Nagios check that utilizes twemproxy status page, and returns OK/SUCCESS when all backend servers 70 | in the sharded cluster are connected, or CRITICAL otherwise. 71 | 72 | ``` 73 | Usage: ./check_twemproxy [-h host] [-p port] 74 | ``` 75 | 76 | Dependencies: ruby with JSON parser installed. 77 | 78 | Example: 79 | 80 | ``` 81 | check_twemproxy --host 192.168.10.100 82 | TWEMPROXY CRITICAL : 192.168.10.100 error with redis cluster [twitter_feed] problem shards: shard003,shard006 83 | ``` 84 | 85 | ``` 86 | check_twemproxy --host 192.168.10.100 87 | TWEMPROXY OK 88 | ``` 89 | -------------------------------------------------------------------------------- /check_haproxy_queue: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # ======================================================================================== 3 | # HAProxy nagios check of current queue depth. 4 | # 5 | # 2013 Wanelo Inc, Apache License. 6 | # 7 | # Usage: ./check_haproxy_queue [-s stats_socket] [-b backend ] 8 | # [-w ] [-c ] 9 | # -b --backend name of backend to watch (ie "app_servers") 10 | # -s --stats_socket location of haproxy stats socket (default /var/run/haproxy.sock) 11 | # -w --warning warning threshold (default 100) 12 | # -c --critical critical threshold (default 500) 13 | # ======================================================================================== 14 | 15 | # Nagios return codes 16 | STATE_OK=0 17 | STATE_WARNING=1 18 | STATE_CRITICAL=2 19 | STATE_UNKNOWN=3 20 | 21 | QUEUE=0 22 | NODENAME=`cat /etc/nodename` 23 | STATS_SOCKET=/var/run/haproxy.sock 24 | 25 | # set thresholds in bytes 26 | WARNING_THRESHOLD=100 27 | CRITICAL_THRESHOLD=500 28 | 29 | # Parse parameters 30 | while [ $# -gt 0 ]; do 31 | case "$1" in 32 | -b | --backend) 33 | shift 34 | BACKEND=$1 35 | ;; 36 | -s | --stats_socket) 37 | shift 38 | STATS_SOCKET=$1 39 | ;; 40 | -w | --warning) 41 | shift 42 | WARNING_THRESHOLD=$1 43 | ;; 44 | -c | --critical) 45 | shift 46 | CRITICAL_THRESHOLD=$1 47 | ;; 48 | *) echo "Unknown argument: $1" 49 | exit $STATE_UNKNOWN 50 | ;; 51 | esac 52 | shift 53 | done 54 | 55 | function result { 56 | DESCRIPTION=$1 57 | STATUS=$2 58 | 59 | if [ -z "$MESSAGE" ]; then 60 | MESSAGE="current queue size is $QUEUE" 61 | fi 62 | 63 | echo "QUEUE $DESCRIPTION : ${NODENAME} $MESSAGE|queue=${QUEUE};${WARNING_THRESHOLD};${CRITICAL_THRESHOLD}" 64 | exit $STATUS 65 | } 66 | 67 | function exit_on_error { 68 | if [ $1 -ne 0 ]; then 69 | MESSAGE=$2 70 | result "CRITICAL" $STATE_CRITICAL 71 | fi 72 | } 73 | 74 | if [ -z "$BACKEND" ]; then 75 | MESSAGE="missing backend parameter" 76 | result "CRITICAL" $STATE_CRITICAL 77 | fi 78 | 79 | STATS=`echo "show stat -1 2 -1" | nc -U $STATS_SOCKET | grep $BACKEND` 80 | exit_on_error $? "error checking stats socket" 81 | QUEUE=`echo $STATS | cut -d',' -f3` 82 | 83 | # Output response 84 | if [ $QUEUE -ge $WARNING_THRESHOLD ] && [ $QUEUE -lt $CRITICAL_THRESHOLD ]; then 85 | result "WARNING" $STATE_WARNING 86 | elif [ $QUEUE -ge $CRITICAL_THRESHOLD ]; then 87 | result "CRITICAL" $STATE_CRITICAL 88 | else 89 | result "OK" $STATE_OK 90 | fi 91 | -------------------------------------------------------------------------------- /check_joyent_zone_mem: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # ======================================================================================== 3 | # Joyent Zone Memory Plugin for Nagios based on jinf/sm-meminfo utility 4 | # 5 | # Wanelo Inc, Apache License. 6 | # 7 | # Usage: ./check_joyent_zone_mem [-w ] [-c ] 8 | # Eg: ./check_joyent_zone_mem -w 60 -c 80 # warning at 60% or higher used, critical at 80% 9 | # ======================================================================================== 10 | 11 | # Nagios return codes 12 | STATE_OK=0 13 | STATE_WARNING=1 14 | STATE_CRITICAL=2 15 | STATE_UNKNOWN=3 16 | 17 | if [ -f "/opt/local/bin/sm-meminfo" ]; then 18 | MEM_CMD="sm-meminfo -p rss" 19 | else 20 | MEM_CMD="jinf -p -m" 21 | fi 22 | 23 | WARNING_THRESHOLD=70 24 | CRITICAL_THRESHOLD=85 25 | # Parse parameters 26 | while [ $# -gt 0 ]; do 27 | case "$1" in 28 | -w | --warning) 29 | shift 30 | WARNING_THRESHOLD=$1 31 | ;; 32 | -c | --critical) 33 | shift 34 | CRITICAL_THRESHOLD=$1 35 | ;; 36 | *) echo "Unknown argument: $1" 37 | exit $STATE_UNKNOWN 38 | ;; 39 | esac 40 | shift 41 | done 42 | 43 | PATH=/opt/local/bin:$PATH 44 | read TOTAL_MEM USED_MEM FREE_MEM <<< $($MEM_CMD | grep mem_ | cut -f 2 -d ':') 45 | RSS=$(($USED_MEM * 100 / $TOTAL_MEM)) 46 | RSS_FREE=$(($FREE_MEM / (1024 * 1024))) 47 | NODENAME=`cat /etc/nodename` 48 | 49 | function result { 50 | DESCRIPTION=$1 51 | STATUS=$2 52 | echo "RSS $DESCRIPTION : ${NODENAME} ${RSS}% used (${RSS_FREE}Mb free)|rss=${RSS}%;${WARNING_THRESHOLD};${CRITICAL_THRESHOLD}" 53 | exit $STATUS 54 | } 55 | 56 | if [ $RSS -ge $WARNING_THRESHOLD ] && [ $RSS -lt $CRITICAL_THRESHOLD ]; then 57 | result "WARNING" $STATE_WARNING 58 | elif [ $RSS -ge $CRITICAL_THRESHOLD ]; then 59 | result "CRITICAL" $STATE_CRITICAL 60 | else 61 | result "OK" $STATE_OK 62 | fi 63 | -------------------------------------------------------------------------------- /check_postgres_replication: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # ======================================================================================== 3 | # Postgres replication lag nagios check using psql and bash. 4 | # 5 | # 2013 Wanelo Inc, Apache License. 6 | # This script expects psql to be in the PATH. 7 | # 8 | # Usage: ./check_postgres_replication [ -h ] [ -m ] [ -U user ] [ -x ] 9 | # [-w ] [-c ] 10 | # -h --host replica host (default 127.0.0.1) 11 | # -m --master master fqdn or ip (required) 12 | # -U --user database user (default postgres) 13 | # -x --units units of measurement to display (KB or MB, default MB) 14 | # -w --warning warning threshold (default 10MB) 15 | # -c --critical critical threshold (default 15MB) 16 | # ======================================================================================== 17 | 18 | # Nagios return codes 19 | readonly STATE_OK=0 20 | readonly STATE_WARNING=1 21 | readonly STATE_CRITICAL=2 22 | readonly STATE_UNKNOWN=3 23 | 24 | readonly ARGS="$@" 25 | 26 | # set thresholds in bytes 27 | readonly DEFAULT_WARNING_THRESHOLD=10485760 28 | readonly DEFAULT_CRITICAL_THRESHOLD=15728640 29 | 30 | readonly DEFAULT_HOST="127.0.0.1" 31 | readonly DEFAULT_USER=postgres 32 | readonly DEFAULT_UNITS=MB 33 | 34 | readonly PATH=/opt/local/bin:${PATH} 35 | readonly NODENAME=$(cat /etc/nodename) 36 | readonly MASTER_SQL="SELECT pg_current_xlog_location()" 37 | readonly REPLICA_SQL="SELECT pg_last_xlog_replay_location()" 38 | readonly REPLICA_TIME_LAG="select now() - pg_last_xact_replay_timestamp()" 39 | readonly ERR=/tmp/repl_chec.$$ 40 | 41 | usage() { 42 | cat <<-EOF 43 | Usage: ./check_postgres_replication [ -h ] [ -m ] [ -U user ] [ -x ] 44 | [-w ] [-c ] 45 | -h --host replica host (default 127.0.0.1) 46 | -m --master master fqdn or ip (required) 47 | -U --user database user (default postgres) 48 | -x --units units of measurement to display (KB or MB, default MB) 49 | -w --warning warning threshold (default 10MB) 50 | -c --critical critical threshold (default 15MB) 51 | 52 | --help show this message 53 | --verbose 54 | EOF 55 | } 56 | 57 | # Parse parameters 58 | parse_arguments() { 59 | local arg=$1 60 | for arg; do 61 | local delim="" 62 | case "$arg" in 63 | --host) args="${args}-h ";; 64 | --master) args="${args}-m ";; 65 | --user) args="${args}-U ";; 66 | --units) args="${args}-x ";; 67 | --warning) args="${args}-w ";; 68 | --critical) args="${args}-c ";; 69 | --help) args="${args}-H ";; 70 | --verbose) args="${args}-v ";; 71 | *) [[ "${arg:0:1}" == "-" ]] || delim="\"" 72 | args="${args}${delim}${arg}${delim} ";; 73 | esac 74 | done 75 | 76 | eval set -- $args 77 | 78 | while getopts "h:m:U:x:w:c:Hv" OPTION 79 | do 80 | case $OPTION in 81 | v) 82 | set -x 83 | ;; 84 | H) 85 | usage 86 | exit 87 | ;; 88 | h) 89 | local host=$OPTARG 90 | ;; 91 | m) 92 | readonly MASTER=$OPTARG 93 | ;; 94 | U) 95 | local user=$OPTARG 96 | ;; 97 | x) 98 | local units=$OPTARG 99 | ;; 100 | w) 101 | local warning_threshold=$OPTARG 102 | ;; 103 | c) 104 | local critical_threshold=$OPTARG 105 | ;; 106 | esac 107 | done 108 | 109 | readonly USER=${user:-$DEFAULT_USER} 110 | readonly HOST=${host:-$DEFAULT_HOST} 111 | readonly UNITS=${units:-$DEFAULT_UNITS} 112 | readonly WARNING_THRESHOLD=${warning_threshold:-$DEFAULT_WARNING_THRESHOLD} 113 | readonly CRITICAL_THRESHOLD=${critical_threshold:-$DEFAULT_CRITICAL_THRESHOLD} 114 | } 115 | 116 | check_required_arguments() { 117 | if [ -z "$MASTER" ]; then 118 | echo "pass master host in parameters via -m flag" 119 | exit 1 120 | fi 121 | } 122 | 123 | normalize_units() { 124 | # Error checking of arguments 125 | case "$UNITS" in 126 | KB) 127 | readonly DIVISOR=1024 128 | ;; 129 | MB) 130 | readonly DIVISOR=1048576 131 | ;; 132 | *) 133 | echo "Incorrect unit of measurement" 134 | usage 135 | exit 1 136 | ;; 137 | esac 138 | } 139 | 140 | result() { 141 | local description=$1 142 | local status=$2 143 | local diff=$3 144 | local time_lag=$4 145 | 146 | local error=$(cat $ERR 2>/dev/null) 147 | 148 | if [[ "${status}" -eq "${STATE_CRITICAL}" && ! -z "${error}" ]]; then 149 | local message="replication check error ${error}" 150 | else 151 | local diff_units=$(bytes_to_units $diff) 152 | local message="replication lag is ${diff_units}${UNITS} : time lag is ${time_lag}" 153 | fi 154 | echo "REPLICATION $description : ${NODENAME} $message|repl=${diff},time_lag=${time_lag};${WARNING_THRESHOLD};${CRITICAL_THRESHOLD}" 155 | rm -f $ERR 156 | exit $status 157 | } 158 | 159 | get_replica_current_xlog() { 160 | echo $(psql -U $USER -Atc "$REPLICA_SQL" -h $HOST 2>$ERR) 161 | } 162 | 163 | get_master_current_xlog() { 164 | echo $(psql -U $USER -Atc "$MASTER_SQL" -h $MASTER 2>$ERR) 165 | } 166 | 167 | check_replica_time_lag() { 168 | echo $(psql -U $USER -Atc "${REPLICA_TIME_LAG}" -h ${HOST} 2>${ERR}) 169 | } 170 | 171 | check_errors() { 172 | if [ $1 -ne 0 ]; then 173 | result "CRITICAL" $STATE_CRITICAL 174 | fi 175 | } 176 | 177 | xlog_to_bytes() { 178 | # http://eulerto.blogspot.com/2011/11/understanding-wal-nomenclature.html 179 | local logid="${1%%/*}" 180 | local offset="${1##*/}" 181 | echo $((0xFF000000 * 0x$logid + 0x$offset)) 182 | } 183 | 184 | bytes_to_units() { 185 | local diff=$1 186 | if [ -z "$diff" ]; then 187 | echo "ERROR: NO DATA AVAILABLE" 188 | else 189 | echo $(( $diff / $DIVISOR )) 190 | fi 191 | } 192 | 193 | main() { 194 | parse_arguments $ARGS 195 | check_required_arguments 196 | normalize_units 197 | 198 | local replica_xlog=$(get_replica_current_xlog) 199 | check_errors $? 200 | local replica_bytes=$(xlog_to_bytes ${replica_xlog}) 201 | 202 | if [ -z "${replica_xlog}" ]; then 203 | echo -n "Unable to find replica XLOG replay location" > $ERR 204 | result "CRITICAL" $STATE_CRITICAL 205 | fi 206 | 207 | # Query master and replica for latest xlog 208 | local master_xlog=$(get_master_current_xlog) 209 | check_errors $? 210 | local master_bytes=$(xlog_to_bytes $master_xlog) 211 | 212 | # Calculate xlog diff in bytes 213 | local diff=$(($master_bytes - $replica_bytes)) 214 | 215 | local time_lag=$(check_replica_time_lag) 216 | 217 | # Output response 218 | if [ $diff -ge $WARNING_THRESHOLD ] && [ $diff -lt $CRITICAL_THRESHOLD ]; then 219 | result "WARNING" $STATE_WARNING $diff $time_lag 220 | elif [ $diff -ge $CRITICAL_THRESHOLD ]; then 221 | result "CRITICAL" $STATE_CRITICAL $diff $time_lag 222 | else 223 | result "OK" $STATE_OK $diff $time_lag 224 | fi 225 | 226 | rm -f $ERR 227 | } 228 | 229 | main 230 | -------------------------------------------------------------------------------- /check_sidekiq_queue: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # ======================================================================================== 3 | # Sidekiq Queue Size Nagios Check 4 | # 5 | # (c) Wanelo Inc, Distributed under Apache License 6 | # 7 | # Usage: 8 | # To check a regular queue: 9 | # ./check_sidekiq_queue [ -h ] [-p ] [ -a ] [ -q ] [ <-n mq> ] [ -d ] [-w ] [-c ] [-i ] 10 | # Eg: ./check_sidekiq_queue -w 500 -c 2000 # warning at 500 or higher used, critical at 2000 or higher 11 | # 12 | # To check schedule or retry (system) queue: 13 | # ./check_sidekiq_queue [ -h ] [ -a ] [ -s ] [ <-n mq> ] [ -d ] [-w ] [-c ] 14 | # 15 | # ======================================================================================== 16 | 17 | # Nagios return codes 18 | STATE_OK=0 19 | STATE_WARNING=1 20 | STATE_CRITICAL=2 21 | STATE_UNKNOWN=3 22 | 23 | WARNING_THRESHOLD=500 24 | CRITICAL_THRESHOLD=1000 25 | QUEUE="default" 26 | SYSTEM="" 27 | NAMESPACE="" 28 | HOST="127.0.0.1" 29 | PORT="6379" 30 | PASS="" 31 | DB=0 32 | 33 | # Parse parameters 34 | while [ $# -gt 0 ]; do 35 | case "$1" in 36 | -d | --db) 37 | shift 38 | DB=$1 39 | ;; 40 | -h | --hostname) 41 | shift 42 | HOST=$1 43 | ;; 44 | -p | --port) 45 | shift 46 | PORT=$1 47 | ;; 48 | -a | --password) 49 | shift 50 | PASS=$1 51 | ;; 52 | -q | --queue) 53 | shift 54 | QUEUE=$1 55 | ;; 56 | -i | --ignore_queues) 57 | shift 58 | IGNORE_QUEUES=$1 59 | ;; 60 | -n | --namespace) 61 | shift 62 | NAMESPACE=$1 63 | ;; 64 | -s | --system) 65 | shift 66 | SYSTEM=$1 67 | ;; 68 | -w | --warning) 69 | shift 70 | WARNING_THRESHOLD=$1 71 | ;; 72 | -c | --critical) 73 | shift 74 | CRITICAL_THRESHOLD=$1 75 | ;; 76 | *) echo "Unknown argument: $1" 77 | exit $STATE_UNKNOWN 78 | ;; 79 | esac 80 | shift 81 | done 82 | 83 | PATH=/opt/local/bin:$PATH 84 | NODENAME=$HOSTNAME 85 | 86 | ERR=/tmp/redis-cli.error.$$ 87 | rm -f $ERR 88 | 89 | function result { 90 | DESCRIPTION=$1 91 | STATUS=$2 92 | echo "SIDEKIQ $DESCRIPTION : ${NODENAME} ${QUEUE_SIZE} on ${QUEUE}|sidekiq_queue_${QUEUE}=${QUEUE_SIZE};${WARNING_THRESHOLD};${CRITICAL_THRESHOLD}" 93 | rm -f $ERR 94 | exit $STATUS 95 | } 96 | 97 | if [ "$QUEUE" != "default" -a -n "$SYSTEM" ]; then 98 | result "CRITICAL invalid usage: pass -q or -s but not both", $STATE_CRITICAL 99 | fi 100 | 101 | if [ -n "$IGNORE_QUEUES" -a "$QUEUE" != "all" ]; then 102 | result "CRITICAL invalid usage: ignore_queues can only be used with QUEUE = all as value", $STATE_CRITICAL 103 | fi 104 | 105 | if [ -n "$SYSTEM" -a "$SYSTEM" != "schedule" -a "$SYSTEM" != "retry" ] ; then 106 | result "CRITICAL invalid usage: -s expect one of schedule or retry", $STATE_CRITICAL 107 | fi 108 | 109 | if [ ! -z "$PASS" ]; then 110 | PASS="-a $PASS" 111 | fi 112 | 113 | if [ ! -z "$NAMESPACE" ]; then 114 | NAMESPACE="$NAMESPACE:" 115 | fi 116 | 117 | if [ -n "$SYSTEM" ]; then 118 | QUEUE_SIZE=`redis-cli -h $HOST -p $PORT $PASS -n $DB zcard ${NAMESPACE}$SYSTEM 2>$ERR | cut -d " " -f 1` 119 | QUEUE=$SYSTEM 120 | elif [ "$QUEUE" == "all" ]; then 121 | ALL_QUEUES=`redis-cli -h $HOST -p $PORT $PASS -n $DB smembers queues 2>$ERR` 122 | QUEUE_SIZE=-1 123 | QUEUE="none" 124 | for EACH_QUEUE in $ALL_QUEUES; do 125 | [[ "$IGNORE_QUEUES" == *"$EACH_QUEUE"* ]] && continue 126 | THIS_QUEUE_SIZE=`redis-cli -h $HOST -p $PORT $PASS -n $DB llen ${NAMESPACE}queue:$EACH_QUEUE 2>$ERR | cut -d " " -f 1` 127 | if [ "$THIS_QUEUE_SIZE" -gt "$QUEUE_SIZE" ]; then 128 | QUEUE_SIZE=$THIS_QUEUE_SIZE 129 | QUEUE=$EACH_QUEUE 130 | fi 131 | done 132 | else 133 | QUEUE_SIZE=`redis-cli -h $HOST -p $PORT $PASS -n $DB llen ${NAMESPACE}queue:$QUEUE 2>$ERR | cut -d " " -f 1` 134 | fi 135 | 136 | if [ -s "$ERR" ]; then 137 | QUEUE_SIZE=`cat $ERR` 138 | result "CRITICAL" $STATE_CRITICAL 139 | fi 140 | 141 | if [ $QUEUE_SIZE -ge $WARNING_THRESHOLD ] && [ $QUEUE_SIZE -lt $CRITICAL_THRESHOLD ]; then 142 | result "WARNING" $STATE_WARNING 143 | elif [ $QUEUE_SIZE -ge $CRITICAL_THRESHOLD ]; then 144 | result "CRITICAL" $STATE_CRITICAL 145 | else 146 | result "OK" $STATE_OK 147 | fi 148 | 149 | # ensure that output from stderr is cleaned up 150 | rm -f $ERR 151 | -------------------------------------------------------------------------------- /check_twemproxy: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env ruby 2 | # ======================================================================================== 3 | # Twemproxy Status Check using JSON status page 4 | # 5 | # (c) Wanelo Inc, Distributed under Apache License 6 | # 7 | # Usage: ./check_twemproxy [-H host] [-p port] 8 | # 9 | # Dependencies: ruby with JSON parser installed. 10 | # 11 | # Returns OK/SUCCESS when all servers in the sharded cluster are connected, or 12 | # CRITICAL otherwise. 13 | # ======================================================================================== 14 | 15 | require 'optparse' 16 | require 'json' 17 | 18 | DEFAULT_PORT = 22222 19 | DEFAULT_WARNING_THRESHOLD = 0 20 | DEFAULT_CRITICAL_THRESHOLD = 10 21 | 22 | options = Struct.new('Options', :host, :port, :verbose, :warning_threshold, :critical_threshold).new 23 | options.port = DEFAULT_PORT 24 | options.warning_threshold = DEFAULT_WARNING_THRESHOLD 25 | options.critical_threshold = DEFAULT_CRITICAL_THRESHOLD 26 | 27 | optparse = OptionParser.new do |opts| 28 | opts.banner = 'Usage: check_twemproxy [-h host] [-p port]' 29 | 30 | opts.on('-H', '--host HOST', String, 'Host name or IP address') do |h| 31 | options.host = h 32 | end 33 | 34 | opts.on('-p', '--port PORT', Integer, "Port (#{DEFAULT_PORT})") do |p| 35 | options.port = p 36 | end 37 | 38 | opts.on('-w', '--warning COUNT', Integer, "Warning threshold for server problems (#{DEFAULT_WARNING_THRESHOLD})") do |w| 39 | options.warning_threshold = w 40 | end 41 | 42 | opts.on('-c', '--critical COUNT', Integer, "Critical threshold for server problems (#{DEFAULT_CRITICAL_THRESHOLD})") do |w| 43 | options.critical_threshold = w 44 | end 45 | 46 | opts.on('-v', '--verbose', 'Run verbosely') do |p| 47 | options.verbose = true 48 | end 49 | 50 | opts.on('-?', '--help', 'Display this screen') do 51 | puts opts 52 | exit 53 | end 54 | end 55 | 56 | begin 57 | optparse.parse! 58 | raise OptionParser::MissingArgument.new('host is required') unless options.host 59 | rescue OptionParser::InvalidOption, OptionParser::MissingArgument => e 60 | puts e.message 61 | puts optparse 62 | exit 3 63 | end 64 | 65 | class TwemproxyCheck 66 | STATE_OK=0 67 | STATE_WARNING=1 68 | STATE_CRITICAL=2 69 | STATE_UNKNOWN=3 70 | 71 | LAST_CHECK_PATTERN = '/tmp/twemproxy-%s' 72 | 73 | attr_accessor :disconnect_count, :error_clusters, :disconnected_servers, :options, :timeout_count, :timedout_servers 74 | 75 | def initialize(options) 76 | @options = options 77 | @disconnect_count = 0 78 | @error_clusters = Hash.new(0) 79 | @disconnected_servers = Hash.new(0) 80 | @timeout_count = 0 81 | @timedout_servers = Hash.new(0) 82 | end 83 | 84 | def check! 85 | begin 86 | check_twemproxy! 87 | persist! 88 | exit! 89 | rescue => e 90 | unknown!(e) 91 | end 92 | end 93 | 94 | protected 95 | 96 | def check_twemproxy! 97 | return if last_check_data.nil? 98 | 99 | check_data.keys.find_all { |k| check_data[k].is_a?(Hash) }.each do |cluster| 100 | check_data[cluster].keys.find_all { |v| check_data[cluster][v].is_a?(Hash) }.each do |server| 101 | check_timeouts!(cluster, server) 102 | next if connections_ok?(cluster, server) 103 | next if no_new_requests?(cluster, server) 104 | 105 | self.disconnect_count += 1 106 | self.disconnected_servers[server] += 1 107 | self.error_clusters[cluster] += 1 108 | end 109 | end 110 | end 111 | 112 | def persist! 113 | ::File.open(last_check_filename, 'w') do |file| 114 | file.write(JSON.dump(check_data)) 115 | end 116 | end 117 | 118 | def exit! 119 | return critical! if critical? 120 | return warning! if warning? 121 | ok! 122 | end 123 | 124 | private 125 | 126 | def connections_ok?(cluster, server) 127 | check_data[cluster][server]['server_connections'].to_i > 0 128 | end 129 | 130 | def no_new_requests?(cluster, server) 131 | check_data[cluster][server]['requests'].to_i - last_check_data[cluster][server]['requests'].to_i <= 0 132 | end 133 | 134 | def timeouts_for(cluster, server) 135 | check_data[cluster][server]['server_timedout'].to_i - last_check_data[cluster][server]['server_timedout'].to_i 136 | end 137 | 138 | def check_timeouts!(cluster, server) 139 | return if timeouts_for(cluster, server) == 0 140 | self.timeout_count += 1 141 | self.error_clusters[cluster] += 1 142 | self.timedout_servers[server] += 1 143 | end 144 | 145 | def check_data 146 | @check_data ||= JSON.parse(`nc #{options.host} #{options.port}`) 147 | end 148 | 149 | def last_check_data 150 | @last_check_data ||= begin 151 | return nil unless ::File.exist?(last_check_filename) 152 | return nil if ::File.ctime(last_check_filename) < Time.now - (5 * 60) 153 | JSON.parse(::File.read(last_check_filename)) 154 | end 155 | end 156 | 157 | def last_check_filename 158 | LAST_CHECK_PATTERN % options.host 159 | end 160 | 161 | def ok? 162 | !critical? && !warning? 163 | end 164 | 165 | def critical? 166 | disconnect_count > options.critical_threshold 167 | end 168 | 169 | def warning? 170 | disconnect_count > options.warning_threshold 171 | end 172 | 173 | def ok! 174 | puts "TWEMPROXY OK : #{message}" 175 | dump_data 176 | exit STATE_OK 177 | end 178 | 179 | def critical! 180 | puts "TWEMPROXY CRITICAL : #{message}" 181 | dump_data 182 | exit STATE_CRITICAL 183 | end 184 | 185 | def warning! 186 | puts "TWEMPROXY WARNING : #{message}" 187 | dump_data 188 | exit STATE_WARNING 189 | end 190 | 191 | def unknown!(e) 192 | puts "TWEMPROXY UNKNOWN : #{e.message}" 193 | exit STATE_UNKNOWN 194 | end 195 | 196 | def message 197 | return options.host if ok? 198 | "servers: #{disconnected_servers.keys.join(',')}, clusters: #{error_clusters.keys.join(',')} | " \ 199 | "disconnects=#{disconnect_count};" \ 200 | "timeouts=#{timeout_count};" \ 201 | "clusters=[#{error_clusters.keys.join(',')}];" \ 202 | "disconnected_shards=#{disconnected_servers.keys.join(',')};" \ 203 | "timedout_shards=#{timedout_servers.keys.join(',')}" 204 | end 205 | 206 | def dump_data 207 | return unless options.verbose 208 | check_data.keys.find_all { |k| check_data[k].is_a?(Hash) }.each do |cluster| 209 | check_data[cluster].keys.find_all { |v| check_data[cluster][v].is_a?(Hash) }.each do |server| 210 | puts "#{cluster}/#{server}: #{check_data[cluster][server].to_s}" 211 | end 212 | end 213 | end 214 | end 215 | 216 | TwemproxyCheck.new(options).check! 217 | --------------------------------------------------------------------------------