├── .gitignore
├── README.md
├── check_haproxy_queue
├── check_joyent_zone_mem
├── check_postgres_replication
├── check_sidekiq_queue
└── check_twemproxy


/.gitignore:
--------------------------------------------------------------------------------
1 | .idea/
2 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | nagios-checks
 2 | =============
 3 | 
 4 | Various nagios checks that we use at Wanelo.
 5 | 
 6 | check_joyent_zone_mem 
 7 | ---------------------
 8 | This script will use the Joyent tool "jinf" to validate that free RAM on the zone is within specified percentage thresholds.
 9 | 
10 | Usage: 
11 | ```
12 | ./check_joyent_zone_mem  [-w <warn_perc>] [-c <critical_perc>]
13 | ```
14 | 
15 | Example:
16 | ```
17 | ./check_joyent_zone_mem -w 75 -c 90 
18 | RSS OK : my-host.prod 47% used (4334Mb free)|rss=47%;70;85
19 | ```
20 | 
21 | check_sidekiq_queue
22 | -------------------
23 | Peeks into the Sidekiq queue using redis-cli and validates the queue depth is within a given warning/critical range.
24 | 
25 | Usage: 
26 | ```
27 | ./check_sidekiq_queue [-h host] [-p <port> ] [-a password] ([-q queue] || [ -s retry|schedule ]) [-n namespace] [-d db] [-w warn_perc] [-c critical_perc] ([-i <ignore_queues>])
28 | ```
29 | 
30 | Defaults: localhost, 6379, no password, default queue, no namespace, db=0, warning at 500, critical at 1000.
31 | 
32 | ```
33 | ./check_sidekiq_queue -h 10.100.1.12 -q activity -w 200 -c 1000
34 | SIDEKIQ OK : redis-host.prod 0 on activity|sidekiq_queue_activity=0;200;1000
35 | ```
36 | 
37 | By passing -q flag you will be getting a size of a regular sidekiq queue, while passing -s flag allows checking the size of
38 | retry and schedule sidekiq system queues.
39 | 
40 | To check for all sidekiq queues, -q flag can be set to 'all'. Thresholds will be compared for the largest queue from all the queues.
41 | To check for all sidekiq queues <b>except</b> a list of queues, -i can be passed. This option can only be used with -q flag equal to 'all'
42 | 
43 | The following example checks threshold for the largest queue among all sidekiq queues except queues `monitor_queue` and `execute_queue`
44 | ```
45 | ./check_sidekiq_queue -h 10.100.1.12 -q all -i monitor_queue,execute_queue -w 200 -c 1000
46 | SIDEKIQ OK : redis-host.prod 86 on activity|sidekiq_queue_activity=0;200;1000
47 | ```
48 | 
49 | check_postgres_replication
50 | --------------------------
51 | Checks transaction log position on a master PostgreSQL host and a replica and warns if the replica
52 | is behind by a certain amount of data.
53 | 
54 | ```
55 | Usage: ./check_postgres_replication [ options ]
56 |    -h   --host       replica host (default 127.0.0.1)
57 |    -m   --master     master fqdn or ip (required)
58 |    -U   --user       database user (default postgres)
59 |    -x   --units      units of measurement to display (KB or MB, default MB)
60 |    -w   --warning    warning threshold in bytes (default 10MB)
61 |    -c   --critical   critical threshold in bytes (default 15MB)
62 | ```
63 | 
64 | Note that `--units` is only used in the response. No math is done to translate `--warning` or `--critical`,
65 | which should be set as bytes. Thus, a 20MB warning would be set as 20971520.
66 | 
67 | check_twemproxy
68 | ---------------
69 | Nagios check that utilizes twemproxy status page, and returns OK/SUCCESS when all backend servers
70 | in the sharded cluster are connected, or CRITICAL otherwise.
71 | 
72 | ```
73 | Usage: ./check_twemproxy [-h host] [-p port]
74 | ```
75 | 
76 | Dependencies: ruby with JSON parser installed.
77 | 
78 | Example:
79 | 
80 | ```
81 | check_twemproxy --host  192.168.10.100
82 | TWEMPROXY CRITICAL : 192.168.10.100 error with redis cluster [twitter_feed] problem shards: shard003,shard006
83 | ```
84 | 
85 | ```
86 | check_twemproxy --host  192.168.10.100
87 | TWEMPROXY OK
88 | ```
89 | 


--------------------------------------------------------------------------------
/check_haproxy_queue:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # ========================================================================================
 3 | # HAProxy nagios check of current queue depth.
 4 | #
 5 | # 2013 Wanelo Inc, Apache License.
 6 | #
 7 | # Usage: ./check_haproxy_queue [-s stats_socket] [-b backend ]
 8 | #                              [-w <warn_cnt>] [-c <critical_cnt>]
 9 | #   -b   --backend          name of backend to watch (ie "app_servers")
10 | #   -s   --stats_socket     location of haproxy stats socket (default /var/run/haproxy.sock)
11 | #   -w   --warning          warning threshold (default 100)
12 | #   -c   --critical         critical threshold (default 500)
13 | # ========================================================================================
14 | 
15 | # Nagios return codes
16 | STATE_OK=0
17 | STATE_WARNING=1
18 | STATE_CRITICAL=2
19 | STATE_UNKNOWN=3
20 | 
21 | QUEUE=0
22 | NODENAME=`cat /etc/nodename`
23 | STATS_SOCKET=/var/run/haproxy.sock
24 | 
25 | # set thresholds in bytes
26 | WARNING_THRESHOLD=100
27 | CRITICAL_THRESHOLD=500
28 | 
29 | # Parse parameters
30 | while [ $# -gt 0 ]; do
31 |     case "$1" in
32 |         -b | --backend)
33 |                 shift
34 |                 BACKEND=$1
35 |                 ;;
36 |         -s | --stats_socket)
37 |                 shift
38 |                 STATS_SOCKET=$1
39 |                 ;;
40 |         -w | --warning)
41 |                 shift
42 |                 WARNING_THRESHOLD=$1
43 |                 ;;
44 |         -c | --critical)
45 |                 shift
46 |                 CRITICAL_THRESHOLD=$1
47 |                 ;;
48 |         *)  echo "Unknown argument: $1"
49 |             exit $STATE_UNKNOWN
50 |             ;;
51 |         esac
52 | shift
53 | done
54 | 
55 | function result {
56 |   DESCRIPTION=$1
57 |   STATUS=$2
58 | 
59 |   if [ -z "$MESSAGE" ]; then
60 |     MESSAGE="current queue size is $QUEUE"
61 |   fi
62 | 
63 |   echo "QUEUE $DESCRIPTION : ${NODENAME} $MESSAGE|queue=${QUEUE};${WARNING_THRESHOLD};${CRITICAL_THRESHOLD}"
64 |   exit $STATUS
65 | }
66 | 
67 | function exit_on_error {
68 |   if [ $1 -ne 0 ]; then
69 |     MESSAGE=$2
70 |     result "CRITICAL" $STATE_CRITICAL
71 |   fi
72 | }
73 | 
74 | if [ -z "$BACKEND" ]; then
75 |   MESSAGE="missing backend parameter"
76 |   result "CRITICAL" $STATE_CRITICAL
77 | fi
78 | 
79 | STATS=`echo "show stat -1 2 -1" | nc -U $STATS_SOCKET | grep $BACKEND`
80 | exit_on_error $? "error checking stats socket"
81 | QUEUE=`echo $STATS | cut -d',' -f3`
82 | 
83 | # Output response
84 | if [ $QUEUE -ge $WARNING_THRESHOLD ] && [ $QUEUE -lt $CRITICAL_THRESHOLD ]; then
85 |   result "WARNING" $STATE_WARNING
86 | elif [ $QUEUE -ge $CRITICAL_THRESHOLD ]; then
87 |   result "CRITICAL" $STATE_CRITICAL
88 | else
89 |   result "OK" $STATE_OK
90 | fi
91 | 


--------------------------------------------------------------------------------
/check_joyent_zone_mem:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # ========================================================================================
 3 | # Joyent Zone Memory Plugin for Nagios based on jinf/sm-meminfo utility
 4 | #
 5 | # Wanelo Inc, Apache License.
 6 | #
 7 | # Usage: ./check_joyent_zone_mem  [-w <warn_perc>] [-c <critical_perc>]
 8 | # Eg:    ./check_joyent_zone_mem  -w 60 -c 80   # warning at 60% or higher used, critical at 80%
 9 | # ========================================================================================
10 | 
11 | # Nagios return codes
12 | STATE_OK=0
13 | STATE_WARNING=1
14 | STATE_CRITICAL=2
15 | STATE_UNKNOWN=3
16 | 
17 | if [ -f "/opt/local/bin/sm-meminfo" ]; then
18 |   MEM_CMD="sm-meminfo -p rss"
19 | else
20 |   MEM_CMD="jinf -p -m"
21 | fi
22 | 
23 | WARNING_THRESHOLD=70
24 | CRITICAL_THRESHOLD=85
25 | # Parse parameters
26 | while [ $# -gt 0 ]; do
27 |     case "$1" in
28 |         -w | --warning)
29 |                 shift
30 |                 WARNING_THRESHOLD=$1
31 |                 ;;
32 |         -c | --critical)
33 |                 shift
34 |                 CRITICAL_THRESHOLD=$1
35 |                 ;;
36 |         *)  echo "Unknown argument: $1"
37 |             exit $STATE_UNKNOWN
38 |             ;;
39 |         esac
40 | shift
41 | done
42 | 
43 | PATH=/opt/local/bin:$PATH
44 | read TOTAL_MEM USED_MEM FREE_MEM <<< $($MEM_CMD | grep mem_ | cut -f 2 -d ':')
45 | RSS=$(($USED_MEM * 100 / $TOTAL_MEM))
46 | RSS_FREE=$(($FREE_MEM / (1024 * 1024)))
47 | NODENAME=`cat /etc/nodename`
48 | 
49 | function result {
50 |   DESCRIPTION=$1
51 |   STATUS=$2
52 |   echo "RSS $DESCRIPTION : ${NODENAME} ${RSS}% used (${RSS_FREE}Mb free)|rss=${RSS}%;${WARNING_THRESHOLD};${CRITICAL_THRESHOLD}"
53 |   exit $STATUS
54 | }
55 | 
56 | if [ $RSS -ge $WARNING_THRESHOLD ] && [ $RSS -lt $CRITICAL_THRESHOLD ]; then
57 |   result "WARNING" $STATE_WARNING
58 | elif [ $RSS -ge $CRITICAL_THRESHOLD ]; then
59 |   result "CRITICAL" $STATE_CRITICAL
60 | else
61 |   result "OK" $STATE_OK
62 | fi
63 | 


--------------------------------------------------------------------------------
/check_postgres_replication:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | # ========================================================================================
  3 | # Postgres replication lag nagios check using psql and bash.
  4 | #
  5 | # 2013 Wanelo Inc, Apache License.
  6 | # This script expects psql to be in the PATH.
  7 | #
  8 | # Usage: ./check_postgres_replication [ -h <host> ] [ -m <master> ] [ -U user ] [ -x <units> ]
  9 | #                                      [-w <warn_perc>] [-c <critical_perc>]
 10 | #   -h   --host       replica host (default 127.0.0.1)
 11 | #   -m   --master     master fqdn or ip (required)
 12 | #   -U   --user       database user (default postgres)
 13 | #   -x   --units      units of measurement to display (KB or MB, default MB)
 14 | #   -w   --warning    warning threshold (default 10MB)
 15 | #   -c   --critical   critical threshold (default 15MB)
 16 | # ========================================================================================
 17 | 
 18 | # Nagios return codes
 19 | readonly STATE_OK=0
 20 | readonly STATE_WARNING=1
 21 | readonly STATE_CRITICAL=2
 22 | readonly STATE_UNKNOWN=3
 23 | 
 24 | readonly ARGS="$@"
 25 | 
 26 | # set thresholds in bytes
 27 | readonly DEFAULT_WARNING_THRESHOLD=10485760
 28 | readonly DEFAULT_CRITICAL_THRESHOLD=15728640
 29 | 
 30 | readonly DEFAULT_HOST="127.0.0.1"
 31 | readonly DEFAULT_USER=postgres
 32 | readonly DEFAULT_UNITS=MB
 33 | 
 34 | readonly PATH=/opt/local/bin:${PATH}
 35 | readonly NODENAME=$(cat /etc/nodename)
 36 | readonly MASTER_SQL="SELECT pg_current_xlog_location()"
 37 | readonly REPLICA_SQL="SELECT pg_last_xlog_replay_location()"
 38 | readonly REPLICA_TIME_LAG="select now() - pg_last_xact_replay_timestamp()"
 39 | readonly ERR=/tmp/repl_chec.$$
 40 | 
 41 | usage() {
 42 |   cat <<-EOF
 43 | Usage: ./check_postgres_replication [ -h <host> ] [ -m <master> ] [ -U user ] [ -x <units> ]
 44 |                                     [-w <warn_perc>] [-c <critical_perc>]
 45 |    -h   --host       replica host (default 127.0.0.1)
 46 |    -m   --master     master fqdn or ip (required)
 47 |    -U   --user       database user (default postgres)
 48 |    -x   --units      units of measurement to display (KB or MB, default MB)
 49 |    -w   --warning    warning threshold (default 10MB)
 50 |    -c   --critical   critical threshold (default 15MB)
 51 | 
 52 |         --help       show this message
 53 |         --verbose
 54 | EOF
 55 | }
 56 | 
 57 | # Parse parameters
 58 | parse_arguments() {
 59 |   local arg=$1
 60 |   for arg; do
 61 |     local delim=""
 62 |     case "$arg" in
 63 |       --host)      args="${args}-h ";;
 64 |       --master)    args="${args}-m ";;
 65 |       --user)      args="${args}-U ";;
 66 |       --units)     args="${args}-x ";;
 67 |       --warning)   args="${args}-w ";;
 68 |       --critical)  args="${args}-c ";;
 69 |       --help)      args="${args}-H ";;
 70 |       --verbose)   args="${args}-v ";;
 71 |       *) [[ "${arg:0:1}" == "-" ]] || delim="\""
 72 |           args="${args}${delim}${arg}${delim} ";;
 73 |     esac
 74 |   done
 75 | 
 76 |   eval set -- $args
 77 | 
 78 |   while getopts "h:m:U:x:w:c:Hv" OPTION
 79 |   do
 80 |     case $OPTION in
 81 |     v)
 82 |         set -x
 83 |         ;;
 84 |     H)
 85 |         usage
 86 |         exit
 87 |         ;;
 88 |     h)
 89 |         local host=$OPTARG
 90 |         ;;
 91 |     m)
 92 |         readonly MASTER=$OPTARG
 93 |         ;;
 94 |     U)
 95 |         local user=$OPTARG
 96 |         ;;
 97 |     x)
 98 |         local units=$OPTARG
 99 |         ;;
100 |     w)
101 |         local warning_threshold=$OPTARG
102 |         ;;
103 |     c)
104 |         local critical_threshold=$OPTARG
105 |         ;;
106 |     esac
107 |   done
108 | 
109 |   readonly USER=${user:-$DEFAULT_USER}
110 |   readonly HOST=${host:-$DEFAULT_HOST}
111 |   readonly UNITS=${units:-$DEFAULT_UNITS}
112 |   readonly WARNING_THRESHOLD=${warning_threshold:-$DEFAULT_WARNING_THRESHOLD}
113 |   readonly CRITICAL_THRESHOLD=${critical_threshold:-$DEFAULT_CRITICAL_THRESHOLD}
114 | }
115 | 
116 | check_required_arguments() {
117 |   if [ -z "$MASTER" ]; then
118 |     echo "pass master host in parameters via -m flag"
119 |     exit 1
120 |   fi
121 | }
122 | 
123 | normalize_units() {
124 |   # Error checking of arguments
125 |   case "$UNITS" in
126 |     KB)
127 |       readonly DIVISOR=1024
128 |       ;;
129 |     MB)
130 |       readonly DIVISOR=1048576
131 |       ;;
132 |     *)
133 |       echo "Incorrect unit of measurement"
134 |       usage
135 |       exit 1
136 |       ;;
137 |   esac
138 | }
139 | 
140 | result() {
141 |   local description=$1
142 |   local status=$2
143 |   local diff=$3
144 |   local time_lag=$4
145 | 
146 |   local error=$(cat $ERR 2>/dev/null)
147 | 
148 |   if [[ "${status}" -eq "${STATE_CRITICAL}" && ! -z "${error}" ]]; then
149 |     local message="replication check error ${error}"
150 |   else
151 |     local diff_units=$(bytes_to_units $diff)
152 |     local message="replication lag is ${diff_units}${UNITS} : time lag is ${time_lag}"
153 |   fi
154 |   echo "REPLICATION $description : ${NODENAME} $message|repl=${diff},time_lag=${time_lag};${WARNING_THRESHOLD};${CRITICAL_THRESHOLD}"
155 |   rm -f $ERR
156 |   exit $status
157 | }
158 | 
159 | get_replica_current_xlog() {
160 |   echo $(psql -U $USER -Atc "$REPLICA_SQL" -h $HOST 2>$ERR)
161 | }
162 | 
163 | get_master_current_xlog() {
164 |   echo $(psql -U $USER -Atc "$MASTER_SQL" -h $MASTER 2>$ERR)
165 | }
166 | 
167 | check_replica_time_lag() {
168 |   echo $(psql -U $USER -Atc "${REPLICA_TIME_LAG}" -h ${HOST} 2>${ERR})
169 | }
170 | 
171 | check_errors() {
172 |   if [ $1 -ne 0 ]; then
173 |     result "CRITICAL" $STATE_CRITICAL
174 |   fi
175 | }
176 | 
177 | xlog_to_bytes() {
178 |   # http://eulerto.blogspot.com/2011/11/understanding-wal-nomenclature.html
179 |   local logid="${1%%/*}"
180 |   local offset="${1##*/}"
181 |   echo $((0xFF000000 * 0x$logid + 0x$offset))
182 | }
183 | 
184 | bytes_to_units() {
185 |   local diff=$1
186 |   if [ -z "$diff" ]; then
187 |     echo "ERROR: NO DATA AVAILABLE"
188 |   else
189 |     echo $(( $diff / $DIVISOR ))
190 |   fi
191 | }
192 | 
193 | main() {
194 |   parse_arguments $ARGS
195 |   check_required_arguments
196 |   normalize_units
197 | 
198 |   local replica_xlog=$(get_replica_current_xlog)
199 |   check_errors $?
200 |   local replica_bytes=$(xlog_to_bytes ${replica_xlog})
201 | 
202 |   if [ -z "${replica_xlog}" ]; then
203 |     echo -n "Unable to find replica XLOG replay location" > $ERR
204 |     result "CRITICAL" $STATE_CRITICAL
205 |   fi
206 | 
207 |   # Query master and replica for latest xlog
208 |   local master_xlog=$(get_master_current_xlog)
209 |   check_errors $?
210 |   local master_bytes=$(xlog_to_bytes $master_xlog)
211 | 
212 |   # Calculate xlog diff in bytes
213 |   local diff=$(($master_bytes - $replica_bytes))
214 | 
215 |   local time_lag=$(check_replica_time_lag)
216 | 
217 |   # Output response
218 |   if [ $diff -ge $WARNING_THRESHOLD ] && [ $diff -lt $CRITICAL_THRESHOLD ]; then
219 |     result "WARNING" $STATE_WARNING $diff $time_lag
220 |   elif [ $diff -ge $CRITICAL_THRESHOLD ]; then
221 |     result "CRITICAL" $STATE_CRITICAL $diff $time_lag
222 |   else
223 |     result "OK" $STATE_OK $diff $time_lag
224 |   fi
225 | 
226 |   rm -f $ERR
227 | }
228 | 
229 | main
230 | 


--------------------------------------------------------------------------------
/check_sidekiq_queue:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | # ========================================================================================
  3 | # Sidekiq Queue Size Nagios Check
  4 | #
  5 | # (c) Wanelo Inc, Distributed under Apache License
  6 | #
  7 | # Usage:
  8 | # To check a regular queue:
  9 | #        ./check_sidekiq_queue [ -h <host> ] [-p <port> ] [ -a <password> ] [ -q <default> ] [ <-n mq> ] [ -d <redis-db> ] [-w <warn_perc>] [-c <critical_perc>] [-i <ignore_queues>]
 10 | # Eg:    ./check_sidekiq_queue -w 500 -c 2000   # warning at 500 or higher used, critical at 2000 or higher
 11 | #
 12 | # To check schedule or retry (system) queue:
 13 | #        ./check_sidekiq_queue [ -h <host> ] [ -a <password> ] [ -s <schedule|retry> ] [ <-n mq> ] [ -d <redis-db> ] [-w <warn_perc>] [-c <critical_perc>]
 14 | #
 15 | # ========================================================================================
 16 | 
 17 | # Nagios return codes
 18 | STATE_OK=0
 19 | STATE_WARNING=1
 20 | STATE_CRITICAL=2
 21 | STATE_UNKNOWN=3
 22 | 
 23 | WARNING_THRESHOLD=500
 24 | CRITICAL_THRESHOLD=1000
 25 | QUEUE="default"
 26 | SYSTEM=""
 27 | NAMESPACE=""
 28 | HOST="127.0.0.1"
 29 | PORT="6379"
 30 | PASS=""
 31 | DB=0
 32 | 
 33 | # Parse parameters
 34 | while [ $# -gt 0 ]; do
 35 |     case "$1" in
 36 |         -d | --db)
 37 |                 shift
 38 |                 DB=$1
 39 |                 ;;
 40 |         -h | --hostname)
 41 |                 shift
 42 |                 HOST=$1
 43 |                 ;;
 44 |         -p | --port)
 45 |                 shift
 46 |                 PORT=$1
 47 |                 ;;
 48 |         -a | --password)
 49 |                 shift
 50 |                 PASS=$1
 51 |                 ;;
 52 |         -q | --queue)
 53 |                 shift
 54 |                 QUEUE=$1
 55 |                 ;;
 56 |         -i | --ignore_queues)
 57 |                 shift
 58 |                 IGNORE_QUEUES=$1
 59 |                 ;;
 60 |         -n | --namespace)
 61 |                 shift
 62 |                 NAMESPACE=$1
 63 |                 ;;
 64 |         -s | --system)
 65 |                 shift
 66 |                 SYSTEM=$1
 67 |                 ;;
 68 |         -w | --warning)
 69 |                 shift
 70 |                 WARNING_THRESHOLD=$1
 71 |                 ;;
 72 |         -c | --critical)
 73 |                 shift
 74 |                 CRITICAL_THRESHOLD=$1
 75 |                 ;;
 76 |         *)  echo "Unknown argument: $1"
 77 |             exit $STATE_UNKNOWN
 78 |             ;;
 79 |         esac
 80 | shift
 81 | done
 82 | 
 83 | PATH=/opt/local/bin:$PATH
 84 | NODENAME=$HOSTNAME
 85 | 
 86 | ERR=/tmp/redis-cli.error.$$
 87 | rm -f $ERR
 88 | 
 89 | function result {
 90 |   DESCRIPTION=$1
 91 |   STATUS=$2
 92 |   echo "SIDEKIQ $DESCRIPTION : ${NODENAME} ${QUEUE_SIZE} on ${QUEUE}|sidekiq_queue_${QUEUE}=${QUEUE_SIZE};${WARNING_THRESHOLD};${CRITICAL_THRESHOLD}"
 93 |   rm -f $ERR
 94 |   exit $STATUS
 95 | }
 96 | 
 97 | if [ "$QUEUE" != "default" -a -n "$SYSTEM" ]; then
 98 |   result "CRITICAL invalid usage: pass -q or -s but not both", $STATE_CRITICAL
 99 | fi
100 | 
101 | if [ -n "$IGNORE_QUEUES" -a "$QUEUE" != "all" ]; then
102 |   result "CRITICAL invalid usage: ignore_queues can only be used with QUEUE = all as value", $STATE_CRITICAL
103 | fi
104 | 
105 | if [ -n "$SYSTEM" -a "$SYSTEM" != "schedule" -a "$SYSTEM" != "retry" ] ; then
106 |   result "CRITICAL invalid usage: -s expect one of schedule or retry", $STATE_CRITICAL
107 | fi
108 | 
109 | if [ ! -z "$PASS"  ]; then
110 |   PASS="-a $PASS"
111 | fi
112 | 
113 | if [ ! -z "$NAMESPACE"  ]; then
114 |  NAMESPACE="$NAMESPACE:"
115 | fi
116 | 
117 | if [ -n "$SYSTEM" ]; then
118 |   QUEUE_SIZE=`redis-cli -h $HOST -p $PORT $PASS -n $DB zcard ${NAMESPACE}$SYSTEM 2>$ERR | cut -d " " -f 1`
119 |   QUEUE=$SYSTEM
120 | elif [ "$QUEUE" == "all" ]; then
121 |   ALL_QUEUES=`redis-cli -h $HOST -p $PORT $PASS -n $DB smembers queues 2>$ERR`
122 |   QUEUE_SIZE=-1
123 |   QUEUE="none"
124 |   for EACH_QUEUE in $ALL_QUEUES; do
125 |     [[ "$IGNORE_QUEUES" == *"$EACH_QUEUE"* ]] && continue
126 |     THIS_QUEUE_SIZE=`redis-cli -h $HOST -p $PORT $PASS -n $DB llen ${NAMESPACE}queue:$EACH_QUEUE 2>$ERR | cut -d " " -f 1`
127 |     if [ "$THIS_QUEUE_SIZE" -gt "$QUEUE_SIZE" ]; then
128 |       QUEUE_SIZE=$THIS_QUEUE_SIZE
129 |       QUEUE=$EACH_QUEUE
130 |     fi
131 |   done
132 | else
133 |   QUEUE_SIZE=`redis-cli -h $HOST -p $PORT $PASS -n $DB llen ${NAMESPACE}queue:$QUEUE 2>$ERR | cut -d " " -f 1`
134 | fi
135 | 
136 | if [ -s "$ERR" ];  then
137 |   QUEUE_SIZE=`cat $ERR`
138 |   result "CRITICAL" $STATE_CRITICAL
139 | fi
140 | 
141 | if [ $QUEUE_SIZE -ge $WARNING_THRESHOLD ] && [ $QUEUE_SIZE -lt $CRITICAL_THRESHOLD ]; then
142 |   result "WARNING" $STATE_WARNING
143 | elif [ $QUEUE_SIZE -ge $CRITICAL_THRESHOLD ]; then
144 |   result "CRITICAL" $STATE_CRITICAL
145 | else
146 |   result "OK" $STATE_OK
147 | fi
148 | 
149 | # ensure that output from stderr is cleaned up
150 | rm -f $ERR
151 | 


--------------------------------------------------------------------------------
/check_twemproxy:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env ruby
  2 | # ========================================================================================
  3 | # Twemproxy Status Check using JSON status page
  4 | #
  5 | # (c) Wanelo Inc, Distributed under Apache License
  6 | #
  7 | # Usage: ./check_twemproxy [-H host] [-p port]
  8 | #
  9 | # Dependencies: ruby with JSON parser installed.
 10 | #
 11 | # Returns OK/SUCCESS when all servers in the sharded cluster are connected, or
 12 | # CRITICAL otherwise.
 13 | # ========================================================================================
 14 | 
 15 | require 'optparse'
 16 | require 'json'
 17 | 
 18 | DEFAULT_PORT = 22222
 19 | DEFAULT_WARNING_THRESHOLD = 0
 20 | DEFAULT_CRITICAL_THRESHOLD = 10
 21 | 
 22 | options = Struct.new('Options', :host, :port, :verbose, :warning_threshold, :critical_threshold).new
 23 | options.port = DEFAULT_PORT
 24 | options.warning_threshold = DEFAULT_WARNING_THRESHOLD
 25 | options.critical_threshold = DEFAULT_CRITICAL_THRESHOLD
 26 | 
 27 | optparse = OptionParser.new do |opts|
 28 |   opts.banner = 'Usage: check_twemproxy [-h host] [-p port]'
 29 | 
 30 |   opts.on('-H', '--host HOST', String, 'Host name or IP address') do |h|
 31 |     options.host = h
 32 |   end
 33 | 
 34 |   opts.on('-p', '--port PORT', Integer, "Port (#{DEFAULT_PORT})") do |p|
 35 |     options.port = p
 36 |   end
 37 | 
 38 |   opts.on('-w', '--warning COUNT', Integer, "Warning threshold for server problems (#{DEFAULT_WARNING_THRESHOLD})") do |w|
 39 |     options.warning_threshold = w
 40 |   end
 41 | 
 42 |   opts.on('-c', '--critical COUNT', Integer, "Critical threshold for server problems (#{DEFAULT_CRITICAL_THRESHOLD})") do |w|
 43 |     options.critical_threshold = w
 44 |   end
 45 | 
 46 |   opts.on('-v', '--verbose', 'Run verbosely') do |p|
 47 |     options.verbose = true
 48 |   end
 49 | 
 50 |   opts.on('-?', '--help', 'Display this screen') do
 51 |     puts opts
 52 |     exit
 53 |   end
 54 | end
 55 | 
 56 | begin
 57 |   optparse.parse!
 58 |   raise OptionParser::MissingArgument.new('host is required') unless options.host
 59 | rescue OptionParser::InvalidOption, OptionParser::MissingArgument => e
 60 |   puts e.message
 61 |   puts optparse
 62 |   exit 3
 63 | end
 64 | 
 65 | class TwemproxyCheck
 66 |   STATE_OK=0
 67 |   STATE_WARNING=1
 68 |   STATE_CRITICAL=2
 69 |   STATE_UNKNOWN=3
 70 | 
 71 |   LAST_CHECK_PATTERN = '/tmp/twemproxy-%s'
 72 | 
 73 |   attr_accessor :disconnect_count, :error_clusters, :disconnected_servers, :options, :timeout_count, :timedout_servers
 74 | 
 75 |   def initialize(options)
 76 |     @options = options
 77 |     @disconnect_count = 0
 78 |     @error_clusters = Hash.new(0)
 79 |     @disconnected_servers = Hash.new(0)
 80 |     @timeout_count = 0
 81 |     @timedout_servers = Hash.new(0)
 82 |   end
 83 | 
 84 |   def check!
 85 |     begin
 86 |       check_twemproxy!
 87 |       persist!
 88 |       exit!
 89 |     rescue => e
 90 |       unknown!(e)
 91 |     end
 92 |   end
 93 | 
 94 |   protected
 95 | 
 96 |   def check_twemproxy!
 97 |     return if last_check_data.nil?
 98 | 
 99 |     check_data.keys.find_all { |k| check_data[k].is_a?(Hash) }.each do |cluster|
100 |       check_data[cluster].keys.find_all { |v| check_data[cluster][v].is_a?(Hash) }.each do |server|
101 |         check_timeouts!(cluster, server)
102 |         next if connections_ok?(cluster, server)
103 |         next if no_new_requests?(cluster, server)
104 | 
105 |         self.disconnect_count += 1
106 |         self.disconnected_servers[server] += 1
107 |         self.error_clusters[cluster] += 1
108 |       end
109 |     end
110 |   end
111 | 
112 |   def persist!
113 |     ::File.open(last_check_filename, 'w') do |file|
114 |       file.write(JSON.dump(check_data))
115 |     end
116 |   end
117 | 
118 |   def exit!
119 |     return critical! if critical?
120 |     return warning! if warning?
121 |     ok!
122 |   end
123 | 
124 |   private
125 | 
126 |   def connections_ok?(cluster, server)
127 |     check_data[cluster][server]['server_connections'].to_i > 0
128 |   end
129 | 
130 |   def no_new_requests?(cluster, server)
131 |     check_data[cluster][server]['requests'].to_i - last_check_data[cluster][server]['requests'].to_i <= 0
132 |   end
133 | 
134 |   def timeouts_for(cluster, server)
135 |     check_data[cluster][server]['server_timedout'].to_i - last_check_data[cluster][server]['server_timedout'].to_i
136 |   end
137 | 
138 |   def check_timeouts!(cluster, server)
139 |     return if timeouts_for(cluster, server) == 0
140 |     self.timeout_count += 1
141 |     self.error_clusters[cluster] += 1
142 |     self.timedout_servers[server] += 1
143 |   end
144 | 
145 |   def check_data
146 |     @check_data ||= JSON.parse(`nc #{options.host} #{options.port}`)
147 |   end
148 | 
149 |   def last_check_data
150 |     @last_check_data ||= begin
151 |       return nil unless ::File.exist?(last_check_filename)
152 |       return nil if ::File.ctime(last_check_filename) < Time.now - (5 * 60)
153 |       JSON.parse(::File.read(last_check_filename))
154 |     end
155 |   end
156 | 
157 |   def last_check_filename
158 |     LAST_CHECK_PATTERN % options.host
159 |   end
160 | 
161 |   def ok?
162 |     !critical? && !warning?
163 |   end
164 | 
165 |   def critical?
166 |     disconnect_count > options.critical_threshold
167 |   end
168 | 
169 |   def warning?
170 |     disconnect_count > options.warning_threshold
171 |   end
172 | 
173 |   def ok!
174 |     puts "TWEMPROXY OK : #{message}"
175 |     dump_data
176 |     exit STATE_OK
177 |   end
178 | 
179 |   def critical!
180 |     puts "TWEMPROXY CRITICAL : #{message}"
181 |     dump_data
182 |     exit STATE_CRITICAL
183 |   end
184 | 
185 |   def warning!
186 |     puts "TWEMPROXY WARNING : #{message}"
187 |     dump_data
188 |     exit STATE_WARNING
189 |   end
190 | 
191 |   def unknown!(e)
192 |     puts "TWEMPROXY UNKNOWN : #{e.message}"
193 |     exit STATE_UNKNOWN
194 |   end
195 | 
196 |   def message
197 |     return options.host if ok?
198 |     "servers: #{disconnected_servers.keys.join(',')}, clusters: #{error_clusters.keys.join(',')} | " \
199 |     "disconnects=#{disconnect_count};" \
200 |     "timeouts=#{timeout_count};" \
201 |     "clusters=[#{error_clusters.keys.join(',')}];" \
202 |     "disconnected_shards=#{disconnected_servers.keys.join(',')};" \
203 |     "timedout_shards=#{timedout_servers.keys.join(',')}"
204 |   end
205 | 
206 |   def dump_data
207 |     return unless options.verbose
208 |     check_data.keys.find_all { |k| check_data[k].is_a?(Hash) }.each do |cluster|
209 |       check_data[cluster].keys.find_all { |v| check_data[cluster][v].is_a?(Hash) }.each do |server|
210 |         puts "#{cluster}/#{server}: #{check_data[cluster][server].to_s}"
211 |       end
212 |     end
213 |   end
214 | end
215 | 
216 | TwemproxyCheck.new(options).check!
217 | 


--------------------------------------------------------------------------------