├── README.md
├── cassandra_repair_scheduler.py
├── casstop
├── poison_pill_tester
└── stop_cassandra_repairs


/README.md:
--------------------------------------------------------------------------------
  1 | cassandra_tools
  2 | =======
  3 | 
  4 | # casstop
  5 | 
  6 | "top"-like tool for Cassandra.  It attempts to enable a real-time view into the state of your Cassandra cluster.  Requires http://mx4j.sourceforge.net/.
  7 | 
  8 | ## Usage
  9 | ```
 10 | casstop $NODENAME [$NODENAME ...]
 11 | ```
 12 | 
 13 | # stop_cassandra_repairs
 14 | 
 15 | Cassandra repairs have an unfortunate tendency to hang, but there are no tools to kill off such a hung repair, thus tying up resources on the problem nodes until such time as they are restarted.  stop_cassandra_repairs will use MX4J to stop any outstanding repairs on the nodes you give it.  Requires http://mx4j.sourceforge.net/.
 16 | 
 17 | ##Usage
 18 | ```
 19 | stop_cassandra_repairs $HUNG_NODE [$HUNG_NODE ...]
 20 | ```
 21 | 
 22 | # cassandra_repair_scheduler.py
 23 | 
 24 | Script for scheduling repairs on your cluster.  Requires the use of
 25 | https://github.com/BrianGallew/cassandra_range_repair to work.
 26 | 
 27 | ##Basic Usage
 28 | ```
 29 | echo 0 */4 * * * /usr/local/bin/cassandra_repair_scheduler.py >> /etc/crontab
 30 | ```
 31 | 
 32 | ##Help
 33 | ```
 34 | usage: cassandra_repair_scheduler.py [-h] [-v] [-d] [--syslog FACILITY]
 35 |                                      [--logfile FILENAME] [-H HOSTNAME]
 36 |                                      [-p PORT] [-U USERNAME] [-P PASSWORD]
 37 |                                      [-t TTL] [-k KEYSPACE]
 38 |                                      [--cqlversion CQLVERSION]
 39 |                                      [-r RANGE_REPAIR_TOOL]
 40 | 
 41 | optional arguments:
 42 |   -h, --help            show this help message and exit
 43 |   -v, --verbose         Verbose output
 44 |   -d, --debug           Debugging output
 45 |   --syslog FACILITY     Send log messages to the syslog
 46 |   --logfile FILENAME    Send log messages to a file
 47 |   -H HOSTNAME, --hostname HOSTNAME
 48 |                         Hostname (default: mactheknife.local)
 49 |   -p PORT, --port PORT  Port (default: 9160)
 50 |   -U USERNAME, --username USERNAME
 51 |                         Username (if necessary)
 52 |   -P PASSWORD, --password PASSWORD
 53 |                         Password. (prompt if user provided but not password)
 54 |   -t TTL, --ttl TTL     TTL (default: 1728000)
 55 |   -k KEYSPACE, --keyspace KEYSPACE
 56 |                         Keyspace to use (default: operations)
 57 |   --cqlversion CQLVERSION
 58 |                         CQL version (default: 3.0.5)
 59 |   -r RANGE_REPAIR_TOOL, --range_repair_tool RANGE_REPAIR_TOOL
 60 |                         Range repair tool path (default:
 61 |                         /usr/local/bin/range_repair.py)
 62 |   --watch               See the live repair status.
 63 | ```
 64 | 
 65 | # poison_pill_tester
 66 | 
 67 | Script for discovering wide rows.  You'll need to convert your data to JSON, first.
 68 | 
 69 | ##Usage
 70 | ```
 71 |  nodetool snapshot keyspace suspect_column_family
 72 |  #(cd into the relevant snapshot directory)
 73 |  for d in *-Data.db
 74 |  do
 75 |    e=$(echo $d | sed s,-Data.db,.json,)
 76 |    sstable2json $d > $e
 77 |  done
 78 |  poison_pill_tester *.json
 79 | ```
 80 | 
 81 | # MX4J
 82 | 
 83 | http://mx4j.sourceforge.net/ is, among other things, a JMX<->HTML bridge.
 84 | 
 85 | ## Installation
 86 | 
 87 | MX4J is available via "apt-get install libmx4j-java" on Ubuntu (that's
 88 | where it's current used).
 89 | 
 90 | Once that library is installed, you need to ensure that it's in Cassandra's
 91 | load path, most easily done by symlinking it like this:
 92 | 
 93 | ```
 94 | ln -s /usr/share/java/mx4j-tools.jar /usr/share/cassandra/lib/mx4j-tools.jar
 95 | ```
 96 | 
 97 | Finally, you need to load it into cassandra.  Edit cassandra-env.sh and add
 98 | the following line:
 99 | 
100 | ```
101 | JVM_OPTS="${JVM_OPTS} -Dmx4jaddress=$(ip addr show dev eth0 | grep 'inet ' | sed -e s,inet,, -e 's,/.*,,')"
102 | ```
103 | 
104 | NB: that assumes you have only one address on eth0, and that eth0 is what
105 | you want MX4J listening on.
106 | 


--------------------------------------------------------------------------------
/cassandra_repair_scheduler.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | 
  3 | # Author: Brian Gallew <bgallew@llnw.com> or <geek@gallew.org>
  4 | 
  5 | """
  6 | Run repairs on a regularly scheduled basis.  Drop this in cron, not more
  7 | often than hourly.
  8 | 
  9 | Workflow:
 10 | 1) Insert a record (with TTL) into the queue table.
 11 | 2) Select all records from the queue table.
 12 |    2a) If ours isn't the first one, delete ours and then exit.
 13 | 3) Select all records from the status table.
 14 | 4) If our status exists and is "running", delete our queue record and exit.
 15 | 5) If our status exists and is "completed" and the completion time is too
 16 |    recent, delete our queue record and exit.
 17 | 6) Create/replace our status with "running" (with TTL)
 18 | 7) Delete our queue record
 19 | 8) Run a repair
 20 | 9) Replace our status with "completed" (with TTL)
 21 | 
 22 | """
 23 | 
 24 | import logging
 25 | import argparse
 26 | import platform
 27 | import getpass
 28 | import time
 29 | import subprocess
 30 | import cql
 31 | import curses
 32 | import curses.wrapper
 33 | import threading
 34 | 
 35 | COMPLETED = "Completed"
 36 | DELAY = 'delay'
 37 | 
 38 | class CqlWrapper(object):
 39 | 
 40 |     """Keep all of the CQL-specific stuff in here so we can have consistent
 41 |     retry handling, etc.
 42 | 
 43 |     Updates to SCHEMA may require updates to create_schema.
 44 |     """
 45 |     SCHEMA = [
 46 |         """CREATE KEYSPACE "{keyspace}"
 47 |              WITH replication = {{'class' : 'NetworkTopologyStrategy',
 48 |                                  {data_center_replication_map}}}
 49 |              AND durable_writes = false;
 50 |         """,
 51 |         """USE {keyspace};""",
 52 |         """CREATE TABLE "mutex" (
 53 |              nodename varchar,
 54 |              data_center varchar,
 55 |              PRIMARY KEY ((nodename), data_center))
 56 |            WITH comment='Poor MUTEX implementation'
 57 |         """,
 58 |         """CREATE TABLE "repair_status" (
 59 |              nodename varchar,
 60 |              data_center varchar,
 61 |              repair_status varchar,
 62 |              PRIMARY KEY ((nodename), data_center))
 63 |            WITH comment='Repair status of each node'
 64 |         """,
 65 |     ]
 66 |     GET_STATUS = """SELECT "repair_status" FROM "repair_status"
 67 |                     WHERE "nodename" = :nodename AND "data_center" = :data_center"""
 68 |     GET_LOCAL_STATUS = """SELECT "nodename", "repair_status" FROM "repair_status"
 69 |                           WHERE "data_center" = :data_center ALLOW FILTERING"""
 70 |     # This next statement could get ugly if you have 1000+ nodes.
 71 |     GET_ALL_STATUS = """SELECT "nodename", "data_center", "repair_status", WRITETIME("repair_status") FROM "repair_status" """
 72 |     MUTEX_START = """INSERT INTO "mutex" ("nodename", "data_center")
 73 |                      VALUES (:nodename, :data_center) USING TTL :ttl"""
 74 |     MUTEX_CHECK = """SELECT "nodename", "data_center" FROM "mutex" """
 75 |     MUTEX_CLEANUP = """DELETE FROM "mutex" WHERE "nodename" = :nodename AND "data_center" = :data_center"""
 76 |     SELECT_ALL_DATACENTERS = """SELECT data_center FROM system.peers"""
 77 |     SELECT_MY_DATACENTER = """SELECT data_center FROM system.local"""
 78 |     REPAIR_START = """INSERT INTO "repair_status" ("nodename", "data_center", "repair_status")
 79 |                       VALUES (:nodename, :data_center, 'Started') USING TTL :ttl"""
 80 |     REPAIR_UPDATE = """UPDATE "repair_status" USING TTL :ttl SET "repair_status" = :newstatus
 81 |                        WHERE "nodename" = :nodename AND "data_center" = :data_center"""
 82 |     REPAIR_CLEANUP = """DELETE FROM "repair_status" WHERE "nodename" = :nodename AND "data_center" = :data_center"""
 83 | 
 84 |     def __init__(self, option_group):
 85 |         """Set up and manage our connection.
 86 |         :param option_group: result of CLI parsing
 87 |         """
 88 |         self.option_group = option_group
 89 |         self.nodename = option_group.hostname
 90 |         self.conn = None
 91 |         try:
 92 |             self.standard_connection()
 93 |         except:
 94 |             self.create_schema()
 95 |         self.data_center = self.get_data_center()
 96 |         return
 97 | 
 98 |     def get_data_center(self):
 99 |         """Get our data center tag.
100 |         :returns: data_center"""
101 |         result = self.query_or_die(
102 |             self.SELECT_MY_DATACENTER, "Looking for my datacenter")
103 |         if not result:
104 |             logging.fatal(
105 |                 "No data center in local data.  Still bootstrapping?")
106 |             exit(1)
107 |         return result[0][0]
108 | 
109 |     def standard_connection(self):
110 |         """Set up a connection to Cassandra.
111 |         """
112 |         logging.debug('connecting to %s', self.option_group.keyspace)
113 |         self.conn = cql.connect(self.option_group.hostname,
114 |                                 self.option_group.port,
115 |                                 self.option_group.keyspace,
116 |                                 user=self.option_group.username,
117 |                                 password=self.option_group.password,
118 |                                 cql_version=self.option_group.cqlversion)
119 |         return
120 | 
121 |     def create_schema(self):
122 |         """Creates the schema if it doesn't exist using the CQL in self.SCHEMA.
123 |         Each query in there will be formatted with locals(), so if you
124 |         update self.SCHEMA, be sure to update this function, too.
125 |         """
126 |         logging.info('creating schema')
127 |         self.conn = cql.connect(self.option_group.hostname,
128 |                                 self.option_group.port,
129 |                                 "system",
130 |                                 user=self.option_group.username,
131 |                                 password=self.option_group.password,
132 |                                 cql_version=self.option_group.cqlversion)
133 | 
134 |         data_center = self.query_or_die(self.SELECT_ALL_DATACENTERS,
135 |                                         "Unable to determine the local data center")
136 |         if not data_center:
137 |             logging.fatal(
138 |                 "No peers defined, repairs on a single-node cluster are silly")
139 |             exit(0)
140 | 
141 |         # Cassandra doesn't support 'SELECT foo, 1 FROM ..." or DISTINCT,
142 |         # so we have do something a little complicated to deduplicate the
143 |         # results and then produce the desired string.
144 |         data_center_replication_map = {}
145 |         for row in data_center:
146 |             data_center_replication_map[row[0]] = None
147 |         data_center_replication_map = ", ".join(
148 |             ["'%s':3" % x for x in data_center_replication_map])
149 | 
150 |         # This declaration is just so that "keyspace" will appear in locals.
151 |         # pylint: disable=unused-variable
152 |         keyspace = self.option_group.keyspace
153 |         # pylint: enable=unused-variable
154 |         for cql_query in self.SCHEMA:
155 |             self.query(cql_query.format(**locals()))
156 |         return
157 | 
158 |     def query_or_die(self, query_string, error_message, consistency_level="LOCAL_QUORUM", **kwargs):
159 |         """Execute a query, on exception print an error message and exit.
160 |         :param query_string: CQL to perform
161 |         :param error_message: printed on error
162 |         :param kwargs: dictionary to use for parameter substitution in the CQL
163 |         """
164 |         try:
165 |             return self.query(query_string, consistency_level=consistency_level, **kwargs)
166 |         except Exception as e:
167 |             logging.fatal("%s: %s", error_message, e)
168 |         exit(1)
169 | 
170 |     def query(self, query_string, consistency_level="LOCAL_QUORUM", **kwargs):
171 |         """Execute a query.
172 |         :param query_string: CQL to perform
173 |         :param kwargs: dictionary to use for parameter substitution in the CQL
174 |         :returns: query results
175 |         """
176 |         if not self.conn:
177 |             self.standard_connection()
178 |         cursor = self.conn.cursor()
179 |         logging.debug("Query: %s, arguments: %s", query_string, str(kwargs))
180 |         cursor.execute(
181 |             query_string.encode('ascii'), kwargs, consistency_level=consistency_level)
182 |         data = cursor.fetchall()
183 |         cursor.close()
184 |         logging.debug(str(data))
185 |         return data
186 | 
187 |     def get_all_status(self):
188 |         """Get the status of all repairs.
189 |         """
190 |         while True:
191 |             try:
192 |                 return self.query(self.GET_ALL_STATUS, consistency_level="ONE")
193 |             except:
194 |                 self.close()
195 |                 time.sleep(1)
196 |         return []
197 | 
198 |     def close(self):
199 |         """Shut down the connection gracefully."""
200 |         self.conn.close()
201 |         self.conn = None
202 |         return
203 | 
204 |     def check_should_run(self):
205 |         """Check to see if it is appropriate to start up.
206 |         :returns: boolean
207 |         """
208 |         logging.debug("Check to see if we're already running a repair")
209 |         result = self.query_or_die(
210 |             self.GET_STATUS, "Checking status",
211 |             nodename=self.nodename, data_center=self.data_center)
212 |         # If there's any result at all, either a run is in progress, or the
213 |         # last completed run hasn't expired yet.  Either way, bail.
214 |         if result:
215 |             logging.info("Repair in progress: %s", result[0][0])
216 |             return False
217 | 
218 |         logging.debug("Check to see if anyone else in the local ring is running a repair")
219 |         result = self.query_or_die(self.GET_LOCAL_STATUS,
220 |                                    "Checking local ring status",
221 |                                    nodename=self.nodename,
222 |                                    data_center=self.data_center)
223 | 
224 |         if result:
225 |             already_running = [x[0] for x in result if x[1] != COMPLETED]
226 |             if already_running:
227 |                 logging.info("Another node is repairing.: %s", already_running[0])
228 |                 return False
229 |         self.query_or_die(self.MUTEX_START, "Starting MUTEX",
230 |                           nodename=self.nodename,
231 |                           data_center=self.data_center,
232 |                           ttl=self.option_group.ttl)
233 |         # Totally arbitrary delay here, because I don't trust C*.
234 |         logging.debug('Five second pause here')
235 |         time.sleep(5)
236 |         result = self.query_or_die(self.MUTEX_CHECK, "Checking MUTEX",
237 |                                    consistency_level="ONE",
238 |                                    data_center=self.data_center)
239 |         if not result or not [x[0] for x in result if x[1] == self.data_center][0] == self.nodename:
240 |             self.query(self.MUTEX_CLEANUP, nodename=self.nodename, data_center=self.data_center)
241 |             return False
242 |         return True
243 | 
244 |     def claim_repair(self):
245 |         """Insert a row claiming that we're starting the repair,
246 |         then remove the MUTEX."""
247 |         self.query_or_die(self.REPAIR_START,
248 |                           "Starting Repair", nodename=self.nodename,
249 |                           data_center=self.data_center, ttl=self.option_group.ttl)
250 |         self.query_or_die(self.MUTEX_CLEANUP,
251 |                           "Dropping MUTEX record",
252 |                           nodename=self.nodename,
253 |                           data_center=self.data_center)
254 |         self.close()
255 |         return
256 | 
257 |     def run_repair(self):
258 |         """Run the entire repair"""
259 |         cmd = [self.option_group.range_repair_tool,
260 |                "-D", self.data_center,
261 |                "-H", self.nodename,
262 |                "--dry-run"]     # So we get a list of commands to run.
263 |         if self.option_group.local:
264 |             cmd.append("--local")
265 |         logging.debug("geting repair steps, this may take a while")
266 |         repair_steps = subprocess.check_output(cmd).split('\n')
267 |         for line in repair_steps:
268 |             if not line:
269 |                 continue
270 |             step, repair_command = line.split(" ", 1)
271 |             try:
272 |                 self.query(self.REPAIR_UPDATE, nodename=self.nodename,
273 |                            newstatus=step, data_center=self.data_center,
274 |                            ttl=self.option_group.ttl)
275 |             except:
276 |                 logging.warning("Failed to update repair status, continuing anyway")
277 |             self.close()        # Individual repairs may be slow
278 |             logging.debug(repair_command)
279 |             subprocess.call(repair_command, shell=True)
280 |         try:
281 |             self.query(self.REPAIR_UPDATE, nodename=self.nodename,
282 |                        ttl=self.option_group.ttl, data_center=self.data_center,
283 |                        newstatus=COMPLETED)
284 |         except:
285 |             logging.warning("Failed to update repair status, continuing anyway")
286 |         return
287 | 
288 |     def reset_repair_status(self):
289 |         """Reset the repair status by removing the records from the database.
290 |         """
291 |         self.query(self.MUTEX_CLEANUP, nodename=self.nodename, data_center=self.data_center)
292 |         self.query(self.REPAIR_CLEANUP, nodename=self.nodename, data_center=self.data_center)
293 |         return
294 | 
295 | def status_update_loop(connection, options, status_dict):
296 |     """Update a status dictionary periodically with the results of a CQL query.
297 |     This never exits, and should be used inside a thread.
298 |     :param connection: CqlWrapper (used for queries)
299 |     :param options: control dictionary
300 |     :param status_dict: a dictionary which will hold all of the returned data
301 |     """
302 |     while True:
303 |         start_time = time.time()
304 |         new_names = []
305 |         for row in connection.get_all_status():
306 |             status_dict[row[0]] = row
307 |             new_names.append(row[0])
308 |         logging.debug("status_update_loop: status: %s", str(status_dict))
309 |         for nodename in status_dict.keys():
310 |             if not nodename in new_names: del status_dict[nodename]
311 |         end_time = time.time()
312 |         delta_time = options[DELAY] - (end_time - start_time)
313 |         if delta_time > 0: time.sleep(delta_time)
314 | 
315 | def row_sort_function(left, right):
316 |     '''Function for sorting query results for the entire cluster.  Using a real
317 |     function because it'll be called a lot, and lambdas can get expensive.
318 | 
319 |     :param left: first item to check
320 |     :param right: second item to check
321 | 
322 |     '''
323 |     return cmp(left[3], right[3])
324 | 
325 | def format_time(seconds):
326 |     """Convert time in seconds to a human-readabase value.
327 |     :param seconds: Time to be converted
328 |     :returns: formatted string
329 |     """
330 |     if seconds < 60:
331 |         return "{seconds:0.2f} seconds".format(seconds=seconds)
332 |     seconds = seconds / 60.0
333 |     if seconds < 60:
334 |         return "{seconds:0.2f} minutes".format(seconds=seconds)
335 |     seconds = seconds / 60.0
336 |     if seconds < 24:
337 |         return "{seconds:0.2f} hours".format(seconds=seconds)
338 |     seconds = seconds / 24.0
339 |     return "{seconds:0.2f} days".format(seconds=seconds)
340 | 
341 | def screen_update_loop(window, options, status_dict):
342 |     """Updates a curses window with data from the status dictionary.
343 |     This never exits, and should be used inside a thread.
344 |     :param window: curses window
345 |     :param options: control dictionary
346 |     :param status_dict: a dictionary which will hold all of the returned data
347 |     """
348 |     status_message = "Running: {running:3d}  Complete: {complete:3d}          Refresh: {delay:d}"
349 |     status_format = "{hostname:35s} {status:20s} {delay}"
350 |     color_warning = curses.color_pair(1)
351 |     color_bad = curses.color_pair(2)
352 |     color_green = curses.color_pair(3)
353 |     curses.init_pair(1, curses.COLOR_YELLOW, curses.COLOR_BLACK)
354 |     curses.init_pair(2, curses.COLOR_RED, curses.COLOR_BLACK)
355 |     curses.init_pair(3, curses.COLOR_GREEN, curses.COLOR_BLACK)
356 | 
357 |     while True:
358 |         start_time = time.time()
359 |         (RESTY, _) = window.getmaxyx()
360 |         window.clear()
361 |         display_data = []
362 |         complete_data = []
363 |         for row in status_dict:
364 |             if status_dict[row][2] == COMPLETED:
365 |                 complete_data.append(status_dict[row])
366 |             else:
367 |                 display_data.append(status_dict[row])
368 | 
369 |         window.addstr(1, 0, status_message.format(complete=len(complete_data),
370 |                                                   running=len(display_data),
371 |                                                   delay=options[DELAY]))
372 |         display_data.sort(row_sort_function)
373 |         complete_data.sort(row_sort_function)
374 |         window.addstr(3, 0, status_format.format(hostname="Hostname",
375 |                                                  status="Status",
376 |                                                  delay="Time since last update"), curses.A_BOLD)
377 |         display_data.extend(complete_data)
378 |         logging.debug("all data: %s", str(display_data))
379 |         current_row = 4
380 |         for line in display_data:
381 |             if current_row > RESTY-1:
382 |                 break
383 |             delta = time.time() - (line[3]/1000000.0)
384 |             if delta > 4*3600 and not line[2] == COMPLETED:
385 |                 attribute = color_bad
386 |             elif delta > 2*3600 and not line[2] == COMPLETED:
387 |                 attribute = color_warning
388 |             else:
389 |                 attribute = color_green
390 |             window.addstr(current_row, 0, status_format.format(hostname=line[0],
391 |                                                                status=line[2],
392 |                                                                delay=format_time(delta)),
393 |                           attribute)
394 |             current_row += 1
395 |         delta_time = options[DELAY] - (time.time() - start_time)
396 |         if delta_time > 0: time.sleep(delta_time)
397 |         window.refresh()
398 | 
399 | def watch(main_window, connection):
400 |     """Query Cassandra for the current repair status and display.
401 |     :param main_window: curses window
402 |     :param connection: CqlWrapper object
403 |     """
404 |     status_dict = {}
405 |     option_dict = {}
406 |     option_dict[DELAY] = 5
407 |     update_thread = threading.Thread(target=status_update_loop,
408 |                                      args=(connection, option_dict, status_dict))
409 |     update_thread.daemon = True
410 |     update_thread.start()
411 | 
412 |     redraw_thread = threading.Thread(target=screen_update_loop,
413 |                                      args=(main_window, option_dict, status_dict))
414 |     redraw_thread.daemon = True
415 |     redraw_thread.start()
416 | 
417 |     while 1:
418 |         try:
419 |             key = main_window.getkey()
420 |             if key == 'q': break
421 |             elif key == '+': option_dict[DELAY] = option_dict[DELAY] + 1
422 |             elif key == '-' and option_dict[DELAY] > 1: option_dict[DELAY] = option_dict[DELAY] - 1
423 |         except KeyboardInterrupt: raise SystemExit
424 |         except: pass
425 |     return
426 | 
427 | 
428 | def setup_logging(option_group):
429 |     """Sets up logging in a syslog format by log level
430 |     :param option_group: options as returned by the OptionParser
431 |     """
432 |     stderr_log_format = "%(levelname) -8s %(asctime)s %(funcName)s line:%(lineno)d: %(message)s"
433 |     file_log_format = "%(asctime)s - %(levelname)s - %(message)s"
434 |     logger = logging.getLogger()
435 |     if option_group.debug:
436 |         logger.setLevel(level=logging.DEBUG)
437 |     elif option_group.verbose:
438 |         logger.setLevel(level=logging.INFO)
439 |     else:
440 |         logger.setLevel(level=logging.WARNING)
441 | 
442 |     # First, clear out any default handlers
443 |     for handler in logger.handlers:
444 |         logger.removeHandler(handler)
445 | 
446 |     if option_group.syslog:
447 |         # Use standard format here because timestamp and level will be added by
448 |         # syslogd.
449 |         logger.addHandler(logging.SyslogHandler(facility=option_group.syslog))
450 |     if option_group.logfile:
451 |         logger.addHandler(logging.FileHandler(option_group.logfile))
452 |         logger.handlers[-1].setFormatter(logging.Formatter(file_log_format))
453 |     if not logger.handlers:
454 |         logger.addHandler(logging.StreamHandler())
455 |         logger.handlers[-1].setFormatter(logging.Formatter(stderr_log_format))
456 |     return
457 | 
458 | 
459 | def cli_parsing():
460 |     """Parse the command line.
461 |     :returns: option set
462 |     """
463 |     parser = argparse.ArgumentParser()
464 |     parser.add_argument("-v", "--verbose", action='store_true',
465 |                         default=False, help="Verbose output")
466 |     parser.add_argument("-d", "--debug", action='store_true',
467 |                         default=False, help="Debugging output")
468 |     parser.add_argument("--syslog", metavar="FACILITY",
469 |                         help="Send log messages to the syslog")
470 |     parser.add_argument("--logfile", metavar="FILENAME",
471 |                         help="Send log messages to a file")
472 |     parser.add_argument("-H", "--hostname", default=platform.node(),
473 |                         help="Hostname (default: %(default)s)")
474 |     parser.add_argument("-p", "--port", default=9160, type=int,
475 |                         help="Port (default: %(default)d)")
476 |     parser.add_argument("-U", "--username",
477 |                         help="Username (if necessary)")
478 |     parser.add_argument("-P", "--password",
479 |                         help="Password. (prompt if user provided but not password)")
480 |     parser.add_argument("-t", "--ttl", default=3600 * 24 * 20, type=int,
481 |                         help="TTL (default: %(default)d)")
482 |     parser.add_argument("-k", "--keyspace", default="operations",
483 |                         help="Keyspace to use (default: %(default)s)")
484 |     parser.add_argument("--cqlversion", default="3.0.5",
485 |                         help="CQL version (default: %(default)s)")
486 |     parser.add_argument("-r", "--range_repair_tool",
487 |                         default="/usr/local/bin/range_repair.py",
488 |                         help="Range repair tool path (default: %(default)s)")
489 |     parser.add_argument("--local", default=False, action="store_true",
490 |                         help="Run the repairs in the local ring only")
491 |     parser.add_argument("--watch", action="store_true", default=False,
492 |                         help="Watch the live repair status")
493 |     parser.add_argument("--reset", action="store_true", default=False,
494 |                         help="Reset the repair status for the host")
495 |     options = parser.parse_args()
496 |     setup_logging(options)
497 |     if options.username and not options.password:
498 |         options.password = getpass.getpass(
499 |             'Password for %s: ' % options.username)
500 |     return options
501 | 
502 | 
503 | def main():
504 |     """Main entry point.  Runs the actual program here."""
505 |     logging.debug('main')
506 |     options = cli_parsing()
507 |     connection = CqlWrapper(options)
508 |     if options.reset:
509 |         connection.reset_repair_status()
510 |         exit()
511 |     if not options.watch:
512 |         if connection.check_should_run():
513 |             connection.claim_repair()
514 |             # Arguably, this should not be done in the connection.
515 |             connection.run_repair()
516 |     else:
517 |         curses.wrapper(watch, connection)
518 |     connection.close()
519 | 
520 |     return
521 | 
522 | 
523 | if __name__ == '__main__':
524 |     main()
525 | 


--------------------------------------------------------------------------------
/casstop:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | 
  3 | # Author: Brian Gallew <bgallew@llnw.com> or <geek@gallew.org>
  4 | 
  5 | import sys, xmltodict, urllib2, optparse, threading, re, curses, curses.wrapper
  6 | import time, socket, json, logging, collections, traceback, signal, termios
  7 | import pprint
  8 | 
  9 | _INTERNED = ['Cluster', 'DC', 'EXTENDED_STATUS', 'Hostname', 'HOSTNAMES',
 10 |              'LIVE', 'Load', 'READ_LATENCY_FIFTEEN_MINUTE', 'LABEL',
 11 |              'READ_LATENCY_FIVE_MINUTE', 'READ_LATENCY_INSTANTANEOUS',
 12 |              'READ_LATENCY_ONE_MINUTE', 'READ_RATE_FIFTEEN_MINUTE',
 13 |              'READ_RATE_FIVE_MINUTE', 'READ_RATE_ONE_MINUTE', 'Severity', 'STATUS',
 14 |              'WRITE_LATENCY_FIFTEEN_MINUTE', 'WRITE_LATENCY_FIVE_MINUTE',
 15 |              'WRITE_LATENCY_INSTANTANEOUS', 'WRITE_LATENCY_ONE_MINUTE',
 16 |              'WRITE_RATE_FIFTEEN_MINUTE', 'WRITE_RATE_FIVE_MINUTE', 'DEAD', 
 17 |              'WRITE_RATE_ONE_MINUTE', 'PendingTasks', 'read_latency_averages',
 18 |              'write_latency_averages', 'RACK', 'CLUSTER_NAME', 'Compactions',
 19 |              'ITEM', 'JAVA_OBJECT', 'URL', 'VALUE', 'OPERATION','ONE', 'FIVE', 'FIFTEEN', 'POPS',
 20 | ]
 21 | 
 22 | for value in _INTERNED: locals()[value.upper()] = value
 23 | 
 24 | _debuginfo = collections.deque()
 25 | def debug(item):
 26 |     _debuginfo.append(item)
 27 |     if len(_debuginfo) > 160: del _debuginfo[0]
 28 |     return
 29 | 
 30 | def sigwinch_handler(n, frame):
 31 |     curses.initscr()
 32 |     return
 33 | 
 34 | class MovingAverages(object):
 35 |     '''Handle moving averages as often displayed by programs like top(8).'''
 36 |     def __init__(self):
 37 |         self.queue = collections.deque() # Where we keep our data stashed away
 38 |         self.one = self.five = self.fifteen = 0.0
 39 |         return
 40 |     def add(self, value):
 41 |         '''Add a new value, timestamped appropriately.  Throw away any old
 42 |         values, then re-compute the moving averages.'''
 43 |         now = time.time()
 44 |         self.queue.appendleft((value, now))
 45 |         # These two lines discard old stuff
 46 |         old = now - 900         # 15 minutes
 47 |         while self.queue and self.queue[-1][1] < old: del self.queue[-1]
 48 |         total = 0.0
 49 |         count = 0
 50 | 
 51 |         then = now - 60
 52 |         while count < len(self.queue):
 53 |             value, timestamp = self.queue[count]
 54 |             if timestamp < then: break
 55 |             count += 1
 56 |             total += value
 57 |         self.one = total/count
 58 | 
 59 |         then = now - 300
 60 |         while count < len(self.queue):
 61 |             value, timestamp = self.queue[count]
 62 |             if timestamp < then: break
 63 |             count += 1
 64 |             total += value
 65 |         self.five = total/count
 66 | 
 67 |         while count < len(self.queue):
 68 |             value, timestamp = self.queue[count]
 69 |             count += 1
 70 |             total += value
 71 |         self.fifteen = total/count
 72 |         return self
 73 | 
 74 | 
 75 | class CursedIntDataAttribute(dict):
 76 |     url_template = 'http://{Hostname:s}:8081/{OPERATION:s}?objectname={JAVA_OBJECT:s}&attribute={ITEM:s}&operation={ITEM:s}&template=identity'
 77 |     bean_designator = 'MBean'
 78 |     return_value_designators = ['Attribute', '@value']
 79 |     default_value = 0
 80 |     operation = 'getattribute'
 81 |     datatype = int
 82 |     default_format = '{VALUE:>4d}'
 83 |     def __init__(self, hostname, java_object, item, *args, **kwargs):
 84 |         '''Set some default values for the dictionary, largely for debugging and interpolation purposes'''
 85 |         dict.__init__(self, *args, **kwargs)
 86 |         self[HOSTNAME] = hostname
 87 |         self[JAVA_OBJECT] = java_object
 88 |         self[ITEM] = item
 89 |         self[OPERATION] = self.operation # This seems silly, but it lets subclasses override while still letting us do interpolation.
 90 |         self[URL] = self.url_template.format(**self)
 91 |         self[VALUE] = self.default_value
 92 |         return None
 93 |         
 94 |     def __call__(self):
 95 |         '''Make the requisite HTTP request to get a new data item, storing the
 96 |         coerced result into self[VALUE] (or store the default value if some part of
 97 |         the process fails.'''
 98 |         try:
 99 |             data = {}
100 |             fh = urllib2.urlopen(self[URL], None, 30)
101 |             data_string = fh.read()
102 |             fh.close()
103 |             data = xmltodict.parse(data_string)
104 |             if not data:
105 |                 debug('%(Hostname)s:%(ITEM)s.__call__: no results returned for %(URL)s' % self)
106 |                 self[VALUE] = self.default_value
107 |             else:
108 |                 self[VALUE] = self.datatype(data[self.bean_designator][self.return_value_designators[0]][self.return_value_designators[1]])
109 |                 debug('%(Hostname)s:%(ITEM)s.__call__: set value to %(VALUE)s' % self)
110 |         except Exception as e:
111 |             debug(('%(Hostname)s:%(ITEM)s.__call__: Unable to load data for %(URL)s: ' % self) + str(e) + str(data))
112 |             self[VALUE] = self.default_value
113 |         return self[VALUE]
114 |         
115 |     def draw(self, window, y, x, color=0, warning=None, critical=None, newfmt=None, length=0):
116 |         '''Standard display method for these values.
117 | 
118 |         (window,y,x) are the expected curses items.
119 | 
120 |         "color" is the curses attribute set to use by default.  I'm
121 |         cheating big-time here and assuming that the default color should
122 |         be the result of curses.color_pair(0) AND that the result of that
123 |         will always be 0.  This is probably FRAGILE.
124 | 
125 |         "warning" and "critical", if set, should be a list/tuple where the
126 |         first item is a test value and the second item is a curses
127 |         attribute set.  If either of the tests is true, the appropriate
128 |         attribute set will override the default attribute set.
129 | 
130 |         newfmt is a format string using the new style of
131 |         formatting.
132 | 
133 |         length, if greater than zero, will guarantee that the displayed
134 |         string does not exceed a certain length.
135 | 
136 |         '''
137 |         debug('draw: keys=%s' % str(self.keys()))
138 |         if newfmt: display_value = newfmt.format(**self)
139 |         else: display_value = str(self)
140 |         if critical and self[VALUE] > critical[0]: color = critical[1]
141 |         elif warning and self[VALUE] > warning[0]: color = warning[1]
142 |         if length: window.addnstr(y, x, display_value, length, color)
143 |         else: window.addstr(y, x, display_value, color)
144 |         return
145 | 
146 | 
147 |     def __add__(self, other):
148 |         try: return self[VALUE] + other
149 |         except: return self[VALUE] + other[VALUE]
150 |     def __div__(self, other):
151 |         try: return self[VALUE] / other
152 |         except: return self[VALUE] / other[VALUE]
153 |     def __str__(self): return self.default_format.format(**self)
154 |     def type_coercion_data_dict_to_int(self, datastring):
155 |         try:
156 |             return sum(eval(datastring.replace('=', ':')).values())
157 |         except:
158 |             debug('CursedIntDictDataOperation.type_coercion: unable to eval %s' % datastring)
159 |         return 0
160 | 
161 | class CursedStringDataAttribute(CursedIntDataAttribute):
162 |     datatype = str
163 |     default_value = 'nodata'
164 |     default_format = '{VALUE:s}'
165 | 
166 | class CursedFloatDataAttribute(CursedIntDataAttribute):
167 |     datatype = float
168 |     default_value = 0.0
169 |     default_format = '{VALUE:>6.2f}'
170 | 
171 | class CursedIntDictDataOperation(CursedIntDataAttribute):
172 |     '''This is designed to invoke a JMX function which returns a (possibly
173 |     ordered) Dict where all the values are INTs.  It will use the sum of those
174 |     values as its result.'''
175 |     datatype = CursedIntDataAttribute.type_coercion_data_dict_to_int
176 |     bean_designator = 'MBeanOperation'
177 |     return_value_designators = ['Operation', '@return']
178 |     operation = 'invoke'
179 |     default_format = '{VALUE:>8d}'
180 | 
181 | 
182 | class CursedSeverity(CursedFloatDataAttribute):
183 |     def __init__(self, hostname):
184 |         '''Cheating here for no good reason other than to emphasize the specialness of this one.'''
185 |         CursedFloatDataAttribute.__init__(self, hostname, 'org.apache.cassandra.db:type=CompactionManager', PENDINGTASKS)
186 |         self[COMPACTIONS] = CursedStringDataAttribute(hostname, 'org.apache.cassandra.db:type=CompactionManager', 'Compactions')
187 |         return None
188 | 
189 |     def __call__(self):
190 |         '''Kick off a thread to get the compaction data before we get our own, then rejoin it.  Parallelism FTW@'''
191 |         compact = self[COMPACTIONS]
192 |         t = threading.Thread(target=compact)
193 |         t.daemon = True
194 |         t.start()
195 |         CursedFloatDataAttribute.__call__(self)
196 |         t.join()
197 |         total = 0.0
198 |         done = 0.0
199 |         if len(compact[VALUE]) > 4: compact[VALUE] = compact[VALUE][2:-2].replace('}','')
200 |         for row in compact[VALUE].split(','):
201 |             if 'total=' in row: total += int(row.split('total=')[-1])
202 |             if 'completed=' in row: done += int(row.split('completed=')[-1])
203 |         if total: percent = (total - done)/total
204 |         else: percent = 1
205 |         self[VALUE] = (self[VALUE] - len(compact[VALUE].split('},'))) + percent
206 |         
207 | 
208 | class CursedByteDataAttribute(CursedFloatDataAttribute):
209 |     default_format = '{VALUE:>6.2f} {LABEL}'
210 |     def __init__(self, *args, **kwargs):
211 |         CursedFloatDataAttribute.__init__(self, *args, **kwargs)
212 |         self[LABEL] = 'B'
213 |         return None
214 |     def __str__(self):
215 |         '''Bytes are useful things, but my mind things in megs, gigs, etc.
216 |         '''
217 |         raw = self[VALUE]
218 |         label = 'B'
219 |         for l in ['KB', 'MB', 'GB', 'TB', 'PB']:
220 |             if raw < 1024: break
221 |             raw = raw/1024.0
222 |             label = l
223 |         return self.default_format.format(VALUE=raw, LABEL=label)
224 |     
225 | 
226 | class CursedLatencyAverage(CursedIntDataAttribute):
227 |     datatype = float
228 |     default_value = 0.0
229 |     default_format = '{ONE:>6.2f}/{FIVE:>6.2f}/{FIFTEEN:>6.2f}'
230 |     def __init__(self, *args, **kwargs):
231 |         CursedIntDataAttribute.__init__(self, *args, **kwargs)
232 |         self.averages = MovingAverages()
233 |         self[VALUE] = 0.0
234 |         self[ONE] = 0.0
235 |         self[FIVE] = 0.0
236 |         self[FIFTEEN] = 0.0
237 |         return None
238 |     def __call__(self):
239 |         '''Cassandra hard-codes latency to be measured in MICROseconds.  I want to
240 |         keep track of, and display in, seconds.
241 | 
242 |         '''
243 |         raw = CursedIntDataAttribute.__call__(self)/1000000.0
244 |         self[VALUE] = raw
245 |         self.averages.add(raw)
246 |         self[ONE] = self.averages.one
247 |         self[FIVE] = self.averages.five
248 |         self[FIFTEEN] = self.averages.fifteen
249 |         return raw
250 | 
251 |     def draw(self, window, y, x, color=0, warning=None, critical=None, newfmt=None, length=0, averages=False):
252 |         '''Standard display method for these values.
253 | 
254 |         (window,y,x) are the expected curses items.
255 | 
256 |         "color" is the curses attribute set to use by default.  I'm
257 |         cheating big-time here and assuming that the default color should
258 |         be the result of curses.color_pair(0) AND that the result of that
259 |         will always be 0.  This is probably FRAGILE.
260 | 
261 |         "warning" and "critical", if set, should be a list/tuple where the
262 |         first item is a test value and the second item is a curses
263 |         attribute set.  If either of the tests is true, the appropriate
264 |         attribute set will override the default attribute set.
265 | 
266 |         newfmt is a format string using the new style of
267 |         formatting.
268 | 
269 |         length, if greater than zero, will guarantee that the displayed
270 |         string does not exceed a certain length.
271 | 
272 |         '''
273 |         debug('draw: keys=%s' % str(self.keys()))
274 |         if newfmt: display_value = newfmt.format(**self)
275 |         elif averages: display_value = '/'.join([self.default_format]*3).format(**self)
276 |         else: display_value = self.default_format.format(**self)
277 |         if critical and self[VALUE] > critical[0]: color = critical[1]
278 |         elif warning and self[VALUE] > warning[0]: color = warning[1]
279 |         if length: window.addnstr(y, x, display_value, length, color)
280 |         else: window.addstr(y, x, display_value, color)
281 |         return
282 | 
283 | 
284 | # This is here because
285 | # 1) It has to be after all of the individual JMX object type declarations,
286 | # 2) it has to be before CursedCluster.
287 | #
288 | # Its purpose is to provide a single list of keys and functions.  Each of
289 | # the functions is initialized as function(hostname, *args)
290 | host_attribute_set = {
291 |     LOAD: (CursedByteDataAttribute, ('org.apache.cassandra.db:type=StorageService', LOAD)),
292 |     SEVERITY: (CursedSeverity, ()),
293 |     STATUS: (CursedStringDataAttribute, ('org.apache.cassandra.db:type=StorageService', 'OperationMode')),
294 |     READ_LATENCY_INSTANTANEOUS: (CursedLatencyAverage, ('org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency', '75thPercentile')),
295 |     READ_RATE_ONE_MINUTE: (CursedFloatDataAttribute, ('org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency', 'OneMinuteRate')),
296 |     READ_RATE_FIVE_MINUTE: (CursedFloatDataAttribute, ('org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency', 'FiveMinuteRate')),
297 |     READ_RATE_FIFTEEN_MINUTE: (CursedFloatDataAttribute, ('org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency', 'FifteenMinuteRate')),
298 |     WRITE_LATENCY_INSTANTANEOUS: (CursedLatencyAverage, ('org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency', '75thPercentile')),
299 |     WRITE_RATE_ONE_MINUTE: (CursedFloatDataAttribute, ('org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency', 'OneMinuteRate')),
300 |     WRITE_RATE_FIVE_MINUTE: (CursedFloatDataAttribute, ('org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency', 'FiveMinuteRate')),
301 |     WRITE_RATE_FIFTEEN_MINUTE: (CursedFloatDataAttribute, ('org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency', 'FifteenMinuteRate')),
302 |     }
303 |     
304 | class CursedCluster(CursedStringDataAttribute):
305 |     '''This is a really specialized version of CursedStringDataAttribute, since
306 |     there should be only one *and* it's going to do a lot of string
307 |     processing and item creation.
308 | 
309 |     '''
310 |     datatype = str
311 |     default_value = ''
312 |     default_format = '{VALUE}'
313 |     ENDPOINT_SPLITTER = re.compile('^/', re.MULTILINE).split
314 | 
315 |     def __init__(self, hostname, delay=300):
316 |         '''In addition to the superclass startup, we extract the value of delay
317 |         from the passed in arguments (defaulting to 300), set a couple of
318 |         instance variables, load up the initial data, and fire up the refresh loop.
319 |         '''
320 |         self.delay = delay
321 |         CursedStringDataAttribute.__init__(self, hostname, 'org.apache.cassandra.net:type=FailureDetector', 'AllEndpointStates')
322 |         self[HOSTNAMES] = {}
323 |         self[POPS] = []
324 |         # Get the cluster name, it should never change!
325 |         self[CLUSTER_NAME] = CursedStringDataAttribute(hostname,
326 |                                                        'org.apache.cassandra.db:type=StorageService',
327 |                                                        'ClusterName')()
328 |         self()                  # Load the data up once!
329 |         t = threading.Thread(target=self._refresh_loop)
330 |         t.daemon = True
331 |         t.start()
332 |         return None
333 |     def __call__(self):
334 |         '''As well as the standard superclass functionality, we parse the returned
335 |         value into a bunch of hosts, each with a few static attributes, and
336 |         initialize each host with a set of data.  Adding more data items to
337 |         check should be done HERE.
338 | 
339 |         '''
340 |         data_string = CursedIntDataAttribute.__call__(self)
341 |         new_host_list = {}
342 |         new_pop_list = {}
343 |         for row in self.ENDPOINT_SPLITTER(data_string):
344 |             try:
345 |                 debug(row)
346 |                 if not row: continue
347 |                 if 'STATUS:remov' in row: continue
348 |                 row_pieces = row.split('\n  ')
349 |                 complete_address_set = socket.gethostbyaddr(row_pieces[0])
350 |                 debug('get_gossip_information: gethostbyaddr returned %s' % str(complete_address_set))
351 |                 endpoint = complete_address_set[0].replace('.cint','')
352 |                 new_host = new_host_list.setdefault(endpoint, {})
353 |                 new_host[LIVE] = True
354 |                 for line in row_pieces[1:]:
355 |                     key, value = line.split(':', 1)
356 |                     if key == RACK:
357 |                         rack_string = value.strip()
358 |                         new_host[key] = ' '*(5-len(rack_string)) + rack_string
359 |                     if key == DC:
360 |                         new_pop_list[value] = True
361 |                         new_host[key] = value.strip()
362 |                 for key in host_attribute_set:
363 |                     function, args = host_attribute_set[key] 
364 |                     new_host[key] = function(endpoint, *args)
365 |                 new_host[READ_RATE_ONE_MINUTE].default_format = "{VALUE:5.0f}"
366 |                 new_host[WRITE_RATE_ONE_MINUTE].default_format = "{VALUE:5.0f}"
367 | 
368 |             except Exception as e: debug('CursedCluster.__call__: ' + str(e))
369 |         if new_host_list: self[HOSTNAMES] = new_host_list
370 |         if new_pop_list: self[POPS] = new_pop_list.keys()
371 |         return self
372 |     def _refresh_loop(self):
373 |         '''Simple little infinite loop defined on the class because I think it's
374 |         cleaner than a lambda.
375 | 
376 |         '''
377 |         while 1:
378 |             time.sleep(self.delay)
379 |             self()
380 | 
381 | class Cluster(object):
382 |     # this is kind of an evil faux-function-definition
383 |     ENDPOINT_SPLITTER = re.compile('^/', re.MULTILINE).split
384 |     refresh_delay = 3
385 |     refresh = True
386 | 
387 |     def __init__(self, hostname, header_window, data_window, status_window):
388 |         self.compaction_averages = MovingAverages()
389 |         self.header_window = header_window
390 |         self.data_window = data_window
391 |         self.status_window = status_window
392 |         
393 |         self.last_refresh = 0.0
394 |         self.sort_order = 0
395 |         self.cluster_data = CursedCluster(hostname)
396 |         self.dead_nodes = []
397 |         self.item = SEVERITY
398 |         self.redraw_semaphore = threading.Semaphore()
399 |         self.redraw_lock = threading.Lock()
400 |         
401 |         sys.stdout.write(']0; Cassandra Top - %s ' % self.cluster_data[CLUSTER_NAME])
402 |         self.draw_data = self.draw_cluster_data
403 |         self.title = 'Cluster Summary'
404 |         self.good = curses.color_pair(0)
405 |         self.warning = curses.color_pair(1)
406 |         self.bad = curses.color_pair(2)
407 |         self.green = curses.color_pair(3)
408 |         curses.init_pair(1, curses.COLOR_YELLOW, curses.COLOR_BLACK)
409 |         curses.init_pair(2, curses.COLOR_RED, curses.COLOR_BLACK)
410 |         curses.init_pair(3, curses.COLOR_GREEN, curses.COLOR_BLACK)
411 |         return
412 | 
413 |     def dispatch(self, group, function):
414 |         group.append(threading.Thread(target=function))
415 |         group[-1].daemon = True
416 |         group[-1].start()
417 |         return
418 |     def rejoin(self, group):
419 |         for t in group: t.join()
420 |         return
421 |     def __call__(self):
422 |         '''this is the updating loop'''
423 |         self.redraw_semaphore.acquire() # Prevent the drawing routine from
424 |                                         # doing anything until we have
425 |                                         # data.
426 |         drawer = threading.Thread(target = self.draw)
427 |         drawer.daemon = True
428 |         drawer.start()
429 |         while 1:
430 |             now = time.time()
431 |             thread_group = []
432 |             for host in self.cluster_data[HOSTNAMES].values():
433 |                 for item in host.values():
434 |                     if isinstance(item, dict): self.dispatch(thread_group, item)
435 |             self.rejoin(thread_group)
436 |             for hostname in self.cluster_data[HOSTNAMES]:
437 |                 host = self.cluster_data[HOSTNAMES][hostname]
438 |                 host[LIVE] = True
439 |                 if host[STATUS][VALUE] != 'NORMAL':
440 |                     host[LIVE] = False
441 |                     debug('%s marked down because "%s" is not "NORMAL"' % (hostname, host[STATUS]))
442 |                 if host[LOAD][VALUE] == 0.0:
443 |                     host[LIVE] = False
444 |                     debug('%s marked down because the load is 0.0 (may just be new)' % hostname)
445 |             then = time.time()
446 |             self.stop_refresh()
447 |             self.redraw_semaphore.release()
448 |             self.start_refresh()
449 |             self.last_refresh = then - now
450 |             left = self.refresh_delay - self.last_refresh
451 |             if left > 0: time.sleep(left)
452 |         return
453 | 
454 |     def draw(self):
455 |         while 1:
456 |             self.redraw_semaphore.acquire()
457 |             try:
458 |                 self.draw_header()
459 |                 self.draw_data()
460 |                 self.draw_status()
461 |             except: debug(traceback.format_exc())
462 |         return
463 |         
464 |     def draw_labelled_item(self, window, starty, startx, label, value, warning=None, critical=None, fmt=None, hilight=False, length=0):
465 |         if fmt: display_value = fmt % value
466 |         else: display_value = str(value)
467 |         if hilight: window.addstr(starty, startx, label, curses.A_BOLD | self.green)
468 |         else: window.addstr(starty, startx, label, curses.A_BOLD)
469 |         if critical and value > critical: color = self.bad
470 |         elif warning and value > warning: color = self.warning
471 |         else: color = self.good
472 |         if length: window.addnstr(display_value, length, color)
473 |         else: window.addstr(display_value, color)
474 |         return
475 | 
476 |     def draw_header(self):
477 |         dead_count = len([x for x in self.cluster_data[HOSTNAMES].values() if not x[LIVE]])
478 |         host_count = len(self.cluster_data[HOSTNAMES]) + dead_count
479 |         compaction_data = 0.0
480 |         read_rate_one = 0.0
481 |         read_rate_five = 0.0
482 |         read_rate_fifteen = 0.0
483 |         write_rate_one = 0.0
484 |         write_rate_five = 0.0
485 |         write_rate_fifteen = 0.0
486 |         read_latency_one = 0.0
487 |         read_latency_five = 0.0
488 |         read_latency_fifteen = 0.0
489 |         write_latency_one = 0.0
490 |         write_latency_five = 0.0
491 |         write_latency_fifteen = 0.0
492 |         for x in self.cluster_data[HOSTNAMES].values():
493 |             compaction_data += x[SEVERITY][VALUE]
494 |             read_rate_one += x[READ_RATE_ONE_MINUTE][VALUE]
495 |             read_rate_five += x[READ_RATE_FIVE_MINUTE][VALUE]
496 |             read_rate_fifteen += x[READ_RATE_FIFTEEN_MINUTE][VALUE]
497 |             write_rate_one += x[WRITE_RATE_ONE_MINUTE][VALUE]
498 |             write_rate_five += x[WRITE_RATE_FIVE_MINUTE][VALUE]
499 |             write_rate_fifteen += x[WRITE_RATE_FIFTEEN_MINUTE][VALUE]
500 |             read_latency_one += x[READ_LATENCY_INSTANTANEOUS][ONE]
501 |             read_latency_five += x[READ_LATENCY_INSTANTANEOUS][FIVE]
502 |             read_latency_fifteen += x[READ_LATENCY_INSTANTANEOUS][FIFTEEN]
503 |             write_latency_one += x[WRITE_LATENCY_INSTANTANEOUS][ONE]
504 |             write_latency_five += x[WRITE_LATENCY_INSTANTANEOUS][FIVE]
505 |             write_latency_fifteen += x[WRITE_LATENCY_INSTANTANEOUS][FIFTEEN]
506 |         self.compaction_averages.add(compaction_data)
507 |         self.header_window.clear()
508 |         self.draw_labelled_item(self.header_window, 0, 0, 'Live Nodes: ', len(self.cluster_data[HOSTNAMES]))
509 |         self.draw_labelled_item(self.header_window, 1, 0, 'Dead Nodes: ', dead_count, warning=host_count*0.25, critical=host_count*0.5)
510 |         self.draw_labelled_item(self.header_window, 0, 16, 'Compactions: ', (self.compaction_averages.one,self.compaction_averages.five,self.compaction_averages.fifteen), fmt='%5.2f/%5.2f/%5.2f')
511 |         self.draw_labelled_item(self.header_window, 0, 53, 'Rrate: ', (read_rate_one, read_rate_five, read_rate_fifteen), fmt='%5.0f/%5.0f/%5.0f')
512 |         self.draw_labelled_item(self.header_window, 1, 53, 'Wrate: ', (write_rate_one, write_rate_five, write_rate_fifteen), fmt='%5.0f/%5.0f/%5.0f')
513 |         self.draw_labelled_item(self.header_window, 0, 78, 'Rlatency: ', (read_latency_one/1000, read_latency_five/1000, read_latency_fifteen/1000), fmt='%5.2f/%5.2f/%5.2f')
514 |         self.draw_labelled_item(self.header_window, 1, 78, 'Wlatency: ', (write_latency_one/1000, write_latency_five/1000, write_latency_fifteen/1000), fmt='%5.2f/%5.2f/%5.2f')
515 |         self.refresh and self.header_window.refresh()
516 |         return
517 | 
518 |     def size_convert(self, value):
519 |         label = 'B'
520 |         for l in ['KB', 'MB', 'GB', 'TB', 'PB']:
521 |             if value < 1024: break
522 |             value = value/1024.0
523 |             label = l
524 |         return (value, label)
525 |     
526 |     def draw_cluster_data(self):
527 |         (RESTY, RESTX) = self.data_window.getmaxyx()
528 |         self.data_window.clear()
529 |         self.data_window.standout()
530 |         self.draw_labelled_item(self.data_window, 0,  1, DC, '', hilight=(self.sort_order == 0))
531 |         self.draw_labelled_item(self.data_window, 0,  5, 'Nodes', '')
532 |         self.draw_labelled_item(self.data_window, 0, 11, 'Racks', '')
533 |         self.draw_labelled_item(self.data_window, 0, 20, 'Load', '', hilight=(self.sort_order == 1))
534 |         self.draw_labelled_item(self.data_window, 0, 28, 'Comps', '', hilight=(self.sort_order == 2))
535 |         self.draw_labelled_item(self.data_window, 0, 35, 'Rlat', '', hilight=(self.sort_order == 3))
536 |         self.draw_labelled_item(self.data_window, 0, 40, 'Rrate', '', hilight=(self.sort_order == 4))
537 |         self.draw_labelled_item(self.data_window, 0, 47, 'Wlat', '', hilight=(self.sort_order == 5))
538 |         self.draw_labelled_item(self.data_window, 0, 52, 'Wrate', '', hilight=(self.sort_order == 6))
539 |         
540 |         self.data_window.standend()
541 |         summarized_data = {}
542 |         debug('draw_cluster_data: new summary created')
543 |         for host in self.cluster_data[HOSTNAMES].values():
544 |             dc = host[DC]
545 |             debug('draw_cluster_data: DC is ' + dc)
546 |             if not summarized_data.has_key(dc):
547 |                 debug('draw_cluster_data: added DC - ' + dc)
548 |                 summarized_data[dc] = host.copy()
549 |                 summarized_data[dc][LIVE] = host[LIVE] and 1 or 0
550 |                 summarized_data[dc][DEAD] = not host[LIVE] and 1 or 0
551 |                 summarized_data[dc][RACK] = {host[RACK]: True}
552 |             else:
553 |                 if host[LIVE]: summarized_data[dc][LIVE] += 1
554 |                 else: summarized_data[dc][DEAD] += 1
555 |                 summarized_data[dc][RACK][host[RACK]] = True
556 |                 for key in [LOAD, SEVERITY,
557 |                             READ_LATENCY_INSTANTANEOUS,
558 |                             READ_RATE_ONE_MINUTE,
559 |                             WRITE_LATENCY_INSTANTANEOUS,
560 |                             WRITE_RATE_ONE_MINUTE]:
561 |                     try: summarized_data[dc][key] = host[key] + summarized_data[dc][key]
562 |                     except Exception as e: debug('draw_cluster_data - exception when summarizing: ' + str(e))
563 |         # Do sorting here
564 |         y = 0
565 |         sort_key = [DC, LOAD, SEVERITY, READ_LATENCY_INSTANTANEOUS,
566 |                     READ_RATE_ONE_MINUTE, WRITE_LATENCY_INSTANTANEOUS, WRITE_RATE_ONE_MINUTE][self.sort_order]
567 |         for row in sorted(summarized_data.values(), cmp=lambda x,y: cmp(x[sort_key], y[sort_key]), reverse = (self.sort_order != 0)):
568 |             y += 1
569 |             self.draw_labelled_item(self.data_window, y, 0, '', row[DC])
570 |             self.draw_labelled_item(self.data_window, y, 5, '', row[LIVE] + row[DEAD], fmt='%2d/', length=3)
571 |             self.draw_labelled_item(self.data_window, y, 8, '', row[DEAD], length=2, critical=row[LIVE]/3, warning=0)
572 |             self.draw_labelled_item(self.data_window, y, 12, '', len(row[RACK]), fmt='%2d', length=2)
573 |             self.draw_labelled_item(self.data_window, y, 16, '', '%7.2f %s' % self.size_convert(row[LOAD]), length=10)
574 |             self.draw_labelled_item(self.data_window, y, 27, '', row[SEVERITY], fmt='%6.2f', length=6)
575 |             self.draw_labelled_item(self.data_window, y, 34, '', row[READ_LATENCY_INSTANTANEOUS], fmt='%5.0f', length=6)
576 |             self.draw_labelled_item(self.data_window, y, 40, '', row[READ_RATE_ONE_MINUTE], fmt='%5.0f', length=6)
577 |             self.draw_labelled_item(self.data_window, y, 46, '', row[WRITE_LATENCY_INSTANTANEOUS], fmt='%5.0f', length=6)
578 |             self.draw_labelled_item(self.data_window, y, 52, '', row[WRITE_RATE_ONE_MINUTE], fmt='%5.0f', length=6)
579 | 
580 |         self.refresh and self.data_window.refresh()
581 |         return
582 | 
583 |     def draw_data_dict_item(self, y, x, datadict, key, length=0, fmt = None):
584 |         if not fmt:
585 |             if length: fmt = '{0:%d}' % length
586 |             else: fmt = '{0}'
587 |         try:
588 |             value = fmt.format(datadict[key])
589 |         except Exception as e:
590 |             self.data_window.addstr(y, x, 'NODATA', self.bad)
591 |             debug('draw_data_dict_item:' + key + '  ' + str(e))
592 |             return
593 |         if length:
594 |             self.data_window.addnstr(y, x, value, length)
595 |         else:   self.data_window.addstr(y, x, value)
596 |         return
597 | 
598 |     def sorted_host_key_order(self):
599 |         if self.draw_data == self.draw_host_data:
600 |             if self.sort_order == 1: # Sort by DC/host
601 |                 return sorted(self.cluster_data[HOSTNAMES].keys(), cmp=lambda x,y: cmp(self.cluster_data[HOSTNAMES][x][DC]+x, self.cluster_data[HOSTNAMES][y][DC]+y))
602 | 
603 |             if self.sort_order == 2: # Sort by RACK/host
604 |                 return sorted(sorted(self.cluster_data[HOSTNAMES].keys()), cmp=lambda x,y: cmp(self.cluster_data[HOSTNAMES][x][RACK], self.cluster_data[HOSTNAMES][y][RACK]))
605 | 
606 |             if self.sort_order == 3: # Sort by LOAD
607 |                 return sorted(sorted(self.cluster_data[HOSTNAMES].keys()), cmp=lambda x,y: cmp(self.cluster_data[HOSTNAMES][y][LOAD][VALUE], self.cluster_data[HOSTNAMES][x][LOAD][VALUE]))
608 | 
609 |             if self.sort_order == 4: # Sort by SEVERITY (compactions)
610 |                 return sorted(self.cluster_data[HOSTNAMES].keys(), cmp=lambda x,y: cmp(self.cluster_data[HOSTNAMES][y][SEVERITY][VALUE], self.cluster_data[HOSTNAMES][x][SEVERITY][VALUE]))
611 | 
612 |             if self.sort_order == 5:
613 |                 return sorted(self.cluster_data[HOSTNAMES].keys(), cmp=lambda x,y: cmp(self.cluster_data[HOSTNAMES][y][READ_LATENCY_INSTANTANEOUS][VALUE], self.cluster_data[HOSTNAMES][x][READ_LATENCY_INSTANTANEOUS][VALUE]))
614 | 
615 |             if self.sort_order == 6:
616 |                 return sorted(self.cluster_data[HOSTNAMES].keys(), cmp=lambda x,y: cmp(self.cluster_data[HOSTNAMES][y][READ_RATE_ONE_MINUTE][VALUE], self.cluster_data[HOSTNAMES][x][READ_RATE_ONE_MINUTE][VALUE]))
617 | 
618 |             if self.sort_order == 7:
619 |                 return sorted(self.cluster_data[HOSTNAMES].keys(), cmp=lambda x,y: cmp(self.cluster_data[HOSTNAMES][y][WRITE_LATENCY_INSTANTANEOUS][VALUE], self.cluster_data[HOSTNAMES][x][WRITE_LATENCY_INSTANTANEOUS][VALUE]))
620 | 
621 |             if self.sort_order == 8:
622 |                 return sorted(self.cluster_data[HOSTNAMES].keys(), cmp=lambda x,y: cmp(self.cluster_data[HOSTNAMES][y][WRITE_RATE_ONE_MINUTE][VALUE], self.cluster_data[HOSTNAMES][x][WRITE_RATE_ONE_MINUTE][VALUE]))
623 | 
624 |             #if self.sort_order == '1': This is the default, so we'll leave it as a fall-through
625 |             return sorted(self.cluster_data[HOSTNAMES].keys())
626 | 
627 |         if self.draw_data == self.draw_cluster_item:
628 |             # sorts by DC by Host regardless
629 |             return sorted(self.cluster_data[HOSTNAMES].keys(), cmp=lambda x,y: cmp(self.cluster_data[HOSTNAMES][x][DC]+x, self.cluster_data[HOSTNAMES][y][DC]+y))
630 |         pass
631 | 
632 | 
633 |     def draw_host_data(self):
634 |         (RESTY, RESTX) = self.data_window.getmaxyx()
635 |         self.data_window.clear()
636 |         self.draw_labelled_item(self.data_window, 0, 0, HOSTNAME, '', hilight=(self.sort_order == 0))
637 |         self.draw_labelled_item(self.data_window, 0, 10, DC, '', hilight=(self.sort_order == 1))
638 |         self.draw_labelled_item(self.data_window, 0, 16, 'Rack', '', hilight=(self.sort_order == 2))
639 |         self.draw_labelled_item(self.data_window, 0, 22, 'Status', '')
640 |         self.draw_labelled_item(self.data_window, 0, 31, 'Load', '', hilight=(self.sort_order == 3))
641 |         self.draw_labelled_item(self.data_window, 0, 40, 'Comps', '', hilight=(self.sort_order == 4))
642 |         self.draw_labelled_item(self.data_window, 0, 47, 'Rlat', '', hilight=(self.sort_order == 5))
643 |         self.draw_labelled_item(self.data_window, 0, 52, 'Rrate', '', hilight=(self.sort_order == 6))
644 |         self.draw_labelled_item(self.data_window, 0, 59, 'Wlat', '', hilight=(self.sort_order == 7))
645 |         self.draw_labelled_item(self.data_window, 0, 64, 'Wrate', '', hilight=(self.sort_order == 8))
646 | 
647 |         host_list = self.sorted_host_key_order()
648 |         y = 0
649 |         for host in host_list:
650 |             y += 1
651 |             if not y < RESTY: continue
652 |             data_set = self.cluster_data[HOSTNAMES][host]
653 |             if not data_set[LIVE]: self.data_window.addnstr(y, 0, host.split('.')[0], 10, self.bad)
654 |             else: self.data_window.addnstr(y, 0, host.split('.')[0], 10)
655 |             self.draw_data_dict_item(y, 10, data_set, DC, length=5)
656 |             self.draw_data_dict_item(y, 16, data_set, RACK, length=5)
657 |             data_set[STATUS].draw(self.data_window, y, 22, length=6)
658 |             data_set[LOAD].draw(self.data_window, y, 29, length=9)
659 |             data_set[SEVERITY].draw(self.data_window, y, 40, length=5)
660 |             data_set[READ_LATENCY_INSTANTANEOUS].draw(self.data_window, y, 46, length=5)
661 |             data_set[READ_RATE_ONE_MINUTE].draw(self.data_window, y, 52, length=5)
662 |             data_set[WRITE_LATENCY_INSTANTANEOUS].draw(self.data_window, y, 58, length=5)
663 |             data_set[WRITE_RATE_ONE_MINUTE].draw(self.data_window, y, 64, length=5)
664 |             if data_set.get(STATUS, None) == DEAD: self.draw_data_dict_item(y, 50, data_set, EXTENDED_STATUS)
665 |         self.refresh and self.data_window.refresh()
666 |         return
667 | 
668 |     def draw_cluster_item(self):
669 |         self.data_window.clear()
670 |         self.draw_labelled_item(self.data_window, 0, 0, HOSTNAME, '')
671 |         self.draw_labelled_item(self.data_window, 0, 30, DC, '')
672 |         self.draw_labelled_item(self.data_window, 0, 62, 'Cluster', '')
673 |         host_list = self.sorted_host_key_order()
674 |         cluster_total = None
675 |         writer = ClusterObject(self.data_window, self.item, self.cluster_data[HOSTNAMES][host_list[0]][DC], 1)
676 |         
677 |         for host in host_list:
678 |             if not writer.dc == self.cluster_data[HOSTNAMES][host][DC]:
679 |                 if cluster_total == None: cluster_total = writer.finish()
680 |                 elif getattr(cluster_total, 'count', None): cluster_total = map(sum, zip(cluster_total, writer.finish()))
681 |                 else: cluster_total += writer.finish()
682 |                 writer = ClusterObject(self.data_window, self.item, self.cluster_data[HOSTNAMES][host][DC], writer.row+1)
683 |             writer.entry(self.cluster_data[HOSTNAMES][host], host.split('.')[0])
684 |         if cluster_total == None: cluster_total = writer.finish()
685 |         if getattr(cluster_total, 'count', None):
686 |             cluster_total = map(sum, zip(cluster_total, writer.finish()))
687 |         else:
688 |             cluster_total += writer.finish()
689 |         if self.item == LOAD:
690 |             value, label = self.size_convert(cluster_total)
691 |             self.data_window.addstr(1, 60, writer.fmt.format(VALUE=value, LABEL=label))
692 |         else:
693 |             self.data_window.addstr(1, 60, writer.fmt.format(VALUE=cluster_total))
694 |                 
695 |         self.refresh and self.data_window.refresh()
696 |         return
697 | 
698 |         
699 |     def draw_status(self):
700 |         (RESTY, RESTX) = self.status_window.getmaxyx()
701 |         status_message = 'Update frequency: %ds (%0.2f)' % (self.refresh_delay, self.last_refresh)
702 |         l = len(status_message)
703 |         t = len(self.title)
704 |         if l+t+5 > RESTX:
705 |             tx = l+1
706 |             tlen = RESTX - l - 4
707 |         else:
708 |             tx = (RESTX - t) - 3
709 |             tlen = t
710 |         self.status_window.clear()
711 |         self.status_window.addstr(status_message)
712 |         try:
713 |             self.status_window.addnstr(0, tx, self.title, tlen, curses.color_pair(3) | curses.A_STANDOUT)
714 |         except:
715 |             raise Exception('tx=%d, RESTX=%d, RESTY=%d' % (tx,RESTX, RESTY))
716 |         self.refresh and self.status_window.refresh()
717 |         return
718 |     def stop_refresh(self):
719 |         return self.redraw_lock.acquire()
720 | 
721 |     def start_refresh(self):
722 |         try: return self.redraw_lock.release()
723 |         except: return
724 | 
725 | class ClusterObject(object):
726 |     '''Utility class to make printing a data item for a ring just a little neater'''
727 |     def __init__(self, window, item, dc, row):
728 |         self.window = window
729 |         self.item = item
730 |         self.dc = dc
731 |         self.row = self.top_row = row
732 |         self.maxy, self.maxx = self.window.getmaxyx()
733 |         if isinstance(item, basestring): self.total = 0
734 |         else: self.total = [0] * len(item)
735 |         return
736 | 
737 |     def entry(self, obj, hostname):
738 |         self.fmt = obj[self.item].default_format
739 |         self.total += obj[self.item][VALUE]
740 |         if self.row < self.maxy:
741 |             try:
742 |                 self.window.addstr(self.row, 0, hostname)
743 |                 obj[self.item].draw(self.window, self.row, 15)
744 |             except: debug(traceback.format_exc())
745 |         self.row += 1
746 |         return
747 | 
748 |     def finish(self):
749 |         if self.top_row < self.maxy:
750 |             value = self.total
751 |             try:
752 |                 self.window.addstr(self.top_row, 30, self.dc)
753 |                 self.window.addstr(self.top_row, 35, self.fmt.format(VALUE=value))
754 |             except: debug(traceback.format_exc())
755 |         return self.total
756 | 
757 | helpstrings = [
758 |     ('', 'Summary information in the first couple lines is for the entire cluster.'),
759 |     ('', ''),
760 |     ('q', 'Exit the program (immediately)'),
761 |     ('c', 'Display cluster summary data'),
762 |     ('h', 'Display host data'),
763 |     ('s', 'Display severity (compaction) data'),
764 |     ('l', 'Display load data'),
765 |     ('r', 'Display read data'),
766 |     ('w', 'Display write data'),
767 |     ('', ''),
768 |     ('+', 'Increase the delay between updates (takes effect after next update)'),
769 |     ('-', 'Decrease the delay between updates (takes effect after next update)'),
770 |     ('', ''),
771 |     ('1-9', 'Column to sort on, OR switch values sets in read/write data'),
772 |     ('<>', 'Previous/next sort column, OR switch values sets in read/write data'),
773 |     ('', ''),
774 |     ('?', 'This help screen'),
775 |     ]
776 | 
777 | 
778 | def display_help(topscr):
779 |     (RESTY, RESTX) = topscr.getmaxyx()
780 |     helpscr = topscr.subwin(RESTY-4,RESTX-4,2,2)
781 |     (RESTY, RESTX) = helpscr.getmaxyx()
782 |     helpscr.clrtobot()
783 |     helpscr.box()
784 |     y = 1
785 |     for parts in helpstrings:
786 |         y += 1
787 |         helpscr.addnstr(y, 3, '%s: %s' % parts, RESTX-4)
788 |     helpscr.addstr(RESTY-1, 3, 'Press any key to leave help')
789 |     helpscr.refresh()
790 |     key = helpscr.getkey()
791 |     helpscr.erase()
792 |     curses.doupdate()
793 |     
794 |     
795 | def display_initial(topscr):
796 |     (RESTY, RESTX) = topscr.getmaxyx()
797 |     message = 'Please wait while I perform the initial data fetch'
798 |     width = len(message) + 4
799 |     helpscr = topscr.subwin(3, width, RESTY/2-1, (RESTX-width)/2)
800 |     helpscr.clrtobot()
801 |     helpscr.box()
802 |     helpscr.addstr(1,2,message)
803 |     helpscr.refresh()
804 |     helpscr.erase()
805 |     curses.doupdate()
806 |     
807 |     
808 | def main(stdscr, hostname):
809 |     display_initial(stdscr)
810 |     (RESTY, RESTX) = stdscr.getmaxyx()
811 |     header_win = stdscr.subwin(5, RESTX, 0, 0)
812 |     data_win = stdscr.subwin(RESTY-6, RESTX, 5, 0)
813 |     status_win = stdscr.subwin(1, RESTX-19, RESTY-1, 0)
814 |     target = Cluster(hostname, header_win, data_win, status_win)
815 |     debug(str(target.cluster_data))
816 |     if not target.cluster_data:
817 |         debug('Unable to contact any seeds')
818 |         raise SystemExit('Unable to contact any seeds')
819 |     target.redraw_semaphore.release()
820 |     cluster = threading.Thread(target = target)
821 |     cluster.daemon = True
822 |     cluster.start()
823 |     stdscr.addstr(RESTY-1, RESTX-19, "Press '?' for help")
824 |     stdscr.refresh()
825 |     read_list = [(READ_RATE_ONE_MINUTE, 'Read Rate (1 minute)'),
826 |                  (READ_RATE_FIVE_MINUTE, 'Read Rate (5 minutes)'),
827 |                  (READ_RATE_FIFTEEN_MINUTE, 'Read Rate (15 minutes)'),
828 |                  (READ_LATENCY_ONE_MINUTE, 'Read Latency (1 minute)'),
829 |                  (READ_LATENCY_FIVE_MINUTE, 'Read Latency (5 minutes)'),
830 |                  (READ_LATENCY_FIFTEEN_MINUTE, 'Read Latency (15 minutes)'),
831 |              ]
832 |     write_list = [(WRITE_RATE_ONE_MINUTE, 'Write Rate (1 minute)'),
833 |                  (WRITE_RATE_FIVE_MINUTE, 'Write Rate (5 minutes)'),
834 |                  (WRITE_RATE_FIFTEEN_MINUTE, 'Write Rate (15 minutes)'),
835 |                  (WRITE_LATENCY_ONE_MINUTE, 'Write Latency (1 minute)'),
836 |                  (WRITE_LATENCY_FIVE_MINUTE, 'Write Latency (5 minutes)'),
837 |                  (WRITE_LATENCY_FIFTEEN_MINUTE, 'Write Latency (15 minutes)'),
838 |              ]
839 |     while 1:
840 |         try:
841 |             key = stdscr.getkey()
842 |             if key == 'q': break
843 |             elif key == '+': Cluster.refresh_delay = Cluster.refresh_delay + 1
844 |             elif key == '-':
845 |                 if Cluster.refresh_delay > 1: Cluster.refresh_delay = Cluster.refresh_delay - 1
846 |             elif key == 'c':
847 |                 target.draw_data = target.draw_cluster_data
848 |                 target.title = 'Cluster Summary'
849 |             elif key == 'h':
850 |                 target.draw_data = target.draw_host_data
851 |                 target.title = 'Hosts Summary'
852 |                 target.sort_order = 0
853 |             elif key == 's':
854 |                 target.item = SEVERITY
855 |                 target.draw_data = target.draw_cluster_item
856 |                 target.title = 'Compactions'
857 |             elif key == 'l':
858 |                 target.item = LOAD
859 |                 target.draw_data = target.draw_cluster_item
860 |                 target.title = 'Load'
861 |             elif key == 'r':
862 |                 target.sort_order = target.sort_order % len(read_list)
863 |                 target.draw_data = target.draw_cluster_item
864 |                 target.item, target.title = read_list[target.sort_order]
865 |             elif key == 'w':
866 |                 target.sort_order = target.sort_order % len(write_list)
867 |                 target.draw_data = target.draw_cluster_item
868 |                 target.item, target.title = write_list[target.sort_order]
869 |             elif key in '1234567890':
870 |                 value = (int(key) - 1 + 10) % 10
871 |                 if target.title in ['Compactions', 'Load']: pass
872 |                 elif target.title == 'Cluster Summary':
873 |                     if value < 8 and value > -1: target.sort_order = value
874 |                 elif target.title == 'Hosts Summary':
875 |                     if value < 10 and value > -1: target.sort_order = value
876 |                 elif target.title in [x[1] for x in read_list]:
877 |                     if value < 6 and value > -1:
878 |                         target.sort_order = value % len(read_list)
879 |                         target.item, target.title = read_list[target.sort_order]
880 |                 elif target.title in [x[1] for x in write_list]:
881 |                     if value < 6 and value > -1:
882 |                         target.sort_order = value % len(write_list)
883 |                         target.item, target.title = write_list[target.sort_order]
884 |             elif key in '<>':
885 |                 if target.title in ['Compactions', 'Load']: pass
886 |                 elif target.title == 'Cluster Summary':
887 |                     if key == '>': target.sort_order = (target.sort_order + 1) % 8
888 |                     else: target.sort_order = (target.sort_order + 7) % 9
889 |                 elif target.title == 'Hosts Summary':
890 |                     if key == '>': target.sort_order = (target.sort_order + 1) % 10
891 |                     else: target.sort_order = (target.sort_order + 9) % 10
892 |                 elif target.title in [x[1] for x in read_list]:
893 |                     if key == '>': target.sort_order = (target.sort_order + 1) % len(read_list)
894 |                     else: target.sort_order = (target.sort_order + 5) % len(read_list)
895 |                     target.item, target.title = read_list[target.sort_order]
896 |                 elif target.title in [x[1] for x in write_list]:
897 |                     if key == '>': target.sort_order = (target.sort_order + 1) % len(write_list)
898 |                     else: target.sort_order = (target.sort_order + 5) % len(write_list)
899 |                     target.item, target.title = write_list[target.sort_order]
900 | 
901 |             if key == '?':
902 |                 target.stop_refresh()
903 |                 display_help(stdscr)
904 |                 target.start_refresh()
905 |             target.redraw_semaphore.release()
906 |         except KeyboardInterrupt: raise SystemExit
907 |         except: pass
908 |     return target
909 | 
910 | def one_shot(key, hostname):
911 |     '''Extract the name of a status item, get that item, print it to stdout,
912 |     and exit.
913 | 
914 |     Parameters:
915 |       key      - Name of a status variable
916 |       hostname - A host name.  Just one, really.
917 |     
918 |     Return value: does not return
919 | 
920 |     '''
921 |     if not host_attribute_set.has_key(key):
922 |         logging.fatal('No such status item: %s', key)
923 |         exit(-1)
924 |     function, args = host_attribute_set[key]
925 |     data_item = function(hostname, *args)
926 |     data_item()
927 |     print data_item[VALUE]
928 |     exit()
929 | 
930 | def tp_stat(key, hostname):
931 |     '''Extract one value from JMX, kind of like nodetool tpstats.
932 |     '''
933 |     data_item = CursedStringDataAttribute(hostname, 'org.apache.cassandra.metrics:type=ThreadPools,path=request,scope=%sStage,name=%sTasks' % key, 'Value')
934 |     data_item()
935 |     print data_item[VALUE]
936 |     exit()
937 | 
938 | def random_stat(key, hostname, jmx_object):
939 |     '''Extract one value from JMX, kind of like nodetool tpstats.
940 |     '''
941 |     data_item = CursedStringDataAttribute(hostname, jmx_object, key)
942 |     data_item()
943 |     print data_item[VALUE]
944 |     exit()
945 |     
946 | parser = optparse.OptionParser(description='Top-like program for Cassandra.  '
947 |                                'The seed_host is used as the starting point to discover the cluster.',
948 |                                usage = '%prog [options] seed_host')
949 | parser.add_option('-d', '--debug', dest='debug', default=False, action='store_true')
950 | parser.add_option('-o', '--one-shot', help='Variable name to extract from the server once.  Valid status variables are: ' + ' '.join(host_attribute_set.keys()))
951 | parser.add_option('-t', '--tpstat', nargs=2, help='Variable and status to extract from the server (e.g. --tpstat ReadStage Pending)')
952 | 
953 | options, args = parser.parse_args()
954 | if not args:
955 |     parser.print_usage()
956 |     exit(-1)
957 | if options.debug: logging.basicConfig(level=logging.DEBUG)
958 | else: logging.basicConfig(level=logging.WARNING)
959 | 
960 | if options.one_shot: one_shot(options.one_shot, args[0])
961 | if options.tpstat: tp_stat(options.tpstat, args[0])
962 | if options.tpstat: random_stat(options.tpstat, args[0])
963 | 
964 | signal.signal(signal.SIGWINCH, sigwinch_handler)
965 | old_tty = termios.tcgetattr(sys.stdin.fileno())
966 | retdata = object()
967 | try:
968 |     retdata = curses.wrapper(main, args[0])
969 | except KeyboardInterrupt:
970 |     pass
971 | except Exception, e:
972 |     raise
973 | 
974 | termios.tcsetattr(sys.stdin.fileno(), termios.TCSANOW, old_tty)
975 | #logging.debug(pprint.pformat( getattr(retdata,'cluster_data', None)))
976 | logging.debug('%s', 'live nodes')
977 | logging.debug(getattr(retdata, 'hostnames', None))
978 | logging.debug('dead nodes')
979 | logging.debug(pprint.pformat(getattr(retdata, 'dead_nodes', None)))
980 | 
981 | logging.debug('debuginfo')
982 | logging.debug(pprint.pformat(list(_debuginfo)))
983 | 


--------------------------------------------------------------------------------
/poison_pill_tester:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env python
 2 | 
 3 | import json, sys, pprint
 4 | 
 5 | for filename in sys.argv[1:]:
 6 |   longest = 0
 7 |   widest = 0
 8 |   long_item = None
 9 |   wide_item = None
10 |   try:
11 |     for row in json.load(open(filename)):
12 |       try:
13 |         length = len(str(row))
14 |         if length > longest:
15 |           longest = length
16 |           long_item = row
17 |       except:
18 |         pass
19 |       try:
20 |         length = len(row['columns'])
21 |         if length > widest:
22 |           widest = length
23 |           wide_item = row
24 |       except:
25 |         pass
26 |     try: print filename, widest, wide_item['key'], longest, long_item['key']
27 |     except: pass
28 |   except: pass
29 | 


--------------------------------------------------------------------------------
/stop_cassandra_repairs:
--------------------------------------------------------------------------------
 1 | #! /bin/bash
 2 | 
 3 | # Author: Brian Gallew <bgallew@llnw.com> or <geek@gallew.org>
 4 | 
 5 | if [ -z "${1}" ] ; then
 6 |     print "Usage: ${0} hostname [hostname ...]"
 7 |     exit
 8 | fi
 9 | 
10 | while test -n "${1}" ; do
11 |     wget -q -O /dev/null "http://${1}:8081/invoke?operation=forceTerminateAllRepairSessions&objectname=org.apache.cassandra.db%3Atype%3DStorageService"
12 |     shift
13 | done
14 | 


--------------------------------------------------------------------------------