├── README.md ├── cassandra_repair_scheduler.py ├── casstop ├── poison_pill_tester └── stop_cassandra_repairs /README.md: -------------------------------------------------------------------------------- 1 | cassandra_tools 2 | ======= 3 | 4 | # casstop 5 | 6 | "top"-like tool for Cassandra. It attempts to enable a real-time view into the state of your Cassandra cluster. Requires http://mx4j.sourceforge.net/. 7 | 8 | ## Usage 9 | ``` 10 | casstop $NODENAME [$NODENAME ...] 11 | ``` 12 | 13 | # stop_cassandra_repairs 14 | 15 | Cassandra repairs have an unfortunate tendency to hang, but there are no tools to kill off such a hung repair, thus tying up resources on the problem nodes until such time as they are restarted. stop_cassandra_repairs will use MX4J to stop any outstanding repairs on the nodes you give it. Requires http://mx4j.sourceforge.net/. 16 | 17 | ##Usage 18 | ``` 19 | stop_cassandra_repairs $HUNG_NODE [$HUNG_NODE ...] 20 | ``` 21 | 22 | # cassandra_repair_scheduler.py 23 | 24 | Script for scheduling repairs on your cluster. Requires the use of 25 | https://github.com/BrianGallew/cassandra_range_repair to work. 26 | 27 | ##Basic Usage 28 | ``` 29 | echo 0 */4 * * * /usr/local/bin/cassandra_repair_scheduler.py >> /etc/crontab 30 | ``` 31 | 32 | ##Help 33 | ``` 34 | usage: cassandra_repair_scheduler.py [-h] [-v] [-d] [--syslog FACILITY] 35 | [--logfile FILENAME] [-H HOSTNAME] 36 | [-p PORT] [-U USERNAME] [-P PASSWORD] 37 | [-t TTL] [-k KEYSPACE] 38 | [--cqlversion CQLVERSION] 39 | [-r RANGE_REPAIR_TOOL] 40 | 41 | optional arguments: 42 | -h, --help show this help message and exit 43 | -v, --verbose Verbose output 44 | -d, --debug Debugging output 45 | --syslog FACILITY Send log messages to the syslog 46 | --logfile FILENAME Send log messages to a file 47 | -H HOSTNAME, --hostname HOSTNAME 48 | Hostname (default: mactheknife.local) 49 | -p PORT, --port PORT Port (default: 9160) 50 | -U USERNAME, --username USERNAME 51 | Username (if necessary) 52 | -P PASSWORD, --password PASSWORD 53 | Password. (prompt if user provided but not password) 54 | -t TTL, --ttl TTL TTL (default: 1728000) 55 | -k KEYSPACE, --keyspace KEYSPACE 56 | Keyspace to use (default: operations) 57 | --cqlversion CQLVERSION 58 | CQL version (default: 3.0.5) 59 | -r RANGE_REPAIR_TOOL, --range_repair_tool RANGE_REPAIR_TOOL 60 | Range repair tool path (default: 61 | /usr/local/bin/range_repair.py) 62 | --watch See the live repair status. 63 | ``` 64 | 65 | # poison_pill_tester 66 | 67 | Script for discovering wide rows. You'll need to convert your data to JSON, first. 68 | 69 | ##Usage 70 | ``` 71 | nodetool snapshot keyspace suspect_column_family 72 | #(cd into the relevant snapshot directory) 73 | for d in *-Data.db 74 | do 75 | e=$(echo $d | sed s,-Data.db,.json,) 76 | sstable2json $d > $e 77 | done 78 | poison_pill_tester *.json 79 | ``` 80 | 81 | # MX4J 82 | 83 | http://mx4j.sourceforge.net/ is, among other things, a JMX<->HTML bridge. 84 | 85 | ## Installation 86 | 87 | MX4J is available via "apt-get install libmx4j-java" on Ubuntu (that's 88 | where it's current used). 89 | 90 | Once that library is installed, you need to ensure that it's in Cassandra's 91 | load path, most easily done by symlinking it like this: 92 | 93 | ``` 94 | ln -s /usr/share/java/mx4j-tools.jar /usr/share/cassandra/lib/mx4j-tools.jar 95 | ``` 96 | 97 | Finally, you need to load it into cassandra. Edit cassandra-env.sh and add 98 | the following line: 99 | 100 | ``` 101 | JVM_OPTS="${JVM_OPTS} -Dmx4jaddress=$(ip addr show dev eth0 | grep 'inet ' | sed -e s,inet,, -e 's,/.*,,')" 102 | ``` 103 | 104 | NB: that assumes you have only one address on eth0, and that eth0 is what 105 | you want MX4J listening on. 106 | -------------------------------------------------------------------------------- /cassandra_repair_scheduler.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | # Author: Brian Gallew or 4 | 5 | """ 6 | Run repairs on a regularly scheduled basis. Drop this in cron, not more 7 | often than hourly. 8 | 9 | Workflow: 10 | 1) Insert a record (with TTL) into the queue table. 11 | 2) Select all records from the queue table. 12 | 2a) If ours isn't the first one, delete ours and then exit. 13 | 3) Select all records from the status table. 14 | 4) If our status exists and is "running", delete our queue record and exit. 15 | 5) If our status exists and is "completed" and the completion time is too 16 | recent, delete our queue record and exit. 17 | 6) Create/replace our status with "running" (with TTL) 18 | 7) Delete our queue record 19 | 8) Run a repair 20 | 9) Replace our status with "completed" (with TTL) 21 | 22 | """ 23 | 24 | import logging 25 | import argparse 26 | import platform 27 | import getpass 28 | import time 29 | import subprocess 30 | import cql 31 | import curses 32 | import curses.wrapper 33 | import threading 34 | 35 | COMPLETED = "Completed" 36 | DELAY = 'delay' 37 | 38 | class CqlWrapper(object): 39 | 40 | """Keep all of the CQL-specific stuff in here so we can have consistent 41 | retry handling, etc. 42 | 43 | Updates to SCHEMA may require updates to create_schema. 44 | """ 45 | SCHEMA = [ 46 | """CREATE KEYSPACE "{keyspace}" 47 | WITH replication = {{'class' : 'NetworkTopologyStrategy', 48 | {data_center_replication_map}}} 49 | AND durable_writes = false; 50 | """, 51 | """USE {keyspace};""", 52 | """CREATE TABLE "mutex" ( 53 | nodename varchar, 54 | data_center varchar, 55 | PRIMARY KEY ((nodename), data_center)) 56 | WITH comment='Poor MUTEX implementation' 57 | """, 58 | """CREATE TABLE "repair_status" ( 59 | nodename varchar, 60 | data_center varchar, 61 | repair_status varchar, 62 | PRIMARY KEY ((nodename), data_center)) 63 | WITH comment='Repair status of each node' 64 | """, 65 | ] 66 | GET_STATUS = """SELECT "repair_status" FROM "repair_status" 67 | WHERE "nodename" = :nodename AND "data_center" = :data_center""" 68 | GET_LOCAL_STATUS = """SELECT "nodename", "repair_status" FROM "repair_status" 69 | WHERE "data_center" = :data_center ALLOW FILTERING""" 70 | # This next statement could get ugly if you have 1000+ nodes. 71 | GET_ALL_STATUS = """SELECT "nodename", "data_center", "repair_status", WRITETIME("repair_status") FROM "repair_status" """ 72 | MUTEX_START = """INSERT INTO "mutex" ("nodename", "data_center") 73 | VALUES (:nodename, :data_center) USING TTL :ttl""" 74 | MUTEX_CHECK = """SELECT "nodename", "data_center" FROM "mutex" """ 75 | MUTEX_CLEANUP = """DELETE FROM "mutex" WHERE "nodename" = :nodename AND "data_center" = :data_center""" 76 | SELECT_ALL_DATACENTERS = """SELECT data_center FROM system.peers""" 77 | SELECT_MY_DATACENTER = """SELECT data_center FROM system.local""" 78 | REPAIR_START = """INSERT INTO "repair_status" ("nodename", "data_center", "repair_status") 79 | VALUES (:nodename, :data_center, 'Started') USING TTL :ttl""" 80 | REPAIR_UPDATE = """UPDATE "repair_status" USING TTL :ttl SET "repair_status" = :newstatus 81 | WHERE "nodename" = :nodename AND "data_center" = :data_center""" 82 | REPAIR_CLEANUP = """DELETE FROM "repair_status" WHERE "nodename" = :nodename AND "data_center" = :data_center""" 83 | 84 | def __init__(self, option_group): 85 | """Set up and manage our connection. 86 | :param option_group: result of CLI parsing 87 | """ 88 | self.option_group = option_group 89 | self.nodename = option_group.hostname 90 | self.conn = None 91 | try: 92 | self.standard_connection() 93 | except: 94 | self.create_schema() 95 | self.data_center = self.get_data_center() 96 | return 97 | 98 | def get_data_center(self): 99 | """Get our data center tag. 100 | :returns: data_center""" 101 | result = self.query_or_die( 102 | self.SELECT_MY_DATACENTER, "Looking for my datacenter") 103 | if not result: 104 | logging.fatal( 105 | "No data center in local data. Still bootstrapping?") 106 | exit(1) 107 | return result[0][0] 108 | 109 | def standard_connection(self): 110 | """Set up a connection to Cassandra. 111 | """ 112 | logging.debug('connecting to %s', self.option_group.keyspace) 113 | self.conn = cql.connect(self.option_group.hostname, 114 | self.option_group.port, 115 | self.option_group.keyspace, 116 | user=self.option_group.username, 117 | password=self.option_group.password, 118 | cql_version=self.option_group.cqlversion) 119 | return 120 | 121 | def create_schema(self): 122 | """Creates the schema if it doesn't exist using the CQL in self.SCHEMA. 123 | Each query in there will be formatted with locals(), so if you 124 | update self.SCHEMA, be sure to update this function, too. 125 | """ 126 | logging.info('creating schema') 127 | self.conn = cql.connect(self.option_group.hostname, 128 | self.option_group.port, 129 | "system", 130 | user=self.option_group.username, 131 | password=self.option_group.password, 132 | cql_version=self.option_group.cqlversion) 133 | 134 | data_center = self.query_or_die(self.SELECT_ALL_DATACENTERS, 135 | "Unable to determine the local data center") 136 | if not data_center: 137 | logging.fatal( 138 | "No peers defined, repairs on a single-node cluster are silly") 139 | exit(0) 140 | 141 | # Cassandra doesn't support 'SELECT foo, 1 FROM ..." or DISTINCT, 142 | # so we have do something a little complicated to deduplicate the 143 | # results and then produce the desired string. 144 | data_center_replication_map = {} 145 | for row in data_center: 146 | data_center_replication_map[row[0]] = None 147 | data_center_replication_map = ", ".join( 148 | ["'%s':3" % x for x in data_center_replication_map]) 149 | 150 | # This declaration is just so that "keyspace" will appear in locals. 151 | # pylint: disable=unused-variable 152 | keyspace = self.option_group.keyspace 153 | # pylint: enable=unused-variable 154 | for cql_query in self.SCHEMA: 155 | self.query(cql_query.format(**locals())) 156 | return 157 | 158 | def query_or_die(self, query_string, error_message, consistency_level="LOCAL_QUORUM", **kwargs): 159 | """Execute a query, on exception print an error message and exit. 160 | :param query_string: CQL to perform 161 | :param error_message: printed on error 162 | :param kwargs: dictionary to use for parameter substitution in the CQL 163 | """ 164 | try: 165 | return self.query(query_string, consistency_level=consistency_level, **kwargs) 166 | except Exception as e: 167 | logging.fatal("%s: %s", error_message, e) 168 | exit(1) 169 | 170 | def query(self, query_string, consistency_level="LOCAL_QUORUM", **kwargs): 171 | """Execute a query. 172 | :param query_string: CQL to perform 173 | :param kwargs: dictionary to use for parameter substitution in the CQL 174 | :returns: query results 175 | """ 176 | if not self.conn: 177 | self.standard_connection() 178 | cursor = self.conn.cursor() 179 | logging.debug("Query: %s, arguments: %s", query_string, str(kwargs)) 180 | cursor.execute( 181 | query_string.encode('ascii'), kwargs, consistency_level=consistency_level) 182 | data = cursor.fetchall() 183 | cursor.close() 184 | logging.debug(str(data)) 185 | return data 186 | 187 | def get_all_status(self): 188 | """Get the status of all repairs. 189 | """ 190 | while True: 191 | try: 192 | return self.query(self.GET_ALL_STATUS, consistency_level="ONE") 193 | except: 194 | self.close() 195 | time.sleep(1) 196 | return [] 197 | 198 | def close(self): 199 | """Shut down the connection gracefully.""" 200 | self.conn.close() 201 | self.conn = None 202 | return 203 | 204 | def check_should_run(self): 205 | """Check to see if it is appropriate to start up. 206 | :returns: boolean 207 | """ 208 | logging.debug("Check to see if we're already running a repair") 209 | result = self.query_or_die( 210 | self.GET_STATUS, "Checking status", 211 | nodename=self.nodename, data_center=self.data_center) 212 | # If there's any result at all, either a run is in progress, or the 213 | # last completed run hasn't expired yet. Either way, bail. 214 | if result: 215 | logging.info("Repair in progress: %s", result[0][0]) 216 | return False 217 | 218 | logging.debug("Check to see if anyone else in the local ring is running a repair") 219 | result = self.query_or_die(self.GET_LOCAL_STATUS, 220 | "Checking local ring status", 221 | nodename=self.nodename, 222 | data_center=self.data_center) 223 | 224 | if result: 225 | already_running = [x[0] for x in result if x[1] != COMPLETED] 226 | if already_running: 227 | logging.info("Another node is repairing.: %s", already_running[0]) 228 | return False 229 | self.query_or_die(self.MUTEX_START, "Starting MUTEX", 230 | nodename=self.nodename, 231 | data_center=self.data_center, 232 | ttl=self.option_group.ttl) 233 | # Totally arbitrary delay here, because I don't trust C*. 234 | logging.debug('Five second pause here') 235 | time.sleep(5) 236 | result = self.query_or_die(self.MUTEX_CHECK, "Checking MUTEX", 237 | consistency_level="ONE", 238 | data_center=self.data_center) 239 | if not result or not [x[0] for x in result if x[1] == self.data_center][0] == self.nodename: 240 | self.query(self.MUTEX_CLEANUP, nodename=self.nodename, data_center=self.data_center) 241 | return False 242 | return True 243 | 244 | def claim_repair(self): 245 | """Insert a row claiming that we're starting the repair, 246 | then remove the MUTEX.""" 247 | self.query_or_die(self.REPAIR_START, 248 | "Starting Repair", nodename=self.nodename, 249 | data_center=self.data_center, ttl=self.option_group.ttl) 250 | self.query_or_die(self.MUTEX_CLEANUP, 251 | "Dropping MUTEX record", 252 | nodename=self.nodename, 253 | data_center=self.data_center) 254 | self.close() 255 | return 256 | 257 | def run_repair(self): 258 | """Run the entire repair""" 259 | cmd = [self.option_group.range_repair_tool, 260 | "-D", self.data_center, 261 | "-H", self.nodename, 262 | "--dry-run"] # So we get a list of commands to run. 263 | if self.option_group.local: 264 | cmd.append("--local") 265 | logging.debug("geting repair steps, this may take a while") 266 | repair_steps = subprocess.check_output(cmd).split('\n') 267 | for line in repair_steps: 268 | if not line: 269 | continue 270 | step, repair_command = line.split(" ", 1) 271 | try: 272 | self.query(self.REPAIR_UPDATE, nodename=self.nodename, 273 | newstatus=step, data_center=self.data_center, 274 | ttl=self.option_group.ttl) 275 | except: 276 | logging.warning("Failed to update repair status, continuing anyway") 277 | self.close() # Individual repairs may be slow 278 | logging.debug(repair_command) 279 | subprocess.call(repair_command, shell=True) 280 | try: 281 | self.query(self.REPAIR_UPDATE, nodename=self.nodename, 282 | ttl=self.option_group.ttl, data_center=self.data_center, 283 | newstatus=COMPLETED) 284 | except: 285 | logging.warning("Failed to update repair status, continuing anyway") 286 | return 287 | 288 | def reset_repair_status(self): 289 | """Reset the repair status by removing the records from the database. 290 | """ 291 | self.query(self.MUTEX_CLEANUP, nodename=self.nodename, data_center=self.data_center) 292 | self.query(self.REPAIR_CLEANUP, nodename=self.nodename, data_center=self.data_center) 293 | return 294 | 295 | def status_update_loop(connection, options, status_dict): 296 | """Update a status dictionary periodically with the results of a CQL query. 297 | This never exits, and should be used inside a thread. 298 | :param connection: CqlWrapper (used for queries) 299 | :param options: control dictionary 300 | :param status_dict: a dictionary which will hold all of the returned data 301 | """ 302 | while True: 303 | start_time = time.time() 304 | new_names = [] 305 | for row in connection.get_all_status(): 306 | status_dict[row[0]] = row 307 | new_names.append(row[0]) 308 | logging.debug("status_update_loop: status: %s", str(status_dict)) 309 | for nodename in status_dict.keys(): 310 | if not nodename in new_names: del status_dict[nodename] 311 | end_time = time.time() 312 | delta_time = options[DELAY] - (end_time - start_time) 313 | if delta_time > 0: time.sleep(delta_time) 314 | 315 | def row_sort_function(left, right): 316 | '''Function for sorting query results for the entire cluster. Using a real 317 | function because it'll be called a lot, and lambdas can get expensive. 318 | 319 | :param left: first item to check 320 | :param right: second item to check 321 | 322 | ''' 323 | return cmp(left[3], right[3]) 324 | 325 | def format_time(seconds): 326 | """Convert time in seconds to a human-readabase value. 327 | :param seconds: Time to be converted 328 | :returns: formatted string 329 | """ 330 | if seconds < 60: 331 | return "{seconds:0.2f} seconds".format(seconds=seconds) 332 | seconds = seconds / 60.0 333 | if seconds < 60: 334 | return "{seconds:0.2f} minutes".format(seconds=seconds) 335 | seconds = seconds / 60.0 336 | if seconds < 24: 337 | return "{seconds:0.2f} hours".format(seconds=seconds) 338 | seconds = seconds / 24.0 339 | return "{seconds:0.2f} days".format(seconds=seconds) 340 | 341 | def screen_update_loop(window, options, status_dict): 342 | """Updates a curses window with data from the status dictionary. 343 | This never exits, and should be used inside a thread. 344 | :param window: curses window 345 | :param options: control dictionary 346 | :param status_dict: a dictionary which will hold all of the returned data 347 | """ 348 | status_message = "Running: {running:3d} Complete: {complete:3d} Refresh: {delay:d}" 349 | status_format = "{hostname:35s} {status:20s} {delay}" 350 | color_warning = curses.color_pair(1) 351 | color_bad = curses.color_pair(2) 352 | color_green = curses.color_pair(3) 353 | curses.init_pair(1, curses.COLOR_YELLOW, curses.COLOR_BLACK) 354 | curses.init_pair(2, curses.COLOR_RED, curses.COLOR_BLACK) 355 | curses.init_pair(3, curses.COLOR_GREEN, curses.COLOR_BLACK) 356 | 357 | while True: 358 | start_time = time.time() 359 | (RESTY, _) = window.getmaxyx() 360 | window.clear() 361 | display_data = [] 362 | complete_data = [] 363 | for row in status_dict: 364 | if status_dict[row][2] == COMPLETED: 365 | complete_data.append(status_dict[row]) 366 | else: 367 | display_data.append(status_dict[row]) 368 | 369 | window.addstr(1, 0, status_message.format(complete=len(complete_data), 370 | running=len(display_data), 371 | delay=options[DELAY])) 372 | display_data.sort(row_sort_function) 373 | complete_data.sort(row_sort_function) 374 | window.addstr(3, 0, status_format.format(hostname="Hostname", 375 | status="Status", 376 | delay="Time since last update"), curses.A_BOLD) 377 | display_data.extend(complete_data) 378 | logging.debug("all data: %s", str(display_data)) 379 | current_row = 4 380 | for line in display_data: 381 | if current_row > RESTY-1: 382 | break 383 | delta = time.time() - (line[3]/1000000.0) 384 | if delta > 4*3600 and not line[2] == COMPLETED: 385 | attribute = color_bad 386 | elif delta > 2*3600 and not line[2] == COMPLETED: 387 | attribute = color_warning 388 | else: 389 | attribute = color_green 390 | window.addstr(current_row, 0, status_format.format(hostname=line[0], 391 | status=line[2], 392 | delay=format_time(delta)), 393 | attribute) 394 | current_row += 1 395 | delta_time = options[DELAY] - (time.time() - start_time) 396 | if delta_time > 0: time.sleep(delta_time) 397 | window.refresh() 398 | 399 | def watch(main_window, connection): 400 | """Query Cassandra for the current repair status and display. 401 | :param main_window: curses window 402 | :param connection: CqlWrapper object 403 | """ 404 | status_dict = {} 405 | option_dict = {} 406 | option_dict[DELAY] = 5 407 | update_thread = threading.Thread(target=status_update_loop, 408 | args=(connection, option_dict, status_dict)) 409 | update_thread.daemon = True 410 | update_thread.start() 411 | 412 | redraw_thread = threading.Thread(target=screen_update_loop, 413 | args=(main_window, option_dict, status_dict)) 414 | redraw_thread.daemon = True 415 | redraw_thread.start() 416 | 417 | while 1: 418 | try: 419 | key = main_window.getkey() 420 | if key == 'q': break 421 | elif key == '+': option_dict[DELAY] = option_dict[DELAY] + 1 422 | elif key == '-' and option_dict[DELAY] > 1: option_dict[DELAY] = option_dict[DELAY] - 1 423 | except KeyboardInterrupt: raise SystemExit 424 | except: pass 425 | return 426 | 427 | 428 | def setup_logging(option_group): 429 | """Sets up logging in a syslog format by log level 430 | :param option_group: options as returned by the OptionParser 431 | """ 432 | stderr_log_format = "%(levelname) -8s %(asctime)s %(funcName)s line:%(lineno)d: %(message)s" 433 | file_log_format = "%(asctime)s - %(levelname)s - %(message)s" 434 | logger = logging.getLogger() 435 | if option_group.debug: 436 | logger.setLevel(level=logging.DEBUG) 437 | elif option_group.verbose: 438 | logger.setLevel(level=logging.INFO) 439 | else: 440 | logger.setLevel(level=logging.WARNING) 441 | 442 | # First, clear out any default handlers 443 | for handler in logger.handlers: 444 | logger.removeHandler(handler) 445 | 446 | if option_group.syslog: 447 | # Use standard format here because timestamp and level will be added by 448 | # syslogd. 449 | logger.addHandler(logging.SyslogHandler(facility=option_group.syslog)) 450 | if option_group.logfile: 451 | logger.addHandler(logging.FileHandler(option_group.logfile)) 452 | logger.handlers[-1].setFormatter(logging.Formatter(file_log_format)) 453 | if not logger.handlers: 454 | logger.addHandler(logging.StreamHandler()) 455 | logger.handlers[-1].setFormatter(logging.Formatter(stderr_log_format)) 456 | return 457 | 458 | 459 | def cli_parsing(): 460 | """Parse the command line. 461 | :returns: option set 462 | """ 463 | parser = argparse.ArgumentParser() 464 | parser.add_argument("-v", "--verbose", action='store_true', 465 | default=False, help="Verbose output") 466 | parser.add_argument("-d", "--debug", action='store_true', 467 | default=False, help="Debugging output") 468 | parser.add_argument("--syslog", metavar="FACILITY", 469 | help="Send log messages to the syslog") 470 | parser.add_argument("--logfile", metavar="FILENAME", 471 | help="Send log messages to a file") 472 | parser.add_argument("-H", "--hostname", default=platform.node(), 473 | help="Hostname (default: %(default)s)") 474 | parser.add_argument("-p", "--port", default=9160, type=int, 475 | help="Port (default: %(default)d)") 476 | parser.add_argument("-U", "--username", 477 | help="Username (if necessary)") 478 | parser.add_argument("-P", "--password", 479 | help="Password. (prompt if user provided but not password)") 480 | parser.add_argument("-t", "--ttl", default=3600 * 24 * 20, type=int, 481 | help="TTL (default: %(default)d)") 482 | parser.add_argument("-k", "--keyspace", default="operations", 483 | help="Keyspace to use (default: %(default)s)") 484 | parser.add_argument("--cqlversion", default="3.0.5", 485 | help="CQL version (default: %(default)s)") 486 | parser.add_argument("-r", "--range_repair_tool", 487 | default="/usr/local/bin/range_repair.py", 488 | help="Range repair tool path (default: %(default)s)") 489 | parser.add_argument("--local", default=False, action="store_true", 490 | help="Run the repairs in the local ring only") 491 | parser.add_argument("--watch", action="store_true", default=False, 492 | help="Watch the live repair status") 493 | parser.add_argument("--reset", action="store_true", default=False, 494 | help="Reset the repair status for the host") 495 | options = parser.parse_args() 496 | setup_logging(options) 497 | if options.username and not options.password: 498 | options.password = getpass.getpass( 499 | 'Password for %s: ' % options.username) 500 | return options 501 | 502 | 503 | def main(): 504 | """Main entry point. Runs the actual program here.""" 505 | logging.debug('main') 506 | options = cli_parsing() 507 | connection = CqlWrapper(options) 508 | if options.reset: 509 | connection.reset_repair_status() 510 | exit() 511 | if not options.watch: 512 | if connection.check_should_run(): 513 | connection.claim_repair() 514 | # Arguably, this should not be done in the connection. 515 | connection.run_repair() 516 | else: 517 | curses.wrapper(watch, connection) 518 | connection.close() 519 | 520 | return 521 | 522 | 523 | if __name__ == '__main__': 524 | main() 525 | -------------------------------------------------------------------------------- /casstop: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | # Author: Brian Gallew or 4 | 5 | import sys, xmltodict, urllib2, optparse, threading, re, curses, curses.wrapper 6 | import time, socket, json, logging, collections, traceback, signal, termios 7 | import pprint 8 | 9 | _INTERNED = ['Cluster', 'DC', 'EXTENDED_STATUS', 'Hostname', 'HOSTNAMES', 10 | 'LIVE', 'Load', 'READ_LATENCY_FIFTEEN_MINUTE', 'LABEL', 11 | 'READ_LATENCY_FIVE_MINUTE', 'READ_LATENCY_INSTANTANEOUS', 12 | 'READ_LATENCY_ONE_MINUTE', 'READ_RATE_FIFTEEN_MINUTE', 13 | 'READ_RATE_FIVE_MINUTE', 'READ_RATE_ONE_MINUTE', 'Severity', 'STATUS', 14 | 'WRITE_LATENCY_FIFTEEN_MINUTE', 'WRITE_LATENCY_FIVE_MINUTE', 15 | 'WRITE_LATENCY_INSTANTANEOUS', 'WRITE_LATENCY_ONE_MINUTE', 16 | 'WRITE_RATE_FIFTEEN_MINUTE', 'WRITE_RATE_FIVE_MINUTE', 'DEAD', 17 | 'WRITE_RATE_ONE_MINUTE', 'PendingTasks', 'read_latency_averages', 18 | 'write_latency_averages', 'RACK', 'CLUSTER_NAME', 'Compactions', 19 | 'ITEM', 'JAVA_OBJECT', 'URL', 'VALUE', 'OPERATION','ONE', 'FIVE', 'FIFTEEN', 'POPS', 20 | ] 21 | 22 | for value in _INTERNED: locals()[value.upper()] = value 23 | 24 | _debuginfo = collections.deque() 25 | def debug(item): 26 | _debuginfo.append(item) 27 | if len(_debuginfo) > 160: del _debuginfo[0] 28 | return 29 | 30 | def sigwinch_handler(n, frame): 31 | curses.initscr() 32 | return 33 | 34 | class MovingAverages(object): 35 | '''Handle moving averages as often displayed by programs like top(8).''' 36 | def __init__(self): 37 | self.queue = collections.deque() # Where we keep our data stashed away 38 | self.one = self.five = self.fifteen = 0.0 39 | return 40 | def add(self, value): 41 | '''Add a new value, timestamped appropriately. Throw away any old 42 | values, then re-compute the moving averages.''' 43 | now = time.time() 44 | self.queue.appendleft((value, now)) 45 | # These two lines discard old stuff 46 | old = now - 900 # 15 minutes 47 | while self.queue and self.queue[-1][1] < old: del self.queue[-1] 48 | total = 0.0 49 | count = 0 50 | 51 | then = now - 60 52 | while count < len(self.queue): 53 | value, timestamp = self.queue[count] 54 | if timestamp < then: break 55 | count += 1 56 | total += value 57 | self.one = total/count 58 | 59 | then = now - 300 60 | while count < len(self.queue): 61 | value, timestamp = self.queue[count] 62 | if timestamp < then: break 63 | count += 1 64 | total += value 65 | self.five = total/count 66 | 67 | while count < len(self.queue): 68 | value, timestamp = self.queue[count] 69 | count += 1 70 | total += value 71 | self.fifteen = total/count 72 | return self 73 | 74 | 75 | class CursedIntDataAttribute(dict): 76 | url_template = 'http://{Hostname:s}:8081/{OPERATION:s}?objectname={JAVA_OBJECT:s}&attribute={ITEM:s}&operation={ITEM:s}&template=identity' 77 | bean_designator = 'MBean' 78 | return_value_designators = ['Attribute', '@value'] 79 | default_value = 0 80 | operation = 'getattribute' 81 | datatype = int 82 | default_format = '{VALUE:>4d}' 83 | def __init__(self, hostname, java_object, item, *args, **kwargs): 84 | '''Set some default values for the dictionary, largely for debugging and interpolation purposes''' 85 | dict.__init__(self, *args, **kwargs) 86 | self[HOSTNAME] = hostname 87 | self[JAVA_OBJECT] = java_object 88 | self[ITEM] = item 89 | self[OPERATION] = self.operation # This seems silly, but it lets subclasses override while still letting us do interpolation. 90 | self[URL] = self.url_template.format(**self) 91 | self[VALUE] = self.default_value 92 | return None 93 | 94 | def __call__(self): 95 | '''Make the requisite HTTP request to get a new data item, storing the 96 | coerced result into self[VALUE] (or store the default value if some part of 97 | the process fails.''' 98 | try: 99 | data = {} 100 | fh = urllib2.urlopen(self[URL], None, 30) 101 | data_string = fh.read() 102 | fh.close() 103 | data = xmltodict.parse(data_string) 104 | if not data: 105 | debug('%(Hostname)s:%(ITEM)s.__call__: no results returned for %(URL)s' % self) 106 | self[VALUE] = self.default_value 107 | else: 108 | self[VALUE] = self.datatype(data[self.bean_designator][self.return_value_designators[0]][self.return_value_designators[1]]) 109 | debug('%(Hostname)s:%(ITEM)s.__call__: set value to %(VALUE)s' % self) 110 | except Exception as e: 111 | debug(('%(Hostname)s:%(ITEM)s.__call__: Unable to load data for %(URL)s: ' % self) + str(e) + str(data)) 112 | self[VALUE] = self.default_value 113 | return self[VALUE] 114 | 115 | def draw(self, window, y, x, color=0, warning=None, critical=None, newfmt=None, length=0): 116 | '''Standard display method for these values. 117 | 118 | (window,y,x) are the expected curses items. 119 | 120 | "color" is the curses attribute set to use by default. I'm 121 | cheating big-time here and assuming that the default color should 122 | be the result of curses.color_pair(0) AND that the result of that 123 | will always be 0. This is probably FRAGILE. 124 | 125 | "warning" and "critical", if set, should be a list/tuple where the 126 | first item is a test value and the second item is a curses 127 | attribute set. If either of the tests is true, the appropriate 128 | attribute set will override the default attribute set. 129 | 130 | newfmt is a format string using the new style of 131 | formatting. 132 | 133 | length, if greater than zero, will guarantee that the displayed 134 | string does not exceed a certain length. 135 | 136 | ''' 137 | debug('draw: keys=%s' % str(self.keys())) 138 | if newfmt: display_value = newfmt.format(**self) 139 | else: display_value = str(self) 140 | if critical and self[VALUE] > critical[0]: color = critical[1] 141 | elif warning and self[VALUE] > warning[0]: color = warning[1] 142 | if length: window.addnstr(y, x, display_value, length, color) 143 | else: window.addstr(y, x, display_value, color) 144 | return 145 | 146 | 147 | def __add__(self, other): 148 | try: return self[VALUE] + other 149 | except: return self[VALUE] + other[VALUE] 150 | def __div__(self, other): 151 | try: return self[VALUE] / other 152 | except: return self[VALUE] / other[VALUE] 153 | def __str__(self): return self.default_format.format(**self) 154 | def type_coercion_data_dict_to_int(self, datastring): 155 | try: 156 | return sum(eval(datastring.replace('=', ':')).values()) 157 | except: 158 | debug('CursedIntDictDataOperation.type_coercion: unable to eval %s' % datastring) 159 | return 0 160 | 161 | class CursedStringDataAttribute(CursedIntDataAttribute): 162 | datatype = str 163 | default_value = 'nodata' 164 | default_format = '{VALUE:s}' 165 | 166 | class CursedFloatDataAttribute(CursedIntDataAttribute): 167 | datatype = float 168 | default_value = 0.0 169 | default_format = '{VALUE:>6.2f}' 170 | 171 | class CursedIntDictDataOperation(CursedIntDataAttribute): 172 | '''This is designed to invoke a JMX function which returns a (possibly 173 | ordered) Dict where all the values are INTs. It will use the sum of those 174 | values as its result.''' 175 | datatype = CursedIntDataAttribute.type_coercion_data_dict_to_int 176 | bean_designator = 'MBeanOperation' 177 | return_value_designators = ['Operation', '@return'] 178 | operation = 'invoke' 179 | default_format = '{VALUE:>8d}' 180 | 181 | 182 | class CursedSeverity(CursedFloatDataAttribute): 183 | def __init__(self, hostname): 184 | '''Cheating here for no good reason other than to emphasize the specialness of this one.''' 185 | CursedFloatDataAttribute.__init__(self, hostname, 'org.apache.cassandra.db:type=CompactionManager', PENDINGTASKS) 186 | self[COMPACTIONS] = CursedStringDataAttribute(hostname, 'org.apache.cassandra.db:type=CompactionManager', 'Compactions') 187 | return None 188 | 189 | def __call__(self): 190 | '''Kick off a thread to get the compaction data before we get our own, then rejoin it. Parallelism FTW@''' 191 | compact = self[COMPACTIONS] 192 | t = threading.Thread(target=compact) 193 | t.daemon = True 194 | t.start() 195 | CursedFloatDataAttribute.__call__(self) 196 | t.join() 197 | total = 0.0 198 | done = 0.0 199 | if len(compact[VALUE]) > 4: compact[VALUE] = compact[VALUE][2:-2].replace('}','') 200 | for row in compact[VALUE].split(','): 201 | if 'total=' in row: total += int(row.split('total=')[-1]) 202 | if 'completed=' in row: done += int(row.split('completed=')[-1]) 203 | if total: percent = (total - done)/total 204 | else: percent = 1 205 | self[VALUE] = (self[VALUE] - len(compact[VALUE].split('},'))) + percent 206 | 207 | 208 | class CursedByteDataAttribute(CursedFloatDataAttribute): 209 | default_format = '{VALUE:>6.2f} {LABEL}' 210 | def __init__(self, *args, **kwargs): 211 | CursedFloatDataAttribute.__init__(self, *args, **kwargs) 212 | self[LABEL] = 'B' 213 | return None 214 | def __str__(self): 215 | '''Bytes are useful things, but my mind things in megs, gigs, etc. 216 | ''' 217 | raw = self[VALUE] 218 | label = 'B' 219 | for l in ['KB', 'MB', 'GB', 'TB', 'PB']: 220 | if raw < 1024: break 221 | raw = raw/1024.0 222 | label = l 223 | return self.default_format.format(VALUE=raw, LABEL=label) 224 | 225 | 226 | class CursedLatencyAverage(CursedIntDataAttribute): 227 | datatype = float 228 | default_value = 0.0 229 | default_format = '{ONE:>6.2f}/{FIVE:>6.2f}/{FIFTEEN:>6.2f}' 230 | def __init__(self, *args, **kwargs): 231 | CursedIntDataAttribute.__init__(self, *args, **kwargs) 232 | self.averages = MovingAverages() 233 | self[VALUE] = 0.0 234 | self[ONE] = 0.0 235 | self[FIVE] = 0.0 236 | self[FIFTEEN] = 0.0 237 | return None 238 | def __call__(self): 239 | '''Cassandra hard-codes latency to be measured in MICROseconds. I want to 240 | keep track of, and display in, seconds. 241 | 242 | ''' 243 | raw = CursedIntDataAttribute.__call__(self)/1000000.0 244 | self[VALUE] = raw 245 | self.averages.add(raw) 246 | self[ONE] = self.averages.one 247 | self[FIVE] = self.averages.five 248 | self[FIFTEEN] = self.averages.fifteen 249 | return raw 250 | 251 | def draw(self, window, y, x, color=0, warning=None, critical=None, newfmt=None, length=0, averages=False): 252 | '''Standard display method for these values. 253 | 254 | (window,y,x) are the expected curses items. 255 | 256 | "color" is the curses attribute set to use by default. I'm 257 | cheating big-time here and assuming that the default color should 258 | be the result of curses.color_pair(0) AND that the result of that 259 | will always be 0. This is probably FRAGILE. 260 | 261 | "warning" and "critical", if set, should be a list/tuple where the 262 | first item is a test value and the second item is a curses 263 | attribute set. If either of the tests is true, the appropriate 264 | attribute set will override the default attribute set. 265 | 266 | newfmt is a format string using the new style of 267 | formatting. 268 | 269 | length, if greater than zero, will guarantee that the displayed 270 | string does not exceed a certain length. 271 | 272 | ''' 273 | debug('draw: keys=%s' % str(self.keys())) 274 | if newfmt: display_value = newfmt.format(**self) 275 | elif averages: display_value = '/'.join([self.default_format]*3).format(**self) 276 | else: display_value = self.default_format.format(**self) 277 | if critical and self[VALUE] > critical[0]: color = critical[1] 278 | elif warning and self[VALUE] > warning[0]: color = warning[1] 279 | if length: window.addnstr(y, x, display_value, length, color) 280 | else: window.addstr(y, x, display_value, color) 281 | return 282 | 283 | 284 | # This is here because 285 | # 1) It has to be after all of the individual JMX object type declarations, 286 | # 2) it has to be before CursedCluster. 287 | # 288 | # Its purpose is to provide a single list of keys and functions. Each of 289 | # the functions is initialized as function(hostname, *args) 290 | host_attribute_set = { 291 | LOAD: (CursedByteDataAttribute, ('org.apache.cassandra.db:type=StorageService', LOAD)), 292 | SEVERITY: (CursedSeverity, ()), 293 | STATUS: (CursedStringDataAttribute, ('org.apache.cassandra.db:type=StorageService', 'OperationMode')), 294 | READ_LATENCY_INSTANTANEOUS: (CursedLatencyAverage, ('org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency', '75thPercentile')), 295 | READ_RATE_ONE_MINUTE: (CursedFloatDataAttribute, ('org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency', 'OneMinuteRate')), 296 | READ_RATE_FIVE_MINUTE: (CursedFloatDataAttribute, ('org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency', 'FiveMinuteRate')), 297 | READ_RATE_FIFTEEN_MINUTE: (CursedFloatDataAttribute, ('org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency', 'FifteenMinuteRate')), 298 | WRITE_LATENCY_INSTANTANEOUS: (CursedLatencyAverage, ('org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency', '75thPercentile')), 299 | WRITE_RATE_ONE_MINUTE: (CursedFloatDataAttribute, ('org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency', 'OneMinuteRate')), 300 | WRITE_RATE_FIVE_MINUTE: (CursedFloatDataAttribute, ('org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency', 'FiveMinuteRate')), 301 | WRITE_RATE_FIFTEEN_MINUTE: (CursedFloatDataAttribute, ('org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency', 'FifteenMinuteRate')), 302 | } 303 | 304 | class CursedCluster(CursedStringDataAttribute): 305 | '''This is a really specialized version of CursedStringDataAttribute, since 306 | there should be only one *and* it's going to do a lot of string 307 | processing and item creation. 308 | 309 | ''' 310 | datatype = str 311 | default_value = '' 312 | default_format = '{VALUE}' 313 | ENDPOINT_SPLITTER = re.compile('^/', re.MULTILINE).split 314 | 315 | def __init__(self, hostname, delay=300): 316 | '''In addition to the superclass startup, we extract the value of delay 317 | from the passed in arguments (defaulting to 300), set a couple of 318 | instance variables, load up the initial data, and fire up the refresh loop. 319 | ''' 320 | self.delay = delay 321 | CursedStringDataAttribute.__init__(self, hostname, 'org.apache.cassandra.net:type=FailureDetector', 'AllEndpointStates') 322 | self[HOSTNAMES] = {} 323 | self[POPS] = [] 324 | # Get the cluster name, it should never change! 325 | self[CLUSTER_NAME] = CursedStringDataAttribute(hostname, 326 | 'org.apache.cassandra.db:type=StorageService', 327 | 'ClusterName')() 328 | self() # Load the data up once! 329 | t = threading.Thread(target=self._refresh_loop) 330 | t.daemon = True 331 | t.start() 332 | return None 333 | def __call__(self): 334 | '''As well as the standard superclass functionality, we parse the returned 335 | value into a bunch of hosts, each with a few static attributes, and 336 | initialize each host with a set of data. Adding more data items to 337 | check should be done HERE. 338 | 339 | ''' 340 | data_string = CursedIntDataAttribute.__call__(self) 341 | new_host_list = {} 342 | new_pop_list = {} 343 | for row in self.ENDPOINT_SPLITTER(data_string): 344 | try: 345 | debug(row) 346 | if not row: continue 347 | if 'STATUS:remov' in row: continue 348 | row_pieces = row.split('\n ') 349 | complete_address_set = socket.gethostbyaddr(row_pieces[0]) 350 | debug('get_gossip_information: gethostbyaddr returned %s' % str(complete_address_set)) 351 | endpoint = complete_address_set[0].replace('.cint','') 352 | new_host = new_host_list.setdefault(endpoint, {}) 353 | new_host[LIVE] = True 354 | for line in row_pieces[1:]: 355 | key, value = line.split(':', 1) 356 | if key == RACK: 357 | rack_string = value.strip() 358 | new_host[key] = ' '*(5-len(rack_string)) + rack_string 359 | if key == DC: 360 | new_pop_list[value] = True 361 | new_host[key] = value.strip() 362 | for key in host_attribute_set: 363 | function, args = host_attribute_set[key] 364 | new_host[key] = function(endpoint, *args) 365 | new_host[READ_RATE_ONE_MINUTE].default_format = "{VALUE:5.0f}" 366 | new_host[WRITE_RATE_ONE_MINUTE].default_format = "{VALUE:5.0f}" 367 | 368 | except Exception as e: debug('CursedCluster.__call__: ' + str(e)) 369 | if new_host_list: self[HOSTNAMES] = new_host_list 370 | if new_pop_list: self[POPS] = new_pop_list.keys() 371 | return self 372 | def _refresh_loop(self): 373 | '''Simple little infinite loop defined on the class because I think it's 374 | cleaner than a lambda. 375 | 376 | ''' 377 | while 1: 378 | time.sleep(self.delay) 379 | self() 380 | 381 | class Cluster(object): 382 | # this is kind of an evil faux-function-definition 383 | ENDPOINT_SPLITTER = re.compile('^/', re.MULTILINE).split 384 | refresh_delay = 3 385 | refresh = True 386 | 387 | def __init__(self, hostname, header_window, data_window, status_window): 388 | self.compaction_averages = MovingAverages() 389 | self.header_window = header_window 390 | self.data_window = data_window 391 | self.status_window = status_window 392 | 393 | self.last_refresh = 0.0 394 | self.sort_order = 0 395 | self.cluster_data = CursedCluster(hostname) 396 | self.dead_nodes = [] 397 | self.item = SEVERITY 398 | self.redraw_semaphore = threading.Semaphore() 399 | self.redraw_lock = threading.Lock() 400 | 401 | sys.stdout.write(']0; Cassandra Top - %s ' % self.cluster_data[CLUSTER_NAME]) 402 | self.draw_data = self.draw_cluster_data 403 | self.title = 'Cluster Summary' 404 | self.good = curses.color_pair(0) 405 | self.warning = curses.color_pair(1) 406 | self.bad = curses.color_pair(2) 407 | self.green = curses.color_pair(3) 408 | curses.init_pair(1, curses.COLOR_YELLOW, curses.COLOR_BLACK) 409 | curses.init_pair(2, curses.COLOR_RED, curses.COLOR_BLACK) 410 | curses.init_pair(3, curses.COLOR_GREEN, curses.COLOR_BLACK) 411 | return 412 | 413 | def dispatch(self, group, function): 414 | group.append(threading.Thread(target=function)) 415 | group[-1].daemon = True 416 | group[-1].start() 417 | return 418 | def rejoin(self, group): 419 | for t in group: t.join() 420 | return 421 | def __call__(self): 422 | '''this is the updating loop''' 423 | self.redraw_semaphore.acquire() # Prevent the drawing routine from 424 | # doing anything until we have 425 | # data. 426 | drawer = threading.Thread(target = self.draw) 427 | drawer.daemon = True 428 | drawer.start() 429 | while 1: 430 | now = time.time() 431 | thread_group = [] 432 | for host in self.cluster_data[HOSTNAMES].values(): 433 | for item in host.values(): 434 | if isinstance(item, dict): self.dispatch(thread_group, item) 435 | self.rejoin(thread_group) 436 | for hostname in self.cluster_data[HOSTNAMES]: 437 | host = self.cluster_data[HOSTNAMES][hostname] 438 | host[LIVE] = True 439 | if host[STATUS][VALUE] != 'NORMAL': 440 | host[LIVE] = False 441 | debug('%s marked down because "%s" is not "NORMAL"' % (hostname, host[STATUS])) 442 | if host[LOAD][VALUE] == 0.0: 443 | host[LIVE] = False 444 | debug('%s marked down because the load is 0.0 (may just be new)' % hostname) 445 | then = time.time() 446 | self.stop_refresh() 447 | self.redraw_semaphore.release() 448 | self.start_refresh() 449 | self.last_refresh = then - now 450 | left = self.refresh_delay - self.last_refresh 451 | if left > 0: time.sleep(left) 452 | return 453 | 454 | def draw(self): 455 | while 1: 456 | self.redraw_semaphore.acquire() 457 | try: 458 | self.draw_header() 459 | self.draw_data() 460 | self.draw_status() 461 | except: debug(traceback.format_exc()) 462 | return 463 | 464 | def draw_labelled_item(self, window, starty, startx, label, value, warning=None, critical=None, fmt=None, hilight=False, length=0): 465 | if fmt: display_value = fmt % value 466 | else: display_value = str(value) 467 | if hilight: window.addstr(starty, startx, label, curses.A_BOLD | self.green) 468 | else: window.addstr(starty, startx, label, curses.A_BOLD) 469 | if critical and value > critical: color = self.bad 470 | elif warning and value > warning: color = self.warning 471 | else: color = self.good 472 | if length: window.addnstr(display_value, length, color) 473 | else: window.addstr(display_value, color) 474 | return 475 | 476 | def draw_header(self): 477 | dead_count = len([x for x in self.cluster_data[HOSTNAMES].values() if not x[LIVE]]) 478 | host_count = len(self.cluster_data[HOSTNAMES]) + dead_count 479 | compaction_data = 0.0 480 | read_rate_one = 0.0 481 | read_rate_five = 0.0 482 | read_rate_fifteen = 0.0 483 | write_rate_one = 0.0 484 | write_rate_five = 0.0 485 | write_rate_fifteen = 0.0 486 | read_latency_one = 0.0 487 | read_latency_five = 0.0 488 | read_latency_fifteen = 0.0 489 | write_latency_one = 0.0 490 | write_latency_five = 0.0 491 | write_latency_fifteen = 0.0 492 | for x in self.cluster_data[HOSTNAMES].values(): 493 | compaction_data += x[SEVERITY][VALUE] 494 | read_rate_one += x[READ_RATE_ONE_MINUTE][VALUE] 495 | read_rate_five += x[READ_RATE_FIVE_MINUTE][VALUE] 496 | read_rate_fifteen += x[READ_RATE_FIFTEEN_MINUTE][VALUE] 497 | write_rate_one += x[WRITE_RATE_ONE_MINUTE][VALUE] 498 | write_rate_five += x[WRITE_RATE_FIVE_MINUTE][VALUE] 499 | write_rate_fifteen += x[WRITE_RATE_FIFTEEN_MINUTE][VALUE] 500 | read_latency_one += x[READ_LATENCY_INSTANTANEOUS][ONE] 501 | read_latency_five += x[READ_LATENCY_INSTANTANEOUS][FIVE] 502 | read_latency_fifteen += x[READ_LATENCY_INSTANTANEOUS][FIFTEEN] 503 | write_latency_one += x[WRITE_LATENCY_INSTANTANEOUS][ONE] 504 | write_latency_five += x[WRITE_LATENCY_INSTANTANEOUS][FIVE] 505 | write_latency_fifteen += x[WRITE_LATENCY_INSTANTANEOUS][FIFTEEN] 506 | self.compaction_averages.add(compaction_data) 507 | self.header_window.clear() 508 | self.draw_labelled_item(self.header_window, 0, 0, 'Live Nodes: ', len(self.cluster_data[HOSTNAMES])) 509 | self.draw_labelled_item(self.header_window, 1, 0, 'Dead Nodes: ', dead_count, warning=host_count*0.25, critical=host_count*0.5) 510 | self.draw_labelled_item(self.header_window, 0, 16, 'Compactions: ', (self.compaction_averages.one,self.compaction_averages.five,self.compaction_averages.fifteen), fmt='%5.2f/%5.2f/%5.2f') 511 | self.draw_labelled_item(self.header_window, 0, 53, 'Rrate: ', (read_rate_one, read_rate_five, read_rate_fifteen), fmt='%5.0f/%5.0f/%5.0f') 512 | self.draw_labelled_item(self.header_window, 1, 53, 'Wrate: ', (write_rate_one, write_rate_five, write_rate_fifteen), fmt='%5.0f/%5.0f/%5.0f') 513 | self.draw_labelled_item(self.header_window, 0, 78, 'Rlatency: ', (read_latency_one/1000, read_latency_five/1000, read_latency_fifteen/1000), fmt='%5.2f/%5.2f/%5.2f') 514 | self.draw_labelled_item(self.header_window, 1, 78, 'Wlatency: ', (write_latency_one/1000, write_latency_five/1000, write_latency_fifteen/1000), fmt='%5.2f/%5.2f/%5.2f') 515 | self.refresh and self.header_window.refresh() 516 | return 517 | 518 | def size_convert(self, value): 519 | label = 'B' 520 | for l in ['KB', 'MB', 'GB', 'TB', 'PB']: 521 | if value < 1024: break 522 | value = value/1024.0 523 | label = l 524 | return (value, label) 525 | 526 | def draw_cluster_data(self): 527 | (RESTY, RESTX) = self.data_window.getmaxyx() 528 | self.data_window.clear() 529 | self.data_window.standout() 530 | self.draw_labelled_item(self.data_window, 0, 1, DC, '', hilight=(self.sort_order == 0)) 531 | self.draw_labelled_item(self.data_window, 0, 5, 'Nodes', '') 532 | self.draw_labelled_item(self.data_window, 0, 11, 'Racks', '') 533 | self.draw_labelled_item(self.data_window, 0, 20, 'Load', '', hilight=(self.sort_order == 1)) 534 | self.draw_labelled_item(self.data_window, 0, 28, 'Comps', '', hilight=(self.sort_order == 2)) 535 | self.draw_labelled_item(self.data_window, 0, 35, 'Rlat', '', hilight=(self.sort_order == 3)) 536 | self.draw_labelled_item(self.data_window, 0, 40, 'Rrate', '', hilight=(self.sort_order == 4)) 537 | self.draw_labelled_item(self.data_window, 0, 47, 'Wlat', '', hilight=(self.sort_order == 5)) 538 | self.draw_labelled_item(self.data_window, 0, 52, 'Wrate', '', hilight=(self.sort_order == 6)) 539 | 540 | self.data_window.standend() 541 | summarized_data = {} 542 | debug('draw_cluster_data: new summary created') 543 | for host in self.cluster_data[HOSTNAMES].values(): 544 | dc = host[DC] 545 | debug('draw_cluster_data: DC is ' + dc) 546 | if not summarized_data.has_key(dc): 547 | debug('draw_cluster_data: added DC - ' + dc) 548 | summarized_data[dc] = host.copy() 549 | summarized_data[dc][LIVE] = host[LIVE] and 1 or 0 550 | summarized_data[dc][DEAD] = not host[LIVE] and 1 or 0 551 | summarized_data[dc][RACK] = {host[RACK]: True} 552 | else: 553 | if host[LIVE]: summarized_data[dc][LIVE] += 1 554 | else: summarized_data[dc][DEAD] += 1 555 | summarized_data[dc][RACK][host[RACK]] = True 556 | for key in [LOAD, SEVERITY, 557 | READ_LATENCY_INSTANTANEOUS, 558 | READ_RATE_ONE_MINUTE, 559 | WRITE_LATENCY_INSTANTANEOUS, 560 | WRITE_RATE_ONE_MINUTE]: 561 | try: summarized_data[dc][key] = host[key] + summarized_data[dc][key] 562 | except Exception as e: debug('draw_cluster_data - exception when summarizing: ' + str(e)) 563 | # Do sorting here 564 | y = 0 565 | sort_key = [DC, LOAD, SEVERITY, READ_LATENCY_INSTANTANEOUS, 566 | READ_RATE_ONE_MINUTE, WRITE_LATENCY_INSTANTANEOUS, WRITE_RATE_ONE_MINUTE][self.sort_order] 567 | for row in sorted(summarized_data.values(), cmp=lambda x,y: cmp(x[sort_key], y[sort_key]), reverse = (self.sort_order != 0)): 568 | y += 1 569 | self.draw_labelled_item(self.data_window, y, 0, '', row[DC]) 570 | self.draw_labelled_item(self.data_window, y, 5, '', row[LIVE] + row[DEAD], fmt='%2d/', length=3) 571 | self.draw_labelled_item(self.data_window, y, 8, '', row[DEAD], length=2, critical=row[LIVE]/3, warning=0) 572 | self.draw_labelled_item(self.data_window, y, 12, '', len(row[RACK]), fmt='%2d', length=2) 573 | self.draw_labelled_item(self.data_window, y, 16, '', '%7.2f %s' % self.size_convert(row[LOAD]), length=10) 574 | self.draw_labelled_item(self.data_window, y, 27, '', row[SEVERITY], fmt='%6.2f', length=6) 575 | self.draw_labelled_item(self.data_window, y, 34, '', row[READ_LATENCY_INSTANTANEOUS], fmt='%5.0f', length=6) 576 | self.draw_labelled_item(self.data_window, y, 40, '', row[READ_RATE_ONE_MINUTE], fmt='%5.0f', length=6) 577 | self.draw_labelled_item(self.data_window, y, 46, '', row[WRITE_LATENCY_INSTANTANEOUS], fmt='%5.0f', length=6) 578 | self.draw_labelled_item(self.data_window, y, 52, '', row[WRITE_RATE_ONE_MINUTE], fmt='%5.0f', length=6) 579 | 580 | self.refresh and self.data_window.refresh() 581 | return 582 | 583 | def draw_data_dict_item(self, y, x, datadict, key, length=0, fmt = None): 584 | if not fmt: 585 | if length: fmt = '{0:%d}' % length 586 | else: fmt = '{0}' 587 | try: 588 | value = fmt.format(datadict[key]) 589 | except Exception as e: 590 | self.data_window.addstr(y, x, 'NODATA', self.bad) 591 | debug('draw_data_dict_item:' + key + ' ' + str(e)) 592 | return 593 | if length: 594 | self.data_window.addnstr(y, x, value, length) 595 | else: self.data_window.addstr(y, x, value) 596 | return 597 | 598 | def sorted_host_key_order(self): 599 | if self.draw_data == self.draw_host_data: 600 | if self.sort_order == 1: # Sort by DC/host 601 | return sorted(self.cluster_data[HOSTNAMES].keys(), cmp=lambda x,y: cmp(self.cluster_data[HOSTNAMES][x][DC]+x, self.cluster_data[HOSTNAMES][y][DC]+y)) 602 | 603 | if self.sort_order == 2: # Sort by RACK/host 604 | return sorted(sorted(self.cluster_data[HOSTNAMES].keys()), cmp=lambda x,y: cmp(self.cluster_data[HOSTNAMES][x][RACK], self.cluster_data[HOSTNAMES][y][RACK])) 605 | 606 | if self.sort_order == 3: # Sort by LOAD 607 | return sorted(sorted(self.cluster_data[HOSTNAMES].keys()), cmp=lambda x,y: cmp(self.cluster_data[HOSTNAMES][y][LOAD][VALUE], self.cluster_data[HOSTNAMES][x][LOAD][VALUE])) 608 | 609 | if self.sort_order == 4: # Sort by SEVERITY (compactions) 610 | return sorted(self.cluster_data[HOSTNAMES].keys(), cmp=lambda x,y: cmp(self.cluster_data[HOSTNAMES][y][SEVERITY][VALUE], self.cluster_data[HOSTNAMES][x][SEVERITY][VALUE])) 611 | 612 | if self.sort_order == 5: 613 | return sorted(self.cluster_data[HOSTNAMES].keys(), cmp=lambda x,y: cmp(self.cluster_data[HOSTNAMES][y][READ_LATENCY_INSTANTANEOUS][VALUE], self.cluster_data[HOSTNAMES][x][READ_LATENCY_INSTANTANEOUS][VALUE])) 614 | 615 | if self.sort_order == 6: 616 | return sorted(self.cluster_data[HOSTNAMES].keys(), cmp=lambda x,y: cmp(self.cluster_data[HOSTNAMES][y][READ_RATE_ONE_MINUTE][VALUE], self.cluster_data[HOSTNAMES][x][READ_RATE_ONE_MINUTE][VALUE])) 617 | 618 | if self.sort_order == 7: 619 | return sorted(self.cluster_data[HOSTNAMES].keys(), cmp=lambda x,y: cmp(self.cluster_data[HOSTNAMES][y][WRITE_LATENCY_INSTANTANEOUS][VALUE], self.cluster_data[HOSTNAMES][x][WRITE_LATENCY_INSTANTANEOUS][VALUE])) 620 | 621 | if self.sort_order == 8: 622 | return sorted(self.cluster_data[HOSTNAMES].keys(), cmp=lambda x,y: cmp(self.cluster_data[HOSTNAMES][y][WRITE_RATE_ONE_MINUTE][VALUE], self.cluster_data[HOSTNAMES][x][WRITE_RATE_ONE_MINUTE][VALUE])) 623 | 624 | #if self.sort_order == '1': This is the default, so we'll leave it as a fall-through 625 | return sorted(self.cluster_data[HOSTNAMES].keys()) 626 | 627 | if self.draw_data == self.draw_cluster_item: 628 | # sorts by DC by Host regardless 629 | return sorted(self.cluster_data[HOSTNAMES].keys(), cmp=lambda x,y: cmp(self.cluster_data[HOSTNAMES][x][DC]+x, self.cluster_data[HOSTNAMES][y][DC]+y)) 630 | pass 631 | 632 | 633 | def draw_host_data(self): 634 | (RESTY, RESTX) = self.data_window.getmaxyx() 635 | self.data_window.clear() 636 | self.draw_labelled_item(self.data_window, 0, 0, HOSTNAME, '', hilight=(self.sort_order == 0)) 637 | self.draw_labelled_item(self.data_window, 0, 10, DC, '', hilight=(self.sort_order == 1)) 638 | self.draw_labelled_item(self.data_window, 0, 16, 'Rack', '', hilight=(self.sort_order == 2)) 639 | self.draw_labelled_item(self.data_window, 0, 22, 'Status', '') 640 | self.draw_labelled_item(self.data_window, 0, 31, 'Load', '', hilight=(self.sort_order == 3)) 641 | self.draw_labelled_item(self.data_window, 0, 40, 'Comps', '', hilight=(self.sort_order == 4)) 642 | self.draw_labelled_item(self.data_window, 0, 47, 'Rlat', '', hilight=(self.sort_order == 5)) 643 | self.draw_labelled_item(self.data_window, 0, 52, 'Rrate', '', hilight=(self.sort_order == 6)) 644 | self.draw_labelled_item(self.data_window, 0, 59, 'Wlat', '', hilight=(self.sort_order == 7)) 645 | self.draw_labelled_item(self.data_window, 0, 64, 'Wrate', '', hilight=(self.sort_order == 8)) 646 | 647 | host_list = self.sorted_host_key_order() 648 | y = 0 649 | for host in host_list: 650 | y += 1 651 | if not y < RESTY: continue 652 | data_set = self.cluster_data[HOSTNAMES][host] 653 | if not data_set[LIVE]: self.data_window.addnstr(y, 0, host.split('.')[0], 10, self.bad) 654 | else: self.data_window.addnstr(y, 0, host.split('.')[0], 10) 655 | self.draw_data_dict_item(y, 10, data_set, DC, length=5) 656 | self.draw_data_dict_item(y, 16, data_set, RACK, length=5) 657 | data_set[STATUS].draw(self.data_window, y, 22, length=6) 658 | data_set[LOAD].draw(self.data_window, y, 29, length=9) 659 | data_set[SEVERITY].draw(self.data_window, y, 40, length=5) 660 | data_set[READ_LATENCY_INSTANTANEOUS].draw(self.data_window, y, 46, length=5) 661 | data_set[READ_RATE_ONE_MINUTE].draw(self.data_window, y, 52, length=5) 662 | data_set[WRITE_LATENCY_INSTANTANEOUS].draw(self.data_window, y, 58, length=5) 663 | data_set[WRITE_RATE_ONE_MINUTE].draw(self.data_window, y, 64, length=5) 664 | if data_set.get(STATUS, None) == DEAD: self.draw_data_dict_item(y, 50, data_set, EXTENDED_STATUS) 665 | self.refresh and self.data_window.refresh() 666 | return 667 | 668 | def draw_cluster_item(self): 669 | self.data_window.clear() 670 | self.draw_labelled_item(self.data_window, 0, 0, HOSTNAME, '') 671 | self.draw_labelled_item(self.data_window, 0, 30, DC, '') 672 | self.draw_labelled_item(self.data_window, 0, 62, 'Cluster', '') 673 | host_list = self.sorted_host_key_order() 674 | cluster_total = None 675 | writer = ClusterObject(self.data_window, self.item, self.cluster_data[HOSTNAMES][host_list[0]][DC], 1) 676 | 677 | for host in host_list: 678 | if not writer.dc == self.cluster_data[HOSTNAMES][host][DC]: 679 | if cluster_total == None: cluster_total = writer.finish() 680 | elif getattr(cluster_total, 'count', None): cluster_total = map(sum, zip(cluster_total, writer.finish())) 681 | else: cluster_total += writer.finish() 682 | writer = ClusterObject(self.data_window, self.item, self.cluster_data[HOSTNAMES][host][DC], writer.row+1) 683 | writer.entry(self.cluster_data[HOSTNAMES][host], host.split('.')[0]) 684 | if cluster_total == None: cluster_total = writer.finish() 685 | if getattr(cluster_total, 'count', None): 686 | cluster_total = map(sum, zip(cluster_total, writer.finish())) 687 | else: 688 | cluster_total += writer.finish() 689 | if self.item == LOAD: 690 | value, label = self.size_convert(cluster_total) 691 | self.data_window.addstr(1, 60, writer.fmt.format(VALUE=value, LABEL=label)) 692 | else: 693 | self.data_window.addstr(1, 60, writer.fmt.format(VALUE=cluster_total)) 694 | 695 | self.refresh and self.data_window.refresh() 696 | return 697 | 698 | 699 | def draw_status(self): 700 | (RESTY, RESTX) = self.status_window.getmaxyx() 701 | status_message = 'Update frequency: %ds (%0.2f)' % (self.refresh_delay, self.last_refresh) 702 | l = len(status_message) 703 | t = len(self.title) 704 | if l+t+5 > RESTX: 705 | tx = l+1 706 | tlen = RESTX - l - 4 707 | else: 708 | tx = (RESTX - t) - 3 709 | tlen = t 710 | self.status_window.clear() 711 | self.status_window.addstr(status_message) 712 | try: 713 | self.status_window.addnstr(0, tx, self.title, tlen, curses.color_pair(3) | curses.A_STANDOUT) 714 | except: 715 | raise Exception('tx=%d, RESTX=%d, RESTY=%d' % (tx,RESTX, RESTY)) 716 | self.refresh and self.status_window.refresh() 717 | return 718 | def stop_refresh(self): 719 | return self.redraw_lock.acquire() 720 | 721 | def start_refresh(self): 722 | try: return self.redraw_lock.release() 723 | except: return 724 | 725 | class ClusterObject(object): 726 | '''Utility class to make printing a data item for a ring just a little neater''' 727 | def __init__(self, window, item, dc, row): 728 | self.window = window 729 | self.item = item 730 | self.dc = dc 731 | self.row = self.top_row = row 732 | self.maxy, self.maxx = self.window.getmaxyx() 733 | if isinstance(item, basestring): self.total = 0 734 | else: self.total = [0] * len(item) 735 | return 736 | 737 | def entry(self, obj, hostname): 738 | self.fmt = obj[self.item].default_format 739 | self.total += obj[self.item][VALUE] 740 | if self.row < self.maxy: 741 | try: 742 | self.window.addstr(self.row, 0, hostname) 743 | obj[self.item].draw(self.window, self.row, 15) 744 | except: debug(traceback.format_exc()) 745 | self.row += 1 746 | return 747 | 748 | def finish(self): 749 | if self.top_row < self.maxy: 750 | value = self.total 751 | try: 752 | self.window.addstr(self.top_row, 30, self.dc) 753 | self.window.addstr(self.top_row, 35, self.fmt.format(VALUE=value)) 754 | except: debug(traceback.format_exc()) 755 | return self.total 756 | 757 | helpstrings = [ 758 | ('', 'Summary information in the first couple lines is for the entire cluster.'), 759 | ('', ''), 760 | ('q', 'Exit the program (immediately)'), 761 | ('c', 'Display cluster summary data'), 762 | ('h', 'Display host data'), 763 | ('s', 'Display severity (compaction) data'), 764 | ('l', 'Display load data'), 765 | ('r', 'Display read data'), 766 | ('w', 'Display write data'), 767 | ('', ''), 768 | ('+', 'Increase the delay between updates (takes effect after next update)'), 769 | ('-', 'Decrease the delay between updates (takes effect after next update)'), 770 | ('', ''), 771 | ('1-9', 'Column to sort on, OR switch values sets in read/write data'), 772 | ('<>', 'Previous/next sort column, OR switch values sets in read/write data'), 773 | ('', ''), 774 | ('?', 'This help screen'), 775 | ] 776 | 777 | 778 | def display_help(topscr): 779 | (RESTY, RESTX) = topscr.getmaxyx() 780 | helpscr = topscr.subwin(RESTY-4,RESTX-4,2,2) 781 | (RESTY, RESTX) = helpscr.getmaxyx() 782 | helpscr.clrtobot() 783 | helpscr.box() 784 | y = 1 785 | for parts in helpstrings: 786 | y += 1 787 | helpscr.addnstr(y, 3, '%s: %s' % parts, RESTX-4) 788 | helpscr.addstr(RESTY-1, 3, 'Press any key to leave help') 789 | helpscr.refresh() 790 | key = helpscr.getkey() 791 | helpscr.erase() 792 | curses.doupdate() 793 | 794 | 795 | def display_initial(topscr): 796 | (RESTY, RESTX) = topscr.getmaxyx() 797 | message = 'Please wait while I perform the initial data fetch' 798 | width = len(message) + 4 799 | helpscr = topscr.subwin(3, width, RESTY/2-1, (RESTX-width)/2) 800 | helpscr.clrtobot() 801 | helpscr.box() 802 | helpscr.addstr(1,2,message) 803 | helpscr.refresh() 804 | helpscr.erase() 805 | curses.doupdate() 806 | 807 | 808 | def main(stdscr, hostname): 809 | display_initial(stdscr) 810 | (RESTY, RESTX) = stdscr.getmaxyx() 811 | header_win = stdscr.subwin(5, RESTX, 0, 0) 812 | data_win = stdscr.subwin(RESTY-6, RESTX, 5, 0) 813 | status_win = stdscr.subwin(1, RESTX-19, RESTY-1, 0) 814 | target = Cluster(hostname, header_win, data_win, status_win) 815 | debug(str(target.cluster_data)) 816 | if not target.cluster_data: 817 | debug('Unable to contact any seeds') 818 | raise SystemExit('Unable to contact any seeds') 819 | target.redraw_semaphore.release() 820 | cluster = threading.Thread(target = target) 821 | cluster.daemon = True 822 | cluster.start() 823 | stdscr.addstr(RESTY-1, RESTX-19, "Press '?' for help") 824 | stdscr.refresh() 825 | read_list = [(READ_RATE_ONE_MINUTE, 'Read Rate (1 minute)'), 826 | (READ_RATE_FIVE_MINUTE, 'Read Rate (5 minutes)'), 827 | (READ_RATE_FIFTEEN_MINUTE, 'Read Rate (15 minutes)'), 828 | (READ_LATENCY_ONE_MINUTE, 'Read Latency (1 minute)'), 829 | (READ_LATENCY_FIVE_MINUTE, 'Read Latency (5 minutes)'), 830 | (READ_LATENCY_FIFTEEN_MINUTE, 'Read Latency (15 minutes)'), 831 | ] 832 | write_list = [(WRITE_RATE_ONE_MINUTE, 'Write Rate (1 minute)'), 833 | (WRITE_RATE_FIVE_MINUTE, 'Write Rate (5 minutes)'), 834 | (WRITE_RATE_FIFTEEN_MINUTE, 'Write Rate (15 minutes)'), 835 | (WRITE_LATENCY_ONE_MINUTE, 'Write Latency (1 minute)'), 836 | (WRITE_LATENCY_FIVE_MINUTE, 'Write Latency (5 minutes)'), 837 | (WRITE_LATENCY_FIFTEEN_MINUTE, 'Write Latency (15 minutes)'), 838 | ] 839 | while 1: 840 | try: 841 | key = stdscr.getkey() 842 | if key == 'q': break 843 | elif key == '+': Cluster.refresh_delay = Cluster.refresh_delay + 1 844 | elif key == '-': 845 | if Cluster.refresh_delay > 1: Cluster.refresh_delay = Cluster.refresh_delay - 1 846 | elif key == 'c': 847 | target.draw_data = target.draw_cluster_data 848 | target.title = 'Cluster Summary' 849 | elif key == 'h': 850 | target.draw_data = target.draw_host_data 851 | target.title = 'Hosts Summary' 852 | target.sort_order = 0 853 | elif key == 's': 854 | target.item = SEVERITY 855 | target.draw_data = target.draw_cluster_item 856 | target.title = 'Compactions' 857 | elif key == 'l': 858 | target.item = LOAD 859 | target.draw_data = target.draw_cluster_item 860 | target.title = 'Load' 861 | elif key == 'r': 862 | target.sort_order = target.sort_order % len(read_list) 863 | target.draw_data = target.draw_cluster_item 864 | target.item, target.title = read_list[target.sort_order] 865 | elif key == 'w': 866 | target.sort_order = target.sort_order % len(write_list) 867 | target.draw_data = target.draw_cluster_item 868 | target.item, target.title = write_list[target.sort_order] 869 | elif key in '1234567890': 870 | value = (int(key) - 1 + 10) % 10 871 | if target.title in ['Compactions', 'Load']: pass 872 | elif target.title == 'Cluster Summary': 873 | if value < 8 and value > -1: target.sort_order = value 874 | elif target.title == 'Hosts Summary': 875 | if value < 10 and value > -1: target.sort_order = value 876 | elif target.title in [x[1] for x in read_list]: 877 | if value < 6 and value > -1: 878 | target.sort_order = value % len(read_list) 879 | target.item, target.title = read_list[target.sort_order] 880 | elif target.title in [x[1] for x in write_list]: 881 | if value < 6 and value > -1: 882 | target.sort_order = value % len(write_list) 883 | target.item, target.title = write_list[target.sort_order] 884 | elif key in '<>': 885 | if target.title in ['Compactions', 'Load']: pass 886 | elif target.title == 'Cluster Summary': 887 | if key == '>': target.sort_order = (target.sort_order + 1) % 8 888 | else: target.sort_order = (target.sort_order + 7) % 9 889 | elif target.title == 'Hosts Summary': 890 | if key == '>': target.sort_order = (target.sort_order + 1) % 10 891 | else: target.sort_order = (target.sort_order + 9) % 10 892 | elif target.title in [x[1] for x in read_list]: 893 | if key == '>': target.sort_order = (target.sort_order + 1) % len(read_list) 894 | else: target.sort_order = (target.sort_order + 5) % len(read_list) 895 | target.item, target.title = read_list[target.sort_order] 896 | elif target.title in [x[1] for x in write_list]: 897 | if key == '>': target.sort_order = (target.sort_order + 1) % len(write_list) 898 | else: target.sort_order = (target.sort_order + 5) % len(write_list) 899 | target.item, target.title = write_list[target.sort_order] 900 | 901 | if key == '?': 902 | target.stop_refresh() 903 | display_help(stdscr) 904 | target.start_refresh() 905 | target.redraw_semaphore.release() 906 | except KeyboardInterrupt: raise SystemExit 907 | except: pass 908 | return target 909 | 910 | def one_shot(key, hostname): 911 | '''Extract the name of a status item, get that item, print it to stdout, 912 | and exit. 913 | 914 | Parameters: 915 | key - Name of a status variable 916 | hostname - A host name. Just one, really. 917 | 918 | Return value: does not return 919 | 920 | ''' 921 | if not host_attribute_set.has_key(key): 922 | logging.fatal('No such status item: %s', key) 923 | exit(-1) 924 | function, args = host_attribute_set[key] 925 | data_item = function(hostname, *args) 926 | data_item() 927 | print data_item[VALUE] 928 | exit() 929 | 930 | def tp_stat(key, hostname): 931 | '''Extract one value from JMX, kind of like nodetool tpstats. 932 | ''' 933 | data_item = CursedStringDataAttribute(hostname, 'org.apache.cassandra.metrics:type=ThreadPools,path=request,scope=%sStage,name=%sTasks' % key, 'Value') 934 | data_item() 935 | print data_item[VALUE] 936 | exit() 937 | 938 | def random_stat(key, hostname, jmx_object): 939 | '''Extract one value from JMX, kind of like nodetool tpstats. 940 | ''' 941 | data_item = CursedStringDataAttribute(hostname, jmx_object, key) 942 | data_item() 943 | print data_item[VALUE] 944 | exit() 945 | 946 | parser = optparse.OptionParser(description='Top-like program for Cassandra. ' 947 | 'The seed_host is used as the starting point to discover the cluster.', 948 | usage = '%prog [options] seed_host') 949 | parser.add_option('-d', '--debug', dest='debug', default=False, action='store_true') 950 | parser.add_option('-o', '--one-shot', help='Variable name to extract from the server once. Valid status variables are: ' + ' '.join(host_attribute_set.keys())) 951 | parser.add_option('-t', '--tpstat', nargs=2, help='Variable and status to extract from the server (e.g. --tpstat ReadStage Pending)') 952 | 953 | options, args = parser.parse_args() 954 | if not args: 955 | parser.print_usage() 956 | exit(-1) 957 | if options.debug: logging.basicConfig(level=logging.DEBUG) 958 | else: logging.basicConfig(level=logging.WARNING) 959 | 960 | if options.one_shot: one_shot(options.one_shot, args[0]) 961 | if options.tpstat: tp_stat(options.tpstat, args[0]) 962 | if options.tpstat: random_stat(options.tpstat, args[0]) 963 | 964 | signal.signal(signal.SIGWINCH, sigwinch_handler) 965 | old_tty = termios.tcgetattr(sys.stdin.fileno()) 966 | retdata = object() 967 | try: 968 | retdata = curses.wrapper(main, args[0]) 969 | except KeyboardInterrupt: 970 | pass 971 | except Exception, e: 972 | raise 973 | 974 | termios.tcsetattr(sys.stdin.fileno(), termios.TCSANOW, old_tty) 975 | #logging.debug(pprint.pformat( getattr(retdata,'cluster_data', None))) 976 | logging.debug('%s', 'live nodes') 977 | logging.debug(getattr(retdata, 'hostnames', None)) 978 | logging.debug('dead nodes') 979 | logging.debug(pprint.pformat(getattr(retdata, 'dead_nodes', None))) 980 | 981 | logging.debug('debuginfo') 982 | logging.debug(pprint.pformat(list(_debuginfo))) 983 | -------------------------------------------------------------------------------- /poison_pill_tester: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | import json, sys, pprint 4 | 5 | for filename in sys.argv[1:]: 6 | longest = 0 7 | widest = 0 8 | long_item = None 9 | wide_item = None 10 | try: 11 | for row in json.load(open(filename)): 12 | try: 13 | length = len(str(row)) 14 | if length > longest: 15 | longest = length 16 | long_item = row 17 | except: 18 | pass 19 | try: 20 | length = len(row['columns']) 21 | if length > widest: 22 | widest = length 23 | wide_item = row 24 | except: 25 | pass 26 | try: print filename, widest, wide_item['key'], longest, long_item['key'] 27 | except: pass 28 | except: pass 29 | -------------------------------------------------------------------------------- /stop_cassandra_repairs: -------------------------------------------------------------------------------- 1 | #! /bin/bash 2 | 3 | # Author: Brian Gallew or 4 | 5 | if [ -z "${1}" ] ; then 6 | print "Usage: ${0} hostname [hostname ...]" 7 | exit 8 | fi 9 | 10 | while test -n "${1}" ; do 11 | wget -q -O /dev/null "http://${1}:8081/invoke?operation=forceTerminateAllRepairSessions&objectname=org.apache.cassandra.db%3Atype%3DStorageService" 12 | shift 13 | done 14 | --------------------------------------------------------------------------------