├── .gitignore ├── CMakeLists.txt ├── config └── system_monitor.yaml ├── package.xml ├── launch └── system_monitor.launch ├── copyright ├── README.md ├── changelog └── bin ├── ntp_monitor.py ├── mem_monitor.py ├── net_monitor.py ├── hdd_monitor.py └── cpu_monitor.py /.gitignore: -------------------------------------------------------------------------------- 1 | build 2 | -------------------------------------------------------------------------------- /CMakeLists.txt: -------------------------------------------------------------------------------- 1 | cmake_minimum_required(VERSION 2.8.3) 2 | project(system_monitor) 3 | 4 | find_package(catkin) 5 | 6 | catkin_package() 7 | -------------------------------------------------------------------------------- /config/system_monitor.yaml: -------------------------------------------------------------------------------- 1 | cpu_monitor: 2 | check_core_temps: true 3 | cpu_load_warn: 0.9 4 | cpu_load_error: 1.0 5 | cpu_load1_warn: 0.9 6 | cpu_load5_warn: 0.8 7 | cpu_temp_warn: 85.0 8 | cpu_temp_error: 90.0 9 | hdd_monitor: 10 | no_hdd_temp: true 11 | no_hdd_temp_warn: false 12 | hdd_level_warn: 0.95 13 | hdd_level_error: 0.99 14 | hdd_temp_warn: 55.0 15 | hdd_temp_error: 70.0 16 | mem_monitor: 17 | mem_level_warn: 0.95 18 | mem_level_error: 0.99 19 | ntp_monitor: 20 | reference_host: ntp.ubuntu.com 21 | offset_tolerance: 500.0 22 | error_offset_tolerance: 5000000.0 23 | net_monitor: 24 | net_level_warn: 0.95 25 | net_capacity: 128.0 26 | -------------------------------------------------------------------------------- /package.xml: -------------------------------------------------------------------------------- 1 | 2 | system_monitor 3 | 0.1.0 4 | 5 | System monitoring tools for ROS 6 | 7 | Ralf Kaestner 8 | Jerome Maye 9 | Ralf Kaestner 10 | GNU Lesser General Public License (LGPL) 11 | http://github.com/ethz-asl/ros-system-monitor 12 | catkin 13 | rospy 14 | std_msgs 15 | diagnostic_msgs 16 | sysstat 17 | hddtemp 18 | ntpdate 19 | ifstat 20 | 21 | -------------------------------------------------------------------------------- /launch/system_monitor.launch: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 9 | 11 | 13 | 15 | 17 | 18 | 19 | 21 | 22 | 23 | -------------------------------------------------------------------------------- /copyright: -------------------------------------------------------------------------------- 1 | Copyright (C) 2009, Willow Garage, Inc. 2 | Copyright (C) 2013 by Ralf Kaestner 3 | Copyright (C) 2013 by Jerome Maye 4 | 5 | All rights reserved. 6 | 7 | Redistribution and use in source and binary forms, with or without 8 | modification, are permitted provided that the following conditions 9 | are met: 10 | 11 | 1. Redistributions of source code must retain the above copyright 12 | notice, this list of conditions and the following disclaimer. 13 | 14 | 2. Redistributions in binary form must reproduce the above copyright 15 | notice, this list of conditions and the following disclaimer in 16 | the documentation and/or other materials provided with the 17 | distribution. 18 | 19 | 3. The name of the copyright holders may be used to endorse or 20 | promote products derived from this software without specific 21 | prior written permission. 22 | 23 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 24 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 25 | LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 26 | FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE 27 | COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, 28 | INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 29 | BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 30 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 31 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 32 | LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN 33 | ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 34 | POSSIBILITY OF SUCH DAMAGE. 35 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ros-system-monitor 2 | 3 | ## Synopsis 4 | 5 | System monitoring tools for ROS. 6 | 7 | **Author(s):** Willow Garage, Inc., Jerome Maye, Ralf Kaestner 8 | 9 | **Maintainer:** Ralf Kaestner 10 | 11 | **License:** BSD License (BSD) 12 | 13 | **Operating system(s):** Debian-based Linux 14 | 15 | **Package PPA:** ppa:ethz-asl/ros 16 | 17 | ## Description 18 | 19 | This project provides system monitoring tools for ROS in the form of the 20 | following ROS nodes: 21 | 22 | * CPU monitor 23 | * HDD monitor 24 | * Memory monitor 25 | * Network monitor 26 | * NTP monitor 27 | 28 | Each node publishes ROS diagnostics which can conveniently be visualized 29 | in the runtime monitor. 30 | 31 | ## Installation 32 | 33 | Here, we assume you intend to build/install the project for the ROS 34 | distribution named `ROS_DISTRO`. 35 | 36 | ### Installing from packages (recommended for Ubuntu LTS users) 37 | 38 | The maintainers of this project provide binary packages for the latest Ubuntu 39 | LTS releases and commonly used system architectures. To install these packages, 40 | you may follow these instructions: 41 | 42 | * Add the project PPA to your APT sources by issuing 43 | 44 | ``` 45 | sudo add-apt-repository ppa:ethz-asl/ros 46 | ``` 47 | 48 | on the command line 49 | 50 | * To re-synchronize your package index files, run 51 | 52 | ``` 53 | sudo apt-get update 54 | ``` 55 | 56 | * Install all project packages and their dependencies through 57 | 58 | ``` 59 | sudo apt-get install ros-ROS_DISTRO-system-monitor 60 | ``` 61 | 62 | or using your favorite package management tool 63 | 64 | ### Building from source 65 | 66 | This project may be built using the CMake build system with an open-source 67 | macro extension called ReMake. 68 | 69 | #### Installing build dependencies 70 | 71 | The build dependencies of this project are available from the standard 72 | package repositories of recent Ubuntu and ROS releases. To install them, 73 | simply use the command 74 | 75 | ``` 76 | sudo apt-get install ros-ROS_DISTRO-rospy, ros-ROS_DISTRO-message-generation, ros-ROS_DISTRO-std-msgs, ros-ROS_DISTRO-diagnostic-msgs 77 | 78 | ``` 79 | #### Building with ReMake 80 | 81 | ##### Preparing the build system 82 | 83 | If you already have installed ReMake on your build system, you may 84 | skip this step. Otherwise, before attempting to build this project the 85 | traditional CMake way, you must install ReMake following 86 | [these instructions](https://github.com/kralf/remake). 87 | 88 | ##### Building with CMake 89 | 90 | Once ReMake is available on your build system, you may attempt to build this 91 | project the CMake way. Assuming that you have cloned the project sources into 92 | `PROJECT_DIR`, a typical out-of-source build might look like this: 93 | 94 | * Create a build directory using 95 | 96 | ``` 97 | mkdir -p PROJECT_DIR/build 98 | ``` 99 | 100 | * Switch into the build directoy by 101 | 102 | ``` 103 | cd PROJECT_DIR/build 104 | ``` 105 | 106 | * In the build directory, run 107 | 108 | ``` 109 | cmake -DROS_DISTRIBUTION=ROS_DISTRO PROJECT_DIR 110 | ``` 111 | 112 | to configure the build 113 | 114 | * If you want to inspect or modify the build configuration, issue 115 | 116 | ``` 117 | ccmake PROJECT_DIR 118 | ``` 119 | 120 | * Build the project using 121 | 122 | ``` 123 | make 124 | ``` 125 | 126 | * If you intend to install the project, call 127 | 128 | ``` 129 | make packages_install 130 | ``` 131 | 132 | (from packages on Debian-based Linux only) or 133 | 134 | ``` 135 | make install 136 | ``` 137 | 138 | ## API documentation 139 | 140 | This project does not yet provide any API documentation. 141 | 142 | ## Feature requests and bug reports 143 | 144 | If you would like to propose a feature for this project, please consider 145 | contributing or send a feature request to the project authors. Bugs may be 146 | reported through the project's issue page. 147 | 148 | ## Further reading 149 | 150 | For additional information of the Robot Operating System (ROS), please refer 151 | to the official [ROS documentation](http://wiki.ros.org). 152 | -------------------------------------------------------------------------------- /changelog: -------------------------------------------------------------------------------- 1 | ros-system-monitor (0.1-22) unstable; urgency=low 2 | 3 | * Updated changelog 4 | 5 | -- Ralf Kaestner Tue, 21 Jul 2015 10:34:00 +0200 6 | 7 | ros-system-monitor (0.1-21) unstable; urgency=low 8 | 9 | * Fixed changelog 10 | 11 | -- Ralf Kaestner Tue, 21 Jul 2015 10:32:05 +0200 12 | 13 | ros-system-monitor (0.1-20) unstable; urgency=low 14 | 15 | * Fixed bug in net_monitor's check_network() which was caused by an invalid 16 | return value after a failure to call ifstat 17 | 18 | -- Ralf Kaestner Tue, 21 Jul 2015 10:29:56 +0200 19 | 20 | ros-system-monitor (0.1-19) unstable; urgency=low 21 | 22 | * Removed debugging output 23 | 24 | -- Ralf Kaestner Wed, 17 Jun 2015 09:09:57 +0200 25 | 26 | ros-system-monitor (0.1-18) unstable; urgency=low 27 | 28 | * Fixed launch namespaces 29 | 30 | -- Ralf Kaestner Wed, 17 Jun 2015 09:08:33 +0200 31 | 32 | ros-system-monitor (0.1-17) unstable; urgency=low 33 | 34 | * Added config file argument to launch file 35 | 36 | -- Ralf Kaestner Wed, 17 Jun 2015 08:47:41 +0200 37 | 38 | ros-system-monitor (0.1-16) unstable; urgency=low 39 | 40 | * Logging hddtemp errors to info 41 | 42 | -- Ralf Kaestner Wed, 17 Jun 2015 08:43:50 +0200 43 | 44 | ros-system-monitor (0.1-15) unstable; urgency=low 45 | 46 | * Added output argument to launch file 47 | 48 | -- Ralf Kaestner Wed, 17 Jun 2015 08:36:26 +0200 49 | 50 | ros-system-monitor (0.1-14) unstable; urgency=low 51 | 52 | * Catkinization 53 | 54 | -- Ralf Kaestner Wed, 17 Jun 2015 07:46:27 +0200 55 | 56 | ros-system-monitor (0.1-13) unstable; urgency=low 57 | 58 | * Fixed ROS package dependencies 59 | 60 | -- Ralf Kaestner Wed, 18 Feb 2015 12:34:10 +0100 61 | 62 | ros-system-monitor (0.1-12) unstable; urgency=low 63 | 64 | * Changed PPA 65 | 66 | -- Ralf Kaestner Wed, 18 Feb 2015 08:05:03 +0100 67 | 68 | ros-system-monitor (0.1-11) unstable; urgency=low 69 | 70 | * Updated README.md 71 | 72 | -- Ralf Kaestner Wed, 18 Feb 2015 08:03:35 +0100 73 | 74 | ros-system-monitor (0.1-10) unstable; urgency=low 75 | 76 | * Fixed project information 77 | * Removed system-wide ROS configuration file 78 | * Adapted configuration file path in launch files 79 | * removing wlan from status 80 | 81 | -- Ralf Kaestner Wed, 18 Feb 2015 07:37:06 +0100 82 | 83 | ros-system-monitor (0.1-9) unstable; urgency=low 84 | 85 | * removing wlan from status 86 | 87 | -- Jerome Maye Tue, 14 Oct 2014 14:28:23 +0200 88 | 89 | ros-system-monitor (0.1-8) unstable; urgency=low 90 | 91 | * adding new computers 92 | 93 | -- Jerome Maye Tue, 14 Oct 2014 13:23:07 +0200 94 | 95 | ros-system-monitor (0.1-7) unstable; urgency=low 96 | 97 | * modifying configuration for ntp. 98 | 99 | -- Jerome Maye Thu, 18 Sep 2014 11:25:29 +0200 100 | 101 | ros-system-monitor (0.1-6) unstable; urgency=low 102 | 103 | * porting to PPA. 104 | 105 | -- Jerome Maye Tue, 16 Sep 2014 19:38:14 +0200 106 | 107 | ros-system-monitor (0.1-5) unstable; urgency=low 108 | 109 | * adding teamlio 110 | 111 | -- Jerome Maye Tue, 16 Sep 2014 12:29:14 +0200 112 | 113 | ros-system-monitor (0.1-4) unstable; urgency=low 114 | 115 | * adding teamlio laptop configuration. 116 | 117 | -- Jerome Maye Tue, 16 Sep 2014 12:21:55 +0200 118 | 119 | ros-system-monitor (0.1-3) unstable; urgency=low 120 | 121 | * hack for hdd monitor because of lightdm permissions. 122 | 123 | -- Jerome Maye Mon, 15 Sep 2014 21:06:35 +0200 124 | 125 | ros-system-monitor (0.1-2) unstable; urgency=low 126 | 127 | * Bugs corrections. 128 | 129 | -- Jerome Maye Fri, 12 Sep 2014 13:33:23 +0200 130 | 131 | ros-system-monitor (0.1-1) unstable; urgency=low 132 | 133 | * First draft. 134 | 135 | -- Jerome Maye Fri, 12 Sep 2014 12:55:02 +0200 136 | -------------------------------------------------------------------------------- /bin/ntp_monitor.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | ############################################################################ 3 | # Copyright (C) 2009, Willow Garage, Inc. # 4 | # Copyright (C) 2013 by Jerome Maye # 5 | # jerome.maye@mavt.ethz.ch # 6 | # # 7 | # All rights reserved. # 8 | # # 9 | # Redistribution and use in source and binary forms, with or without # 10 | # modification, are permitted provided that the following conditions # 11 | # are met: # 12 | # # 13 | # 1. Redistributions of source code must retain the above copyright # 14 | # notice, this list of conditions and the following disclaimer. # 15 | # # 16 | # 2. Redistributions in binary form must reproduce the above copyright # 17 | # notice, this list of conditions and the following disclaimer in # 18 | # the documentation and/or other materials provided with the # 19 | # distribution. # 20 | # # 21 | # 3. The name of the copyright holders may be used to endorse or # 22 | # promote products derived from this software without specific # 23 | # prior written permission. # 24 | # # 25 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS # 26 | # "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT # 27 | # LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS # 28 | # FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE # 29 | # COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, # 30 | # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, # 31 | # BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; # 32 | # LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER # 33 | # CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT # 34 | # LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN # 35 | # ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE # 36 | # POSSIBILITY OF SUCH DAMAGE. # 37 | ############################################################################ 38 | 39 | from diagnostic_msgs.msg import DiagnosticArray, DiagnosticStatus, KeyValue 40 | 41 | import sys 42 | import rospy 43 | import socket 44 | from subprocess import Popen, PIPE 45 | 46 | import time 47 | 48 | import re 49 | 50 | NAME = 'ntp_monitor' 51 | 52 | def ntp_monitor(offset=500, self_offset=500, diag_hostname = None, error_offset = 5000000): 53 | pub = rospy.Publisher("/diagnostics", DiagnosticArray, queue_size = 100) 54 | rospy.init_node(NAME, anonymous=True) 55 | 56 | hostname = socket.gethostname() 57 | if diag_hostname is None: 58 | diag_hostname = hostname 59 | 60 | ntp_hostname = rospy.get_param('~reference_host', 'ntp.ubuntu.com') 61 | offset = rospy.get_param('~offset_tolerance', 500) 62 | error_offset = rospy.get_param('~error_offset_tolerance', 5000000) 63 | 64 | stat = DiagnosticStatus() 65 | stat.level = 0 66 | stat.name = "NTP offset from "+ diag_hostname + " to " + ntp_hostname 67 | stat.message = "OK" 68 | stat.hardware_id = hostname 69 | stat.values = [] 70 | 71 | # self_stat = DiagnosticStatus() 72 | # self_stat.level = DiagnosticStatus.OK 73 | # self_stat.name = "NTP self-offset for "+ diag_hostname 74 | # self_stat.message = "OK" 75 | # self_stat.hardware_id = hostname 76 | # self_stat.values = [] 77 | 78 | while not rospy.is_shutdown(): 79 | for st,host,off in [(stat,ntp_hostname,offset)]: 80 | try: 81 | p = Popen(["ntpdate", "-q", host], stdout=PIPE, stdin=PIPE, stderr=PIPE) 82 | res = p.wait() 83 | (o,e) = p.communicate() 84 | except OSError, (errno, msg): 85 | if errno == 4: 86 | break #ctrl-c interrupt 87 | else: 88 | raise 89 | if (res == 0): 90 | measured_offset = float(re.search("offset (.*),", o).group(1))*1000000 91 | st.level = DiagnosticStatus.OK 92 | st.message = "OK" 93 | st.values = [ KeyValue("Offset (us)", str(measured_offset)), 94 | KeyValue("Offset tolerance (us)", str(off)), 95 | KeyValue("Offset tolerance (us) for Error", str(error_offset)) ] 96 | 97 | if (abs(measured_offset) > off): 98 | st.level = DiagnosticStatus.WARN 99 | st.message = "NTP Offset Too High" 100 | if (abs(measured_offset) > error_offset): 101 | st.level = DiagnosticStatus.ERROR 102 | st.message = "NTP Offset Too High" 103 | 104 | else: 105 | st.level = DiagnosticStatus.ERROR 106 | st.message = "Error Running ntpdate. Returned %d" % res 107 | st.values = [ KeyValue("Offset (us)", "N/A"), 108 | KeyValue("Offset tolerance (us)", str(off)), 109 | KeyValue("Offset tolerance (us) for Error", str(error_offset)), 110 | KeyValue("Output", o), 111 | KeyValue("Errors", e) ] 112 | 113 | 114 | msg = DiagnosticArray() 115 | msg.header.stamp = rospy.get_rostime() 116 | msg.status = [stat] 117 | pub.publish(msg) 118 | time.sleep(1) 119 | 120 | def ntp_monitor_main(argv=sys.argv): 121 | import optparse 122 | parser = optparse.OptionParser(usage="usage: ntp_monitor ntp-hostname []") 123 | parser.add_option("--offset-tolerance", dest="offset_tol", 124 | action="store", default=500, 125 | help="Offset from NTP host", metavar="OFFSET-TOL") 126 | parser.add_option("--error-offset-tolerance", dest="error_offset_tol", 127 | action="store", default=5000000, 128 | help="Offset from NTP host. Above this is error", metavar="OFFSET-TOL") 129 | parser.add_option("--self_offset-tolerance", dest="self_offset_tol", 130 | action="store", default=500, 131 | help="Offset from self", metavar="SELF_OFFSET-TOL") 132 | parser.add_option("--diag-hostname", dest="diag_hostname", 133 | help="Computer name in diagnostics output (ex: 'c1')", 134 | metavar="DIAG_HOSTNAME", 135 | action="store", default=None) 136 | options, args = parser.parse_args(rospy.myargv()) 137 | 138 | # if (len(args) != 2): 139 | # parser.error("Invalid arguments. Must have HOSTNAME [args]. %s" % args) 140 | 141 | 142 | try: 143 | offset = int(options.offset_tol) 144 | self_offset = int(options.self_offset_tol) 145 | error_offset = int(options.error_offset_tol) 146 | except: 147 | parser.error("Offsets must be numbers") 148 | 149 | ntp_monitor(offset, self_offset, options.diag_hostname, error_offset) 150 | 151 | 152 | if __name__ == "__main__": 153 | try: 154 | ntp_monitor_main(rospy.myargv()) 155 | except KeyboardInterrupt: pass 156 | except SystemExit: pass 157 | except: 158 | import traceback 159 | traceback.print_exc() 160 | -------------------------------------------------------------------------------- /bin/mem_monitor.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | ############################################################################ 3 | # Copyright (C) 2009, Willow Garage, Inc. # 4 | # Copyright (C) 2013 by Ralf Kaestner # 5 | # ralf.kaestner@gmail.com # 6 | # Copyright (C) 2013 by Jerome Maye # 7 | # jerome.maye@mavt.ethz.ch # 8 | # # 9 | # All rights reserved. # 10 | # # 11 | # Redistribution and use in source and binary forms, with or without # 12 | # modification, are permitted provided that the following conditions # 13 | # are met: # 14 | # # 15 | # 1. Redistributions of source code must retain the above copyright # 16 | # notice, this list of conditions and the following disclaimer. # 17 | # # 18 | # 2. Redistributions in binary form must reproduce the above copyright # 19 | # notice, this list of conditions and the following disclaimer in # 20 | # the documentation and/or other materials provided with the # 21 | # distribution. # 22 | # # 23 | # 3. The name of the copyright holders may be used to endorse or # 24 | # promote products derived from this software without specific # 25 | # prior written permission. # 26 | # # 27 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS # 28 | # "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT # 29 | # LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS # 30 | # FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE # 31 | # COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, # 32 | # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, # 33 | # BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; # 34 | # LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER # 35 | # CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT # 36 | # LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN # 37 | # ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE # 38 | # POSSIBILITY OF SUCH DAMAGE. # 39 | ############################################################################ 40 | 41 | from __future__ import with_statement 42 | 43 | import rospy 44 | 45 | import traceback 46 | import threading 47 | from threading import Timer 48 | import sys, os, time 49 | from time import sleep 50 | import subprocess 51 | import string 52 | import socket 53 | 54 | from diagnostic_msgs.msg import DiagnosticArray, DiagnosticStatus, KeyValue 55 | 56 | mem_level_warn = 0.95 57 | mem_level_error = 0.99 58 | 59 | stat_dict = { 0: 'OK', 1: 'Warning', 2: 'Error' } 60 | 61 | def update_status_stale(stat, last_update_time): 62 | time_since_update = rospy.get_time() - last_update_time 63 | 64 | stale_status = 'OK' 65 | if time_since_update > 20 and time_since_update <= 35: 66 | stale_status = 'Lagging' 67 | if stat.level == DiagnosticStatus.OK: 68 | stat.message = stale_status 69 | elif stat.message.find(stale_status) < 0: 70 | stat.message = ', '.join([stat.message, stale_status]) 71 | stat.level = max(stat.level, DiagnosticStatus.WARN) 72 | if time_since_update > 35: 73 | stale_status = 'Stale' 74 | if stat.level == DiagnosticStatus.OK: 75 | stat.message = stale_status 76 | elif stat.message.find(stale_status) < 0: 77 | stat.message = ', '.join([stat.message, stale_status]) 78 | stat.level = max(stat.level, DiagnosticStatus.ERROR) 79 | 80 | 81 | stat.values.pop(0) 82 | stat.values.pop(0) 83 | stat.values.insert(0, KeyValue(key = 'Update Status', value = stale_status)) 84 | stat.values.insert(1, KeyValue(key = 'Time Since Update', value = str(time_since_update))) 85 | 86 | 87 | class MemMonitor(): 88 | def __init__(self, hostname, diag_hostname): 89 | self._diag_pub = rospy.Publisher('/diagnostics', DiagnosticArray, queue_size = 100) 90 | 91 | self._mutex = threading.Lock() 92 | 93 | self._mem_level_warn = rospy.get_param('~mem_level_warn', mem_level_warn) 94 | self._mem_level_error = rospy.get_param('~mem_level_error', mem_level_error) 95 | 96 | self._usage_timer = None 97 | 98 | self._usage_stat = DiagnosticStatus() 99 | self._usage_stat.name = 'Memory Usage (%s)' % diag_hostname 100 | self._usage_stat.level = 1 101 | self._usage_stat.hardware_id = hostname 102 | self._usage_stat.message = 'No Data' 103 | self._usage_stat.values = [ KeyValue(key = 'Update Status', value = 'No Data' ), 104 | KeyValue(key = 'Time Since Last Update', value = 'N/A') ] 105 | 106 | self._last_usage_time = 0 107 | self._last_publish_time = 0 108 | 109 | # Start checking everything 110 | self.check_usage() 111 | 112 | ## Must have the lock to cancel everything 113 | def cancel_timers(self): 114 | if self._usage_timer: 115 | self._usage_timer.cancel() 116 | 117 | def check_memory(self): 118 | values = [] 119 | level = DiagnosticStatus.OK 120 | msg = '' 121 | 122 | mem_dict = { 0: 'OK', 1: 'Low Memory', 2: 'Very Low Memory' } 123 | 124 | try: 125 | p = subprocess.Popen('free -tm', 126 | stdout = subprocess.PIPE, 127 | stderr = subprocess.PIPE, shell = True) 128 | stdout, stderr = p.communicate() 129 | retcode = p.returncode 130 | 131 | if retcode != 0: 132 | values.append(KeyValue(key = "\"free -tm\" Call Error", value = str(retcode))) 133 | return DiagnosticStatus.ERROR, values 134 | 135 | rows = stdout.split('\n') 136 | data = rows[1].split() 137 | total_mem_physical = data[1] 138 | used_mem_physical = data[2] 139 | free_mem_physical = data[3] 140 | data = rows[2].split() 141 | total_mem_swap = data[1] 142 | used_mem_swap = data[2] 143 | free_mem_swap = data[3] 144 | data = rows[3].split() 145 | total_mem = data[1] 146 | used_mem = data[2] 147 | free_mem = data[3] 148 | 149 | level = DiagnosticStatus.OK 150 | mem_usage = float(used_mem_physical)/float(total_mem_physical) 151 | if (mem_usage < self._mem_level_warn): 152 | level = DiagnosticStatus.OK 153 | elif (mem_usage < self._mem_level_error): 154 | level = DiagnosticStatus.WARN 155 | else: 156 | level = DiagnosticStatus.ERROR 157 | 158 | values.append(KeyValue(key = 'Memory Status', value = mem_dict[level])) 159 | values.append(KeyValue(key = 'Total Memory (Physical)', value = total_mem_physical+"M")) 160 | values.append(KeyValue(key = 'Used Memory (Physical)', value = used_mem_physical+"M")) 161 | values.append(KeyValue(key = 'Free Memory (Physical)', value = free_mem_physical+"M")) 162 | values.append(KeyValue(key = 'Total Memory (Swap)', value = total_mem_swap+"M")) 163 | values.append(KeyValue(key = 'Used Memory (Swap)', value = used_mem_swap+"M")) 164 | values.append(KeyValue(key = 'Free Memory (Swap)', value = free_mem_swap+"M")) 165 | values.append(KeyValue(key = 'Total Memory', value = total_mem+"M")) 166 | values.append(KeyValue(key = 'Used Memory', value = used_mem+"M")) 167 | values.append(KeyValue(key = 'Free Memory', value = free_mem+"M")) 168 | 169 | msg = mem_dict[level] 170 | except Exception, e: 171 | rospy.logerr(traceback.format_exc()) 172 | msg = 'Memory Usage Check Error' 173 | values.append(KeyValue(key = msg, value = str(e))) 174 | level = DiagnosticStatus.ERROR 175 | 176 | return level, mem_dict[level], values 177 | 178 | def check_usage(self): 179 | if rospy.is_shutdown(): 180 | with self._mutex: 181 | self.cancel_timers() 182 | return 183 | 184 | diag_level = 0 185 | diag_vals = [ KeyValue(key = 'Update Status', value = 'OK' ), 186 | KeyValue(key = 'Time Since Last Update', value = 0 )] 187 | diag_msgs = [] 188 | 189 | # Check memory 190 | mem_level, mem_msg, mem_vals = self.check_memory() 191 | diag_vals.extend(mem_vals) 192 | if mem_level > 0: 193 | diag_msgs.append(mem_msg) 194 | diag_level = max(diag_level, mem_level) 195 | 196 | if diag_msgs and diag_level > 0: 197 | usage_msg = ', '.join(set(diag_msgs)) 198 | else: 199 | usage_msg = stat_dict[diag_level] 200 | 201 | # Update status 202 | with self._mutex: 203 | self._last_usage_time = rospy.get_time() 204 | self._usage_stat.level = diag_level 205 | self._usage_stat.values = diag_vals 206 | 207 | self._usage_stat.message = usage_msg 208 | 209 | if not rospy.is_shutdown(): 210 | self._usage_timer = threading.Timer(5.0, self.check_usage) 211 | self._usage_timer.start() 212 | else: 213 | self.cancel_timers() 214 | 215 | def publish_stats(self): 216 | with self._mutex: 217 | # Update everything with last update times 218 | update_status_stale(self._usage_stat, self._last_usage_time) 219 | 220 | msg = DiagnosticArray() 221 | msg.header.stamp = rospy.get_rostime() 222 | msg.status.append(self._usage_stat) 223 | 224 | if rospy.get_time() - self._last_publish_time > 0.5: 225 | self._diag_pub.publish(msg) 226 | self._last_publish_time = rospy.get_time() 227 | 228 | 229 | if __name__ == '__main__': 230 | hostname = socket.gethostname() 231 | hostname = hostname.replace('-', '_') 232 | 233 | import optparse 234 | parser = optparse.OptionParser(usage="usage: mem_monitor.py [--diag-hostname=cX]") 235 | parser.add_option("--diag-hostname", dest="diag_hostname", 236 | help="Computer name in diagnostics output (ex: 'c1')", 237 | metavar="DIAG_HOSTNAME", 238 | action="store", default = hostname) 239 | options, args = parser.parse_args(rospy.myargv()) 240 | 241 | try: 242 | rospy.init_node('mem_monitor_%s' % hostname) 243 | except rospy.exceptions.ROSInitException: 244 | print >> sys.stderr, 'Memory monitor is unable to initialize node. Master may not be running.' 245 | sys.exit(0) 246 | 247 | mem_node = MemMonitor(hostname, options.diag_hostname) 248 | 249 | rate = rospy.Rate(1.0) 250 | try: 251 | while not rospy.is_shutdown(): 252 | rate.sleep() 253 | mem_node.publish_stats() 254 | except KeyboardInterrupt: 255 | pass 256 | except Exception, e: 257 | traceback.print_exc() 258 | rospy.logerr(traceback.format_exc()) 259 | 260 | mem_node.cancel_timers() 261 | sys.exit(0) 262 | -------------------------------------------------------------------------------- /bin/net_monitor.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | ############################################################################ 3 | # Copyright (C) 2009, Willow Garage, Inc. # 4 | # Copyright (C) 2013 by Ralf Kaestner # 5 | # ralf.kaestner@gmail.com # 6 | # Copyright (C) 2013 by Jerome Maye # 7 | # jerome.maye@mavt.ethz.ch # 8 | # # 9 | # All rights reserved. # 10 | # # 11 | # Redistribution and use in source and binary forms, with or without # 12 | # modification, are permitted provided that the following conditions # 13 | # are met: # 14 | # # 15 | # 1. Redistributions of source code must retain the above copyright # 16 | # notice, this list of conditions and the following disclaimer. # 17 | # # 18 | # 2. Redistributions in binary form must reproduce the above copyright # 19 | # notice, this list of conditions and the following disclaimer in # 20 | # the documentation and/or other materials provided with the # 21 | # distribution. # 22 | # # 23 | # 3. The name of the copyright holders may be used to endorse or # 24 | # promote products derived from this software without specific # 25 | # prior written permission. # 26 | # # 27 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS # 28 | # "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT # 29 | # LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS # 30 | # FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE # 31 | # COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, # 32 | # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, # 33 | # BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; # 34 | # LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER # 35 | # CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT # 36 | # LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN # 37 | # ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE # 38 | # POSSIBILITY OF SUCH DAMAGE. # 39 | ############################################################################ 40 | 41 | from __future__ import with_statement 42 | 43 | import rospy 44 | 45 | import traceback 46 | import threading 47 | from threading import Timer 48 | import sys, os, time 49 | from time import sleep 50 | import subprocess 51 | import string 52 | import re 53 | 54 | import socket 55 | 56 | from diagnostic_msgs.msg import DiagnosticArray, DiagnosticStatus, KeyValue 57 | 58 | net_level_warn = 0.95 59 | net_capacity = 128 60 | 61 | stat_dict = {0: 'OK', 1: 'Warning', 2: 'Error'} 62 | 63 | def update_status_stale(stat, last_update_time): 64 | time_since_update = rospy.get_time() - last_update_time 65 | stale_status = 'OK' 66 | if time_since_update > 20 and time_since_update <= 35: 67 | stale_status = 'Lagging' 68 | if stat.level == DiagnosticStatus.OK: 69 | stat.message = stale_status 70 | elif stat.message.find(stale_status) < 0: 71 | stat.message = ', '.join([stat.message, stale_status]) 72 | stat.level = max(stat.level, DiagnosticStatus.WARN) 73 | if time_since_update > 35: 74 | stale_status = 'Stale' 75 | if stat.level == DiagnosticStatus.OK: 76 | stat.message = stale_status 77 | elif stat.message.find(stale_status) < 0: 78 | stat.message = ', '.join([stat.message, stale_status]) 79 | stat.level = max(stat.level, DiagnosticStatus.ERROR) 80 | stat.values.pop(0) 81 | stat.values.pop(0) 82 | stat.values.insert(0, KeyValue(key = 'Update Status', value = stale_status)) 83 | stat.values.insert(1, KeyValue(key = 'Time Since Update', 84 | value = str(time_since_update))) 85 | 86 | def get_sys_net_stat(iface, sys): 87 | cmd = 'cat /sys/class/net/%s/statistics/%s' %(iface, sys) 88 | p = subprocess.Popen(cmd, 89 | stdout = subprocess.PIPE, 90 | stderr = subprocess.PIPE, shell = True) 91 | stdout, stderr = p.communicate() 92 | return (p.returncode, stdout.strip()) 93 | 94 | def get_sys_net(iface, sys): 95 | cmd = 'cat /sys/class/net/%s/%s' %(iface, sys) 96 | p = subprocess.Popen(cmd, 97 | stdout = subprocess.PIPE, 98 | stderr = subprocess.PIPE, shell = True) 99 | stdout, stderr = p.communicate() 100 | return (p.returncode, stdout.strip()) 101 | 102 | class NetMonitor(): 103 | def __init__(self, hostname, diag_hostname): 104 | self._diag_pub = rospy.Publisher('/diagnostics', DiagnosticArray, queue_size = 100) 105 | self._mutex = threading.Lock() 106 | self._net_level_warn = rospy.get_param('~net_level_warn', net_level_warn) 107 | self._net_capacity = rospy.get_param('~net_capacity', net_capacity) 108 | self._usage_timer = None 109 | self._usage_stat = DiagnosticStatus() 110 | self._usage_stat.name = 'Network Usage (%s)' % diag_hostname 111 | self._usage_stat.level = 1 112 | self._usage_stat.hardware_id = hostname 113 | self._usage_stat.message = 'No Data' 114 | self._usage_stat.values = [KeyValue(key = 'Update Status', 115 | value = 'No Data' ), 116 | KeyValue(key = 'Time Since Last Update', 117 | value = 'N/A') ] 118 | self._last_usage_time = 0 119 | self._last_publish_time = 0 120 | self.check_usage() 121 | 122 | def cancel_timers(self): 123 | if self._usage_timer: 124 | self._usage_timer.cancel() 125 | 126 | def check_network(self): 127 | values = [] 128 | net_dict = {0: 'OK', 1: 'High Network Usage', 2: 'Network Down', 3: 'Call Error'} 129 | try: 130 | p = subprocess.Popen('ifstat -q -S 1 1', 131 | stdout = subprocess.PIPE, 132 | stderr = subprocess.PIPE, shell = True) 133 | stdout, stderr = p.communicate() 134 | retcode = p.returncode 135 | if retcode != 0: 136 | values.append(KeyValue(key = "\"ifstat -q -S 1 1\" Call Error", 137 | value = str(retcode))) 138 | return DiagnosticStatus.ERROR, net_dict[3], values 139 | rows = stdout.split('\n') 140 | data = rows[0].split() 141 | ifaces = [] 142 | for i in range(0, len(data)): 143 | ifaces.append(data[i]) 144 | data = rows[2].split() 145 | kb_in = [] 146 | kb_out = [] 147 | for i in range(0, len(data), 2): 148 | kb_in.append(data[i]) 149 | kb_out.append(data[i + 1]) 150 | level = DiagnosticStatus.OK 151 | for i in range(0, len(ifaces)): 152 | values.append(KeyValue(key = 'Interface Name', 153 | value = ifaces[i])) 154 | (retcode, cmd_out) = get_sys_net(ifaces[i], 'operstate') 155 | if retcode == 0: 156 | values.append(KeyValue(key = 'State', value = cmd_out)) 157 | ifacematch = re.match('eth[0-9]+', ifaces[i]) 158 | if ifacematch and (cmd_out == 'down' or cmd_out == 'dormant'): 159 | level = DiagnosticStatus.ERROR 160 | values.append(KeyValue(key = 'Input Traffic', 161 | value = str(float(kb_in[i]) / 1024) + " (MB/s)")) 162 | values.append(KeyValue(key = 'Output Traffic', 163 | value = str(float(kb_out[i]) / 1024) + " (MB/s)")) 164 | net_usage_in = float(kb_in[i]) / 1024 / self._net_capacity 165 | net_usage_out = float(kb_out[i]) / 1024 / self._net_capacity 166 | if net_usage_in > self._net_level_warn or\ 167 | net_usage_out > self._net_level_warn: 168 | level = DiagnosticStatus.WARN 169 | (retcode, cmd_out) = get_sys_net(ifaces[i], 'mtu') 170 | if retcode == 0: 171 | values.append(KeyValue(key = 'MTU', value = cmd_out)) 172 | (retcode, cmd_out) = get_sys_net_stat(ifaces[i], 'rx_bytes') 173 | if retcode == 0: 174 | values.append(KeyValue(key = 'Total received MB', 175 | value = str(float(cmd_out) / 1024 / 1024))) 176 | (retcode, cmd_out) = get_sys_net_stat(ifaces[i], 'tx_bytes') 177 | if retcode == 0: 178 | values.append(KeyValue(key = 'Total transmitted MB', 179 | value = str(float(cmd_out) / 1024 / 1024))) 180 | (retcode, cmd_out) = get_sys_net_stat(ifaces[i], 'collisions') 181 | if retcode == 0: 182 | values.append(KeyValue(key = 'Collisions', value = cmd_out)) 183 | (retcode, cmd_out) = get_sys_net_stat(ifaces[i], 'rx_errors') 184 | if retcode == 0: 185 | values.append(KeyValue(key = 'Rx Errors', value = cmd_out)) 186 | (retcode, cmd_out) = get_sys_net_stat(ifaces[i], 'tx_errors') 187 | if retcode == 0: 188 | values.append(KeyValue(key = 'Tx Errors', value = cmd_out)) 189 | except Exception, e: 190 | rospy.logerr(traceback.format_exc()) 191 | msg = 'Network Usage Check Error' 192 | values.append(KeyValue(key = msg, value = str(e))) 193 | level = DiagnosticStatus.ERROR 194 | return level, net_dict[level], values 195 | 196 | def check_usage(self): 197 | if rospy.is_shutdown(): 198 | with self._mutex: 199 | self.cancel_timers() 200 | return 201 | diag_level = 0 202 | diag_vals = [KeyValue(key = 'Update Status', value = 'OK'), 203 | KeyValue(key = 'Time Since Last Update', value = 0)] 204 | diag_msgs = [] 205 | net_level, net_msg, net_vals = self.check_network() 206 | diag_vals.extend(net_vals) 207 | if net_level > 0: 208 | diag_msgs.append(net_msg) 209 | diag_level = max(diag_level, net_level) 210 | if diag_msgs and diag_level > 0: 211 | usage_msg = ', '.join(set(diag_msgs)) 212 | else: 213 | usage_msg = stat_dict[diag_level] 214 | with self._mutex: 215 | self._last_usage_time = rospy.get_time() 216 | self._usage_stat.level = diag_level 217 | self._usage_stat.values = diag_vals 218 | self._usage_stat.message = usage_msg 219 | if not rospy.is_shutdown(): 220 | self._usage_timer = threading.Timer(5.0, self.check_usage) 221 | self._usage_timer.start() 222 | else: 223 | self.cancel_timers() 224 | 225 | def publish_stats(self): 226 | with self._mutex: 227 | update_status_stale(self._usage_stat, self._last_usage_time) 228 | msg = DiagnosticArray() 229 | msg.header.stamp = rospy.get_rostime() 230 | msg.status.append(self._usage_stat) 231 | if rospy.get_time() - self._last_publish_time > 0.5: 232 | self._diag_pub.publish(msg) 233 | self._last_publish_time = rospy.get_time() 234 | 235 | if __name__ == '__main__': 236 | hostname = socket.gethostname() 237 | hostname = hostname.replace('-', '_') 238 | 239 | import optparse 240 | parser =\ 241 | optparse.OptionParser( 242 | usage="usage: net_monitor.py [--diag-hostname=cX]") 243 | parser.add_option("--diag-hostname", dest="diag_hostname", 244 | help="Computer name in diagnostics output (ex: 'c1')", 245 | metavar="DIAG_HOSTNAME", 246 | action="store", default = hostname) 247 | options, args = parser.parse_args(rospy.myargv()) 248 | try: 249 | rospy.init_node('net_monitor_%s' % hostname) 250 | except rospy.exceptions.ROSInitException: 251 | print >> sys.stderr,\ 252 | 'Network monitor is unable to initialize node. Master may not be running.' 253 | sys.exit(0) 254 | net_node = NetMonitor(hostname, options.diag_hostname) 255 | rate = rospy.Rate(1.0) 256 | try: 257 | while not rospy.is_shutdown(): 258 | rate.sleep() 259 | net_node.publish_stats() 260 | except KeyboardInterrupt: 261 | pass 262 | except Exception, e: 263 | traceback.print_exc() 264 | rospy.logerr(traceback.format_exc()) 265 | net_node.cancel_timers() 266 | sys.exit(0) 267 | -------------------------------------------------------------------------------- /bin/hdd_monitor.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | ############################################################################ 3 | # Copyright (C) 2009, Willow Garage, Inc. # 4 | # Copyright (C) 2013 by Ralf Kaestner # 5 | # ralf.kaestner@gmail.com # 6 | # Copyright (C) 2013 by Jerome Maye # 7 | # jerome.maye@mavt.ethz.ch # 8 | # # 9 | # All rights reserved. # 10 | # # 11 | # Redistribution and use in source and binary forms, with or without # 12 | # modification, are permitted provided that the following conditions # 13 | # are met: # 14 | # # 15 | # 1. Redistributions of source code must retain the above copyright # 16 | # notice, this list of conditions and the following disclaimer. # 17 | # # 18 | # 2. Redistributions in binary form must reproduce the above copyright # 19 | # notice, this list of conditions and the following disclaimer in # 20 | # the documentation and/or other materials provided with the # 21 | # distribution. # 22 | # # 23 | # 3. The name of the copyright holders may be used to endorse or # 24 | # promote products derived from this software without specific # 25 | # prior written permission. # 26 | # # 27 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS # 28 | # "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT # 29 | # LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS # 30 | # FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE # 31 | # COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, # 32 | # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, # 33 | # BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; # 34 | # LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER # 35 | # CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT # 36 | # LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN # 37 | # ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE # 38 | # POSSIBILITY OF SUCH DAMAGE. # 39 | ############################################################################ 40 | 41 | from __future__ import with_statement 42 | 43 | import rospy 44 | 45 | import traceback 46 | import threading 47 | from threading import Timer 48 | import sys, os, time 49 | from time import sleep 50 | import subprocess 51 | 52 | import socket 53 | 54 | from diagnostic_msgs.msg import DiagnosticArray, DiagnosticStatus, KeyValue 55 | 56 | hdd_level_warn = 0.95 57 | hdd_level_error = 0.99 58 | hdd_temp_warn = 55.0 59 | hdd_temp_error = 70.0 60 | 61 | stat_dict = { 0: 'OK', 1: 'Warning', 2: 'Error' } 62 | temp_dict = { 0: 'OK', 1: 'Hot', 2: 'Critical Hot' } 63 | usage_dict = { 0: 'OK', 1: 'Low Disk Space', 2: 'Very Low Disk Space' } 64 | 65 | REMOVABLE = ['/dev/sg1', '/dev/sdb'] # Store removable drives so we can ignore if removed 66 | 67 | ## Connects to hddtemp daemon to get temp, HDD make. 68 | def get_hddtemp_data(hostname = 'localhost', port = 7634): 69 | try: 70 | hdd_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 71 | hdd_sock.connect((hostname, port)) 72 | sock_data = '' 73 | while True: 74 | newdat = hdd_sock.recv(1024) 75 | if len(newdat) == 0: 76 | break 77 | sock_data = sock_data + newdat 78 | hdd_sock.close() 79 | 80 | sock_vals = sock_data.split('|') 81 | 82 | # Format of output looks like ' | DRIVE | MAKE | TEMP | ' 83 | idx = 0 84 | 85 | drives = [] 86 | makes = [] 87 | temps = [] 88 | while idx + 5 < len(sock_vals): 89 | this_drive = sock_vals[idx + 1] 90 | this_make = sock_vals[idx + 2] 91 | this_temp = sock_vals[idx + 3] 92 | 93 | # Sometimes we get duplicate makes if hard drives are mounted 94 | # to two different points 95 | if this_make in makes: 96 | idx += 5 97 | continue 98 | 99 | drives.append(this_drive) 100 | makes.append(this_make) 101 | temps.append(this_temp) 102 | 103 | idx += 5 104 | 105 | return True, drives, makes, temps 106 | except: 107 | rospy.loginfo(traceback.format_exc()) 108 | return False, [ 'Exception' ], [ traceback.format_exc() ], [ 0 ] 109 | 110 | def update_status_stale(stat, last_update_time): 111 | time_since_update = rospy.get_time() - last_update_time 112 | 113 | stale_status = 'OK' 114 | if time_since_update > 20 and time_since_update <= 35: 115 | stale_status = 'Lagging' 116 | if stat.level == DiagnosticStatus.OK: 117 | stat.message = stale_status 118 | elif stat.message.find(stale_status) < 0: 119 | stat.message = ', '.join([stat.message, stale_status]) 120 | stat.level = max(stat.level, DiagnosticStatus.WARN) 121 | if time_since_update > 35: 122 | stale_status = 'Stale' 123 | if stat.level == DiagnosticStatus.OK: 124 | stat.message = stale_status 125 | elif stat.message.find(stale_status) < 0: 126 | stat.message = ', '.join([stat.message, stale_status]) 127 | stat.level = max(stat.level, DiagnosticStatus.ERROR) 128 | 129 | stat.values.pop(0) 130 | stat.values.pop(0) 131 | stat.values.insert(0, KeyValue(key = 'Update Status', value = stale_status)) 132 | stat.values.insert(1, KeyValue(key = 'Time Since Update', value = str(time_since_update))) 133 | 134 | class hdd_monitor(): 135 | def __init__(self, hostname, diag_hostname): 136 | self._mutex = threading.Lock() 137 | 138 | self._hostname = hostname 139 | self._no_temp = rospy.get_param('~no_hdd_temp', False) 140 | self._no_temp_warn = rospy.get_param('~no_hdd_temp_warn', False) 141 | self._hdd_level_warn = rospy.get_param('~hdd_level_warn', hdd_level_warn) 142 | self._hdd_level_error = rospy.get_param('~hdd_level_error', hdd_level_error) 143 | self._hdd_temp_warn = rospy.get_param('~hdd_temp_warn', hdd_temp_warn) 144 | self._hdd_temp_error = rospy.get_param('~hdd_temp_error', hdd_temp_error) 145 | 146 | self._diag_pub = rospy.Publisher('/diagnostics', DiagnosticArray, queue_size = 100) 147 | 148 | self._last_publish_time = 0 149 | 150 | self._last_temp_time = 0 151 | self._temp_timer = None 152 | if not self._no_temp: 153 | self._temp_stat = DiagnosticStatus() 154 | self._temp_stat.name = "HDD Temperature (%s)" % diag_hostname 155 | self._temp_stat.level = DiagnosticStatus.ERROR 156 | self._temp_stat.hardware_id = hostname 157 | self._temp_stat.message = 'No Data' 158 | self._temp_stat.values = [ KeyValue(key = 'Update Status', value = 'No Data'), 159 | KeyValue(key = 'Time Since Last Update', value = 'N/A') ] 160 | self.check_temps() 161 | 162 | self._last_usage_time = 0 163 | self._usage_timer = None 164 | self._usage_stat = DiagnosticStatus() 165 | self._usage_stat.level = DiagnosticStatus.ERROR 166 | self._usage_stat.hardware_id = hostname 167 | self._usage_stat.name = 'HDD Usage (%s)' % diag_hostname 168 | self._usage_stat.values = [ KeyValue(key = 'Update Status', value = 'No Data' ), 169 | KeyValue(key = 'Time Since Last Update', value = 'N/A') ] 170 | self.check_disk_usage() 171 | 172 | ## Must have the lock to cancel everything 173 | def cancel_timers(self): 174 | if self._temp_timer: 175 | self._temp_timer.cancel() 176 | self._temp_timer = None 177 | 178 | if self._usage_timer: 179 | self._usage_timer.cancel() 180 | self._usage_timer = None 181 | 182 | def check_temps(self): 183 | if rospy.is_shutdown(): 184 | with self._mutex: 185 | self.cancel_timers() 186 | return 187 | 188 | diag_strs = [ KeyValue(key = 'Update Status', value = 'OK' ) , 189 | KeyValue(key = 'Time Since Last Update', value = '0' ) ] 190 | diag_level = DiagnosticStatus.OK 191 | diag_message = 'OK' 192 | 193 | temp_ok, drives, makes, temps = get_hddtemp_data() 194 | 195 | for index in range(0, len(drives)): 196 | temp = temps[index] 197 | 198 | if not unicode(temp).isnumeric() and drives[index] not in REMOVABLE: 199 | temp_level = DiagnosticStatus.ERROR 200 | temp_ok = False 201 | elif not unicode(temp).isnumeric() and drives[index] in REMOVABLE: 202 | temp_level = DiagnosticStatus.OK 203 | temp = "Removed" 204 | else: 205 | temp_level = DiagnosticStatus.OK 206 | if float(temp) >= self._hdd_temp_warn: 207 | temp_level = DiagnosticStatus.WARN 208 | if float(temp) >= self._hdd_temp_error: 209 | temp_level = DiagnosticStatus.ERROR 210 | 211 | diag_level = max(diag_level, temp_level) 212 | 213 | diag_strs.append(KeyValue(key = 'Disk %d Temperature Status' % index, value = temp_dict[temp_level])) 214 | diag_strs.append(KeyValue(key = 'Disk %d Mount Pt.' % index, value = drives[index])) 215 | diag_strs.append(KeyValue(key = 'Disk %d Device ID' % index, value = makes[index])) 216 | diag_strs.append(KeyValue(key = 'Disk %d Temperature' % index, value = str(temp)+"DegC")) 217 | 218 | if not temp_ok: 219 | diag_level = DiagnosticStatus.ERROR 220 | 221 | with self._mutex: 222 | self._last_temp_time = rospy.get_time() 223 | self._temp_stat.values = diag_strs 224 | self._temp_stat.level = diag_level 225 | 226 | # Give No Data message if we have no reading 227 | self._temp_stat.message = temp_dict[diag_level] 228 | if not temp_ok: 229 | self._temp_stat.message = 'Error' 230 | 231 | if self._no_temp_warn and temp_ok: 232 | self._temp_stat.level = DiagnosticStatus.OK 233 | 234 | if not rospy.is_shutdown(): 235 | self._temp_timer = threading.Timer(10.0, self.check_temps) 236 | self._temp_timer.start() 237 | else: 238 | self.cancel_timers() 239 | 240 | def check_disk_usage(self): 241 | if rospy.is_shutdown(): 242 | with self._mutex: 243 | self.cancel_timers() 244 | return 245 | 246 | diag_vals = [ KeyValue(key = 'Update Status', value = 'OK' ), 247 | KeyValue(key = 'Time Since Last Update', value = '0' ) ] 248 | diag_level = DiagnosticStatus.OK 249 | diag_message = 'OK' 250 | 251 | try: 252 | p = subprocess.Popen(["df", "-Pht", "ext4"], stdout=subprocess.PIPE, stderr=subprocess.PIPE) 253 | stdout, stderr = p.communicate() 254 | retcode = p.returncode 255 | 256 | if (retcode == 0 or retcode == 1): 257 | diag_vals.append(KeyValue(key = 'Disk Space Reading', value = 'OK')) 258 | rows = stdout.split('\n') 259 | del rows[0] 260 | row_count = 0 261 | 262 | for row in rows: 263 | if len(row.split()) < 2: 264 | continue 265 | if unicode(row.split()[0]) == "none": 266 | continue 267 | 268 | row_count += 1 269 | g_available = row.split()[-3] 270 | g_use = row.split()[-2] 271 | name = row.split()[0] 272 | size = row.split()[1] 273 | mount_pt = row.split()[-1] 274 | 275 | hdd_usage = float(g_use.replace("%", ""))*1e-2 276 | if (hdd_usage < self._hdd_level_warn): 277 | level = DiagnosticStatus.OK 278 | elif (hdd_usage < self._hdd_level_error): 279 | level = DiagnosticStatus.WARN 280 | else: 281 | level = DiagnosticStatus.ERROR 282 | 283 | diag_vals.append(KeyValue( 284 | key = 'Disk %d Name' % row_count, value = name)) 285 | diag_vals.append(KeyValue( 286 | key = 'Disk %d Size' % row_count, value = size)) 287 | diag_vals.append(KeyValue( 288 | key = 'Disk %d Available' % row_count, value = g_available)) 289 | diag_vals.append(KeyValue( 290 | key = 'Disk %d Use' % row_count, value = g_use)) 291 | diag_vals.append(KeyValue( 292 | key = 'Disk %d Status' % row_count, value = stat_dict[level])) 293 | diag_vals.append(KeyValue( 294 | key = 'Disk %d Mount Point' % row_count, value = mount_pt)) 295 | 296 | diag_level = max(diag_level, level) 297 | diag_message = usage_dict[diag_level] 298 | 299 | else: 300 | diag_vals.append(KeyValue(key = 'Disk Space Reading', value = 'Failed')) 301 | diag_level = DiagnosticStatus.ERROR 302 | diag_message = stat_dict[diag_level] 303 | 304 | 305 | except: 306 | rospy.logerr(traceback.format_exc()) 307 | 308 | diag_vals.append(KeyValue(key = 'Disk Space Reading', value = 'Exception')) 309 | diag_vals.append(KeyValue(key = 'Disk Space Ex', value = traceback.format_exc())) 310 | 311 | diag_level = DiagnosticStatus.ERROR 312 | diag_message = stat_dict[diag_level] 313 | 314 | # Update status 315 | with self._mutex: 316 | self._last_usage_time = rospy.get_time() 317 | self._usage_stat.values = diag_vals 318 | self._usage_stat.message = diag_message 319 | self._usage_stat.level = diag_level 320 | 321 | if not rospy.is_shutdown(): 322 | self._usage_timer = threading.Timer(5.0, self.check_disk_usage) 323 | self._usage_timer.start() 324 | else: 325 | self.cancel_timers() 326 | 327 | 328 | def publish_stats(self): 329 | with self._mutex: 330 | msg = DiagnosticArray() 331 | msg.header.stamp = rospy.get_rostime() 332 | 333 | if not self._no_temp: 334 | update_status_stale(self._temp_stat, self._last_temp_time) 335 | msg.status.append(self._temp_stat) 336 | 337 | update_status_stale(self._usage_stat, self._last_usage_time) 338 | msg.status.append(self._usage_stat) 339 | 340 | if rospy.get_time() - self._last_publish_time > 0.5: 341 | self._diag_pub.publish(msg) 342 | self._last_publish_time = rospy.get_time() 343 | 344 | 345 | 346 | 347 | ##\todo Need to check HDD input/output too using iostat 348 | 349 | if __name__ == '__main__': 350 | hostname = socket.gethostname() 351 | hostname = hostname.replace('-', '_') 352 | 353 | import optparse 354 | parser = optparse.OptionParser(usage="usage: hdd_monitor.py [--diag-hostname=cX]") 355 | parser.add_option("--diag-hostname", dest="diag_hostname", 356 | help="Computer name in diagnostics output (ex: 'c1')", 357 | metavar="DIAG_HOSTNAME", 358 | action="store", default = hostname) 359 | options, args = parser.parse_args(rospy.myargv()) 360 | 361 | try: 362 | rospy.init_node('hdd_monitor_%s' % hostname) 363 | except rospy.exceptions.ROSInitException: 364 | print 'HDD monitor is unable to initialize node. Master may not be running.' 365 | sys.exit(0) 366 | 367 | hdd_monitor = hdd_monitor(hostname, options.diag_hostname) 368 | rate = rospy.Rate(1.0) 369 | 370 | try: 371 | while not rospy.is_shutdown(): 372 | rate.sleep() 373 | hdd_monitor.publish_stats() 374 | except KeyboardInterrupt: 375 | pass 376 | except Exception, e: 377 | traceback.print_exc() 378 | 379 | hdd_monitor.cancel_timers() 380 | sys.exit(0) 381 | -------------------------------------------------------------------------------- /bin/cpu_monitor.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | ############################################################################ 3 | # Copyright (C) 2009, Willow Garage, Inc. # 4 | # Copyright (C) 2013 by Ralf Kaestner # 5 | # ralf.kaestner@gmail.com # 6 | # Copyright (C) 2013 by Jerome Maye # 7 | # jerome.maye@mavt.ethz.ch # 8 | # # 9 | # All rights reserved. # 10 | # # 11 | # Redistribution and use in source and binary forms, with or without # 12 | # modification, are permitted provided that the following conditions # 13 | # are met: # 14 | # # 15 | # 1. Redistributions of source code must retain the above copyright # 16 | # notice, this list of conditions and the following disclaimer. # 17 | # # 18 | # 2. Redistributions in binary form must reproduce the above copyright # 19 | # notice, this list of conditions and the following disclaimer in # 20 | # the documentation and/or other materials provided with the # 21 | # distribution. # 22 | # # 23 | # 3. The name of the copyright holders may be used to endorse or # 24 | # promote products derived from this software without specific # 25 | # prior written permission. # 26 | # # 27 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS # 28 | # "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT # 29 | # LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS # 30 | # FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE # 31 | # COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, # 32 | # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, # 33 | # BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; # 34 | # LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER # 35 | # CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT # 36 | # LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN # 37 | # ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE # 38 | # POSSIBILITY OF SUCH DAMAGE. # 39 | ############################################################################ 40 | 41 | from __future__ import with_statement 42 | 43 | import rospy 44 | 45 | import traceback 46 | import threading 47 | from threading import Timer 48 | import sys, os, time 49 | from time import sleep 50 | import subprocess 51 | import string 52 | import multiprocessing 53 | 54 | import socket 55 | 56 | from diagnostic_msgs.msg import DiagnosticArray, DiagnosticStatus, KeyValue 57 | 58 | cpu_load_warn = 0.9 59 | cpu_load_error = 1.1 60 | cpu_load1_warn = 0.9 61 | cpu_load5_warn = 0.8 62 | cpu_temp_warn = 85.0 63 | cpu_temp_error = 90.0 64 | 65 | stat_dict = { 0: 'OK', 1: 'Warning', 2: 'Error' } 66 | 67 | def update_status_stale(stat, last_update_time): 68 | time_since_update = rospy.get_time() - last_update_time 69 | 70 | stale_status = 'OK' 71 | if time_since_update > 20 and time_since_update <= 35: 72 | stale_status = 'Lagging' 73 | if stat.level == DiagnosticStatus.OK: 74 | stat.message = stale_status 75 | elif stat.message.find(stale_status) < 0: 76 | stat.message = ', '.join([stat.message, stale_status]) 77 | stat.level = max(stat.level, DiagnosticStatus.WARN) 78 | if time_since_update > 35: 79 | stale_status = 'Stale' 80 | if stat.level == DiagnosticStatus.OK: 81 | stat.message = stale_status 82 | elif stat.message.find(stale_status) < 0: 83 | stat.message = ', '.join([stat.message, stale_status]) 84 | stat.level = max(stat.level, DiagnosticStatus.ERROR) 85 | 86 | 87 | stat.values.pop(0) 88 | stat.values.pop(0) 89 | stat.values.insert(0, KeyValue(key = 'Update Status', value = stale_status)) 90 | stat.values.insert(1, KeyValue(key = 'Time Since Update', value = str(time_since_update))) 91 | 92 | 93 | class CPUMonitor(): 94 | def __init__(self, hostname, diag_hostname): 95 | self._diag_pub = rospy.Publisher('/diagnostics', DiagnosticArray, queue_size = 100) 96 | 97 | self._mutex = threading.Lock() 98 | 99 | self._check_core_temps = rospy.get_param('~check_core_temps', True) 100 | 101 | self._cpu_load_warn = rospy.get_param('~cpu_load_warn', cpu_load_warn) 102 | self._cpu_load_error = rospy.get_param('~cpu_load_error', cpu_load_error) 103 | self._cpu_load1_warn = rospy.get_param('~cpu_load1_warn', cpu_load1_warn) 104 | self._cpu_load5_warn = rospy.get_param('~cpu_load5_warn', cpu_load5_warn) 105 | self._cpu_temp_warn = rospy.get_param('~cpu_temp_warn', cpu_temp_warn) 106 | self._cpu_temp_error = rospy.get_param('~cpu_temp_error', cpu_temp_error) 107 | 108 | self._num_cores = multiprocessing.cpu_count() 109 | 110 | self._temps_timer = None 111 | self._usage_timer = None 112 | 113 | # Get temp_input files 114 | self._temp_vals = self.get_core_temp_names() 115 | 116 | # CPU stats 117 | self._temp_stat = DiagnosticStatus() 118 | self._temp_stat.name = 'CPU Temperature (%s)' % diag_hostname 119 | self._temp_stat.level = 1 120 | self._temp_stat.hardware_id = hostname 121 | self._temp_stat.message = 'No Data' 122 | self._temp_stat.values = [ KeyValue(key = 'Update Status', value = 'No Data' ), 123 | KeyValue(key = 'Time Since Last Update', value = 'N/A') ] 124 | 125 | self._usage_stat = DiagnosticStatus() 126 | self._usage_stat.name = 'CPU Usage (%s)' % diag_hostname 127 | self._usage_stat.level = 1 128 | self._usage_stat.hardware_id = hostname 129 | self._usage_stat.message = 'No Data' 130 | self._usage_stat.values = [ KeyValue(key = 'Update Status', value = 'No Data' ), 131 | KeyValue(key = 'Time Since Last Update', value = 'N/A') ] 132 | 133 | self._last_temp_time = 0 134 | self._last_usage_time = 0 135 | self._last_publish_time = 0 136 | 137 | self._usage_old = 0 138 | self._has_warned_mpstat = False 139 | self._has_error_core_count = False 140 | 141 | # Start checking everything 142 | self.check_temps() 143 | self.check_usage() 144 | 145 | # Restart temperature checking 146 | def _restart_temp_check(self): 147 | rospy.logerr('Restarting temperature check thread in cpu_monitor. This should not happen') 148 | try: 149 | with self._mutex: 150 | if self._temps_timer: 151 | self._temps_timer.cancel() 152 | 153 | self.check_temps() 154 | except Exception, e: 155 | rospy.logerr('Unable to restart temp thread. Error: %s' % traceback.format_exc()) 156 | 157 | 158 | ## Must have the lock to cancel everything 159 | def cancel_timers(self): 160 | if self._temps_timer: 161 | self._temps_timer.cancel() 162 | 163 | if self._usage_timer: 164 | self._usage_timer.cancel() 165 | 166 | ##\brief Check CPU core temps 167 | ## 168 | ## Use 'find /sys -name temp1_input' to find cores 169 | ## Read from every core, divide by 1000 170 | def check_core_temps(self, sys_temp_strings): 171 | diag_vals = [] 172 | diag_level = 0 173 | diag_msgs = [] 174 | 175 | for index, temp_str in enumerate(sys_temp_strings): 176 | if len(temp_str) < 5: 177 | continue 178 | 179 | cmd = 'cat %s' % temp_str 180 | p = subprocess.Popen(cmd, stdout = subprocess.PIPE, 181 | stderr = subprocess.PIPE, shell = True) 182 | stdout, stderr = p.communicate() 183 | retcode = p.returncode 184 | 185 | if retcode != 0: 186 | diag_level = DiagnosticStatus.ERROR 187 | diag_msg = [ 'Core Temperature Error' ] 188 | diag_vals = [ KeyValue(key = 'Core Temperature Error', value = stderr), 189 | KeyValue(key = 'Output', value = stdout) ] 190 | return diag_vals, diag_msgs, diag_level 191 | 192 | tmp = stdout.strip() 193 | if unicode(tmp).isnumeric(): 194 | temp = float(tmp) / 1000 195 | diag_vals.append(KeyValue(key = 'Core %d Temperature' % index, value = str(temp)+"DegC")) 196 | 197 | if temp >= self._cpu_temp_warn: 198 | diag_level = max(diag_level, DiagnosticStatus.WARN) 199 | diag_msgs.append('Warm') 200 | elif temp >= self._cpu_temp_error: 201 | diag_level = max(diag_level, DiagnosticStatus.ERROR) 202 | diag_msgs.append('Hot') 203 | else: 204 | diag_level = max(diag_level, DiagnosticStatus.ERROR) # Error if not numeric value 205 | diag_vals.append(KeyValue(key = 'Core %s Temperature' % index, value = tmp)) 206 | 207 | return diag_vals, diag_msgs, diag_level 208 | 209 | ## Checks clock speed from reading from CPU info 210 | def check_clock_speed(self): 211 | vals = [] 212 | msgs = [] 213 | lvl = DiagnosticStatus.OK 214 | 215 | try: 216 | p = subprocess.Popen('cat /proc/cpuinfo | grep MHz', 217 | stdout = subprocess.PIPE, 218 | stderr = subprocess.PIPE, shell = True) 219 | stdout, stderr = p.communicate() 220 | retcode = p.returncode 221 | 222 | if retcode != 0: 223 | lvl = DiagnosticStatus.ERROR 224 | msgs = [ 'Clock speed error' ] 225 | vals = [ KeyValue(key = 'Clock speed error', value = stderr), 226 | KeyValue(key = 'Output', value = stdout) ] 227 | 228 | return (vals, msgs, lvl) 229 | 230 | for index, ln in enumerate(stdout.split('\n')): 231 | words = ln.split(':') 232 | if len(words) < 2: 233 | continue 234 | 235 | speed = words[1].strip().split('.')[0] # Conversion to float doesn't work with decimal 236 | vals.append(KeyValue(key = 'Core %d Clock Speed' % index, value = speed+"MHz")) 237 | 238 | except Exception, e: 239 | rospy.logerr(traceback.format_exc()) 240 | lvl = DiagnosticStatus.ERROR 241 | msgs.append('Exception') 242 | vals.append(KeyValue(key = 'Exception', value = traceback.format_exc())) 243 | 244 | return vals, msgs, lvl 245 | 246 | 247 | # Add msgs output, too 248 | ##\brief Uses 'uptime' to see load average 249 | def check_uptime(self): 250 | level = DiagnosticStatus.OK 251 | vals = [] 252 | 253 | load_dict = { 0: 'OK', 1: 'High Load', 2: 'Very High Load' } 254 | 255 | try: 256 | p = subprocess.Popen('uptime', stdout = subprocess.PIPE, 257 | stderr = subprocess.PIPE, shell = True) 258 | stdout, stderr = p.communicate() 259 | retcode = p.returncode 260 | 261 | if retcode != 0: 262 | vals.append(KeyValue(key = 'uptime Failed', value = stderr)) 263 | return DiagnosticStatus.ERROR, vals 264 | 265 | upvals = stdout.split() 266 | load1 = float(upvals[-3].rstrip(','))/self._num_cores 267 | load5 = float(upvals[-2].rstrip(','))/self._num_cores 268 | load15 = float(upvals[-1])/self._num_cores 269 | 270 | # Give warning if we go over load limit 271 | if load1 > self._cpu_load1_warn or load5 > self._cpu_load5_warn: 272 | level = DiagnosticStatus.WARN 273 | 274 | vals.append(KeyValue(key = 'Load Average Status', value = load_dict[level])) 275 | vals.append(KeyValue(key = 'Load Average (1min)', value = str(load1*1e2)+"%")) 276 | vals.append(KeyValue(key = 'Load Average (5min)', value = str(load5*1e2)+"%")) 277 | vals.append(KeyValue(key = 'Load Average (15min)', value = str(load15*1e2)+"%")) 278 | 279 | except Exception, e: 280 | rospy.logerr(traceback.format_exc()) 281 | level = DiagnosticStatus.ERROR 282 | vals.append(KeyValue(key = 'Load Average Status', value = traceback.format_exc())) 283 | 284 | return level, load_dict[level], vals 285 | 286 | ##\brief Use mpstat to find CPU usage 287 | ## 288 | def check_mpstat(self): 289 | vals = [] 290 | mp_level = DiagnosticStatus.OK 291 | 292 | load_dict = { 0: 'OK', 1: 'High Load', 2: 'Error' } 293 | 294 | try: 295 | p = subprocess.Popen('mpstat -P ALL 1 1', 296 | stdout = subprocess.PIPE, 297 | stderr = subprocess.PIPE, shell = True) 298 | stdout, stderr = p.communicate() 299 | retcode = p.returncode 300 | 301 | if retcode != 0: 302 | if not self._has_warned_mpstat: 303 | rospy.logerr("mpstat failed to run for cpu_monitor. Return code %d.", retcode) 304 | self._has_warned_mpstat = True 305 | 306 | mp_level = DiagnosticStatus.ERROR 307 | vals.append(KeyValue(key = '\"mpstat\" Call Error', value = str(retcode))) 308 | return mp_level, 'Unable to Check CPU Usage', vals 309 | 310 | # Check which column '%idle' is, #4539 311 | # mpstat output changed between 8.06 and 8.1 312 | rows = stdout.split('\n') 313 | col_names = rows[2].split() 314 | idle_col = -1 if (len(col_names) > 2 and col_names[-1] == '%idle') else -2 315 | 316 | num_cores = 0 317 | cores_loaded = 0 318 | for index, row in enumerate(stdout.split('\n')): 319 | if index < 3: 320 | continue 321 | 322 | # Skip row containing 'all' data 323 | if row.find('all') > -1: 324 | continue 325 | 326 | lst = row.split() 327 | if len(lst) < 8: 328 | continue 329 | 330 | ## Ignore 'Average: ...' data 331 | if lst[0].startswith('Average'): 332 | continue 333 | 334 | cpu_name = '%d' % (num_cores) 335 | idle = lst[idle_col] 336 | user = lst[3] 337 | nice = lst[4] 338 | system = lst[5] 339 | 340 | core_level = 0 341 | usage = (float(user)+float(nice))*1e-2 342 | if usage > 10.0: # wrong reading, use old reading instead 343 | rospy.logwarn('Read CPU usage of %f percent. Reverting to previous reading of %f percent'%(usage, self._usage_old)) 344 | usage = self._usage_old 345 | self._usage_old = usage 346 | 347 | if usage >= self._cpu_load_warn: 348 | cores_loaded += 1 349 | core_level = DiagnosticStatus.WARN 350 | elif usage >= self._cpu_load_error: 351 | core_level = DiagnosticStatus.ERROR 352 | 353 | vals.append(KeyValue(key = 'Core %s Status' % cpu_name, value = load_dict[core_level])) 354 | vals.append(KeyValue(key = 'Core %s User' % cpu_name, value = user+"%")) 355 | vals.append(KeyValue(key = 'Core %s Nice' % cpu_name, value = nice+"%")) 356 | vals.append(KeyValue(key = 'Core %s System' % cpu_name, value = system+"%")) 357 | vals.append(KeyValue(key = 'Core %s Idle' % cpu_name, value = idle+"%")) 358 | 359 | num_cores += 1 360 | 361 | # Warn for high load only if we have <= 2 cores that aren't loaded 362 | if num_cores - cores_loaded <= 2 and num_cores > 2: 363 | mp_level = DiagnosticStatus.WARN 364 | 365 | if not self._num_cores: 366 | self._num_cores = num_cores 367 | 368 | # Check the number of cores if self._num_cores > 0, #4850 369 | if self._num_cores != num_cores: 370 | mp_level = DiagnosticStatus.ERROR 371 | if not self._has_error_core_count: 372 | rospy.logerr('Error checking number of cores. Expected %d, got %d. Computer may have not booted properly.', 373 | self._num_cores, num_cores) 374 | self._has_error_core_count = True 375 | return DiagnosticStatus.ERROR, 'Incorrect number of CPU cores', vals 376 | 377 | except Exception, e: 378 | mp_level = DiagnosticStatus.ERROR 379 | vals.append(KeyValue(key = 'mpstat Exception', value = str(e))) 380 | 381 | return mp_level, load_dict[mp_level], vals 382 | 383 | ## Returns names for core temperature files 384 | ## Returns list of names, each name can be read like file 385 | def get_core_temp_names(self): 386 | temp_vals = [] 387 | try: 388 | p = subprocess.Popen('find /sys/devices -name temp1_input', 389 | stdout = subprocess.PIPE, 390 | stderr = subprocess.PIPE, shell = True) 391 | stdout, stderr = p.communicate() 392 | retcode = p.returncode 393 | 394 | if retcode != 0: 395 | rospy.logerr('Error find core temp locations: %s' % stderr) 396 | return [] 397 | 398 | for ln in stdout.split('\n'): 399 | temp_vals.append(ln.strip()) 400 | 401 | return temp_vals 402 | except: 403 | rospy.logerr('Exception finding temp vals: %s' % traceback.format_exc()) 404 | return [] 405 | 406 | ## Call every 10sec at minimum 407 | def check_temps(self): 408 | if rospy.is_shutdown(): 409 | with self._mutex: 410 | self.cancel_timers() 411 | return 412 | 413 | diag_vals = [ KeyValue(key = 'Update Status', value = 'OK' ), 414 | KeyValue(key = 'Time Since Last Update', value = str(0) ) ] 415 | diag_msgs = [] 416 | diag_level = 0 417 | 418 | if self._check_core_temps: 419 | core_vals, core_msgs, core_level = self.check_core_temps(self._temp_vals) 420 | diag_vals.extend(core_vals) 421 | diag_msgs.extend(core_msgs) 422 | diag_level = max(diag_level, core_level) 423 | 424 | diag_log = set(diag_msgs) 425 | if len(diag_log) > 0: 426 | message = ', '.join(diag_log) 427 | else: 428 | message = stat_dict[diag_level] 429 | 430 | with self._mutex: 431 | self._last_temp_time = rospy.get_time() 432 | 433 | self._temp_stat.level = diag_level 434 | self._temp_stat.message = message 435 | self._temp_stat.values = diag_vals 436 | 437 | if not rospy.is_shutdown(): 438 | self._temps_timer = threading.Timer(5.0, self.check_temps) 439 | self._temps_timer.start() 440 | else: 441 | self.cancel_timers() 442 | 443 | def check_usage(self): 444 | if rospy.is_shutdown(): 445 | with self._mutex: 446 | self.cancel_timers() 447 | return 448 | 449 | diag_level = 0 450 | diag_vals = [ KeyValue(key = 'Update Status', value = 'OK' ), 451 | KeyValue(key = 'Time Since Last Update', value = 0 )] 452 | diag_msgs = [] 453 | 454 | # Check clock speed 455 | clock_vals, clock_msgs, clock_level = self.check_clock_speed() 456 | diag_vals.extend(clock_vals) 457 | diag_msgs.extend(clock_msgs) 458 | diag_level = max(diag_level, clock_level) 459 | 460 | # Check mpstat 461 | mp_level, mp_msg, mp_vals = self.check_mpstat() 462 | diag_vals.extend(mp_vals) 463 | if mp_level > 0: 464 | diag_msgs.append(mp_msg) 465 | diag_level = max(diag_level, mp_level) 466 | 467 | # Check uptime 468 | uptime_level, up_msg, up_vals = self.check_uptime() 469 | diag_vals.extend(up_vals) 470 | if uptime_level > 0: 471 | diag_msgs.append(up_msg) 472 | diag_level = max(diag_level, uptime_level) 473 | 474 | if diag_msgs and diag_level > 0: 475 | usage_msg = ', '.join(set(diag_msgs)) 476 | else: 477 | usage_msg = stat_dict[diag_level] 478 | 479 | # Update status 480 | with self._mutex: 481 | self._last_usage_time = rospy.get_time() 482 | self._usage_stat.level = diag_level 483 | self._usage_stat.values = diag_vals 484 | 485 | self._usage_stat.message = usage_msg 486 | 487 | if not rospy.is_shutdown(): 488 | self._usage_timer = threading.Timer(5.0, self.check_usage) 489 | self._usage_timer.start() 490 | else: 491 | self.cancel_timers() 492 | 493 | def publish_stats(self): 494 | with self._mutex: 495 | # Update everything with last update times 496 | update_status_stale(self._temp_stat, self._last_temp_time) 497 | update_status_stale(self._usage_stat, self._last_usage_time) 498 | 499 | msg = DiagnosticArray() 500 | msg.header.stamp = rospy.get_rostime() 501 | msg.status.append(self._temp_stat) 502 | msg.status.append(self._usage_stat) 503 | 504 | if rospy.get_time() - self._last_publish_time > 0.5: 505 | self._diag_pub.publish(msg) 506 | self._last_publish_time = rospy.get_time() 507 | 508 | 509 | # Restart temperature checking if it goes stale, #4171 510 | # Need to run this without mutex 511 | if rospy.get_time() - self._last_temp_time > 90: 512 | self._restart_temp_check() 513 | 514 | 515 | if __name__ == '__main__': 516 | hostname = socket.gethostname() 517 | hostname = hostname.replace('-', '_') 518 | 519 | import optparse 520 | parser = optparse.OptionParser(usage="usage: cpu_monitor.py [--diag-hostname=cX]") 521 | parser.add_option("--diag-hostname", dest="diag_hostname", 522 | help="Computer name in diagnostics output (ex: 'c1')", 523 | metavar="DIAG_HOSTNAME", 524 | action="store", default = hostname) 525 | options, args = parser.parse_args(rospy.myargv()) 526 | 527 | try: 528 | rospy.init_node('cpu_monitor_%s' % hostname) 529 | except rospy.exceptions.ROSInitException: 530 | print >> sys.stderr, 'CPU monitor is unable to initialize node. Master may not be running.' 531 | sys.exit(0) 532 | 533 | cpu_node = CPUMonitor(hostname, options.diag_hostname) 534 | 535 | rate = rospy.Rate(1.0) 536 | try: 537 | while not rospy.is_shutdown(): 538 | rate.sleep() 539 | cpu_node.publish_stats() 540 | except KeyboardInterrupt: 541 | pass 542 | except Exception, e: 543 | traceback.print_exc() 544 | rospy.logerr(traceback.format_exc()) 545 | 546 | cpu_node.cancel_timers() 547 | sys.exit(0) 548 | --------------------------------------------------------------------------------