├── .gitignore
├── CMakeLists.txt
├── config
└── system_monitor.yaml
├── package.xml
├── launch
└── system_monitor.launch
├── copyright
├── README.md
├── changelog
└── bin
├── ntp_monitor.py
├── mem_monitor.py
├── net_monitor.py
├── hdd_monitor.py
└── cpu_monitor.py
/.gitignore:
--------------------------------------------------------------------------------
1 | build
2 |
--------------------------------------------------------------------------------
/CMakeLists.txt:
--------------------------------------------------------------------------------
1 | cmake_minimum_required(VERSION 2.8.3)
2 | project(system_monitor)
3 |
4 | find_package(catkin)
5 |
6 | catkin_package()
7 |
--------------------------------------------------------------------------------
/config/system_monitor.yaml:
--------------------------------------------------------------------------------
1 | cpu_monitor:
2 | check_core_temps: true
3 | cpu_load_warn: 0.9
4 | cpu_load_error: 1.0
5 | cpu_load1_warn: 0.9
6 | cpu_load5_warn: 0.8
7 | cpu_temp_warn: 85.0
8 | cpu_temp_error: 90.0
9 | hdd_monitor:
10 | no_hdd_temp: true
11 | no_hdd_temp_warn: false
12 | hdd_level_warn: 0.95
13 | hdd_level_error: 0.99
14 | hdd_temp_warn: 55.0
15 | hdd_temp_error: 70.0
16 | mem_monitor:
17 | mem_level_warn: 0.95
18 | mem_level_error: 0.99
19 | ntp_monitor:
20 | reference_host: ntp.ubuntu.com
21 | offset_tolerance: 500.0
22 | error_offset_tolerance: 5000000.0
23 | net_monitor:
24 | net_level_warn: 0.95
25 | net_capacity: 128.0
26 |
--------------------------------------------------------------------------------
/package.xml:
--------------------------------------------------------------------------------
1 |
2 | system_monitor
3 | 0.1.0
4 |
5 | System monitoring tools for ROS
6 |
7 | Ralf Kaestner
8 | Jerome Maye
9 | Ralf Kaestner
10 | GNU Lesser General Public License (LGPL)
11 | http://github.com/ethz-asl/ros-system-monitor
12 | catkin
13 | rospy
14 | std_msgs
15 | diagnostic_msgs
16 | sysstat
17 | hddtemp
18 | ntpdate
19 | ifstat
20 |
21 |
--------------------------------------------------------------------------------
/launch/system_monitor.launch:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
9 |
11 |
13 |
15 |
17 |
18 |
19 |
21 |
22 |
23 |
--------------------------------------------------------------------------------
/copyright:
--------------------------------------------------------------------------------
1 | Copyright (C) 2009, Willow Garage, Inc.
2 | Copyright (C) 2013 by Ralf Kaestner
3 | Copyright (C) 2013 by Jerome Maye
4 |
5 | All rights reserved.
6 |
7 | Redistribution and use in source and binary forms, with or without
8 | modification, are permitted provided that the following conditions
9 | are met:
10 |
11 | 1. Redistributions of source code must retain the above copyright
12 | notice, this list of conditions and the following disclaimer.
13 |
14 | 2. Redistributions in binary form must reproduce the above copyright
15 | notice, this list of conditions and the following disclaimer in
16 | the documentation and/or other materials provided with the
17 | distribution.
18 |
19 | 3. The name of the copyright holders may be used to endorse or
20 | promote products derived from this software without specific
21 | prior written permission.
22 |
23 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
24 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
25 | LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
26 | FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
27 | COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
28 | INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
29 | BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
30 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
31 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32 | LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
33 | ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
34 | POSSIBILITY OF SUCH DAMAGE.
35 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # ros-system-monitor
2 |
3 | ## Synopsis
4 |
5 | System monitoring tools for ROS.
6 |
7 | **Author(s):** Willow Garage, Inc., Jerome Maye, Ralf Kaestner
8 |
9 | **Maintainer:** Ralf Kaestner
10 |
11 | **License:** BSD License (BSD)
12 |
13 | **Operating system(s):** Debian-based Linux
14 |
15 | **Package PPA:** ppa:ethz-asl/ros
16 |
17 | ## Description
18 |
19 | This project provides system monitoring tools for ROS in the form of the
20 | following ROS nodes:
21 |
22 | * CPU monitor
23 | * HDD monitor
24 | * Memory monitor
25 | * Network monitor
26 | * NTP monitor
27 |
28 | Each node publishes ROS diagnostics which can conveniently be visualized
29 | in the runtime monitor.
30 |
31 | ## Installation
32 |
33 | Here, we assume you intend to build/install the project for the ROS
34 | distribution named `ROS_DISTRO`.
35 |
36 | ### Installing from packages (recommended for Ubuntu LTS users)
37 |
38 | The maintainers of this project provide binary packages for the latest Ubuntu
39 | LTS releases and commonly used system architectures. To install these packages,
40 | you may follow these instructions:
41 |
42 | * Add the project PPA to your APT sources by issuing
43 |
44 | ```
45 | sudo add-apt-repository ppa:ethz-asl/ros
46 | ```
47 |
48 | on the command line
49 |
50 | * To re-synchronize your package index files, run
51 |
52 | ```
53 | sudo apt-get update
54 | ```
55 |
56 | * Install all project packages and their dependencies through
57 |
58 | ```
59 | sudo apt-get install ros-ROS_DISTRO-system-monitor
60 | ```
61 |
62 | or using your favorite package management tool
63 |
64 | ### Building from source
65 |
66 | This project may be built using the CMake build system with an open-source
67 | macro extension called ReMake.
68 |
69 | #### Installing build dependencies
70 |
71 | The build dependencies of this project are available from the standard
72 | package repositories of recent Ubuntu and ROS releases. To install them,
73 | simply use the command
74 |
75 | ```
76 | sudo apt-get install ros-ROS_DISTRO-rospy, ros-ROS_DISTRO-message-generation, ros-ROS_DISTRO-std-msgs, ros-ROS_DISTRO-diagnostic-msgs
77 |
78 | ```
79 | #### Building with ReMake
80 |
81 | ##### Preparing the build system
82 |
83 | If you already have installed ReMake on your build system, you may
84 | skip this step. Otherwise, before attempting to build this project the
85 | traditional CMake way, you must install ReMake following
86 | [these instructions](https://github.com/kralf/remake).
87 |
88 | ##### Building with CMake
89 |
90 | Once ReMake is available on your build system, you may attempt to build this
91 | project the CMake way. Assuming that you have cloned the project sources into
92 | `PROJECT_DIR`, a typical out-of-source build might look like this:
93 |
94 | * Create a build directory using
95 |
96 | ```
97 | mkdir -p PROJECT_DIR/build
98 | ```
99 |
100 | * Switch into the build directoy by
101 |
102 | ```
103 | cd PROJECT_DIR/build
104 | ```
105 |
106 | * In the build directory, run
107 |
108 | ```
109 | cmake -DROS_DISTRIBUTION=ROS_DISTRO PROJECT_DIR
110 | ```
111 |
112 | to configure the build
113 |
114 | * If you want to inspect or modify the build configuration, issue
115 |
116 | ```
117 | ccmake PROJECT_DIR
118 | ```
119 |
120 | * Build the project using
121 |
122 | ```
123 | make
124 | ```
125 |
126 | * If you intend to install the project, call
127 |
128 | ```
129 | make packages_install
130 | ```
131 |
132 | (from packages on Debian-based Linux only) or
133 |
134 | ```
135 | make install
136 | ```
137 |
138 | ## API documentation
139 |
140 | This project does not yet provide any API documentation.
141 |
142 | ## Feature requests and bug reports
143 |
144 | If you would like to propose a feature for this project, please consider
145 | contributing or send a feature request to the project authors. Bugs may be
146 | reported through the project's issue page.
147 |
148 | ## Further reading
149 |
150 | For additional information of the Robot Operating System (ROS), please refer
151 | to the official [ROS documentation](http://wiki.ros.org).
152 |
--------------------------------------------------------------------------------
/changelog:
--------------------------------------------------------------------------------
1 | ros-system-monitor (0.1-22) unstable; urgency=low
2 |
3 | * Updated changelog
4 |
5 | -- Ralf Kaestner Tue, 21 Jul 2015 10:34:00 +0200
6 |
7 | ros-system-monitor (0.1-21) unstable; urgency=low
8 |
9 | * Fixed changelog
10 |
11 | -- Ralf Kaestner Tue, 21 Jul 2015 10:32:05 +0200
12 |
13 | ros-system-monitor (0.1-20) unstable; urgency=low
14 |
15 | * Fixed bug in net_monitor's check_network() which was caused by an invalid
16 | return value after a failure to call ifstat
17 |
18 | -- Ralf Kaestner Tue, 21 Jul 2015 10:29:56 +0200
19 |
20 | ros-system-monitor (0.1-19) unstable; urgency=low
21 |
22 | * Removed debugging output
23 |
24 | -- Ralf Kaestner Wed, 17 Jun 2015 09:09:57 +0200
25 |
26 | ros-system-monitor (0.1-18) unstable; urgency=low
27 |
28 | * Fixed launch namespaces
29 |
30 | -- Ralf Kaestner Wed, 17 Jun 2015 09:08:33 +0200
31 |
32 | ros-system-monitor (0.1-17) unstable; urgency=low
33 |
34 | * Added config file argument to launch file
35 |
36 | -- Ralf Kaestner Wed, 17 Jun 2015 08:47:41 +0200
37 |
38 | ros-system-monitor (0.1-16) unstable; urgency=low
39 |
40 | * Logging hddtemp errors to info
41 |
42 | -- Ralf Kaestner Wed, 17 Jun 2015 08:43:50 +0200
43 |
44 | ros-system-monitor (0.1-15) unstable; urgency=low
45 |
46 | * Added output argument to launch file
47 |
48 | -- Ralf Kaestner Wed, 17 Jun 2015 08:36:26 +0200
49 |
50 | ros-system-monitor (0.1-14) unstable; urgency=low
51 |
52 | * Catkinization
53 |
54 | -- Ralf Kaestner Wed, 17 Jun 2015 07:46:27 +0200
55 |
56 | ros-system-monitor (0.1-13) unstable; urgency=low
57 |
58 | * Fixed ROS package dependencies
59 |
60 | -- Ralf Kaestner Wed, 18 Feb 2015 12:34:10 +0100
61 |
62 | ros-system-monitor (0.1-12) unstable; urgency=low
63 |
64 | * Changed PPA
65 |
66 | -- Ralf Kaestner Wed, 18 Feb 2015 08:05:03 +0100
67 |
68 | ros-system-monitor (0.1-11) unstable; urgency=low
69 |
70 | * Updated README.md
71 |
72 | -- Ralf Kaestner Wed, 18 Feb 2015 08:03:35 +0100
73 |
74 | ros-system-monitor (0.1-10) unstable; urgency=low
75 |
76 | * Fixed project information
77 | * Removed system-wide ROS configuration file
78 | * Adapted configuration file path in launch files
79 | * removing wlan from status
80 |
81 | -- Ralf Kaestner Wed, 18 Feb 2015 07:37:06 +0100
82 |
83 | ros-system-monitor (0.1-9) unstable; urgency=low
84 |
85 | * removing wlan from status
86 |
87 | -- Jerome Maye Tue, 14 Oct 2014 14:28:23 +0200
88 |
89 | ros-system-monitor (0.1-8) unstable; urgency=low
90 |
91 | * adding new computers
92 |
93 | -- Jerome Maye Tue, 14 Oct 2014 13:23:07 +0200
94 |
95 | ros-system-monitor (0.1-7) unstable; urgency=low
96 |
97 | * modifying configuration for ntp.
98 |
99 | -- Jerome Maye Thu, 18 Sep 2014 11:25:29 +0200
100 |
101 | ros-system-monitor (0.1-6) unstable; urgency=low
102 |
103 | * porting to PPA.
104 |
105 | -- Jerome Maye Tue, 16 Sep 2014 19:38:14 +0200
106 |
107 | ros-system-monitor (0.1-5) unstable; urgency=low
108 |
109 | * adding teamlio
110 |
111 | -- Jerome Maye Tue, 16 Sep 2014 12:29:14 +0200
112 |
113 | ros-system-monitor (0.1-4) unstable; urgency=low
114 |
115 | * adding teamlio laptop configuration.
116 |
117 | -- Jerome Maye Tue, 16 Sep 2014 12:21:55 +0200
118 |
119 | ros-system-monitor (0.1-3) unstable; urgency=low
120 |
121 | * hack for hdd monitor because of lightdm permissions.
122 |
123 | -- Jerome Maye Mon, 15 Sep 2014 21:06:35 +0200
124 |
125 | ros-system-monitor (0.1-2) unstable; urgency=low
126 |
127 | * Bugs corrections.
128 |
129 | -- Jerome Maye Fri, 12 Sep 2014 13:33:23 +0200
130 |
131 | ros-system-monitor (0.1-1) unstable; urgency=low
132 |
133 | * First draft.
134 |
135 | -- Jerome Maye Fri, 12 Sep 2014 12:55:02 +0200
136 |
--------------------------------------------------------------------------------
/bin/ntp_monitor.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | ############################################################################
3 | # Copyright (C) 2009, Willow Garage, Inc. #
4 | # Copyright (C) 2013 by Jerome Maye #
5 | # jerome.maye@mavt.ethz.ch #
6 | # #
7 | # All rights reserved. #
8 | # #
9 | # Redistribution and use in source and binary forms, with or without #
10 | # modification, are permitted provided that the following conditions #
11 | # are met: #
12 | # #
13 | # 1. Redistributions of source code must retain the above copyright #
14 | # notice, this list of conditions and the following disclaimer. #
15 | # #
16 | # 2. Redistributions in binary form must reproduce the above copyright #
17 | # notice, this list of conditions and the following disclaimer in #
18 | # the documentation and/or other materials provided with the #
19 | # distribution. #
20 | # #
21 | # 3. The name of the copyright holders may be used to endorse or #
22 | # promote products derived from this software without specific #
23 | # prior written permission. #
24 | # #
25 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS #
26 | # "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT #
27 | # LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS #
28 | # FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE #
29 | # COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, #
30 | # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, #
31 | # BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; #
32 | # LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER #
33 | # CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT #
34 | # LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN #
35 | # ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE #
36 | # POSSIBILITY OF SUCH DAMAGE. #
37 | ############################################################################
38 |
39 | from diagnostic_msgs.msg import DiagnosticArray, DiagnosticStatus, KeyValue
40 |
41 | import sys
42 | import rospy
43 | import socket
44 | from subprocess import Popen, PIPE
45 |
46 | import time
47 |
48 | import re
49 |
50 | NAME = 'ntp_monitor'
51 |
52 | def ntp_monitor(offset=500, self_offset=500, diag_hostname = None, error_offset = 5000000):
53 | pub = rospy.Publisher("/diagnostics", DiagnosticArray, queue_size = 100)
54 | rospy.init_node(NAME, anonymous=True)
55 |
56 | hostname = socket.gethostname()
57 | if diag_hostname is None:
58 | diag_hostname = hostname
59 |
60 | ntp_hostname = rospy.get_param('~reference_host', 'ntp.ubuntu.com')
61 | offset = rospy.get_param('~offset_tolerance', 500)
62 | error_offset = rospy.get_param('~error_offset_tolerance', 5000000)
63 |
64 | stat = DiagnosticStatus()
65 | stat.level = 0
66 | stat.name = "NTP offset from "+ diag_hostname + " to " + ntp_hostname
67 | stat.message = "OK"
68 | stat.hardware_id = hostname
69 | stat.values = []
70 |
71 | # self_stat = DiagnosticStatus()
72 | # self_stat.level = DiagnosticStatus.OK
73 | # self_stat.name = "NTP self-offset for "+ diag_hostname
74 | # self_stat.message = "OK"
75 | # self_stat.hardware_id = hostname
76 | # self_stat.values = []
77 |
78 | while not rospy.is_shutdown():
79 | for st,host,off in [(stat,ntp_hostname,offset)]:
80 | try:
81 | p = Popen(["ntpdate", "-q", host], stdout=PIPE, stdin=PIPE, stderr=PIPE)
82 | res = p.wait()
83 | (o,e) = p.communicate()
84 | except OSError, (errno, msg):
85 | if errno == 4:
86 | break #ctrl-c interrupt
87 | else:
88 | raise
89 | if (res == 0):
90 | measured_offset = float(re.search("offset (.*),", o).group(1))*1000000
91 | st.level = DiagnosticStatus.OK
92 | st.message = "OK"
93 | st.values = [ KeyValue("Offset (us)", str(measured_offset)),
94 | KeyValue("Offset tolerance (us)", str(off)),
95 | KeyValue("Offset tolerance (us) for Error", str(error_offset)) ]
96 |
97 | if (abs(measured_offset) > off):
98 | st.level = DiagnosticStatus.WARN
99 | st.message = "NTP Offset Too High"
100 | if (abs(measured_offset) > error_offset):
101 | st.level = DiagnosticStatus.ERROR
102 | st.message = "NTP Offset Too High"
103 |
104 | else:
105 | st.level = DiagnosticStatus.ERROR
106 | st.message = "Error Running ntpdate. Returned %d" % res
107 | st.values = [ KeyValue("Offset (us)", "N/A"),
108 | KeyValue("Offset tolerance (us)", str(off)),
109 | KeyValue("Offset tolerance (us) for Error", str(error_offset)),
110 | KeyValue("Output", o),
111 | KeyValue("Errors", e) ]
112 |
113 |
114 | msg = DiagnosticArray()
115 | msg.header.stamp = rospy.get_rostime()
116 | msg.status = [stat]
117 | pub.publish(msg)
118 | time.sleep(1)
119 |
120 | def ntp_monitor_main(argv=sys.argv):
121 | import optparse
122 | parser = optparse.OptionParser(usage="usage: ntp_monitor ntp-hostname []")
123 | parser.add_option("--offset-tolerance", dest="offset_tol",
124 | action="store", default=500,
125 | help="Offset from NTP host", metavar="OFFSET-TOL")
126 | parser.add_option("--error-offset-tolerance", dest="error_offset_tol",
127 | action="store", default=5000000,
128 | help="Offset from NTP host. Above this is error", metavar="OFFSET-TOL")
129 | parser.add_option("--self_offset-tolerance", dest="self_offset_tol",
130 | action="store", default=500,
131 | help="Offset from self", metavar="SELF_OFFSET-TOL")
132 | parser.add_option("--diag-hostname", dest="diag_hostname",
133 | help="Computer name in diagnostics output (ex: 'c1')",
134 | metavar="DIAG_HOSTNAME",
135 | action="store", default=None)
136 | options, args = parser.parse_args(rospy.myargv())
137 |
138 | # if (len(args) != 2):
139 | # parser.error("Invalid arguments. Must have HOSTNAME [args]. %s" % args)
140 |
141 |
142 | try:
143 | offset = int(options.offset_tol)
144 | self_offset = int(options.self_offset_tol)
145 | error_offset = int(options.error_offset_tol)
146 | except:
147 | parser.error("Offsets must be numbers")
148 |
149 | ntp_monitor(offset, self_offset, options.diag_hostname, error_offset)
150 |
151 |
152 | if __name__ == "__main__":
153 | try:
154 | ntp_monitor_main(rospy.myargv())
155 | except KeyboardInterrupt: pass
156 | except SystemExit: pass
157 | except:
158 | import traceback
159 | traceback.print_exc()
160 |
--------------------------------------------------------------------------------
/bin/mem_monitor.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | ############################################################################
3 | # Copyright (C) 2009, Willow Garage, Inc. #
4 | # Copyright (C) 2013 by Ralf Kaestner #
5 | # ralf.kaestner@gmail.com #
6 | # Copyright (C) 2013 by Jerome Maye #
7 | # jerome.maye@mavt.ethz.ch #
8 | # #
9 | # All rights reserved. #
10 | # #
11 | # Redistribution and use in source and binary forms, with or without #
12 | # modification, are permitted provided that the following conditions #
13 | # are met: #
14 | # #
15 | # 1. Redistributions of source code must retain the above copyright #
16 | # notice, this list of conditions and the following disclaimer. #
17 | # #
18 | # 2. Redistributions in binary form must reproduce the above copyright #
19 | # notice, this list of conditions and the following disclaimer in #
20 | # the documentation and/or other materials provided with the #
21 | # distribution. #
22 | # #
23 | # 3. The name of the copyright holders may be used to endorse or #
24 | # promote products derived from this software without specific #
25 | # prior written permission. #
26 | # #
27 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS #
28 | # "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT #
29 | # LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS #
30 | # FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE #
31 | # COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, #
32 | # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, #
33 | # BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; #
34 | # LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER #
35 | # CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT #
36 | # LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN #
37 | # ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE #
38 | # POSSIBILITY OF SUCH DAMAGE. #
39 | ############################################################################
40 |
41 | from __future__ import with_statement
42 |
43 | import rospy
44 |
45 | import traceback
46 | import threading
47 | from threading import Timer
48 | import sys, os, time
49 | from time import sleep
50 | import subprocess
51 | import string
52 | import socket
53 |
54 | from diagnostic_msgs.msg import DiagnosticArray, DiagnosticStatus, KeyValue
55 |
56 | mem_level_warn = 0.95
57 | mem_level_error = 0.99
58 |
59 | stat_dict = { 0: 'OK', 1: 'Warning', 2: 'Error' }
60 |
61 | def update_status_stale(stat, last_update_time):
62 | time_since_update = rospy.get_time() - last_update_time
63 |
64 | stale_status = 'OK'
65 | if time_since_update > 20 and time_since_update <= 35:
66 | stale_status = 'Lagging'
67 | if stat.level == DiagnosticStatus.OK:
68 | stat.message = stale_status
69 | elif stat.message.find(stale_status) < 0:
70 | stat.message = ', '.join([stat.message, stale_status])
71 | stat.level = max(stat.level, DiagnosticStatus.WARN)
72 | if time_since_update > 35:
73 | stale_status = 'Stale'
74 | if stat.level == DiagnosticStatus.OK:
75 | stat.message = stale_status
76 | elif stat.message.find(stale_status) < 0:
77 | stat.message = ', '.join([stat.message, stale_status])
78 | stat.level = max(stat.level, DiagnosticStatus.ERROR)
79 |
80 |
81 | stat.values.pop(0)
82 | stat.values.pop(0)
83 | stat.values.insert(0, KeyValue(key = 'Update Status', value = stale_status))
84 | stat.values.insert(1, KeyValue(key = 'Time Since Update', value = str(time_since_update)))
85 |
86 |
87 | class MemMonitor():
88 | def __init__(self, hostname, diag_hostname):
89 | self._diag_pub = rospy.Publisher('/diagnostics', DiagnosticArray, queue_size = 100)
90 |
91 | self._mutex = threading.Lock()
92 |
93 | self._mem_level_warn = rospy.get_param('~mem_level_warn', mem_level_warn)
94 | self._mem_level_error = rospy.get_param('~mem_level_error', mem_level_error)
95 |
96 | self._usage_timer = None
97 |
98 | self._usage_stat = DiagnosticStatus()
99 | self._usage_stat.name = 'Memory Usage (%s)' % diag_hostname
100 | self._usage_stat.level = 1
101 | self._usage_stat.hardware_id = hostname
102 | self._usage_stat.message = 'No Data'
103 | self._usage_stat.values = [ KeyValue(key = 'Update Status', value = 'No Data' ),
104 | KeyValue(key = 'Time Since Last Update', value = 'N/A') ]
105 |
106 | self._last_usage_time = 0
107 | self._last_publish_time = 0
108 |
109 | # Start checking everything
110 | self.check_usage()
111 |
112 | ## Must have the lock to cancel everything
113 | def cancel_timers(self):
114 | if self._usage_timer:
115 | self._usage_timer.cancel()
116 |
117 | def check_memory(self):
118 | values = []
119 | level = DiagnosticStatus.OK
120 | msg = ''
121 |
122 | mem_dict = { 0: 'OK', 1: 'Low Memory', 2: 'Very Low Memory' }
123 |
124 | try:
125 | p = subprocess.Popen('free -tm',
126 | stdout = subprocess.PIPE,
127 | stderr = subprocess.PIPE, shell = True)
128 | stdout, stderr = p.communicate()
129 | retcode = p.returncode
130 |
131 | if retcode != 0:
132 | values.append(KeyValue(key = "\"free -tm\" Call Error", value = str(retcode)))
133 | return DiagnosticStatus.ERROR, values
134 |
135 | rows = stdout.split('\n')
136 | data = rows[1].split()
137 | total_mem_physical = data[1]
138 | used_mem_physical = data[2]
139 | free_mem_physical = data[3]
140 | data = rows[2].split()
141 | total_mem_swap = data[1]
142 | used_mem_swap = data[2]
143 | free_mem_swap = data[3]
144 | data = rows[3].split()
145 | total_mem = data[1]
146 | used_mem = data[2]
147 | free_mem = data[3]
148 |
149 | level = DiagnosticStatus.OK
150 | mem_usage = float(used_mem_physical)/float(total_mem_physical)
151 | if (mem_usage < self._mem_level_warn):
152 | level = DiagnosticStatus.OK
153 | elif (mem_usage < self._mem_level_error):
154 | level = DiagnosticStatus.WARN
155 | else:
156 | level = DiagnosticStatus.ERROR
157 |
158 | values.append(KeyValue(key = 'Memory Status', value = mem_dict[level]))
159 | values.append(KeyValue(key = 'Total Memory (Physical)', value = total_mem_physical+"M"))
160 | values.append(KeyValue(key = 'Used Memory (Physical)', value = used_mem_physical+"M"))
161 | values.append(KeyValue(key = 'Free Memory (Physical)', value = free_mem_physical+"M"))
162 | values.append(KeyValue(key = 'Total Memory (Swap)', value = total_mem_swap+"M"))
163 | values.append(KeyValue(key = 'Used Memory (Swap)', value = used_mem_swap+"M"))
164 | values.append(KeyValue(key = 'Free Memory (Swap)', value = free_mem_swap+"M"))
165 | values.append(KeyValue(key = 'Total Memory', value = total_mem+"M"))
166 | values.append(KeyValue(key = 'Used Memory', value = used_mem+"M"))
167 | values.append(KeyValue(key = 'Free Memory', value = free_mem+"M"))
168 |
169 | msg = mem_dict[level]
170 | except Exception, e:
171 | rospy.logerr(traceback.format_exc())
172 | msg = 'Memory Usage Check Error'
173 | values.append(KeyValue(key = msg, value = str(e)))
174 | level = DiagnosticStatus.ERROR
175 |
176 | return level, mem_dict[level], values
177 |
178 | def check_usage(self):
179 | if rospy.is_shutdown():
180 | with self._mutex:
181 | self.cancel_timers()
182 | return
183 |
184 | diag_level = 0
185 | diag_vals = [ KeyValue(key = 'Update Status', value = 'OK' ),
186 | KeyValue(key = 'Time Since Last Update', value = 0 )]
187 | diag_msgs = []
188 |
189 | # Check memory
190 | mem_level, mem_msg, mem_vals = self.check_memory()
191 | diag_vals.extend(mem_vals)
192 | if mem_level > 0:
193 | diag_msgs.append(mem_msg)
194 | diag_level = max(diag_level, mem_level)
195 |
196 | if diag_msgs and diag_level > 0:
197 | usage_msg = ', '.join(set(diag_msgs))
198 | else:
199 | usage_msg = stat_dict[diag_level]
200 |
201 | # Update status
202 | with self._mutex:
203 | self._last_usage_time = rospy.get_time()
204 | self._usage_stat.level = diag_level
205 | self._usage_stat.values = diag_vals
206 |
207 | self._usage_stat.message = usage_msg
208 |
209 | if not rospy.is_shutdown():
210 | self._usage_timer = threading.Timer(5.0, self.check_usage)
211 | self._usage_timer.start()
212 | else:
213 | self.cancel_timers()
214 |
215 | def publish_stats(self):
216 | with self._mutex:
217 | # Update everything with last update times
218 | update_status_stale(self._usage_stat, self._last_usage_time)
219 |
220 | msg = DiagnosticArray()
221 | msg.header.stamp = rospy.get_rostime()
222 | msg.status.append(self._usage_stat)
223 |
224 | if rospy.get_time() - self._last_publish_time > 0.5:
225 | self._diag_pub.publish(msg)
226 | self._last_publish_time = rospy.get_time()
227 |
228 |
229 | if __name__ == '__main__':
230 | hostname = socket.gethostname()
231 | hostname = hostname.replace('-', '_')
232 |
233 | import optparse
234 | parser = optparse.OptionParser(usage="usage: mem_monitor.py [--diag-hostname=cX]")
235 | parser.add_option("--diag-hostname", dest="diag_hostname",
236 | help="Computer name in diagnostics output (ex: 'c1')",
237 | metavar="DIAG_HOSTNAME",
238 | action="store", default = hostname)
239 | options, args = parser.parse_args(rospy.myargv())
240 |
241 | try:
242 | rospy.init_node('mem_monitor_%s' % hostname)
243 | except rospy.exceptions.ROSInitException:
244 | print >> sys.stderr, 'Memory monitor is unable to initialize node. Master may not be running.'
245 | sys.exit(0)
246 |
247 | mem_node = MemMonitor(hostname, options.diag_hostname)
248 |
249 | rate = rospy.Rate(1.0)
250 | try:
251 | while not rospy.is_shutdown():
252 | rate.sleep()
253 | mem_node.publish_stats()
254 | except KeyboardInterrupt:
255 | pass
256 | except Exception, e:
257 | traceback.print_exc()
258 | rospy.logerr(traceback.format_exc())
259 |
260 | mem_node.cancel_timers()
261 | sys.exit(0)
262 |
--------------------------------------------------------------------------------
/bin/net_monitor.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | ############################################################################
3 | # Copyright (C) 2009, Willow Garage, Inc. #
4 | # Copyright (C) 2013 by Ralf Kaestner #
5 | # ralf.kaestner@gmail.com #
6 | # Copyright (C) 2013 by Jerome Maye #
7 | # jerome.maye@mavt.ethz.ch #
8 | # #
9 | # All rights reserved. #
10 | # #
11 | # Redistribution and use in source and binary forms, with or without #
12 | # modification, are permitted provided that the following conditions #
13 | # are met: #
14 | # #
15 | # 1. Redistributions of source code must retain the above copyright #
16 | # notice, this list of conditions and the following disclaimer. #
17 | # #
18 | # 2. Redistributions in binary form must reproduce the above copyright #
19 | # notice, this list of conditions and the following disclaimer in #
20 | # the documentation and/or other materials provided with the #
21 | # distribution. #
22 | # #
23 | # 3. The name of the copyright holders may be used to endorse or #
24 | # promote products derived from this software without specific #
25 | # prior written permission. #
26 | # #
27 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS #
28 | # "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT #
29 | # LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS #
30 | # FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE #
31 | # COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, #
32 | # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, #
33 | # BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; #
34 | # LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER #
35 | # CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT #
36 | # LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN #
37 | # ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE #
38 | # POSSIBILITY OF SUCH DAMAGE. #
39 | ############################################################################
40 |
41 | from __future__ import with_statement
42 |
43 | import rospy
44 |
45 | import traceback
46 | import threading
47 | from threading import Timer
48 | import sys, os, time
49 | from time import sleep
50 | import subprocess
51 | import string
52 | import re
53 |
54 | import socket
55 |
56 | from diagnostic_msgs.msg import DiagnosticArray, DiagnosticStatus, KeyValue
57 |
58 | net_level_warn = 0.95
59 | net_capacity = 128
60 |
61 | stat_dict = {0: 'OK', 1: 'Warning', 2: 'Error'}
62 |
63 | def update_status_stale(stat, last_update_time):
64 | time_since_update = rospy.get_time() - last_update_time
65 | stale_status = 'OK'
66 | if time_since_update > 20 and time_since_update <= 35:
67 | stale_status = 'Lagging'
68 | if stat.level == DiagnosticStatus.OK:
69 | stat.message = stale_status
70 | elif stat.message.find(stale_status) < 0:
71 | stat.message = ', '.join([stat.message, stale_status])
72 | stat.level = max(stat.level, DiagnosticStatus.WARN)
73 | if time_since_update > 35:
74 | stale_status = 'Stale'
75 | if stat.level == DiagnosticStatus.OK:
76 | stat.message = stale_status
77 | elif stat.message.find(stale_status) < 0:
78 | stat.message = ', '.join([stat.message, stale_status])
79 | stat.level = max(stat.level, DiagnosticStatus.ERROR)
80 | stat.values.pop(0)
81 | stat.values.pop(0)
82 | stat.values.insert(0, KeyValue(key = 'Update Status', value = stale_status))
83 | stat.values.insert(1, KeyValue(key = 'Time Since Update',
84 | value = str(time_since_update)))
85 |
86 | def get_sys_net_stat(iface, sys):
87 | cmd = 'cat /sys/class/net/%s/statistics/%s' %(iface, sys)
88 | p = subprocess.Popen(cmd,
89 | stdout = subprocess.PIPE,
90 | stderr = subprocess.PIPE, shell = True)
91 | stdout, stderr = p.communicate()
92 | return (p.returncode, stdout.strip())
93 |
94 | def get_sys_net(iface, sys):
95 | cmd = 'cat /sys/class/net/%s/%s' %(iface, sys)
96 | p = subprocess.Popen(cmd,
97 | stdout = subprocess.PIPE,
98 | stderr = subprocess.PIPE, shell = True)
99 | stdout, stderr = p.communicate()
100 | return (p.returncode, stdout.strip())
101 |
102 | class NetMonitor():
103 | def __init__(self, hostname, diag_hostname):
104 | self._diag_pub = rospy.Publisher('/diagnostics', DiagnosticArray, queue_size = 100)
105 | self._mutex = threading.Lock()
106 | self._net_level_warn = rospy.get_param('~net_level_warn', net_level_warn)
107 | self._net_capacity = rospy.get_param('~net_capacity', net_capacity)
108 | self._usage_timer = None
109 | self._usage_stat = DiagnosticStatus()
110 | self._usage_stat.name = 'Network Usage (%s)' % diag_hostname
111 | self._usage_stat.level = 1
112 | self._usage_stat.hardware_id = hostname
113 | self._usage_stat.message = 'No Data'
114 | self._usage_stat.values = [KeyValue(key = 'Update Status',
115 | value = 'No Data' ),
116 | KeyValue(key = 'Time Since Last Update',
117 | value = 'N/A') ]
118 | self._last_usage_time = 0
119 | self._last_publish_time = 0
120 | self.check_usage()
121 |
122 | def cancel_timers(self):
123 | if self._usage_timer:
124 | self._usage_timer.cancel()
125 |
126 | def check_network(self):
127 | values = []
128 | net_dict = {0: 'OK', 1: 'High Network Usage', 2: 'Network Down', 3: 'Call Error'}
129 | try:
130 | p = subprocess.Popen('ifstat -q -S 1 1',
131 | stdout = subprocess.PIPE,
132 | stderr = subprocess.PIPE, shell = True)
133 | stdout, stderr = p.communicate()
134 | retcode = p.returncode
135 | if retcode != 0:
136 | values.append(KeyValue(key = "\"ifstat -q -S 1 1\" Call Error",
137 | value = str(retcode)))
138 | return DiagnosticStatus.ERROR, net_dict[3], values
139 | rows = stdout.split('\n')
140 | data = rows[0].split()
141 | ifaces = []
142 | for i in range(0, len(data)):
143 | ifaces.append(data[i])
144 | data = rows[2].split()
145 | kb_in = []
146 | kb_out = []
147 | for i in range(0, len(data), 2):
148 | kb_in.append(data[i])
149 | kb_out.append(data[i + 1])
150 | level = DiagnosticStatus.OK
151 | for i in range(0, len(ifaces)):
152 | values.append(KeyValue(key = 'Interface Name',
153 | value = ifaces[i]))
154 | (retcode, cmd_out) = get_sys_net(ifaces[i], 'operstate')
155 | if retcode == 0:
156 | values.append(KeyValue(key = 'State', value = cmd_out))
157 | ifacematch = re.match('eth[0-9]+', ifaces[i])
158 | if ifacematch and (cmd_out == 'down' or cmd_out == 'dormant'):
159 | level = DiagnosticStatus.ERROR
160 | values.append(KeyValue(key = 'Input Traffic',
161 | value = str(float(kb_in[i]) / 1024) + " (MB/s)"))
162 | values.append(KeyValue(key = 'Output Traffic',
163 | value = str(float(kb_out[i]) / 1024) + " (MB/s)"))
164 | net_usage_in = float(kb_in[i]) / 1024 / self._net_capacity
165 | net_usage_out = float(kb_out[i]) / 1024 / self._net_capacity
166 | if net_usage_in > self._net_level_warn or\
167 | net_usage_out > self._net_level_warn:
168 | level = DiagnosticStatus.WARN
169 | (retcode, cmd_out) = get_sys_net(ifaces[i], 'mtu')
170 | if retcode == 0:
171 | values.append(KeyValue(key = 'MTU', value = cmd_out))
172 | (retcode, cmd_out) = get_sys_net_stat(ifaces[i], 'rx_bytes')
173 | if retcode == 0:
174 | values.append(KeyValue(key = 'Total received MB',
175 | value = str(float(cmd_out) / 1024 / 1024)))
176 | (retcode, cmd_out) = get_sys_net_stat(ifaces[i], 'tx_bytes')
177 | if retcode == 0:
178 | values.append(KeyValue(key = 'Total transmitted MB',
179 | value = str(float(cmd_out) / 1024 / 1024)))
180 | (retcode, cmd_out) = get_sys_net_stat(ifaces[i], 'collisions')
181 | if retcode == 0:
182 | values.append(KeyValue(key = 'Collisions', value = cmd_out))
183 | (retcode, cmd_out) = get_sys_net_stat(ifaces[i], 'rx_errors')
184 | if retcode == 0:
185 | values.append(KeyValue(key = 'Rx Errors', value = cmd_out))
186 | (retcode, cmd_out) = get_sys_net_stat(ifaces[i], 'tx_errors')
187 | if retcode == 0:
188 | values.append(KeyValue(key = 'Tx Errors', value = cmd_out))
189 | except Exception, e:
190 | rospy.logerr(traceback.format_exc())
191 | msg = 'Network Usage Check Error'
192 | values.append(KeyValue(key = msg, value = str(e)))
193 | level = DiagnosticStatus.ERROR
194 | return level, net_dict[level], values
195 |
196 | def check_usage(self):
197 | if rospy.is_shutdown():
198 | with self._mutex:
199 | self.cancel_timers()
200 | return
201 | diag_level = 0
202 | diag_vals = [KeyValue(key = 'Update Status', value = 'OK'),
203 | KeyValue(key = 'Time Since Last Update', value = 0)]
204 | diag_msgs = []
205 | net_level, net_msg, net_vals = self.check_network()
206 | diag_vals.extend(net_vals)
207 | if net_level > 0:
208 | diag_msgs.append(net_msg)
209 | diag_level = max(diag_level, net_level)
210 | if diag_msgs and diag_level > 0:
211 | usage_msg = ', '.join(set(diag_msgs))
212 | else:
213 | usage_msg = stat_dict[diag_level]
214 | with self._mutex:
215 | self._last_usage_time = rospy.get_time()
216 | self._usage_stat.level = diag_level
217 | self._usage_stat.values = diag_vals
218 | self._usage_stat.message = usage_msg
219 | if not rospy.is_shutdown():
220 | self._usage_timer = threading.Timer(5.0, self.check_usage)
221 | self._usage_timer.start()
222 | else:
223 | self.cancel_timers()
224 |
225 | def publish_stats(self):
226 | with self._mutex:
227 | update_status_stale(self._usage_stat, self._last_usage_time)
228 | msg = DiagnosticArray()
229 | msg.header.stamp = rospy.get_rostime()
230 | msg.status.append(self._usage_stat)
231 | if rospy.get_time() - self._last_publish_time > 0.5:
232 | self._diag_pub.publish(msg)
233 | self._last_publish_time = rospy.get_time()
234 |
235 | if __name__ == '__main__':
236 | hostname = socket.gethostname()
237 | hostname = hostname.replace('-', '_')
238 |
239 | import optparse
240 | parser =\
241 | optparse.OptionParser(
242 | usage="usage: net_monitor.py [--diag-hostname=cX]")
243 | parser.add_option("--diag-hostname", dest="diag_hostname",
244 | help="Computer name in diagnostics output (ex: 'c1')",
245 | metavar="DIAG_HOSTNAME",
246 | action="store", default = hostname)
247 | options, args = parser.parse_args(rospy.myargv())
248 | try:
249 | rospy.init_node('net_monitor_%s' % hostname)
250 | except rospy.exceptions.ROSInitException:
251 | print >> sys.stderr,\
252 | 'Network monitor is unable to initialize node. Master may not be running.'
253 | sys.exit(0)
254 | net_node = NetMonitor(hostname, options.diag_hostname)
255 | rate = rospy.Rate(1.0)
256 | try:
257 | while not rospy.is_shutdown():
258 | rate.sleep()
259 | net_node.publish_stats()
260 | except KeyboardInterrupt:
261 | pass
262 | except Exception, e:
263 | traceback.print_exc()
264 | rospy.logerr(traceback.format_exc())
265 | net_node.cancel_timers()
266 | sys.exit(0)
267 |
--------------------------------------------------------------------------------
/bin/hdd_monitor.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | ############################################################################
3 | # Copyright (C) 2009, Willow Garage, Inc. #
4 | # Copyright (C) 2013 by Ralf Kaestner #
5 | # ralf.kaestner@gmail.com #
6 | # Copyright (C) 2013 by Jerome Maye #
7 | # jerome.maye@mavt.ethz.ch #
8 | # #
9 | # All rights reserved. #
10 | # #
11 | # Redistribution and use in source and binary forms, with or without #
12 | # modification, are permitted provided that the following conditions #
13 | # are met: #
14 | # #
15 | # 1. Redistributions of source code must retain the above copyright #
16 | # notice, this list of conditions and the following disclaimer. #
17 | # #
18 | # 2. Redistributions in binary form must reproduce the above copyright #
19 | # notice, this list of conditions and the following disclaimer in #
20 | # the documentation and/or other materials provided with the #
21 | # distribution. #
22 | # #
23 | # 3. The name of the copyright holders may be used to endorse or #
24 | # promote products derived from this software without specific #
25 | # prior written permission. #
26 | # #
27 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS #
28 | # "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT #
29 | # LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS #
30 | # FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE #
31 | # COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, #
32 | # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, #
33 | # BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; #
34 | # LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER #
35 | # CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT #
36 | # LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN #
37 | # ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE #
38 | # POSSIBILITY OF SUCH DAMAGE. #
39 | ############################################################################
40 |
41 | from __future__ import with_statement
42 |
43 | import rospy
44 |
45 | import traceback
46 | import threading
47 | from threading import Timer
48 | import sys, os, time
49 | from time import sleep
50 | import subprocess
51 |
52 | import socket
53 |
54 | from diagnostic_msgs.msg import DiagnosticArray, DiagnosticStatus, KeyValue
55 |
56 | hdd_level_warn = 0.95
57 | hdd_level_error = 0.99
58 | hdd_temp_warn = 55.0
59 | hdd_temp_error = 70.0
60 |
61 | stat_dict = { 0: 'OK', 1: 'Warning', 2: 'Error' }
62 | temp_dict = { 0: 'OK', 1: 'Hot', 2: 'Critical Hot' }
63 | usage_dict = { 0: 'OK', 1: 'Low Disk Space', 2: 'Very Low Disk Space' }
64 |
65 | REMOVABLE = ['/dev/sg1', '/dev/sdb'] # Store removable drives so we can ignore if removed
66 |
67 | ## Connects to hddtemp daemon to get temp, HDD make.
68 | def get_hddtemp_data(hostname = 'localhost', port = 7634):
69 | try:
70 | hdd_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
71 | hdd_sock.connect((hostname, port))
72 | sock_data = ''
73 | while True:
74 | newdat = hdd_sock.recv(1024)
75 | if len(newdat) == 0:
76 | break
77 | sock_data = sock_data + newdat
78 | hdd_sock.close()
79 |
80 | sock_vals = sock_data.split('|')
81 |
82 | # Format of output looks like ' | DRIVE | MAKE | TEMP | '
83 | idx = 0
84 |
85 | drives = []
86 | makes = []
87 | temps = []
88 | while idx + 5 < len(sock_vals):
89 | this_drive = sock_vals[idx + 1]
90 | this_make = sock_vals[idx + 2]
91 | this_temp = sock_vals[idx + 3]
92 |
93 | # Sometimes we get duplicate makes if hard drives are mounted
94 | # to two different points
95 | if this_make in makes:
96 | idx += 5
97 | continue
98 |
99 | drives.append(this_drive)
100 | makes.append(this_make)
101 | temps.append(this_temp)
102 |
103 | idx += 5
104 |
105 | return True, drives, makes, temps
106 | except:
107 | rospy.loginfo(traceback.format_exc())
108 | return False, [ 'Exception' ], [ traceback.format_exc() ], [ 0 ]
109 |
110 | def update_status_stale(stat, last_update_time):
111 | time_since_update = rospy.get_time() - last_update_time
112 |
113 | stale_status = 'OK'
114 | if time_since_update > 20 and time_since_update <= 35:
115 | stale_status = 'Lagging'
116 | if stat.level == DiagnosticStatus.OK:
117 | stat.message = stale_status
118 | elif stat.message.find(stale_status) < 0:
119 | stat.message = ', '.join([stat.message, stale_status])
120 | stat.level = max(stat.level, DiagnosticStatus.WARN)
121 | if time_since_update > 35:
122 | stale_status = 'Stale'
123 | if stat.level == DiagnosticStatus.OK:
124 | stat.message = stale_status
125 | elif stat.message.find(stale_status) < 0:
126 | stat.message = ', '.join([stat.message, stale_status])
127 | stat.level = max(stat.level, DiagnosticStatus.ERROR)
128 |
129 | stat.values.pop(0)
130 | stat.values.pop(0)
131 | stat.values.insert(0, KeyValue(key = 'Update Status', value = stale_status))
132 | stat.values.insert(1, KeyValue(key = 'Time Since Update', value = str(time_since_update)))
133 |
134 | class hdd_monitor():
135 | def __init__(self, hostname, diag_hostname):
136 | self._mutex = threading.Lock()
137 |
138 | self._hostname = hostname
139 | self._no_temp = rospy.get_param('~no_hdd_temp', False)
140 | self._no_temp_warn = rospy.get_param('~no_hdd_temp_warn', False)
141 | self._hdd_level_warn = rospy.get_param('~hdd_level_warn', hdd_level_warn)
142 | self._hdd_level_error = rospy.get_param('~hdd_level_error', hdd_level_error)
143 | self._hdd_temp_warn = rospy.get_param('~hdd_temp_warn', hdd_temp_warn)
144 | self._hdd_temp_error = rospy.get_param('~hdd_temp_error', hdd_temp_error)
145 |
146 | self._diag_pub = rospy.Publisher('/diagnostics', DiagnosticArray, queue_size = 100)
147 |
148 | self._last_publish_time = 0
149 |
150 | self._last_temp_time = 0
151 | self._temp_timer = None
152 | if not self._no_temp:
153 | self._temp_stat = DiagnosticStatus()
154 | self._temp_stat.name = "HDD Temperature (%s)" % diag_hostname
155 | self._temp_stat.level = DiagnosticStatus.ERROR
156 | self._temp_stat.hardware_id = hostname
157 | self._temp_stat.message = 'No Data'
158 | self._temp_stat.values = [ KeyValue(key = 'Update Status', value = 'No Data'),
159 | KeyValue(key = 'Time Since Last Update', value = 'N/A') ]
160 | self.check_temps()
161 |
162 | self._last_usage_time = 0
163 | self._usage_timer = None
164 | self._usage_stat = DiagnosticStatus()
165 | self._usage_stat.level = DiagnosticStatus.ERROR
166 | self._usage_stat.hardware_id = hostname
167 | self._usage_stat.name = 'HDD Usage (%s)' % diag_hostname
168 | self._usage_stat.values = [ KeyValue(key = 'Update Status', value = 'No Data' ),
169 | KeyValue(key = 'Time Since Last Update', value = 'N/A') ]
170 | self.check_disk_usage()
171 |
172 | ## Must have the lock to cancel everything
173 | def cancel_timers(self):
174 | if self._temp_timer:
175 | self._temp_timer.cancel()
176 | self._temp_timer = None
177 |
178 | if self._usage_timer:
179 | self._usage_timer.cancel()
180 | self._usage_timer = None
181 |
182 | def check_temps(self):
183 | if rospy.is_shutdown():
184 | with self._mutex:
185 | self.cancel_timers()
186 | return
187 |
188 | diag_strs = [ KeyValue(key = 'Update Status', value = 'OK' ) ,
189 | KeyValue(key = 'Time Since Last Update', value = '0' ) ]
190 | diag_level = DiagnosticStatus.OK
191 | diag_message = 'OK'
192 |
193 | temp_ok, drives, makes, temps = get_hddtemp_data()
194 |
195 | for index in range(0, len(drives)):
196 | temp = temps[index]
197 |
198 | if not unicode(temp).isnumeric() and drives[index] not in REMOVABLE:
199 | temp_level = DiagnosticStatus.ERROR
200 | temp_ok = False
201 | elif not unicode(temp).isnumeric() and drives[index] in REMOVABLE:
202 | temp_level = DiagnosticStatus.OK
203 | temp = "Removed"
204 | else:
205 | temp_level = DiagnosticStatus.OK
206 | if float(temp) >= self._hdd_temp_warn:
207 | temp_level = DiagnosticStatus.WARN
208 | if float(temp) >= self._hdd_temp_error:
209 | temp_level = DiagnosticStatus.ERROR
210 |
211 | diag_level = max(diag_level, temp_level)
212 |
213 | diag_strs.append(KeyValue(key = 'Disk %d Temperature Status' % index, value = temp_dict[temp_level]))
214 | diag_strs.append(KeyValue(key = 'Disk %d Mount Pt.' % index, value = drives[index]))
215 | diag_strs.append(KeyValue(key = 'Disk %d Device ID' % index, value = makes[index]))
216 | diag_strs.append(KeyValue(key = 'Disk %d Temperature' % index, value = str(temp)+"DegC"))
217 |
218 | if not temp_ok:
219 | diag_level = DiagnosticStatus.ERROR
220 |
221 | with self._mutex:
222 | self._last_temp_time = rospy.get_time()
223 | self._temp_stat.values = diag_strs
224 | self._temp_stat.level = diag_level
225 |
226 | # Give No Data message if we have no reading
227 | self._temp_stat.message = temp_dict[diag_level]
228 | if not temp_ok:
229 | self._temp_stat.message = 'Error'
230 |
231 | if self._no_temp_warn and temp_ok:
232 | self._temp_stat.level = DiagnosticStatus.OK
233 |
234 | if not rospy.is_shutdown():
235 | self._temp_timer = threading.Timer(10.0, self.check_temps)
236 | self._temp_timer.start()
237 | else:
238 | self.cancel_timers()
239 |
240 | def check_disk_usage(self):
241 | if rospy.is_shutdown():
242 | with self._mutex:
243 | self.cancel_timers()
244 | return
245 |
246 | diag_vals = [ KeyValue(key = 'Update Status', value = 'OK' ),
247 | KeyValue(key = 'Time Since Last Update', value = '0' ) ]
248 | diag_level = DiagnosticStatus.OK
249 | diag_message = 'OK'
250 |
251 | try:
252 | p = subprocess.Popen(["df", "-Pht", "ext4"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
253 | stdout, stderr = p.communicate()
254 | retcode = p.returncode
255 |
256 | if (retcode == 0 or retcode == 1):
257 | diag_vals.append(KeyValue(key = 'Disk Space Reading', value = 'OK'))
258 | rows = stdout.split('\n')
259 | del rows[0]
260 | row_count = 0
261 |
262 | for row in rows:
263 | if len(row.split()) < 2:
264 | continue
265 | if unicode(row.split()[0]) == "none":
266 | continue
267 |
268 | row_count += 1
269 | g_available = row.split()[-3]
270 | g_use = row.split()[-2]
271 | name = row.split()[0]
272 | size = row.split()[1]
273 | mount_pt = row.split()[-1]
274 |
275 | hdd_usage = float(g_use.replace("%", ""))*1e-2
276 | if (hdd_usage < self._hdd_level_warn):
277 | level = DiagnosticStatus.OK
278 | elif (hdd_usage < self._hdd_level_error):
279 | level = DiagnosticStatus.WARN
280 | else:
281 | level = DiagnosticStatus.ERROR
282 |
283 | diag_vals.append(KeyValue(
284 | key = 'Disk %d Name' % row_count, value = name))
285 | diag_vals.append(KeyValue(
286 | key = 'Disk %d Size' % row_count, value = size))
287 | diag_vals.append(KeyValue(
288 | key = 'Disk %d Available' % row_count, value = g_available))
289 | diag_vals.append(KeyValue(
290 | key = 'Disk %d Use' % row_count, value = g_use))
291 | diag_vals.append(KeyValue(
292 | key = 'Disk %d Status' % row_count, value = stat_dict[level]))
293 | diag_vals.append(KeyValue(
294 | key = 'Disk %d Mount Point' % row_count, value = mount_pt))
295 |
296 | diag_level = max(diag_level, level)
297 | diag_message = usage_dict[diag_level]
298 |
299 | else:
300 | diag_vals.append(KeyValue(key = 'Disk Space Reading', value = 'Failed'))
301 | diag_level = DiagnosticStatus.ERROR
302 | diag_message = stat_dict[diag_level]
303 |
304 |
305 | except:
306 | rospy.logerr(traceback.format_exc())
307 |
308 | diag_vals.append(KeyValue(key = 'Disk Space Reading', value = 'Exception'))
309 | diag_vals.append(KeyValue(key = 'Disk Space Ex', value = traceback.format_exc()))
310 |
311 | diag_level = DiagnosticStatus.ERROR
312 | diag_message = stat_dict[diag_level]
313 |
314 | # Update status
315 | with self._mutex:
316 | self._last_usage_time = rospy.get_time()
317 | self._usage_stat.values = diag_vals
318 | self._usage_stat.message = diag_message
319 | self._usage_stat.level = diag_level
320 |
321 | if not rospy.is_shutdown():
322 | self._usage_timer = threading.Timer(5.0, self.check_disk_usage)
323 | self._usage_timer.start()
324 | else:
325 | self.cancel_timers()
326 |
327 |
328 | def publish_stats(self):
329 | with self._mutex:
330 | msg = DiagnosticArray()
331 | msg.header.stamp = rospy.get_rostime()
332 |
333 | if not self._no_temp:
334 | update_status_stale(self._temp_stat, self._last_temp_time)
335 | msg.status.append(self._temp_stat)
336 |
337 | update_status_stale(self._usage_stat, self._last_usage_time)
338 | msg.status.append(self._usage_stat)
339 |
340 | if rospy.get_time() - self._last_publish_time > 0.5:
341 | self._diag_pub.publish(msg)
342 | self._last_publish_time = rospy.get_time()
343 |
344 |
345 |
346 |
347 | ##\todo Need to check HDD input/output too using iostat
348 |
349 | if __name__ == '__main__':
350 | hostname = socket.gethostname()
351 | hostname = hostname.replace('-', '_')
352 |
353 | import optparse
354 | parser = optparse.OptionParser(usage="usage: hdd_monitor.py [--diag-hostname=cX]")
355 | parser.add_option("--diag-hostname", dest="diag_hostname",
356 | help="Computer name in diagnostics output (ex: 'c1')",
357 | metavar="DIAG_HOSTNAME",
358 | action="store", default = hostname)
359 | options, args = parser.parse_args(rospy.myargv())
360 |
361 | try:
362 | rospy.init_node('hdd_monitor_%s' % hostname)
363 | except rospy.exceptions.ROSInitException:
364 | print 'HDD monitor is unable to initialize node. Master may not be running.'
365 | sys.exit(0)
366 |
367 | hdd_monitor = hdd_monitor(hostname, options.diag_hostname)
368 | rate = rospy.Rate(1.0)
369 |
370 | try:
371 | while not rospy.is_shutdown():
372 | rate.sleep()
373 | hdd_monitor.publish_stats()
374 | except KeyboardInterrupt:
375 | pass
376 | except Exception, e:
377 | traceback.print_exc()
378 |
379 | hdd_monitor.cancel_timers()
380 | sys.exit(0)
381 |
--------------------------------------------------------------------------------
/bin/cpu_monitor.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | ############################################################################
3 | # Copyright (C) 2009, Willow Garage, Inc. #
4 | # Copyright (C) 2013 by Ralf Kaestner #
5 | # ralf.kaestner@gmail.com #
6 | # Copyright (C) 2013 by Jerome Maye #
7 | # jerome.maye@mavt.ethz.ch #
8 | # #
9 | # All rights reserved. #
10 | # #
11 | # Redistribution and use in source and binary forms, with or without #
12 | # modification, are permitted provided that the following conditions #
13 | # are met: #
14 | # #
15 | # 1. Redistributions of source code must retain the above copyright #
16 | # notice, this list of conditions and the following disclaimer. #
17 | # #
18 | # 2. Redistributions in binary form must reproduce the above copyright #
19 | # notice, this list of conditions and the following disclaimer in #
20 | # the documentation and/or other materials provided with the #
21 | # distribution. #
22 | # #
23 | # 3. The name of the copyright holders may be used to endorse or #
24 | # promote products derived from this software without specific #
25 | # prior written permission. #
26 | # #
27 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS #
28 | # "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT #
29 | # LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS #
30 | # FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE #
31 | # COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, #
32 | # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, #
33 | # BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; #
34 | # LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER #
35 | # CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT #
36 | # LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN #
37 | # ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE #
38 | # POSSIBILITY OF SUCH DAMAGE. #
39 | ############################################################################
40 |
41 | from __future__ import with_statement
42 |
43 | import rospy
44 |
45 | import traceback
46 | import threading
47 | from threading import Timer
48 | import sys, os, time
49 | from time import sleep
50 | import subprocess
51 | import string
52 | import multiprocessing
53 |
54 | import socket
55 |
56 | from diagnostic_msgs.msg import DiagnosticArray, DiagnosticStatus, KeyValue
57 |
58 | cpu_load_warn = 0.9
59 | cpu_load_error = 1.1
60 | cpu_load1_warn = 0.9
61 | cpu_load5_warn = 0.8
62 | cpu_temp_warn = 85.0
63 | cpu_temp_error = 90.0
64 |
65 | stat_dict = { 0: 'OK', 1: 'Warning', 2: 'Error' }
66 |
67 | def update_status_stale(stat, last_update_time):
68 | time_since_update = rospy.get_time() - last_update_time
69 |
70 | stale_status = 'OK'
71 | if time_since_update > 20 and time_since_update <= 35:
72 | stale_status = 'Lagging'
73 | if stat.level == DiagnosticStatus.OK:
74 | stat.message = stale_status
75 | elif stat.message.find(stale_status) < 0:
76 | stat.message = ', '.join([stat.message, stale_status])
77 | stat.level = max(stat.level, DiagnosticStatus.WARN)
78 | if time_since_update > 35:
79 | stale_status = 'Stale'
80 | if stat.level == DiagnosticStatus.OK:
81 | stat.message = stale_status
82 | elif stat.message.find(stale_status) < 0:
83 | stat.message = ', '.join([stat.message, stale_status])
84 | stat.level = max(stat.level, DiagnosticStatus.ERROR)
85 |
86 |
87 | stat.values.pop(0)
88 | stat.values.pop(0)
89 | stat.values.insert(0, KeyValue(key = 'Update Status', value = stale_status))
90 | stat.values.insert(1, KeyValue(key = 'Time Since Update', value = str(time_since_update)))
91 |
92 |
93 | class CPUMonitor():
94 | def __init__(self, hostname, diag_hostname):
95 | self._diag_pub = rospy.Publisher('/diagnostics', DiagnosticArray, queue_size = 100)
96 |
97 | self._mutex = threading.Lock()
98 |
99 | self._check_core_temps = rospy.get_param('~check_core_temps', True)
100 |
101 | self._cpu_load_warn = rospy.get_param('~cpu_load_warn', cpu_load_warn)
102 | self._cpu_load_error = rospy.get_param('~cpu_load_error', cpu_load_error)
103 | self._cpu_load1_warn = rospy.get_param('~cpu_load1_warn', cpu_load1_warn)
104 | self._cpu_load5_warn = rospy.get_param('~cpu_load5_warn', cpu_load5_warn)
105 | self._cpu_temp_warn = rospy.get_param('~cpu_temp_warn', cpu_temp_warn)
106 | self._cpu_temp_error = rospy.get_param('~cpu_temp_error', cpu_temp_error)
107 |
108 | self._num_cores = multiprocessing.cpu_count()
109 |
110 | self._temps_timer = None
111 | self._usage_timer = None
112 |
113 | # Get temp_input files
114 | self._temp_vals = self.get_core_temp_names()
115 |
116 | # CPU stats
117 | self._temp_stat = DiagnosticStatus()
118 | self._temp_stat.name = 'CPU Temperature (%s)' % diag_hostname
119 | self._temp_stat.level = 1
120 | self._temp_stat.hardware_id = hostname
121 | self._temp_stat.message = 'No Data'
122 | self._temp_stat.values = [ KeyValue(key = 'Update Status', value = 'No Data' ),
123 | KeyValue(key = 'Time Since Last Update', value = 'N/A') ]
124 |
125 | self._usage_stat = DiagnosticStatus()
126 | self._usage_stat.name = 'CPU Usage (%s)' % diag_hostname
127 | self._usage_stat.level = 1
128 | self._usage_stat.hardware_id = hostname
129 | self._usage_stat.message = 'No Data'
130 | self._usage_stat.values = [ KeyValue(key = 'Update Status', value = 'No Data' ),
131 | KeyValue(key = 'Time Since Last Update', value = 'N/A') ]
132 |
133 | self._last_temp_time = 0
134 | self._last_usage_time = 0
135 | self._last_publish_time = 0
136 |
137 | self._usage_old = 0
138 | self._has_warned_mpstat = False
139 | self._has_error_core_count = False
140 |
141 | # Start checking everything
142 | self.check_temps()
143 | self.check_usage()
144 |
145 | # Restart temperature checking
146 | def _restart_temp_check(self):
147 | rospy.logerr('Restarting temperature check thread in cpu_monitor. This should not happen')
148 | try:
149 | with self._mutex:
150 | if self._temps_timer:
151 | self._temps_timer.cancel()
152 |
153 | self.check_temps()
154 | except Exception, e:
155 | rospy.logerr('Unable to restart temp thread. Error: %s' % traceback.format_exc())
156 |
157 |
158 | ## Must have the lock to cancel everything
159 | def cancel_timers(self):
160 | if self._temps_timer:
161 | self._temps_timer.cancel()
162 |
163 | if self._usage_timer:
164 | self._usage_timer.cancel()
165 |
166 | ##\brief Check CPU core temps
167 | ##
168 | ## Use 'find /sys -name temp1_input' to find cores
169 | ## Read from every core, divide by 1000
170 | def check_core_temps(self, sys_temp_strings):
171 | diag_vals = []
172 | diag_level = 0
173 | diag_msgs = []
174 |
175 | for index, temp_str in enumerate(sys_temp_strings):
176 | if len(temp_str) < 5:
177 | continue
178 |
179 | cmd = 'cat %s' % temp_str
180 | p = subprocess.Popen(cmd, stdout = subprocess.PIPE,
181 | stderr = subprocess.PIPE, shell = True)
182 | stdout, stderr = p.communicate()
183 | retcode = p.returncode
184 |
185 | if retcode != 0:
186 | diag_level = DiagnosticStatus.ERROR
187 | diag_msg = [ 'Core Temperature Error' ]
188 | diag_vals = [ KeyValue(key = 'Core Temperature Error', value = stderr),
189 | KeyValue(key = 'Output', value = stdout) ]
190 | return diag_vals, diag_msgs, diag_level
191 |
192 | tmp = stdout.strip()
193 | if unicode(tmp).isnumeric():
194 | temp = float(tmp) / 1000
195 | diag_vals.append(KeyValue(key = 'Core %d Temperature' % index, value = str(temp)+"DegC"))
196 |
197 | if temp >= self._cpu_temp_warn:
198 | diag_level = max(diag_level, DiagnosticStatus.WARN)
199 | diag_msgs.append('Warm')
200 | elif temp >= self._cpu_temp_error:
201 | diag_level = max(diag_level, DiagnosticStatus.ERROR)
202 | diag_msgs.append('Hot')
203 | else:
204 | diag_level = max(diag_level, DiagnosticStatus.ERROR) # Error if not numeric value
205 | diag_vals.append(KeyValue(key = 'Core %s Temperature' % index, value = tmp))
206 |
207 | return diag_vals, diag_msgs, diag_level
208 |
209 | ## Checks clock speed from reading from CPU info
210 | def check_clock_speed(self):
211 | vals = []
212 | msgs = []
213 | lvl = DiagnosticStatus.OK
214 |
215 | try:
216 | p = subprocess.Popen('cat /proc/cpuinfo | grep MHz',
217 | stdout = subprocess.PIPE,
218 | stderr = subprocess.PIPE, shell = True)
219 | stdout, stderr = p.communicate()
220 | retcode = p.returncode
221 |
222 | if retcode != 0:
223 | lvl = DiagnosticStatus.ERROR
224 | msgs = [ 'Clock speed error' ]
225 | vals = [ KeyValue(key = 'Clock speed error', value = stderr),
226 | KeyValue(key = 'Output', value = stdout) ]
227 |
228 | return (vals, msgs, lvl)
229 |
230 | for index, ln in enumerate(stdout.split('\n')):
231 | words = ln.split(':')
232 | if len(words) < 2:
233 | continue
234 |
235 | speed = words[1].strip().split('.')[0] # Conversion to float doesn't work with decimal
236 | vals.append(KeyValue(key = 'Core %d Clock Speed' % index, value = speed+"MHz"))
237 |
238 | except Exception, e:
239 | rospy.logerr(traceback.format_exc())
240 | lvl = DiagnosticStatus.ERROR
241 | msgs.append('Exception')
242 | vals.append(KeyValue(key = 'Exception', value = traceback.format_exc()))
243 |
244 | return vals, msgs, lvl
245 |
246 |
247 | # Add msgs output, too
248 | ##\brief Uses 'uptime' to see load average
249 | def check_uptime(self):
250 | level = DiagnosticStatus.OK
251 | vals = []
252 |
253 | load_dict = { 0: 'OK', 1: 'High Load', 2: 'Very High Load' }
254 |
255 | try:
256 | p = subprocess.Popen('uptime', stdout = subprocess.PIPE,
257 | stderr = subprocess.PIPE, shell = True)
258 | stdout, stderr = p.communicate()
259 | retcode = p.returncode
260 |
261 | if retcode != 0:
262 | vals.append(KeyValue(key = 'uptime Failed', value = stderr))
263 | return DiagnosticStatus.ERROR, vals
264 |
265 | upvals = stdout.split()
266 | load1 = float(upvals[-3].rstrip(','))/self._num_cores
267 | load5 = float(upvals[-2].rstrip(','))/self._num_cores
268 | load15 = float(upvals[-1])/self._num_cores
269 |
270 | # Give warning if we go over load limit
271 | if load1 > self._cpu_load1_warn or load5 > self._cpu_load5_warn:
272 | level = DiagnosticStatus.WARN
273 |
274 | vals.append(KeyValue(key = 'Load Average Status', value = load_dict[level]))
275 | vals.append(KeyValue(key = 'Load Average (1min)', value = str(load1*1e2)+"%"))
276 | vals.append(KeyValue(key = 'Load Average (5min)', value = str(load5*1e2)+"%"))
277 | vals.append(KeyValue(key = 'Load Average (15min)', value = str(load15*1e2)+"%"))
278 |
279 | except Exception, e:
280 | rospy.logerr(traceback.format_exc())
281 | level = DiagnosticStatus.ERROR
282 | vals.append(KeyValue(key = 'Load Average Status', value = traceback.format_exc()))
283 |
284 | return level, load_dict[level], vals
285 |
286 | ##\brief Use mpstat to find CPU usage
287 | ##
288 | def check_mpstat(self):
289 | vals = []
290 | mp_level = DiagnosticStatus.OK
291 |
292 | load_dict = { 0: 'OK', 1: 'High Load', 2: 'Error' }
293 |
294 | try:
295 | p = subprocess.Popen('mpstat -P ALL 1 1',
296 | stdout = subprocess.PIPE,
297 | stderr = subprocess.PIPE, shell = True)
298 | stdout, stderr = p.communicate()
299 | retcode = p.returncode
300 |
301 | if retcode != 0:
302 | if not self._has_warned_mpstat:
303 | rospy.logerr("mpstat failed to run for cpu_monitor. Return code %d.", retcode)
304 | self._has_warned_mpstat = True
305 |
306 | mp_level = DiagnosticStatus.ERROR
307 | vals.append(KeyValue(key = '\"mpstat\" Call Error', value = str(retcode)))
308 | return mp_level, 'Unable to Check CPU Usage', vals
309 |
310 | # Check which column '%idle' is, #4539
311 | # mpstat output changed between 8.06 and 8.1
312 | rows = stdout.split('\n')
313 | col_names = rows[2].split()
314 | idle_col = -1 if (len(col_names) > 2 and col_names[-1] == '%idle') else -2
315 |
316 | num_cores = 0
317 | cores_loaded = 0
318 | for index, row in enumerate(stdout.split('\n')):
319 | if index < 3:
320 | continue
321 |
322 | # Skip row containing 'all' data
323 | if row.find('all') > -1:
324 | continue
325 |
326 | lst = row.split()
327 | if len(lst) < 8:
328 | continue
329 |
330 | ## Ignore 'Average: ...' data
331 | if lst[0].startswith('Average'):
332 | continue
333 |
334 | cpu_name = '%d' % (num_cores)
335 | idle = lst[idle_col]
336 | user = lst[3]
337 | nice = lst[4]
338 | system = lst[5]
339 |
340 | core_level = 0
341 | usage = (float(user)+float(nice))*1e-2
342 | if usage > 10.0: # wrong reading, use old reading instead
343 | rospy.logwarn('Read CPU usage of %f percent. Reverting to previous reading of %f percent'%(usage, self._usage_old))
344 | usage = self._usage_old
345 | self._usage_old = usage
346 |
347 | if usage >= self._cpu_load_warn:
348 | cores_loaded += 1
349 | core_level = DiagnosticStatus.WARN
350 | elif usage >= self._cpu_load_error:
351 | core_level = DiagnosticStatus.ERROR
352 |
353 | vals.append(KeyValue(key = 'Core %s Status' % cpu_name, value = load_dict[core_level]))
354 | vals.append(KeyValue(key = 'Core %s User' % cpu_name, value = user+"%"))
355 | vals.append(KeyValue(key = 'Core %s Nice' % cpu_name, value = nice+"%"))
356 | vals.append(KeyValue(key = 'Core %s System' % cpu_name, value = system+"%"))
357 | vals.append(KeyValue(key = 'Core %s Idle' % cpu_name, value = idle+"%"))
358 |
359 | num_cores += 1
360 |
361 | # Warn for high load only if we have <= 2 cores that aren't loaded
362 | if num_cores - cores_loaded <= 2 and num_cores > 2:
363 | mp_level = DiagnosticStatus.WARN
364 |
365 | if not self._num_cores:
366 | self._num_cores = num_cores
367 |
368 | # Check the number of cores if self._num_cores > 0, #4850
369 | if self._num_cores != num_cores:
370 | mp_level = DiagnosticStatus.ERROR
371 | if not self._has_error_core_count:
372 | rospy.logerr('Error checking number of cores. Expected %d, got %d. Computer may have not booted properly.',
373 | self._num_cores, num_cores)
374 | self._has_error_core_count = True
375 | return DiagnosticStatus.ERROR, 'Incorrect number of CPU cores', vals
376 |
377 | except Exception, e:
378 | mp_level = DiagnosticStatus.ERROR
379 | vals.append(KeyValue(key = 'mpstat Exception', value = str(e)))
380 |
381 | return mp_level, load_dict[mp_level], vals
382 |
383 | ## Returns names for core temperature files
384 | ## Returns list of names, each name can be read like file
385 | def get_core_temp_names(self):
386 | temp_vals = []
387 | try:
388 | p = subprocess.Popen('find /sys/devices -name temp1_input',
389 | stdout = subprocess.PIPE,
390 | stderr = subprocess.PIPE, shell = True)
391 | stdout, stderr = p.communicate()
392 | retcode = p.returncode
393 |
394 | if retcode != 0:
395 | rospy.logerr('Error find core temp locations: %s' % stderr)
396 | return []
397 |
398 | for ln in stdout.split('\n'):
399 | temp_vals.append(ln.strip())
400 |
401 | return temp_vals
402 | except:
403 | rospy.logerr('Exception finding temp vals: %s' % traceback.format_exc())
404 | return []
405 |
406 | ## Call every 10sec at minimum
407 | def check_temps(self):
408 | if rospy.is_shutdown():
409 | with self._mutex:
410 | self.cancel_timers()
411 | return
412 |
413 | diag_vals = [ KeyValue(key = 'Update Status', value = 'OK' ),
414 | KeyValue(key = 'Time Since Last Update', value = str(0) ) ]
415 | diag_msgs = []
416 | diag_level = 0
417 |
418 | if self._check_core_temps:
419 | core_vals, core_msgs, core_level = self.check_core_temps(self._temp_vals)
420 | diag_vals.extend(core_vals)
421 | diag_msgs.extend(core_msgs)
422 | diag_level = max(diag_level, core_level)
423 |
424 | diag_log = set(diag_msgs)
425 | if len(diag_log) > 0:
426 | message = ', '.join(diag_log)
427 | else:
428 | message = stat_dict[diag_level]
429 |
430 | with self._mutex:
431 | self._last_temp_time = rospy.get_time()
432 |
433 | self._temp_stat.level = diag_level
434 | self._temp_stat.message = message
435 | self._temp_stat.values = diag_vals
436 |
437 | if not rospy.is_shutdown():
438 | self._temps_timer = threading.Timer(5.0, self.check_temps)
439 | self._temps_timer.start()
440 | else:
441 | self.cancel_timers()
442 |
443 | def check_usage(self):
444 | if rospy.is_shutdown():
445 | with self._mutex:
446 | self.cancel_timers()
447 | return
448 |
449 | diag_level = 0
450 | diag_vals = [ KeyValue(key = 'Update Status', value = 'OK' ),
451 | KeyValue(key = 'Time Since Last Update', value = 0 )]
452 | diag_msgs = []
453 |
454 | # Check clock speed
455 | clock_vals, clock_msgs, clock_level = self.check_clock_speed()
456 | diag_vals.extend(clock_vals)
457 | diag_msgs.extend(clock_msgs)
458 | diag_level = max(diag_level, clock_level)
459 |
460 | # Check mpstat
461 | mp_level, mp_msg, mp_vals = self.check_mpstat()
462 | diag_vals.extend(mp_vals)
463 | if mp_level > 0:
464 | diag_msgs.append(mp_msg)
465 | diag_level = max(diag_level, mp_level)
466 |
467 | # Check uptime
468 | uptime_level, up_msg, up_vals = self.check_uptime()
469 | diag_vals.extend(up_vals)
470 | if uptime_level > 0:
471 | diag_msgs.append(up_msg)
472 | diag_level = max(diag_level, uptime_level)
473 |
474 | if diag_msgs and diag_level > 0:
475 | usage_msg = ', '.join(set(diag_msgs))
476 | else:
477 | usage_msg = stat_dict[diag_level]
478 |
479 | # Update status
480 | with self._mutex:
481 | self._last_usage_time = rospy.get_time()
482 | self._usage_stat.level = diag_level
483 | self._usage_stat.values = diag_vals
484 |
485 | self._usage_stat.message = usage_msg
486 |
487 | if not rospy.is_shutdown():
488 | self._usage_timer = threading.Timer(5.0, self.check_usage)
489 | self._usage_timer.start()
490 | else:
491 | self.cancel_timers()
492 |
493 | def publish_stats(self):
494 | with self._mutex:
495 | # Update everything with last update times
496 | update_status_stale(self._temp_stat, self._last_temp_time)
497 | update_status_stale(self._usage_stat, self._last_usage_time)
498 |
499 | msg = DiagnosticArray()
500 | msg.header.stamp = rospy.get_rostime()
501 | msg.status.append(self._temp_stat)
502 | msg.status.append(self._usage_stat)
503 |
504 | if rospy.get_time() - self._last_publish_time > 0.5:
505 | self._diag_pub.publish(msg)
506 | self._last_publish_time = rospy.get_time()
507 |
508 |
509 | # Restart temperature checking if it goes stale, #4171
510 | # Need to run this without mutex
511 | if rospy.get_time() - self._last_temp_time > 90:
512 | self._restart_temp_check()
513 |
514 |
515 | if __name__ == '__main__':
516 | hostname = socket.gethostname()
517 | hostname = hostname.replace('-', '_')
518 |
519 | import optparse
520 | parser = optparse.OptionParser(usage="usage: cpu_monitor.py [--diag-hostname=cX]")
521 | parser.add_option("--diag-hostname", dest="diag_hostname",
522 | help="Computer name in diagnostics output (ex: 'c1')",
523 | metavar="DIAG_HOSTNAME",
524 | action="store", default = hostname)
525 | options, args = parser.parse_args(rospy.myargv())
526 |
527 | try:
528 | rospy.init_node('cpu_monitor_%s' % hostname)
529 | except rospy.exceptions.ROSInitException:
530 | print >> sys.stderr, 'CPU monitor is unable to initialize node. Master may not be running.'
531 | sys.exit(0)
532 |
533 | cpu_node = CPUMonitor(hostname, options.diag_hostname)
534 |
535 | rate = rospy.Rate(1.0)
536 | try:
537 | while not rospy.is_shutdown():
538 | rate.sleep()
539 | cpu_node.publish_stats()
540 | except KeyboardInterrupt:
541 | pass
542 | except Exception, e:
543 | traceback.print_exc()
544 | rospy.logerr(traceback.format_exc())
545 |
546 | cpu_node.cancel_timers()
547 | sys.exit(0)
548 |
--------------------------------------------------------------------------------