├── .gitignore ├── LICENSE ├── Makefile ├── README.md ├── bin └── run.sh ├── etc ├── master.config ├── slave.config └── sys.config ├── include └── netmon.hrl ├── rebar.config ├── rebar.config.script └── src ├── Makefile ├── edoc.css ├── netmon.app.in ├── netmon.app.src ├── netmon.erl ├── netmon.rel.src ├── netmon_app.erl ├── netmon_instance.erl └── overview.edoc /.gitignore: -------------------------------------------------------------------------------- 1 | *.beam 2 | *.o 3 | *.d 4 | *.swp 5 | *.tgz 6 | doc 7 | priv 8 | ebin 9 | .rebar/* 10 | .eunit 11 | _build 12 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD LICENSE 2 | =========== 3 | 4 | Copyright (C) 2006 Serge Aleynikov 5 | 6 | All rights reserved. 7 | 8 | Redistribution and use in source and binary forms, with or without 9 | modification, are permitted provided that the following conditions are met: 10 | 11 | 1. Redistributions of source code must retain the above copyright notice, 12 | this list of conditions and the following disclaimer. 13 | 14 | 2. Redistributions in binary form must reproduce the above copyright 15 | notice, this list of conditions and the following disclaimer in 16 | the documentation and/or other materials provided with the distribution. 17 | 18 | 3. The names of the authors may not be used to endorse or promote products 19 | derived from this software without specific prior written permission. 20 | 21 | THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED WARRANTIES, 22 | INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND 23 | FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL JCRAFT, 24 | INC. OR ANY CONTRIBUTORS TO THIS SOFTWARE BE LIABLE FOR ANY DIRECT, INDIRECT, 25 | INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 26 | LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, 27 | OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 28 | LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 29 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, 30 | EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 31 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | PROJECT = $(notdir $(PWD)) 2 | TARBALL = $(PROJECT) 3 | REBAR = rebar 4 | 5 | all: 6 | @$(REBAR) compile 7 | 8 | clean doc: 9 | @$(REBAR) $@ 10 | 11 | docs: doc 12 | 13 | clean-docs: 14 | rm -fr doc 15 | 16 | github-docs: 17 | rm -fr doc 18 | @if git branch | grep -q gh-pages ; then \ 19 | git checkout gh-pages; \ 20 | else \ 21 | git checkout -b gh-pages; \ 22 | fi 23 | rm -f rebar.lock 24 | git checkout master -- src include Makefile rebar.* 25 | make docs 26 | mv doc/*.* . 27 | make clean 28 | rm -fr src include Makefile erl_crash.dump priv etc rebar.* README* 29 | @FILES=`git st -uall --porcelain | sed -n '/^?? [A-Za-z0-9]/{s/?? //p}'`; \ 30 | for f in $$FILES ; do \ 31 | echo "Adding $$f"; git add $$f; \ 32 | done 33 | @sh -c "ret=0; set +e; \ 34 | if git commit -a --amend -m 'Documentation updated'; \ 35 | then git push origin +gh-pages; echo 'Pushed gh-pages to origin'; \ 36 | else ret=1; git reset --hard; \ 37 | fi; \ 38 | set -e; git checkout master && echo 'Switched to master'; exit $$ret" 39 | 40 | tar: 41 | rm $(PROJECT).tgz; \ 42 | tar zcf $(TARBALL).tgz --transform 's|^|$(TARBALL)/|' --exclude="core*" --exclude="erl_crash.dump" \ 43 | --exclude="*.tgz" --exclude="*.swp" \ 44 | --exclude="*.o" --exclude=".git" --exclude=".svn" * && \ 45 | echo "Created $(TARBALL).tgz" 46 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Network Monitor ## 2 | 3 | This application detects network splits and executes a custom function when 4 | remote nodes become accessible. 5 | 6 | See documentation at http://saleyn.github.io/netmon. 7 | 8 | ## Author ## 9 | 10 | Serge Aleynikov 11 | 12 | ## License ## 13 | 14 | Open-source BSD License 15 | -------------------------------------------------------------------------------- /bin/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | SCRIPT=`readlink -f $0` 4 | SCRIPT=${SCRIPT:-$0} 5 | cd ${SCRIPT%/*} 6 | 7 | erl -sname a -pa ../ebin -boot ../priv/netmon -config ../etc/sys.config 8 | -------------------------------------------------------------------------------- /etc/master.config: -------------------------------------------------------------------------------- 1 | [ 2 | {kernel, 3 | [{dist_auto_connect, once}, 4 | {inet_dist_listen_min, 4370}, 5 | {inet_dist_listen_max, 4380} 6 | ]}, 7 | 8 | {sasl, [{errlog_type, error}]}, 9 | 10 | {netmon, 11 | [ 12 | {behind_firewall, ['etc600','dev1','dev2']} 13 | ]} 14 | ]. 15 | 16 | % netmon:add_monitor(test, netmon, any, ['b@dev-serge-lap'], {netmon_instance,test_notify}, 10). 17 | % netmon:add_monitor(test, netmon, any, ['a@dev-serge-lap'], {netmon_instance,test_notify}, 10). 18 | -------------------------------------------------------------------------------- /etc/slave.config: -------------------------------------------------------------------------------- 1 | [ 2 | {kernel, 3 | [ 4 | {dist_auto_connect, never}, 5 | {inet_dist_listen_min, 4370}, 6 | {inet_dist_listen_max, 4380} 7 | ]}, 8 | 9 | {sasl, [{errlog_type, error}]}, 10 | 11 | {netmon, 12 | []} 13 | ]. 14 | 15 | % netmon:add_monitor(test, netmon, any, ['b@dev-serge-lap'], {netmon_instance,test_notify}, 10). 16 | % netmon:add_monitor(test, netmon, any, ['a@dev-serge-lap'], {netmon_instance,test_notify}, 10). 17 | -------------------------------------------------------------------------------- /etc/sys.config: -------------------------------------------------------------------------------- 1 | [ 2 | {kernel, 3 | [%{dist_auto_connect, once}, 4 | {inet_dist_listen_min, 4370}, 5 | {inet_dist_listen_max, 4380} 6 | ]}, 7 | 8 | {sasl, [{errlog_type, error}]}, 9 | 10 | {netmon, 11 | [ 12 | {monitors, 13 | [ %{'snmp@dev1', [{test, netmon, any, ['snmp@dev4'], {netmon_instance, test_notify}, 10, net_adm}]} 14 | {'snmp@dev2', [{test, netmon, any, ['snmp@dev3'], {netmon_instance, test_notify}, 10, tcp}]} 15 | ,{'snmp@dev3', [{test, netmon, any, ['snmp@dev1'], {netmon_instance, test_notify}, 10, passive}]} 16 | ,{'snmp@dev4', [{test, netmon, any, ['snmp@dev2'], {netmon_instance, test_notify}, 10, passive}]} 17 | ]}, 18 | {echo_port, 64000}, 19 | %{multicast, {224,1,1,1}}, 20 | %{echo_addr, "192.168.0.2"}, 21 | {port_map, [{'a@dev-serge-lap', 63999}, {'c@dev-serge-lap', 63998}, {a@dev_serge2, 63999}, {b@dev_serge2, 63998}]} 22 | ]} 23 | 24 | ]. 25 | 26 | % netmon:add_monitor(test, netmon, any, ['b@dev-serge-lap'], {netmon_instance,test_notify}, 10). 27 | % netmon:add_monitor(test, netmon, any, ['a@dev-serge-lap'], {netmon_instance,test_notify}, 10). 28 | -------------------------------------------------------------------------------- /include/netmon.hrl: -------------------------------------------------------------------------------- 1 | %%%------------------------------------------------------------------------ 2 | %%% Created: 2007-01-10 by Serge Aleynikov 3 | %%%------------------------------------------------------------------------ 4 | 5 | -record(ping_packet, { 6 | version % string() - netmon version. 7 | , name % atom() - name of the netmon_instance sending request 8 | , type % ping | pong - ping=request, pong=reply 9 | , node % FromNode::node() 10 | , app % FromApp::atom() 11 | , start_time % now() - start time of the pinging/replying node 12 | , sent_time % now() - sent time of the packet 13 | , id % Some unique transaction id 14 | }). 15 | 16 | -record(mon_notify, { 17 | name % Monitor's name 18 | , action % node_up | node_down | ping | pong 19 | , node % Node name 20 | , node_start_time % Start time of the `node' (if known from the last ping packet) 21 | , details % #ping_packet{} when action is 'ping' | 'pong' 22 | % or `Info' when action is node_up | node_down 23 | , up_nodes % Up nodes 24 | , down_nodes % Down nodes 25 | , watch_nodes % Nodes to watch for connectivity 26 | , start_time % Start time of current node 27 | }). 28 | 29 | -ifndef(FMT). 30 | -define(FMT(F, A), lists:flatten(io_lib:format(F, A))). 31 | -endif. 32 | -------------------------------------------------------------------------------- /rebar.config: -------------------------------------------------------------------------------- 1 | {erl_opts, [debug_info, fail_on_warning]}. 2 | 3 | {edoc_opts, [{overview, "src/overview.edoc"}, 4 | {title, "Network Connectivity Monitor"}, 5 | {stylesheet_file, "src/edoc.css"}, 6 | {app_default, "http://www.erlang.org/doc/man"}]}. 7 | 8 | -------------------------------------------------------------------------------- /rebar.config.script: -------------------------------------------------------------------------------- 1 | %% Add version number for edoc 2 | Vsn = string:strip(os:cmd("git describe --always --tags | sed 's/^v//'"), right, $\n). 3 | {value, {_, EdocOpts}, Config2} = lists:keytake(edoc_opts, 1, CONFIG), 4 | [{edoc_opts, [{def, {vsn, Vsn}} | EdocOpts]} | Config2]. 5 | -------------------------------------------------------------------------------- /src/Makefile: -------------------------------------------------------------------------------- 1 | REL_DIR = ../priv 2 | 3 | ERL_ROOT = /opt/erlang 4 | ERLC = $(ERL_ROOT)/bin/erlc 5 | ERLC_FLAGS += -I ../include +debug_info 6 | 7 | include ../vsn.mk 8 | include ../erl_include.mk 9 | include ../erl_targets.mk 10 | -------------------------------------------------------------------------------- /src/edoc.css: -------------------------------------------------------------------------------- 1 | /* Baseline rhythm */ 2 | body { 3 | font-size: 16px; 4 | font-family: Helvetica, sans-serif; 5 | margin: 8px; 6 | } 7 | 8 | p { 9 | font-size: 1em; /* 16px */ 10 | line-height: 1.5em; /* 24px */ 11 | margin: 0 0 1.5em 0; 12 | } 13 | 14 | h1 { 15 | font-size: 1.5em; /* 24px */ 16 | line-height: 1em; /* 24px */ 17 | margin-top: 1em; 18 | margin-bottom: 0em; 19 | } 20 | 21 | h2 { 22 | font-size: 1.375em; /* 22px */ 23 | line-height: 1.0909em; /* 24px */ 24 | margin-top: 1.0909em; 25 | margin-bottom: 0em; 26 | } 27 | 28 | h3 { 29 | font-size: 1.25em; /* 20px */ 30 | line-height: 1.2em; /* 24px */ 31 | margin-top: 1.2em; 32 | margin-bottom: 0em; 33 | } 34 | 35 | h4 { 36 | font-size: 1.125em; /* 18px */ 37 | line-height: 1.3333em; /* 24px */ 38 | margin-top: 1.3333em; 39 | margin-bottom: 0em; 40 | } 41 | 42 | .class-for-16px { 43 | font-size: 1em; /* 16px */ 44 | line-height: 1.5em; /* 24px */ 45 | margin-top: 1.5em; 46 | margin-bottom: 0em; 47 | } 48 | 49 | .class-for-14px { 50 | font-size: 0.875em; /* 14px */ 51 | line-height: 1.7143em; /* 24px */ 52 | margin-top: 1.7143em; 53 | margin-bottom: 0em; 54 | } 55 | 56 | ul { 57 | margin: 0 0 1.5em 0; 58 | list-style-position: outside; 59 | } 60 | 61 | /* Customizations */ 62 | body { 63 | color: #333; 64 | } 65 | 66 | tt, code, pre { font-size: 0.95em } 67 | 68 | pre { 69 | font-size: 0.875em; /* 14px */ 70 | margin: 0 1em 1.7143em; 71 | padding: 0 1em; 72 | background: #E0F0FF; 73 | } 74 | 75 | .navbar img, hr { 76 | display: none; 77 | } 78 | 79 | .navbar { 80 | background-color: #85C2FF; 81 | } 82 | 83 | table { 84 | border-collapse: collapse; 85 | } 86 | 87 | h1 { 88 | border-left: 0.5em solid #85C2FF; 89 | padding-left: 0.5em; 90 | background-color: #85C2FF; 91 | } 92 | 93 | h2.indextitle { 94 | font-size: 1.25em; /* 20px */ 95 | line-height: 1.2em; /* 24px */ 96 | margin: -8px -8px 0.6em; 97 | background-color: #85C2FF; 98 | padding: 0.3em; 99 | } 100 | 101 | ul.index { 102 | list-style: none; 103 | margin-left: 0em; 104 | padding-left: 0; 105 | } 106 | 107 | ul.index li { 108 | display: inline; 109 | padding-right: 0.75em 110 | } 111 | 112 | ul.definitions { 113 | background-color: #E0F0FF; 114 | } 115 | 116 | dt { 117 | font-weight: 600; 118 | } 119 | 120 | div.spec p { 121 | margin-bottom: 0; 122 | padding-left: 1.25em; 123 | background-color: #E0F0FF; 124 | } 125 | 126 | h3.function { 127 | border-left: 0.5em solid #85C2FF; 128 | padding-left: 0.5em; 129 | background-color: #BECFE0; 130 | } 131 | a, a:visited, a:hover, a:active { color: #06C } 132 | h2 a, h3 a { color: #333 } 133 | 134 | h3.typedecl { 135 | background-color: #BECFE0; 136 | } 137 | 138 | i { 139 | font-size: 0.875em; /* 14px */ 140 | line-height: 1.7143em; /* 24px */ 141 | margin-top: 1.7143em; 142 | margin-bottom: 0em; 143 | font-style: normal; 144 | } 145 | -------------------------------------------------------------------------------- /src/netmon.app.in: -------------------------------------------------------------------------------- 1 | {application, netmon, 2 | [ 3 | {description, "Network Connectivity Monitor"}, 4 | {vsn, "%VSN%"}, 5 | {id, "netmon"}, 6 | {modules, [%MODULES%]}, 7 | {registered, [ netmon_app, netmon ] }, 8 | %% NOTE: do not list applications which are load-only! 9 | {applications, [ kernel, stdlib, lama ] }, 10 | %{included_applications, [ mnesia ] }, 11 | %% 12 | %% mod: Specify the module name to start the application, plus args 13 | %% 14 | {mod, {netmon_app, []}}, 15 | {env, []} 16 | ] 17 | }. 18 | -------------------------------------------------------------------------------- /src/netmon.app.src: -------------------------------------------------------------------------------- 1 | {application, netmon, [ 2 | {description, "Network Connectivity Monitor"}, 3 | {vsn, git}, 4 | {id, "netmon"}, 5 | {registered, [netmon_app, netmon]}, 6 | %% NOTE: do not list applications which are load-only! 7 | {applications, [ kernel, stdlib, lama ] }, 8 | %% 9 | %% mod: Specify the module name to start the application, plus args 10 | %% 11 | {mod, {netmon_app, []}}, 12 | {env, []} 13 | ]}. 14 | -------------------------------------------------------------------------------- /src/netmon.erl: -------------------------------------------------------------------------------- 1 | %%%------------------------------------------------------------------------ 2 | %%% @doc Network Node Monitor 3 | %%% @author Serge Aleynikov 4 | %%% @version {@vsn} 5 | %%% @end 6 | %%%------------------------------------------------------------------------ 7 | %%% Created: 2007-01-10 by Serge Aleynikov 8 | %%%------------------------------------------------------------------------ 9 | -module(netmon). 10 | -author('saleyn@gmail.com'). 11 | 12 | -behaviour(gen_server). 13 | 14 | %% External exports 15 | -export([start_link/1, add_monitor/6, add_monitor/7, del_monitor/1, 16 | ip_to_str/1, ip_to_str/2, nodes_to_ips/1]). 17 | 18 | %% Internal exports 19 | -export([send_ping/5, state/0]). 20 | 21 | %% gen_server callbacks 22 | -export([init/1, handle_call/3, handle_cast/2, handle_info/2, 23 | code_change/3, terminate/2]). 24 | 25 | -include("netmon.hrl"). 26 | -include_lib("kernel/include/net_address.hrl"). 27 | -include_lib("kernel/include/inet.hrl"). 28 | 29 | -record(state, { 30 | version, % Module version extracted from -vsn() attribute. 31 | socket, % Echo UDP socket 32 | tcp_listener, % TCP Listener socket for incoming ECHO connections. 33 | echo_addr, % ip_address() - netif to use when sending ECHO packets. 34 | echo_port, % integer() - port to use when sending/receiving ECHO packets. 35 | tcp_timeout, % TCP connect timeout for TCP pings. 36 | accept_ref, % Reference of call to asyncronous accept. 37 | tcp_clients, % List of connected TCP ECHO clients. 38 | port_map, % [ {node(), UdpPort::integer()} ] - exception list of UDP ports 39 | start_time, % Application's startup time in now() format 40 | rev % Revision of this module 41 | }). 42 | 43 | %%%------------------------------------------------------------------------ 44 | %%% API 45 | %%%------------------------------------------------------------------------ 46 | 47 | %%------------------------------------------------------------------------- 48 | %% @spec (Options) -> {ok, Pid} | {error, Reason} 49 | %% Options = [ Option ] 50 | %% Option = {echo_addr, IP::ip_address()} | 51 | %% {echo_port, Port::integer()} | 52 | %% {multicast, IP::ip_address()} 53 | %% @doc Start a network monitoring agent. It will listen for UDP pings 54 | %% on a given interface/port. If multicast IP is given, the process 55 | %% will join the multicast group. 56 | %% @end 57 | %%------------------------------------------------------------------------- 58 | start_link(Options) -> 59 | gen_server:start_link({local, ?MODULE}, ?MODULE, [Options], []). 60 | 61 | %%------------------------------------------------------------------------- 62 | %% @equiv add_monitor(Name, App, Type, Nodes, Notify, Interval, udp) 63 | %% @doc This function starts a process monitoring `Nodes'. 64 | %% @see add_monitor/7 65 | %% @end 66 | %%------------------------------------------------------------------------- 67 | add_monitor(Name, App, Type, Nodes, Notify, Interval) -> 68 | add_monitor(Name, App, Type, Nodes, Notify, Interval, udp). 69 | 70 | %%------------------------------------------------------------------------- 71 | %% @equiv netmon_instance:start_link(Name, App, Type, Nodes, Notify, 72 | %% Interval, PingType) 73 | %% @doc This function starts a process monitoring `Nodes'. 74 | %% @see netmon_instance:start_link/7 75 | %% @end 76 | %%------------------------------------------------------------------------- 77 | add_monitor(Name, App, Type, Nodes, _Notify = {M, F}, Interval, PingType) -> 78 | netmon_app:start_monitor({Name, App, Type, Nodes, {M, F}, Interval, PingType}). 79 | 80 | %%------------------------------------------------------------------------- 81 | %% @spec (Name) -> ok | {error, not_running} 82 | %% @doc Shuts down a monitor `Name' process. 83 | %% @end 84 | %%------------------------------------------------------------------------- 85 | del_monitor(Name) -> 86 | case whereis(Name) of 87 | Pid when is_pid(Pid) -> 88 | exit(Pid, shutdown), 89 | ok; 90 | undefined -> 91 | {error, not_running} 92 | end. 93 | 94 | %%------------------------------------------------------------------------- 95 | %% @spec (Method, NodeIPs, MonitorName::atom(), App::atom(), 96 | %% StartTime::now()) -> ok 97 | %% Method = udp | tcp 98 | %% NodeIPs = [ {node(), ip_address()} ] 99 | %% @private 100 | %%------------------------------------------------------------------------- 101 | send_ping(Method, NodeIPs, MonitorName, App, StartTime) -> 102 | ID = random:uniform((1 bsl 27)-1), 103 | Packet = #ping_packet{ 104 | name = MonitorName, 105 | type = ping, 106 | node = node(), 107 | app = App, 108 | start_time = StartTime, 109 | id = ID 110 | }, 111 | ok = gen_server:call(?MODULE, {send_ping, Method, NodeIPs, Packet}), 112 | ID. 113 | 114 | %%------------------------------------------------------------------------- 115 | %% @spec () -> {ok, [ {Option::atom(), Value} ]} 116 | %% @doc Return internal server state 117 | %% @end 118 | %%------------------------------------------------------------------------- 119 | state() -> 120 | gen_server:call(?MODULE, state). 121 | 122 | %%------------------------------------------------------------------------- 123 | %% @spec (IP) -> string() 124 | %% IP = ip_address() | string() | socket() 125 | %% @doc Formats IP address as "XXX.YYY.ZZZ.NNN" or "XXX.YYY.ZZZ.NNN:MMMMM" 126 | %% if IP is a `socket()'. 127 | %% @end 128 | %%------------------------------------------------------------------------- 129 | ip_to_str({_,_,_,_}=IP) -> 130 | inet_parse:ntoa(IP); 131 | ip_to_str(IP) when is_list(IP) -> 132 | IP; 133 | ip_to_str(Sock) when is_port(Sock) -> 134 | {ok, {Addr, Port}} = inet:sockname(Sock), 135 | ip_to_str(Addr, Port). 136 | 137 | %%------------------------------------------------------------------------- 138 | %% @spec (IP, Port::integer()) -> string() 139 | %% IP = ip_address() | string() 140 | %% @doc Formats IP:Port address as "XXX.YYY.ZZZ.NNN:MMMMM" 141 | %% @end 142 | %%------------------------------------------------------------------------- 143 | ip_to_str({I1,I2,I3,I4}, Port) -> 144 | ?FMT("~w.~w.~w.~w:~w", [I1,I2,I3,I4,Port]); 145 | ip_to_str(IP, Port) when is_list(IP) -> 146 | ?FMT("~s:~w", [IP,Port]). 147 | 148 | %%---------------------------------------------------------------------- 149 | %% @spec nodes_to_ips(Nodes) -> [ {node(), IP::string()} ] 150 | %% Nodes = [ node() ] 151 | %% @doc Converts a list of node names to a list of IP addresses. 152 | %% Error is thrown if a node corresponds to unknown host. 153 | %% @end 154 | %%---------------------------------------------------------------------- 155 | nodes_to_ips(Nodes) -> 156 | F = fun(N) -> 157 | case net_kernel:node_info(N, address) of 158 | {ok, #net_address{address={Addr, _}}} -> 159 | Addr; 160 | {error, _} -> 161 | Host = string:sub_word(atom_to_list(N), 2, $@), 162 | case inet:gethostbyname(Host, inet) of 163 | {ok, #hostent{h_addr_list=[IP | _Rest]}} -> 164 | IP; 165 | {error, Reason} -> 166 | exit(?FMT("Can't resolve host: ~s. Reason: ~p", [Host, Reason])) 167 | end 168 | end 169 | end, 170 | [{N, F(N)} || N <- Nodes]. 171 | 172 | %%%---------------------------------------------------------------------- 173 | %%% Callback functions from gen_server 174 | %%%---------------------------------------------------------------------- 175 | 176 | %%----------------------------------------------------------------------- 177 | %% Func: init/1 178 | %% Returns: {ok, State} | 179 | %% {ok, State, Timeout} | 180 | %% ignore | 181 | %% {stop, Reason} 182 | %% @private 183 | %%----------------------------------------------------------------------- 184 | init([Options]) -> 185 | try 186 | PMap = proplists:get_value(port_map, Options, []), 187 | MCast = proplists:get_value(multicast, Options), 188 | Addr = proplists:get_value(echo_addr, Options, {0,0,0,0}), 189 | DefPort = proplists:get_value(echo_port, Options), 190 | TcpTout = proplists:get_value(tcp_timeout, Options, 2000), 191 | Port = case lists:keysearch(node(), 1, PMap) of 192 | {value, {_, P}} -> P; 193 | false -> DefPort 194 | end, 195 | 196 | case Addr of 197 | Addr when is_list(Addr) -> 198 | {ok, IP} = inet_parse:ipv4_address(Addr); 199 | IP -> 200 | ok 201 | end, 202 | 203 | case MCast of 204 | undefined -> Multicast = []; 205 | MAddr -> Multicast = [{add_membership, {MAddr, IP}}] 206 | end, 207 | 208 | Rev = try re:replace( 209 | proplists:get_value(vsn, module_info(attributes), ["1.0"]), 210 | "[^0-9\.]+", [], [{return, list}]) of 211 | L when is_list(L), L =/= [] -> 212 | L 213 | catch _:_ -> 214 | "1.0" 215 | end, 216 | 217 | IPstr = ip_to_str(IP, Port), 218 | 219 | % Open a UDP socket used for ECHO pings 220 | case gen_udp:open(Port, [binary, {reuseaddr, true}, {ip, IP}, {active, once}] ++ Multicast) of 221 | {ok, Sock} -> 222 | error_logger:info_msg("Opened netmon UDP listener on ~s\n", [IPstr]); 223 | {error, Error} -> 224 | Sock = undefined, 225 | throw({error, ?FMT("Cannot open UDP port (~s): ~s\n", [IPstr, inet:format_error(Error)])}) 226 | end, 227 | 228 | % Open a TCP socket used for ECHO pings 229 | case gen_tcp:listen(Port, [binary, {reuseaddr, true}, {ip, IP}, {active, once}]) of 230 | {ok, LSock} -> 231 | error_logger:info_msg("Opened netmon TCP listener on ~s\n", [IPstr]), 232 | {ok, LRef} = prim_inet:async_accept(LSock, -1), 233 | 234 | {ok, #state{socket=Sock, echo_addr=IP, echo_port=DefPort, 235 | accept_ref=LRef, tcp_listener=LSock, 236 | port_map=PMap, start_time=now(), rev=Rev, 237 | tcp_timeout=TcpTout, tcp_clients=[]}}; 238 | {error, Why} -> 239 | throw({error, ?FMT("Cannot start TCP listener (~s): ~s\n", [IPstr, inet:format_error(Why)])}) 240 | end 241 | 242 | catch _:Err -> 243 | {stop, Err} 244 | end. 245 | 246 | %%---------------------------------------------------------------------- 247 | %% Func: handle_call/3 248 | %% Returns: {reply, Reply, State} | 249 | %% {reply, Reply, State, Timeout} | 250 | %% {noreply, State} | 251 | %% {noreply, State, Timeout} | 252 | %% {stop, Reason, Reply, State} | (terminate/2 is called) 253 | %% {stop, Reason, State} (terminate/2 is called) 254 | %% @private 255 | %%---------------------------------------------------------------------- 256 | handle_call({send_ping, Method, NodeIPs, #ping_packet{} = Packet}, _From, 257 | #state{echo_port=DefPort, port_map=PMap, rev=Rev} = State) -> 258 | PacketB = term_to_binary(Packet#ping_packet{version=Rev, sent_time=now()}), 259 | F = fun({N, IP}) -> 260 | Port = proplists:get_value(N, PMap, DefPort), 261 | do_send_ping(Method, State, IP, Port, PacketB) 262 | end, 263 | [ F(NodeIP) || NodeIP <- NodeIPs ], 264 | {reply, ok, State}; 265 | 266 | handle_call(state, _From, #state{echo_addr=Addr, echo_port=P, port_map=PMap, start_time=T, 267 | socket=S, tcp_listener=L, accept_ref=A, tcp_clients=C} = State) -> 268 | Reply = [{echo_addr, Addr}, {echo_port,P}, {port_map,PMap}, {start_time,T}, 269 | {udp_sock, inet:sockname(S)}, {tcp_sock, inet:sockname(L)}, 270 | {accept_ref, A}, {tcp_clients, C}], 271 | {reply, {ok, Reply}, State}; 272 | 273 | handle_call(Request, _From, _State) -> 274 | {stop, {not_implemented, Request}}. 275 | 276 | do_send_ping(udp, State, IP, Port, PacketB) -> 277 | gen_udp:send(State#state.socket, IP, Port, PacketB); 278 | do_send_ping(tcp, State, IP, Port, PacketB) -> 279 | NetIF = State#state.echo_addr, 280 | Tout = State#state.tcp_timeout, 281 | % Don't block the main server process 282 | spawn(fun() -> 283 | case gen_tcp:connect(IP, Port, [binary, {nodelay, true}, {packet, 2}, {ip, NetIF}, {active, false}], Tout) of 284 | {ok, Sock} -> ok; 285 | {error, Sock} -> exit(normal) 286 | end, 287 | 288 | case gen_tcp:send(Sock, PacketB) of 289 | ok -> ok; 290 | _ -> exit(normal) 291 | end, 292 | 293 | case gen_tcp:recv(Sock, 0, Tout*2) of 294 | {ok, Reply} -> 295 | case binary_to_term(Reply) of 296 | #ping_packet{type=pong, name=Name, app=App} = Pong -> 297 | error_logger:info_msg("Netmon reply to ~w ~w.\n", [Name, App]), 298 | netmon_instance:notify(Pong); 299 | Other -> 300 | error_logger:error_msg("Netmon - invalid TCP pong response: ~p\n", [Other]) 301 | end; 302 | {error, _} -> 303 | ok 304 | end, 305 | 306 | exit(normal) 307 | end), 308 | ok. 309 | 310 | %%---------------------------------------------------------------------- 311 | %% Func: handle_cast/2 312 | %% Returns: {noreply, State} | 313 | %% {noreply, State, Timeout} | 314 | %% {stop, Reason, State} (terminate/2 is called) 315 | %% @private 316 | %%---------------------------------------------------------------------- 317 | handle_cast(_Msg, State) -> 318 | {noreply, State}. 319 | 320 | %%---------------------------------------------------------------------- 321 | %% Func: handle_info/2 322 | %% Returns: {noreply, State} | 323 | %% {noreply, State, Timeout} | 324 | %% {stop, Reason, State} (terminate/2 is called) 325 | %% @private 326 | %%---------------------------------------------------------------------- 327 | handle_info({udp, Socket, IP, Port, Packet}, #state{socket=Socket} = State) -> 328 | do_handle_echo(udp, Socket, IP, Port, Packet, State#state.start_time), 329 | {noreply, State}; 330 | 331 | handle_info({tcp, Socket, Packet}, #state{tcp_clients=Clients} = State) -> 332 | {ok, {IP, Port}} = inet:sockname(Socket), 333 | do_handle_echo(tcp, Socket, IP, Port, Packet, State#state.start_time), 334 | % We are only expecting one ECHO message from client. 335 | gen_tcp:close(Socket), 336 | {noreply, State#state{tcp_clients = (Clients -- [Socket])}}; 337 | 338 | handle_info({tcp_closed, Socket}, #state{tcp_clients=Clients} = State) -> 339 | {noreply, State#state{tcp_clients = (Clients -- [Socket])}}; 340 | 341 | handle_info({tcp_error, Socket, _Reason}, #state{tcp_clients=Clients} = State) -> 342 | {noreply, State#state{tcp_clients = (Clients -- [Socket])}}; 343 | 344 | %% New incoming TCP connection 345 | handle_info({inet_async, ListSock, Ref, {ok, CliSocket}}, 346 | #state{tcp_listener=ListSock, accept_ref=Ref, tcp_clients=CliSocks} = State) -> 347 | true = inet_db:register_socket(CliSocket, inet_tcp), 348 | {ok, NewRef} = prim_inet:async_accept(ListSock, -1), 349 | % Passively wait for one ECHO message. 350 | inet:setopts(CliSocket, [{nodelay, true}, binary, {active, once}, {packet, 2}]), 351 | {noreply, State#state{accept_ref=NewRef, tcp_clients=[CliSocket | CliSocks]}}; 352 | 353 | handle_info({inet_async, ListSock, Ref, Error}, #state{tcp_listener=ListSock, accept_ref=Ref} = State) -> 354 | error_logger:error_msg("Netmon - error in socket acceptor: ~p.\n", [Error]), 355 | {stop, Error, State}; 356 | 357 | handle_info(_Info, State) -> 358 | {noreply, State}. 359 | 360 | %%---------------------------------------------------------------------- 361 | %% Func: code_change/3 362 | %% Purpose: Convert process state when code is changed 363 | %% Returns: {ok, NewState} 364 | %% @private 365 | %%---------------------------------------------------------------------- 366 | code_change(_OldVsn, State, _Extra) -> 367 | {ok, State}. 368 | 369 | %%---------------------------------------------------------------------- 370 | %% Func: terminate/2 371 | %% Purpose: Shutdown the server 372 | %% Returns: any (ignored by gen_server) 373 | %% @private 374 | %%---------------------------------------------------------------------- 375 | terminate(_Reason, #state{tcp_clients=Clients}) -> 376 | [gen_tcp:close(C) || C <- Clients], 377 | ok. 378 | 379 | %%%---------------------------------------------------------------------- 380 | %%% Internal functions 381 | %%%---------------------------------------------------------------------- 382 | 383 | do_handle_echo(Proto, Socket, IP, Port, Packet, StartTime) -> 384 | inet:setopts(Socket, [{active, once}]), 385 | case catch binary_to_term(Packet) of 386 | #ping_packet{type=ping, app=App, node=Node} = Ping -> 387 | error_logger:info_msg("Netmon ~w ping from ~w (~w): ~s\n", 388 | [Proto, Node, App, ip_to_str(IP, Port)]), 389 | netmon_instance:notify(Ping), 390 | Reply = Ping#ping_packet{type=pong, node=node(), start_time=StartTime}, 391 | case Proto of 392 | udp -> ok = gen_udp:send(Socket, IP, Port, term_to_binary(Reply)); 393 | tcp -> ok = gen_tcp:send(Socket, term_to_binary(Reply)) 394 | end; 395 | #ping_packet{type=pong, app=App, node=Node} = Pong when Proto =:= udp -> 396 | % Only UDP pongs arrive here. TCP pongs are handled by do_send_ping() 397 | error_logger:info_msg("Netmon ~w echo from ~w (~w): ~s\n", 398 | [Proto, Node, App, ip_to_str(IP, Port)]), 399 | netmon_instance:notify(Pong); 400 | _ -> 401 | ok 402 | end. 403 | 404 | -------------------------------------------------------------------------------- /src/netmon.rel.src: -------------------------------------------------------------------------------- 1 | {release, {"netmon","1.0"}, {erts, ""}, 2 | [ {kernel, ""} 3 | ,{stdlib, ""} 4 | ,{sasl, ""} 5 | ,{lama, ""} 6 | ,{netmon, ""} 7 | ]}. 8 | -------------------------------------------------------------------------------- /src/netmon_app.erl: -------------------------------------------------------------------------------- 1 | %%%------------------------------------------------------------------------ 2 | %%% File: $Id$ 3 | %%%------------------------------------------------------------------------ 4 | %%% @doc Application for monitoring network of connected nodes. 5 | %%% @author Serge Aleynikov 6 | %%% @version $Revision$ 7 | %%% @end 8 | %%%---------------------------------------------------------------------- 9 | %%% Created: 2007-01-10 by Serge Aleynikov 10 | %%% $URL$ 11 | %%%---------------------------------------------------------------------- 12 | -module(netmon_app). 13 | -author('saleyn@gmail.com'). 14 | -id ("$Id$"). 15 | 16 | -behaviour(application). 17 | 18 | %% application and supervisor callbacks 19 | -export([start/2, stop/1, init/1]). 20 | 21 | %% Internal exports 22 | -export([start_monitor/1, get_node_opts/1]). 23 | 24 | -include_lib("lama/include/logger.hrl"). 25 | 26 | %%%---------------------------------------------------------------------- 27 | %%% API 28 | %%%---------------------------------------------------------------------- 29 | 30 | %%---------------------------------------------------------------------- 31 | %% @doc A startup function for spawning new monitor process 32 | %% Called by the netmon:add_monitor/1. 33 | %% @see netmon:add_monitor/1 34 | %% @see netmon_instance:start_link/7 35 | %% @private 36 | %% @end 37 | %%---------------------------------------------------------------------- 38 | start_monitor({Name, App, Type, Nodes, Notify, Interval}) -> 39 | start_monitor({Name, App, Type, Nodes, Notify, Interval, udp}); 40 | start_monitor({Name, _App, _Type, _Nodes, _Notify, _Interval, _PingType} = Tuple) -> 41 | Args = tuple_to_list(Tuple), 42 | % Netmon Instance 43 | ChildSpec = { Name, % Id = internal id 44 | {netmon_instance,start_link,Args}, % StartFun = {M, F, A} 45 | transient, % Restart = permanent | transient | temporary 46 | 2000, % Shutdown = brutal_kill | int() >= 0 | infinity 47 | worker, % Type = worker | supervisor 48 | [] % Modules = [Module] | dynamic 49 | }, 50 | supervisor:start_child(netmon_instance_sup, ChildSpec); 51 | start_monitor(Other) -> 52 | throw({error, ?FMT("Invalid monitor spec: ~p", [Other])}). 53 | 54 | %%---------------------------------------------------------------------- 55 | %% This is the entry module for your application. It contains the 56 | %% start function and some other stuff. You identify this module 57 | %% using the 'mod' attribute in the .app file. 58 | %% 59 | %% The start function is called by the application controller. 60 | %% It normally returns {ok,Pid}, i.e. the same as gen_server and 61 | %% supervisor. Here, we simply call the start function in our supervisor. 62 | %% One can also return {ok, Pid, State}, where State is reused in stop(State). 63 | %% 64 | %% Type can be 'normal' or {takeover,FromNode}. If the 'start_phases' 65 | %% attribute is present in the .app file, Type can also be {failover,FromNode}. 66 | %% This is an odd compatibility thing. 67 | %% @private 68 | %%---------------------------------------------------------------------- 69 | start(_Type, _Args) -> 70 | case supervisor:start_link({local, ?MODULE}, ?MODULE, []) of 71 | {ok, Pid} -> 72 | %% Start requested monitors 73 | [start_monitor(Args) || Args <- get_node_opts(lama:get_app_env(netmon, monitors, []))], 74 | {ok, Pid}; 75 | Error -> 76 | Error 77 | end. 78 | 79 | %%---------------------------------------------------------------------- 80 | %% stop(State) is called when the application has been terminated, and 81 | %% all processes are gone. The return value is ignored. 82 | %% @private 83 | %%---------------------------------------------------------------------- 84 | stop(_S) -> 85 | ok. 86 | 87 | %%---------------------------------------------------------------------- 88 | %% @spec (Options) -> Opts 89 | %% Options = [{Nodes, Opts}] 90 | %% Nodes = node() | [node()] | extra_db_nodes 91 | %% Opts = [tuple()] 92 | %% @doc Fetch configuration option from list for the current node. 93 | %% @end 94 | %%---------------------------------------------------------------------- 95 | get_node_opts(Options) -> 96 | case [Opts || {Node, Opts} <- Options, node_has_monitors(Node)] of 97 | [Opt] -> Opt; 98 | [] -> [] 99 | end. 100 | 101 | node_has_monitors(Nodes) when Nodes =:= node() -> 102 | true; 103 | node_has_monitors(Nodes) when is_list(Nodes) -> 104 | lists:member(node(), Nodes); 105 | node_has_monitors(_) -> 106 | false. 107 | 108 | %%%--------------------------------------------------------------------- 109 | %%% Supervisor behaviour callbacks 110 | %%%--------------------------------------------------------------------- 111 | 112 | %% @private 113 | init([]) -> 114 | {MaxR, MaxT} = lama:get_app_env(netmon, max_restart_frequency, {2, 5}), 115 | EchoAddr = lama:get_app_env(netmon, echo_addr, {0,0,0,0}), 116 | EchoPort = lama:get_app_env(netmon, echo_port, 64000), 117 | PortMap = lama:get_app_env(netmon, port_map, []), 118 | MnesiaGuard = lama:get_app_env(netmon, mnesia_guard, []), 119 | MCast = case lama:get_app_env(netmon, multicast, undefined) of 120 | undefined -> []; 121 | MCastAddr -> [{multicast, MCastAddr}] 122 | end, 123 | MnesiaGuardSpec = 124 | case MnesiaGuard of 125 | [] -> []; 126 | _ -> [get_mnesia_guard_spec(MnesiaGuard)] 127 | end, 128 | 129 | try 130 | case get_node_opts(lama:get_app_env(netmon, monitors, [])) of 131 | [] -> throw(ignore); 132 | _ -> ok 133 | end, 134 | 135 | {ok, 136 | {_SupFlags = {one_for_one, MaxR, MaxT}, 137 | [ 138 | % Netmon UDP/TCP socket owner 139 | { netmon, % Id = internal id 140 | {netmon,start_link, % StartFun = {M, F, A} 141 | [ [{echo_addr, EchoAddr}, {echo_port, EchoPort}, {port_map, PortMap}] ++ MCast ]}, 142 | permanent, % Restart = permanent | transient | temporary 143 | 2000, % Shutdown = brutal_kill | int() >= 0 | infinity 144 | worker, % Type = worker | supervisor 145 | [netmon] % Modules = [Module] | dynamic 146 | } 147 | ] 148 | ++ MnesiaGuardSpec ++ 149 | [ 150 | % Supervisor of netmon_instance's 151 | { netmon_instance_sup, 152 | {supervisor,start_link, 153 | [{local, netmon_instance_sup}, ?MODULE, {netmon_instance_sup, MaxR, MaxT}]}, 154 | permanent, % Restart = permanent | transient | temporary 155 | infinity, % Shutdown = brutal_kill | int() >= 0 | infinity 156 | supervisor, % Type = worker | supervisor 157 | [netmon] % Modules = [Module] | dynamic 158 | } 159 | ] 160 | } 161 | } 162 | catch _:ignore -> 163 | ignore 164 | end; 165 | 166 | init({netmon_instance_sup, MaxR, MaxT}) -> 167 | %% Start an empty supervisor. Children are added dynamically ('transient' restart means that they'll 168 | %% only be auto-restarted if they terminate abrormally with reason other than 'normal'). 169 | {ok, {_SupFlags = {one_for_one, MaxR, MaxT}, []}}. 170 | 171 | get_mnesia_guard_spec(Options) -> 172 | { netmon_mnesia, % Id = internal id 173 | {netmon_mnesia, start_link, [Options]}, % StartFun = {M, F, A} 174 | permanent, % Restart = permanent | transient | temporary 175 | 2000, % Shutdown = brutal_kill | int() >= 0 | infinity 176 | worker, % Type = worker | supervisor 177 | [netmon_mnesia] % Modules = [Module] | dynamic 178 | }. 179 | 180 | -------------------------------------------------------------------------------- /src/netmon_instance.erl: -------------------------------------------------------------------------------- 1 | %%%------------------------------------------------------------------------ 2 | %%% @doc Network Node Monitor. This module monitors connectivity to a list 3 | %%% of nodes. When connectivity is lost it starts pinging the 4 | %%% disconnected nodes until the ping message is echoed back and 5 | %%% connectivity is restored. 6 | %%% @author Serge Aleynikov 7 | %%% @version {@vsn} 8 | %%% @end 9 | %%%------------------------------------------------------------------------ 10 | %%% Created: 2007-01-10 by Serge Aleynikov 11 | %%%------------------------------------------------------------------------ 12 | -module(netmon_instance). 13 | -author('saleyn@gmail.com'). 14 | 15 | -behaviour(gen_server). 16 | 17 | %% External exports 18 | -export([start_link/7]). 19 | 20 | %% Internal exports 21 | -export([notify/1, state/1, test_notify/1]). 22 | 23 | %% gen_server callbacks 24 | -export([init/1, handle_call/3, handle_cast/2, handle_info/2, 25 | code_change/3, terminate/2]). 26 | 27 | -include("netmon.hrl"). 28 | 29 | -record(state, { 30 | watch_nodes, % List of nodes to watch for connectivity 31 | dead_nodes, % List of disconnected nodes from the watch_nodes list 32 | app, % Application interested in monitoring watch_nodes 33 | type, % any | all 34 | notify_mf, % {M, F} callback to call on node_up | node_down | node_ping 35 | timer, 36 | ping_timeout, % in milliseconds 37 | start_time, % Process start time 38 | ping_type, % What ping type to use (udp | tcp | net_adm | passive) 39 | ips, % List of resolved {Node, IP} pairs 40 | lastid, % ID sent in the last Ping request packet 41 | start_times % dict() of Key=node(), Data=StartTime::now() 42 | }). 43 | 44 | -define(NOTIFY_ARITY, 1). 45 | 46 | %%------------------------------------------------------------------------- 47 | %% @spec (Name, App, Type, Nodes, Notify, Interval, PingType) -> ok | {error, Reason} 48 | %% Name = atom() 49 | %% App = atom() 50 | %% Type = any | all 51 | %% Nodes = [ node() ] 52 | %% Notify = {M, F} 53 | %% Interval = integer() 54 | %% PingType = udp | tcp | net_adm | passive 55 | %% @doc This function adds a monitor of `Nodes'. 56 | %%
57 | %%
Type
58 | %%
Controls whether loss of any/all nodes in the `Nodes' group will 59 | %% result in the call to `Notify'.
60 | %%
Nodes
61 | %%
List of nodes to watch for connectivity.
62 | %%
App
63 | %%
Application name starting the monitor instance.
64 | %%
Interval
65 | %%
Ping interval in seconds (ping is activated on detecting 66 | %% connection loss).
67 | %%
PingType
68 | %%
Controls whether to use UDP pings, net_adm:ping/1 calls 69 | %% or be passive (rely on the peer to establish communication). 70 | %% The later is useful is connectivity to Nodes is protected by 71 | %% a firewall.
72 | %%
Notify
73 | %%
Is a `M:F/1' function that takes #mon_notify{} parameter: 74 | %% ``` 75 | %% #mon_notify{name=Monitor::atom(), action=Action, node=Node::node(), 76 | %% node_start_time=NodeST, Details, UpNodes::nodes(), 77 | %% DnNodes::nodes(), WatchNodes::nodes(), StartTime::now()} -> 78 | %% Result 79 | %% Result = {ignore, NewWatchNodes} | 80 | %% {connect, NewWatchNodes} | 81 | %% stop | 82 | %% shutdown | 83 | %% restart 84 | %% Action = node_up | node_down | ping | pong 85 | %% Details = #ping_packet{} (when Action = ping | pong) 86 | %% | InfoList (when Action = node_up | node_down) 87 | %% NodeST = now() | undefined 88 | %% ''' 89 | %% Arguments given in the `Notify' callback are: 90 | %%
91 | %%
Monitor
92 | %%
Name of the monitor instance (from the first argument of 93 | %% netmon:add_monitor/7.
94 | %%
Node
95 | %%
Node that caused node_up/node_down/pong event.
96 | %%
Event
97 | %%
`node_up' - a new connection from Node is detected.
98 | %%
`node_down' - a loss of connection from Node is detected.
99 | %%
`ping' - a ping request from a peer node
100 | %%
`pong' - an echo reply to a TCP/UDP ping is received 101 | %% from Node.
102 | %%
Details
103 | %%
`#ping_packet{}' - for `pong' events.
104 | %%
`InfoList' - for `node_up' and `node_down' events. 105 | %% See //kernel/net_kernel:monitor_node/1.
106 | %%
UpNodes
107 | %%
List of connected nodes from the `WatchNodes' list.
108 | %%
DownNodes
109 | %%
List of disconnected nodes from the `WatchNodes' list.
110 | %%
WatchNodes
111 | %%
List of nodes to watch for connectivity.
112 | %%
NodeST
113 | %%
Time when this monitor was started on `Node'.
114 | %%
StartTime
115 | %%
Time when this monitor was started on current node.
116 | %%
117 | %%
118 | %%
119 | %% @end 120 | %%------------------------------------------------------------------------- 121 | start_link(Name, App, Type, Nodes, _Notify = {M, F}, Interval, PingType) when 122 | (Type =:= any orelse Type =:= all), 123 | is_list(Nodes), is_integer(Interval), 124 | (PingType =:= udp orelse PingType =:= tcp orelse 125 | PingType =:= net_adm orelse PingType =:= passive) 126 | -> 127 | gen_server:start_link( 128 | {local, Name}, ?MODULE, [App, Type, Nodes, {M, F}, Interval, PingType], []). 129 | 130 | %%------------------------------------------------------------------------- 131 | %% @spec (Name::pid()) -> [ {Option::atom(), Value} ] 132 | %% @doc Get internal server state. Useful for debugging. 133 | %% @end 134 | %%------------------------------------------------------------------------- 135 | state(Name) -> 136 | gen_server:call(Name, state). 137 | 138 | %%------------------------------------------------------------------------- 139 | %% @spec (Ping::#ping_packet{}) -> ok 140 | %% @doc Notify a monitor of a UDP ping/pong message sent/echoed by 141 | %% another node. Called from the `netmon' module. 142 | %% @end 143 | %% @private 144 | %%------------------------------------------------------------------------- 145 | notify(#ping_packet{name=Name} = Ping) -> 146 | gen_server:cast(Name, Ping). 147 | 148 | %%%---------------------------------------------------------------------- 149 | %%% Callback functions from gen_server 150 | %%%---------------------------------------------------------------------- 151 | 152 | %%----------------------------------------------------------------------- 153 | %% Func: init/1 154 | %% Returns: {ok, State} | 155 | %% {ok, State, Timeout} | 156 | %% ignore | 157 | %% {stop, Reason} 158 | %% @private 159 | %%----------------------------------------------------------------------- 160 | init([App, Type, Nodes, {M,F} = Notify, Interval, PingType]) -> 161 | try 162 | {module, M} =:= code:ensure_loaded(M) orelse 163 | exit(?FMT("Module ~w not loaded!", [M])), 164 | 165 | erlang:function_exported(M, F, ?NOTIFY_ARITY) orelse 166 | exit(?FMT("Function ~w:~w/~w not exported!", [M, F, ?NOTIFY_ARITY])), 167 | 168 | {A1,A2,A3} = Now = now(), 169 | random:seed(A1, A2, A3), 170 | case PingType of 171 | passive -> ok; 172 | _ -> self() ! ping_timer 173 | end, 174 | 175 | net_kernel:monitor_nodes(true, [{node_type, visible}, nodedown_reason]), 176 | WNodes = Nodes -- [node()], 177 | DNodes = WNodes -- nodes(connected), 178 | State = #state{app=App, type=Type, watch_nodes=WNodes, dead_nodes=DNodes, 179 | lastid=0, ips=netmon:nodes_to_ips(WNodes), notify_mf=Notify, 180 | start_time=Now, ping_timeout=Interval*1000, ping_type=PingType, 181 | start_times=dict:new()}, 182 | {ok, State} 183 | catch exit:Why -> 184 | {stop, Why} 185 | end. 186 | 187 | %%---------------------------------------------------------------------- 188 | %% Func: handle_call/3 189 | %% Returns: {reply, Reply, State} | 190 | %% {reply, Reply, State, Timeout} | 191 | %% {noreply, State} | 192 | %% {noreply, State, Timeout} | 193 | %% {stop, Reason, Reply, State} | (terminate/2 is called) 194 | %% {stop, Reason, State} (terminate/2 is called) 195 | %% @private 196 | %%---------------------------------------------------------------------- 197 | handle_call(state, _From, State) -> 198 | Reply = [{watch_nodes, State#state.watch_nodes}, 199 | {dead_nodes, State#state.dead_nodes}, 200 | {type, State#state.type}, 201 | {start_time, State#state.start_time}, 202 | {notify_mf, State#state.notify_mf}], 203 | {reply, Reply, State}; 204 | 205 | handle_call(Request, _From, _State) -> 206 | {stop, {not_implemented, Request}}. 207 | 208 | %%---------------------------------------------------------------------- 209 | %% Func: handle_cast/2 210 | %% Returns: {noreply, State} | 211 | %% {noreply, State, Timeout} | 212 | %% {stop, Reason, State} (terminate/2 is called) 213 | %% @private 214 | %%---------------------------------------------------------------------- 215 | 216 | handle_cast(#ping_packet{type=PingOrPong, node=FromNode, app=App, id=ID} = PingPacket, 217 | #state{watch_nodes=WNodes, dead_nodes=DNodes, app=App, lastid=ID} = State) 218 | when PingOrPong =:= ping; PingOrPong =:= pong -> 219 | case {lists:member(FromNode, nodes(connected)), lists:member(FromNode, DNodes)} of 220 | {false, true} -> 221 | % UDP ping response from a disconnected node that we are supposed to monitor. 222 | % Maybe a case of partitioned network or a restart of FromNode. 223 | UpNodes = up_nodes(WNodes), 224 | MyName = element(2, process_info(self(), registered_name)), 225 | Args = {MyName, FromNode, PingOrPong, PingPacket, UpNodes, DNodes, WNodes}, 226 | do_check_notify(PingOrPong, length(UpNodes), Args, State); 227 | _ -> 228 | % We are not interested in this node 229 | {noreply, State} 230 | end; 231 | 232 | handle_cast(_Msg, State) -> 233 | {noreply, State}. 234 | 235 | %%---------------------------------------------------------------------- 236 | %% Func: handle_info/2 237 | %% Returns: {noreply, State} | 238 | %% {noreply, State, Timeout} | 239 | %% {stop, Reason, State} (terminate/2 is called) 240 | %% @private 241 | %%---------------------------------------------------------------------- 242 | handle_info(ping_timer, #state{dead_nodes=[]} = State) -> 243 | {noreply, State}; 244 | handle_info(ping_timer, #state{type=any, dead_nodes=DNodes} = State) -> 245 | {ID, Ref} = send_ping(DNodes, State), 246 | {noreply, State#state{lastid=ID, timer=Ref}}; 247 | handle_info(ping_timer, #state{type=all, watch_nodes=W, dead_nodes=D} = State) 248 | when length(D) =:= length(W) -> 249 | {ID, Ref} = send_ping(D, State), 250 | {noreply, State#state{lastid=ID, timer=Ref}}; 251 | handle_info(ping_timer, #state{type=all} = State) -> 252 | {noreply, State}; 253 | 254 | %% Established communication with Node 255 | handle_info({nodeup, Node, Info}, 256 | #state{watch_nodes=WNodes, dead_nodes=DNodes} = State) -> 257 | case {lists:member(Node, WNodes), lists:member(Node, DNodes)} of 258 | {false, _} -> 259 | {noreply, State}; 260 | {true, false} -> 261 | {noreply, State}; 262 | {true, true} -> 263 | UpNodes = up_nodes(WNodes), 264 | NewDNodes = (DNodes -- UpNodes) -- [Node], 265 | MyName = element(2, process_info(self(), registered_name)), 266 | Msg = {MyName, Node, node_up, Info, UpNodes, NewDNodes, WNodes}, 267 | do_check_notify(node_up, length(UpNodes), Msg, State#state{dead_nodes=NewDNodes}) 268 | end; 269 | 270 | %% Lost communication with Node 271 | handle_info({nodedown, Node, Info}, 272 | #state{watch_nodes=WNodes, dead_nodes=DNodes} = State) -> 273 | case {lists:member(Node, WNodes), lists:member(Node, DNodes)} of 274 | {false, _} -> 275 | {noreply, State}; 276 | {true, true} -> 277 | {noreply, State}; 278 | {true, false} -> 279 | self() ! ping_timer, 280 | UpNodes = up_nodes(WNodes) -- [Node], 281 | NewDNodes = [Node | (DNodes -- UpNodes)], 282 | MyName = element(2, process_info(self(), registered_name)), 283 | Msg = {MyName, Node, node_down, Info, UpNodes, NewDNodes, WNodes}, 284 | do_check_notify(node_down, length(UpNodes), Msg, State#state{dead_nodes=NewDNodes}) 285 | end; 286 | 287 | handle_info(_Info, State) -> 288 | {noreply, State}. 289 | 290 | %%---------------------------------------------------------------------- 291 | %% Func: code_change/3 292 | %% Purpose: Convert process state when code is changed 293 | %% Returns: {ok, NewState} 294 | %% @private 295 | %%---------------------------------------------------------------------- 296 | code_change(_OldVsn, State, _Extra) -> 297 | {ok, State}. 298 | 299 | %%---------------------------------------------------------------------- 300 | %% Func: terminate/2 301 | %% Purpose: Shutdown the server 302 | %% Returns: any (ignored by gen_server) 303 | %% @private 304 | %%---------------------------------------------------------------------- 305 | terminate(_Reason, _State) -> 306 | ok. 307 | 308 | %%%--------------------------------------------------------------------- 309 | %%% Internal functions 310 | %%%--------------------------------------------------------------------- 311 | 312 | do_check_notify(_Type, _UpNodesCount, Args, #state{type=any} = State) -> 313 | notify_callback(Args, State); 314 | do_check_notify(Type, 1, Args, #state{type=all} = State) 315 | when Type=:=node_up; Type=:=pong -> 316 | notify_callback(Args, State); 317 | do_check_notify(Type, 0, Args, #state{type=all} = State) 318 | when Type=:=node_down; Type=:=pong -> 319 | notify_callback(Args, State); 320 | do_check_notify(_Type, _UpNodesCount, _Args, State) -> 321 | {noreply, State}. 322 | 323 | %%---------------------------------------------------------------------- 324 | %% @spec (WatchNodes::list()) -> list() 325 | %% @doc Build an ordered set of connected lists given an ordered set 326 | %% of nodes to filter. 327 | %%---------------------------------------------------------------------- 328 | up_nodes(WatchNodes) -> 329 | [ N || N <- nodes(connected), lists:member(N, WatchNodes) ]. 330 | 331 | %%---------------------------------------------------------------------- 332 | send_ping(_Nodes, #state{ping_type=passive}) -> 333 | {0, undefined}; 334 | send_ping(Nodes, #state{ping_type=net_adm, ping_timeout=Int}) -> 335 | [net_adm:ping(N) || N <- Nodes], 336 | {0, erlang:send_after(Int, self(), ping_timer)}; 337 | send_ping(Nodes, #state{ping_type=Method, start_time=Time, app=App, ping_timeout=Int} = State) -> 338 | {_, Name} = process_info(self(), registered_name), 339 | NodeIPs = [ {N, IP} || {N, IP} <- State#state.ips, lists:member(N, Nodes) ], 340 | ID = netmon:send_ping(Method, NodeIPs, Name, App, Time), 341 | {ID, erlang:send_after(Int, self(), ping_timer)}. 342 | 343 | %%---------------------------------------------------------------------- 344 | notify_callback({MonitorName, Node, Action, Details, 345 | UpNodes, DownNodes, WatchNodes}, #state{notify_mf={M,F}} = State) 346 | when Action =:= node_up; Action =:= node_down; Action =:= ping; Action =:= pong -> 347 | NewST = set_node_start_time(State#state.start_times, Node, Action, Details), 348 | Callback = #mon_notify{ 349 | name = MonitorName, 350 | action = Action, 351 | node = Node, 352 | node_start_time = get_node_start_time(NewST, Node), 353 | details = Details, 354 | up_nodes = UpNodes, 355 | down_nodes = DownNodes, 356 | watch_nodes = WatchNodes, 357 | start_time = State#state.start_time 358 | }, 359 | NewWNodes = 360 | try M:F(Callback) of 361 | {connect, WNodes} -> 362 | net_kernel:connect(Node), 363 | WNodes; 364 | {ignore, WNodes} -> 365 | WNodes; 366 | stop -> 367 | exit(normal); 368 | shutdown -> 369 | init:stop(); 370 | restart -> 371 | init:restart(); 372 | Other -> 373 | exit(?FMT("Bad return value ~w:~w/~w -> ~p", [M, F, ?NOTIFY_ARITY, Other])) 374 | catch _:Reason -> 375 | error_logger:error_msg("Netmon - execution of ~w:~w(~w)\n" 376 | " failed with reason: ~p\n", 377 | [M, F, Callback, Reason]), 378 | exit(Reason) 379 | end, 380 | DNodes = get_dead_nodes(Action, Node, DownNodes), 381 | {noreply, State#state{dead_nodes=DNodes, watch_nodes=NewWNodes, start_times=NewST}}. 382 | 383 | get_node_start_time(Dict, Node) -> 384 | case dict:find(Node, Dict) of 385 | {ok, Value} -> Value; 386 | error -> undefined 387 | end. 388 | 389 | set_node_start_time(Dict, Node, Action, #ping_packet{start_time=Time}) 390 | when Action=:=ping; Action=:=pong -> 391 | dict:store(Node, Time, Dict); 392 | set_node_start_time(Dict, _, _, _) -> 393 | Dict. 394 | 395 | get_dead_nodes(node_up, Node, DownNodes) -> 396 | DownNodes -- [Node]; 397 | get_dead_nodes(node_down, Node, DownNodes) -> 398 | case lists:member(Node, DownNodes) of 399 | true -> DownNodes; 400 | false -> [Node | DownNodes] 401 | end; 402 | get_dead_nodes(_, _Node, DownNodes) -> 403 | DownNodes. 404 | 405 | %%---------------------------------------------------------------------- 406 | %% Test functions 407 | %%---------------------------------------------------------------------- 408 | 409 | %% @doc Sample test function that can be used in the netmon:add_monitor/7 call. 410 | test_notify(#mon_notify{name=Monitor, node=Node, action=Action, 411 | watch_nodes=WNodes, up_nodes=UpNodes, 412 | down_nodes=DownNodes, node_start_time=NST, start_time=ST}) 413 | when Action =:= node_up; Action =:= pong -> 414 | io:format("NETMON (~w): ~-10w -> ~w. Up: ~w, Down: ~w (~s : ~s)\n", 415 | [Monitor, node_up, Node, UpNodes, DownNodes, time2str(NST), time2str(ST)]), 416 | case NST of 417 | undefined -> 418 | {connect, WNodes}; 419 | _ when NST < ST -> 420 | %restart; 421 | {connect, WNodes}; 422 | _ -> 423 | {connect, WNodes} 424 | end; 425 | 426 | test_notify(#mon_notify{name=Monitor, node=Node, action=Action, 427 | watch_nodes=WNodes, up_nodes=UpNodes, 428 | down_nodes=DownNodes, node_start_time=NST, 429 | start_time=ST}) 430 | when Action=:=node_down -> 431 | io:format("NETMON (~w): ~-10w -> ~w. Up: ~w, Down: ~w (~s : ~s)\n", 432 | [Monitor, Action, Node, UpNodes, DownNodes, time2str(NST), time2str(ST)]), 433 | {ignore, WNodes}; 434 | test_notify(#mon_notify{action=ping, watch_nodes=WNodes}) -> 435 | {ignore, WNodes}; 436 | test_notify(#mon_notify{name=Monitor, node=Node, action=Action, up_nodes=UpNodes, 437 | down_nodes=DownNodes, watch_nodes=WNodes, 438 | node_start_time=NST, start_time=ST}) -> 439 | io:format("NETMON (~w): ~-10w -> ~w. Up: ~w, Down: ~w (~w)\n", 440 | [Monitor, Action, Node, UpNodes, DownNodes, NST > ST]), 441 | {ignore, WNodes}. 442 | 443 | time2str({_,_,_} = Now) -> 444 | {{Y,Mo,D},{H,M,S}} = calendar:now_to_local_time(Now), 445 | io_lib:format("~w-~.2.0w-~.2.0w ~.2.0w:~.2.0w:~.2.0w", [Y,Mo,D,H,M,S]); 446 | time2str(undefined) -> 447 | "". 448 | -------------------------------------------------------------------------------- /src/overview.edoc: -------------------------------------------------------------------------------- 1 | Netmon Application 2 | 3 | @author Serge Aleynikov 4 | @version {@vsn} 5 | @title Netmon Application 6 | 7 | @doc Netmon Application can be used to monitor health of inter-node connections. 8 | 9 |

Contents

10 |
    11 |
  1. Overview
  2. 12 |
  3. Application Configuration
  4. 13 |
  5. Reacting to Changes in Connectivy
  6. 14 |
  7. Example Use Cases
  8. 15 |
16 | 17 |

Overview

18 | 19 | The application is composed of a `netmon' process that is listening to datagrams on a UDP socket as 20 | well as subscribing to notification events from the transport layer that inform on node status 21 | changes (nodeup, nodedown). Custom `netmon_instance' monitor processes can be added using 22 | `netmon:add_monitor/7' call to ensure that a loss of communication to any/all given nodes follows 23 | by enabling a custom recovery action. This callback function is documented 24 | as `Notify' argument to {@link netmon_instance:start_link/7}. The recovery action can be one 25 | of four types: 26 | 27 |
    28 |
  • UDP ping
  • 29 |
  • TCP ping
  • 30 |
  • NET_ADM ping
  • 31 |
  • Passive
  • 32 |
33 | 34 | Upon detecting a connectivity failure a `netmon_instance' monitor process starts pinging the remote 35 | nodes using one of the methods above. As soon as the UDP ping is echo'd back to the caller or TCP/NET_ADM 36 | ping establishes TCP transport connection with a remote node, a custom callback function given to 37 | {@link netmon:add_monitor/7} gets called passing the action `node_up', `node_down' or `pong' (response to 38 | a UDP/TCP ping). The return value of this function determines recovery action. 39 | The following returned recovery actions are allowed: 40 | 41 |
42 |
{connect, WatchNodes}
43 |
Use `net_kernel:connect/1' to connect to a Node.
44 |
stop
45 |
Remove monitor instance.
46 |
shutdown
47 |
Shutdown current Erlang node using init:stop/0 call.
48 |
restart
49 |
Restart current node using init:restart/0 call.
50 |
{ignore, WatchNodes}
51 |
Ignore the event and assign a new list of nodes to watch for.
52 |
53 | 54 |

Application Configuration

55 | 56 | All configuration options are grouped by applicaiton and are documented here. 57 | 58 |

Example Use Cases

59 | 60 |
    61 | 62 |
  • Three nodes `a@host1', `b@host2' and `c@host3' need to monitor connectivity to 63 | each other by starting a `test' monitor instance and listening for UDP echoes on port 12000. 64 | Node c@host3 should only trigger hartbeats if it looses access to both nodes `a@host1' and 65 | `b@host2'. Network access between three nodes is not restricted by a firewall. 66 | ``` 67 | +--- a@host1 ---+ +--- b@host2 ---+ 68 | | Applications: | | Applications: | 69 | | - netmon +---------- TCP --+----------+ -netmon | 70 | | - userapp | | | -userapp | 71 | +---------------+ | +---------------+ 72 | | 73 | | +--- c@host3 ---+ 74 | | | Applications: | 75 | +----------+ -netmon | 76 | | -userapp | 77 | +---------------+ 78 | ''' 79 | When there is a connectivity issue detected, all nodes should start UDP ping of each other every 10s, 80 | and upon receiving a `pong' response call user-defined callback `userapp:link_status/1' function that 81 | would determine the recovery action. For documentation of the user-defined callback see 82 | {@link netmon_instance:start_link/7}. 83 | Sample configuration: 84 | ``` 85 | [{netmon, 86 | [{monitors, 87 | [{'a@host1', [{test, netmon, any, ['a@host1', 'b@host2', 'c@host3'], {userapp, link_status}, 10}]}, 88 | {'b@host2', [{test, netmon, any, ['a@host1', 'b@host2', 'c@host3'], {userapp, link_status}, 10}]}, 89 | {'c@host3', [{test, netmon, all, ['a@host1', 'b@host2', 'c@host3'], {userapp, link_status}, 10}]} 90 | ]}, 91 | {echo_port, 12000} 92 | ]} 93 | ''' 94 | Note that the current node can be included in the list of nodes to monitor (it will not be considered by netmon). 95 | 96 | Also identical node configurations can be combined. In the example above configuration of nodes `a@host1' and 97 | `b@host2' can be merged into one entry: 98 | ``` 99 | [{netmon, 100 | [{monitors, 101 | [{['a@host1','b@host2'], [{test, netmon, any, ['a@host1', 'b@host2', 'c@host3'], {userapp, link_status}, 10}]}, 102 | {'c@host3', [{test, netmon, all, ['a@host1', 'b@host2', 'c@host3'], {userapp, link_status}, 10}]} 103 | ]}, 104 | ... 105 | ]} 106 | ''' 107 |
  • 108 | 109 |
  • Three nodes `a@host1', `b@host2' and `c@host3' need to monitor connectivity to 110 | each other by starting a `test' monitor instance. Connectivity from `a@host1' node is not allowed 111 | by a firewall. Nodes `b@host2' and `c@host3' have `netmon' listening for UDP echoes on port 12000. 112 | In case of connectivity loss: 113 |
      114 |
    • Node `a@host1' should passively echo UDP pings coming from 115 | `b@host2' or `c@host3' but shouldn't attempt to ping them itself.
    • 116 |
    • Node `b@host2' (or `c@host3') should trigger NET_ADM recovery when `a@host1' is lost or trigger 117 | UDP pings when `c@host3' (or `b@host2') node is lost.
    • 118 |
    119 | ``` 120 | +--- a@host1 ---+ < +--- b@host2 ---+ 121 | | Applications: | < | Applications: | 122 | | - netmon +-------<-- TCP --+----------+ -netmon | 123 | | - userapp | < | | -userapp | 124 | +---------------+ < | +---------------+ 125 | < | 126 | < | +--- c@host3 ---+ 127 | Firewall restricts < | | Applications: | 128 | connectivity in < +----------+ -netmon | 129 | outbound direction < | -userapp | 130 | only (right-to-left) < +---------------+ 131 | ''' 132 | Upon receiving a `pong' response from a disconnected node or detecting a `node_up' / `node_down' condition 133 | a user-defined callback `userapp:link_status/5' function is called that would determine the recovery action. 134 | Sample configuration: 135 | ``` 136 | [{netmon, 137 | [{monitors, 138 | [{'a@host1', [{test, netmon, any, ['a@host1', 'b@host2', 'c@host3'], {userapp, link_status}, 0, passive}]}, 139 | {'b@host2', [{test_c, netmon, any, ['c@host3'], {userapp, link_status}, 10, udp}, 140 | {test_a, netmon, any, ['a@host1'], {userapp, link_status}, 10, net_adm}]}, 141 | {'c@host3', [{test_b, netmon, any, ['b@host2'], {userapp, link_status}, 10, udp}, 142 | {test_a, netmon, any, ['a@host1'], {userapp, link_status}, 10, net_adm}]} 143 | ]}, 144 | {echo_port, 12000} 145 | ]} 146 | ''' 147 |
  • 148 |
149 | 150 | For testing purposes when you need to run multiple nodes on the same host you can use the `port_map' configuration 151 | option. 152 | 153 |

Monitoring Mnesia

154 | 155 |
156 |
Master Nodes (nodes holding disk copies of tables)
157 |
158 | These nodes must be started with the kernel option {dist_auto_connect, Mode}, where Mode is `once' or `{callback, M, F}'. 159 | Upon detecting a loss of connection to another master they'll turn on a UDP heartbeat and when a hartbeat is echoed by 160 | the disconnected master node, one of the nodes (whichever one was up longer) will be assumed to be a primary and the 161 | secondary one will be restarted. 162 |
163 |
Nodes behind a firewall
164 |
165 | These nodes cannot contact other nodes in front of a firewall and only allow incoming TCP connections to a range of 166 | ports. For these nodes TCP port 4369 must be open in the firewall to allow outbound connections to the epmd name service. 167 | These nodes should be started with the following kernel's options: 168 |
    169 |
  • `inet_dist_listen_min' and `inet_dist_listen_max' options.
  • 170 |
  • `{dist_auto_connect, never}' option should be set to prevent attempts to connect to nodes in front of the firewall.
  • 171 |
  • `{global_groups, [{FirewalGroup, Nodes}]}' should be set to ensure that `global' doesn't attemp to connect to nodes 172 | inside the firewall.
  • 173 |
174 | At startup these nodes should not start mnesia application and wait for establishing connections from master nodes. 175 |
176 |
Slave Nodes (nodes)
177 | 178 |
179 | 180 | @end 181 | --------------------------------------------------------------------------------