├── LICENSE ├── Monitoring.md └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2016, Ruairi Carroll 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | * Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 10 | * Redistributions in binary form must reproduce the above copyright notice, 11 | this list of conditions and the following disclaimer in the documentation 12 | and/or other materials provided with the distribution. 13 | 14 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 15 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 16 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 17 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 18 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 19 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 20 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 21 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 22 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 23 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 24 | -------------------------------------------------------------------------------- /Monitoring.md: -------------------------------------------------------------------------------- 1 | # Monitoring Tools 2 | 3 | Below is a list of some tools which can be used to monitor your network 4 | 5 | ## SNMP Based 6 | 7 | - [LibreNMS](http://www.librenms.org/) 8 | - [Observium](http://observium.org/) 9 | - [Cacti](http://www.cacti.net/) 10 | - [OpenNMS](http://www.opennms.org/) 11 | - [Icinga](https://icinga.com/) 12 | 13 | ## IPFIX Net/S Flow based 14 | 15 | - [SiLK](https://tools.netsa.cert.org/silk/) 16 | - [pmacct](http://www.pmacct.net/) 17 | - [NFDump](http://nfdump.sourceforge.net/) 18 | - [ntop](http://www.ntop.org/) 19 | 20 | ## SPAN/Mirror/pcap based 21 | 22 | - [tstat](tstat.polito.it) 23 | 24 | 25 | ## Time Series based 26 | 27 | - [Collectd](https://collectd.org/) 28 | - Graphite 29 | - [Graphite-web - Frontend](https://github.com/graphite-project/graphite-web) 30 | - [Carbon - Metric processing](https://github.com/graphite-project/carbon) 31 | - [Whisper - Time Series DB](https://github.com/graphite-project/whisper) 32 | - [Prometheus](https://prometheus.io/) 33 | 34 | ## Time Series Prediction/Aberrant Behavior Detection 35 | 36 | - [Banshee](https://github.com/eleme/banshee) 37 | - [RRDTool](http://cricket.sourceforge.net/aberrant/rrd_hw.htm) 38 | 39 | ## Dataplane monitoring 40 | - [todd](https://github.com/mierdin/todd) 41 | 42 | ## Logcollection 43 | - [Graylog](https://www.graylog.org) 44 | - [ELK Stack](https://www.elastic.co/downloads) 45 | 46 | ## External Monitoring 47 | - [RIPE-Atlas](https://atlas.ripe.net) 48 | - [StatusCake](https://statuscake.com) 49 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Manifesto 2 | 3 | Network Engineers Manifesto 4 | 5 | ### Key motivating factors: 6 | 7 | - Data driven decisions. 8 | - Excellence in all things. 9 | - Technical depth and no technology religion 10 | - Clarity of vision, clarity of execution 11 | - Lead the business in decisions related to transportation of packets 12 | - Dedication to all our customers 13 | - "Good enough" is too low a bar 14 | 15 | 16 | ### [Monitoring](Monitoring.md) 17 | 18 | - Monitor, from outside: 19 | - Implement end-to-end tests (eg. server to server, enduser connections to DC) 20 | - Make use of external monitoring services mentioned in [Monitoring](Monitoring.md) 21 | - Monitor, at least: 22 | - Per switch: 23 | - Interface pps,ups,mps,bitrate,drops,errors,buffer depth 24 | - CPU, Mem, ICMP messages generated 25 | - STP states 26 | - Per router: 27 | - All routing protocol states 28 | - Interface pps,ups,mps,bitrate,drops,errors,buffer depth 29 | - CPU, Mem, ICMP messages generated 30 | - Per Firewall: 31 | - Interface pps,ups,mps,bitrate,drops,errors,buffer depth 32 | - CPU, Mem, ICMP messages generated 33 | - CPS, Throughput 34 | - Dropped connections 35 | - ASIC drops 36 | - Per LB 37 | - Interface pps,ups,mps,bitrate,drops,errors,buffer depth 38 | - CPU, Mem, ICMP messages generated 39 | - CPS, Throughput per VIP 40 | - Dropped connections 41 | - ASIC drops 42 | - Per AP 43 | - Interface pps,ups,mps,bitrate,drops,errors,buffer depth 44 | - CPU, Mem, ICMP messages generated 45 | - Logged in users, failed login attempts 46 | - Per Service 47 | - p99, p95 metrics for service latency: 48 | - For end to end transaction 49 | - For TCP re-transmissions 50 | - Latency to/drop server from all DCs 51 | - All monitoring to be a single pane of glass for our users, API driven to allow them to extract their own 52 | 53 | 54 | ### Documentation 55 | 56 | - Everything required to understand the network should be documented 57 | - Documentation must never be out of date. Automation can help with this 58 | - Use documentation to explain why choices have been made 59 | - Use documentation to explain what other options were rejected 60 | 61 | 62 | ### Deployment 63 | 64 | - Static routing to be avoided wherever possible 65 | - Zero touch deployment for new gear 66 | - Entirely templated configlets: 67 | - Base system configuration, including: AAA, Logging, 68 | - OSPF 69 | - STP 70 | - BGP configlets 71 | - IPSec tunneling 72 | - Absolutely no manual configuration pushes to production 73 | - Design and build a working lab for prototyping configuration 74 | - Goal to provide an API to our end users to deploy their infrastructure as they see fit 75 | 76 | ### Planning for failure 77 | - You need redundancy and failovers 78 | - Your `[storage|servers|routers|switches|uplinks|etc.]` are going to fail, sometimes in an isolated manner, sometimes in spectacular simultaneous blowouts. Plan for automatic alternatives. 79 | - Having 3 independent fail safe systems is just fluff if you don't test failover - periodically. 80 | 81 | ### Remote offices 82 | 83 | - Regular random polling of remote users on office internet, general feeling of office network 84 | - Managing this data over time to ensure we have total inclusion of our users 85 | - Dynamic monitoring and failover of IPSec tunnelling 86 | - Monthly SLA reporting of WAN performance based on 100% meshed pinging of remote offices 87 | 88 | 89 | ### Reporting 90 | 91 | - Every single SNMP trap has to be actionable 92 | - Every single packet drop in our network has to be actionable 93 | - Every single TCP re-transmission inside the borders of our administrative control has to be actionable 94 | - Apply predictive algorithms to our graphing to alert of trends before they become issues. 95 | 96 | 97 | ### Personal Development 98 | 99 | - everyone must commit to self-improvement 100 | - Certification track - optional but highly recommended 101 | - Regular hardware deep dives based on freely available vendor documentations, talks, presentations 102 | 103 | --------------------------------------------------------------------------------