├── docs ├── requirements.txt ├── .gitignore ├── images │ ├── cloud1.png │ ├── dashboard.png │ ├── entities.png │ ├── tv-mode.png │ ├── dashboard1.png │ ├── switch-tv-mode.png │ ├── grafana-example1.png │ └── tv-mode-logout-dialog.png ├── user │ ├── check-ref │ │ ├── mongodb_wrapper.rst │ │ ├── tcp_wrapper.rst │ │ ├── ping_wrapper.rst │ │ ├── jmx_wrapper.rst │ │ ├── dns_wrapper.rst │ │ ├── counter_wrapper.rst │ │ ├── eventlog_wrapper.rst │ │ ├── cassandra_wrapper.rst │ │ ├── ebs_wrapper.rst │ │ ├── ldap_wrapper.rst │ │ ├── zomcat_wrapper.rst │ │ ├── entities_wrapper.rst │ │ ├── datapipeline_wrapper.rst │ │ ├── memcached_wrapper.rst │ │ ├── history_wrapper.rst │ │ ├── s3_wrapper.rst │ │ ├── scalyr_wrapper.rst │ │ ├── kairosdb_wrapper.rst │ │ ├── snmp_wrapper.rst │ │ ├── elastic_search_wrapper.rst │ │ ├── http_wrapper.rst │ │ ├── appdynamics_wrapper.rst │ │ ├── redis_wrapper.rst │ │ ├── sql_wrappers.rst │ │ ├── cloudwatch_wrapper.rst │ │ └── kubernetes_wrapper.rst │ ├── notifications │ │ ├── hubot.rst │ │ ├── push.rst │ │ ├── slack.rst │ │ ├── twilio.rst │ │ ├── google_hangouts_chat.rst │ │ ├── mail.rst │ │ ├── pagerduty.rst │ │ ├── opsgenie.rst │ │ ├── hipchat.rst │ │ └── http.rst │ ├── notifications.rst │ ├── downtimes.rst │ ├── comments.rst │ ├── check-definitions.rst │ ├── alert-definition-inheritance.rst │ ├── entities.rst │ ├── tv-login.rst │ ├── alert-definition-parameters.rst │ ├── grafana.rst │ ├── monitoringonaws.rst │ └── alert-definitions.rst ├── developer │ ├── tests.rst │ ├── zmon-python-client.rst │ ├── redis.rst │ ├── zmon-cli.rst │ └── python-tutorial.rst ├── index.rst ├── installation │ ├── requirements.rst │ ├── components.rst │ └── configuration.rst ├── apendix │ └── glossary.rst ├── getting-started.rst ├── Makefile ├── conf.py └── intro.rst ├── .zappr.yaml └── README.rst /docs/requirements.txt: -------------------------------------------------------------------------------- 1 | zmon-cli>=1.0.59 2 | -------------------------------------------------------------------------------- /docs/.gitignore: -------------------------------------------------------------------------------- 1 | .*.un~ 2 | _build 3 | *.swp 4 | -------------------------------------------------------------------------------- /.zappr.yaml: -------------------------------------------------------------------------------- 1 | X-Zalando-Team: zmon 2 | X-Zalando-Type: doc 3 | -------------------------------------------------------------------------------- /docs/images/cloud1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zalando-zmon/zmon-docs/HEAD/docs/images/cloud1.png -------------------------------------------------------------------------------- /docs/images/dashboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zalando-zmon/zmon-docs/HEAD/docs/images/dashboard.png -------------------------------------------------------------------------------- /docs/images/entities.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zalando-zmon/zmon-docs/HEAD/docs/images/entities.png -------------------------------------------------------------------------------- /docs/images/tv-mode.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zalando-zmon/zmon-docs/HEAD/docs/images/tv-mode.png -------------------------------------------------------------------------------- /docs/images/dashboard1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zalando-zmon/zmon-docs/HEAD/docs/images/dashboard1.png -------------------------------------------------------------------------------- /docs/images/switch-tv-mode.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zalando-zmon/zmon-docs/HEAD/docs/images/switch-tv-mode.png -------------------------------------------------------------------------------- /docs/images/grafana-example1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zalando-zmon/zmon-docs/HEAD/docs/images/grafana-example1.png -------------------------------------------------------------------------------- /docs/images/tv-mode-logout-dialog.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zalando-zmon/zmon-docs/HEAD/docs/images/tv-mode-logout-dialog.png -------------------------------------------------------------------------------- /docs/user/check-ref/mongodb_wrapper.rst: -------------------------------------------------------------------------------- 1 | MongoDB 2 | ------- 3 | 4 | Provides access to a MongoDB cluster 5 | 6 | .. py:function:: mongodb(host, port=27017) 7 | 8 | Methods of MongoDB 9 | ^^^^^^^^^^^^^^^^^^ 10 | 11 | .. py:function:: find(database, collection, query) 12 | -------------------------------------------------------------------------------- /docs/user/check-ref/tcp_wrapper.rst: -------------------------------------------------------------------------------- 1 | TCP 2 | --- 3 | 4 | This function opens a TCP connection to a host on a given port. If the 5 | connection succeeds, it returns ‘OK’. The host can be provided directly for global checks or resolved from 6 | entities filter. Assuming that we have an entity filter type=host, the 7 | example below will try to connect to every host on port 22:: 8 | 9 | tcp().open(22) 10 | -------------------------------------------------------------------------------- /docs/user/notifications/hubot.rst: -------------------------------------------------------------------------------- 1 | Hubot 2 | ----- 3 | 4 | Send Hubot notification. 5 | 6 | .. py:function:: notify_hubot(queue, hubot_url, message=None) 7 | 8 | Send Hubot notification. 9 | 10 | :param queue: Hubot queue. 11 | :type queue: str 12 | 13 | :param hubot_url: Hubot url. 14 | :type hubot_url: str 15 | 16 | :param message: Notification message. 17 | :type message: str 18 | 19 | -------------------------------------------------------------------------------- /docs/user/check-ref/ping_wrapper.rst: -------------------------------------------------------------------------------- 1 | Ping 2 | ---- 3 | 4 | Simple ICMP ping function which returns ``True`` if the ping command returned without error and ``False`` otherwise. 5 | 6 | .. py:function:: ping(timeout=1) 7 | 8 | :: 9 | 10 | ping() 11 | 12 | The ``timeout`` argument specifies the timeout in seconds. 13 | Internally it just runs the following system command:: 14 | 15 | ping -c 1 -w 16 | -------------------------------------------------------------------------------- /docs/user/check-ref/jmx_wrapper.rst: -------------------------------------------------------------------------------- 1 | JMX 2 | --- 3 | 4 | To use JMXQuery, run "jmxquery" (this is not yet released) 5 | 6 | Queries beans’ attributes on hosts specified in entities filter:: 7 | 8 | jmx().query('java.lang:type=Memory', 'HeapMemoryUsage', 'NonHeapMemoryUsage').results() 9 | 10 | Another example:: 11 | 12 | jmx().query('java.lang:type=Threading', 'ThreadCount', 'DaemonThreadCount', 'PeakThreadCount').results() 13 | 14 | This would return a dict like: 15 | 16 | .. code-block:: json 17 | 18 | { 19 | "DaemonThreadCount": 524, 20 | "PeakThreadCount": 583, 21 | "ThreadCount": 575 22 | } 23 | -------------------------------------------------------------------------------- /docs/user/notifications/push.rst: -------------------------------------------------------------------------------- 1 | Push 2 | ----- 3 | 4 | Send push notification via ZMON `notification service `_. 5 | 6 | .. py:function:: send_push(url=None, key=None, message=None) 7 | 8 | Send Push notification to mobile devices. 9 | 10 | :param url: Notification service base URL. 11 | :type url: str 12 | 13 | :param key: Notification service API key. 14 | :type key: str 15 | 16 | :param message: Message to be sent in notification. 17 | :type message: str 18 | 19 | .. note:: 20 | 21 | If Message is ``None`` then it will be generated from alert status. 22 | 23 | -------------------------------------------------------------------------------- /docs/user/notifications/slack.rst: -------------------------------------------------------------------------------- 1 | Slack 2 | ----- 3 | 4 | Notify Slack channel with alert status. A ``webhook`` is required for notifications. 5 | 6 | .. py:function:: notify_slack(webhook=None, channel='#general', message=None) 7 | 8 | Send Slack notification to specified channel. 9 | 10 | :param webhook: Slack webhook. If not set, then webhook set in configuration will be used. 11 | :type webhook: str 12 | 13 | :param channel: Channel to be notified. Default is ``#general``. 14 | :type channel: str 15 | 16 | :param message: Message to be sent. If ``None``, then a message constructed from the alert will be sent. 17 | :type message: str 18 | -------------------------------------------------------------------------------- /docs/user/check-ref/dns_wrapper.rst: -------------------------------------------------------------------------------- 1 | DNS 2 | --- 3 | 4 | The ``dns()`` function provide a way to resolve hosts. 5 | 6 | .. py:function:: dns(host=None) 7 | 8 | 9 | Methods of DNS 10 | ^^^^^^^^^^^^^^ 11 | 12 | .. py:method:: resolve(host=None) 13 | 14 | Return IP address of host. If host is ``None``, then will resolve host used in initialization. If both are ``None`` then exception will be raised. 15 | 16 | :return: IP address 17 | :rtype: str 18 | 19 | Example query: 20 | 21 | .. code-block:: python 22 | 23 | dns('google.de').resolve() 24 | '173.194.65.94' 25 | 26 | dns().resolve('google.de') 27 | '173.194.65.94' 28 | -------------------------------------------------------------------------------- /docs/user/notifications.rst: -------------------------------------------------------------------------------- 1 | .. _notifications: 2 | 3 | *********************** 4 | Notifications Reference 5 | *********************** 6 | 7 | ZMON provides several means of notification in case of alerts. Notifications will be triggered when alert status change. Please refer to 8 | :ref:`Notification options ` for different worker configuration options. 9 | 10 | .. include:: notifications/google_hangouts_chat.rst 11 | .. include:: notifications/hipchat.rst 12 | .. include:: notifications/http.rst 13 | .. include:: notifications/hubot.rst 14 | .. include:: notifications/mail.rst 15 | .. include:: notifications/opsgenie.rst 16 | .. include:: notifications/pagerduty.rst 17 | .. include:: notifications/push.rst 18 | .. include:: notifications/slack.rst 19 | .. include:: notifications/twilio.rst 20 | -------------------------------------------------------------------------------- /docs/user/check-ref/counter_wrapper.rst: -------------------------------------------------------------------------------- 1 | Counter 2 | ------- 3 | 4 | The ``counter()`` function allows you to get increment rates of increasing counter values. 5 | Main use case for using ``counter()`` is to get rates per second of JMX counter beans (e.g. "Tomcat Requests"). 6 | The counter function requires one parameter ``key`` to identify the counter. 7 | 8 | 9 | .. py:method:: per_second(value) 10 | 11 | :: 12 | 13 | counter('requests').per_second(get_total_requests()) 14 | 15 | Returns the value's increment rate per second. Value must be a float or integer. 16 | 17 | .. py:method:: per_minute(value) 18 | 19 | :: 20 | 21 | counter('requests').per_minute(get_total_requests()) 22 | 23 | Convenience method to return the value's increment rate per minute (same as result of ``per_second()`` divided by 60). 24 | 25 | Internally counter values and timestamps are stored in Redis. 26 | -------------------------------------------------------------------------------- /docs/user/notifications/twilio.rst: -------------------------------------------------------------------------------- 1 | Twilio 2 | ------ 3 | 4 | Use Twilio to receive phone calls if alerts pop up. This includes basic ACK and escalation. Requires account at Twilio and the notifiction service deployed. Low investment to get going though. WORK IN PROGRESS. 5 | 6 | .. py:function:: notifiy_twilio(numbers=[], message="ZMON Alert Up: Some Alert") 7 | 8 | Make phone call to supplied numbers. First number will be called immediately. After two minutes, another call is made to that number if no ACK. Other numbers follow at 5min interval without ACK. 9 | 10 | :param message: Message to be sent. If ``None``, then a message constructed from the alert will be sent. 11 | :type message: str 12 | 13 | :param numbers: Numbers to call 14 | :type numers: list 15 | 16 | 17 | .. note:: 18 | 19 | Remember to configure your worker for this. 20 | 21 | .. code-block:: bash 22 | 23 | NOTIFICATION_SERVICE_URL 24 | NOTIFICATION_SERVICE_KEY 25 | -------------------------------------------------------------------------------- /docs/developer/tests.rst: -------------------------------------------------------------------------------- 1 | ***** 2 | Tests 3 | ***** 4 | 5 | Acceptance and Unit Tests 6 | ------------------------- 7 | 8 | These tests must be run from inside the vagrant box.:: 9 | 10 | $ vagrant ssh 11 | vagrant@zmon:~$ cd /vagrant/vagrant/ 12 | vagrant@zmon:/vagrant/vagrant$ sudo ./test.sh 13 | 14 | An example output of the previous command can look similar to this:: 15 | 16 | Starting Xvfb... 17 | [13:36:12] Using gulpfile /vagrant/zmon-controller/src/main/webapp/gulpfile.js 18 | [13:36:12] Starting 'test'... 19 | Starting selenium standalone server... 20 | Selenium standalone server started at http://10.0.2.15:47833/wd/hub 21 | Testing dashboard features 22 | should display the search form - pass 23 | 24 | Finished in 3.24 seconds 25 | 1 test, 1 assertion, 0 failures 26 | 27 | Shutting down selenium standalone server. 28 | [13:36:22] Finished 'test' after 10 s 29 | 30 | Only one single acceptance test and no unit tests are provided so far. This is still a work in progress. 31 | -------------------------------------------------------------------------------- /docs/user/downtimes.rst: -------------------------------------------------------------------------------- 1 | .. _downtimes: 2 | 3 | 4 | Downtimes 5 | --------- 6 | 7 | This functionality allows the user to acknowledge an existing alert or create a downtime schedule for an anticipated service 8 | interruption. When acknowleding an existing alert, the user has to provide the predicted duration, and when creating 9 | a scheduled downtime - start and end date. If the downtime is currently active, meaning an alert occured within the 10 | downtime period, the alert notification won't be shown in the dashboard and it'll be greyed out in alert details page. 11 | Please note that the downtime will not be evaluated immediately after creation, meaning that the alert might appear 12 | as active until it's evaluated again by the worker. E.g. if the user defined a downtime for an alert which is evaluated 13 | every minute and the last evaluation was 5 seconds ago, it would take approximately one more minute for the alert to 14 | appear in "downtime state". 15 | 16 | To acknowledge an alert or to schedule a new downtime, the user has to go to the specific alert details page and click 17 | on a downtime button next to the desired alert. 18 | -------------------------------------------------------------------------------- /docs/user/check-ref/eventlog_wrapper.rst: -------------------------------------------------------------------------------- 1 | EventLog 2 | -------- 3 | 4 | The ``eventlog()`` function allows you to conveniently count EventLog_ events by type and time. 5 | 6 | 7 | .. py:method:: count(event_type_ids, time_from, [time_to=None], [group_by=None]) 8 | 9 | Return event counts for given parameters. 10 | 11 | *event_type_ids* is either a single integer (use hex notation, e.g. ``0x96001``) or a list of integers. 12 | 13 | *time_from* is a string time specification (``'-5m'`` means 5 minutes ago, ``'-1h'`` means 1 hour ago). 14 | 15 | *time_to* is a string time specification and defaults to *now* if not given. 16 | 17 | *group_by* can specify an EventLog field name to group counts by 18 | 19 | :: 20 | 21 | eventlog().count(0x96001, time_from='-1m') # returns a single number 22 | eventlog().count([0x96001, 0x63005], time_from='-1m') # returns dict {'96001': 123, '63005': 456} 23 | eventlog().count(0x96001, time_from='-1m', group_by='appDomainId') # returns dict {'1': 123, '5': 456, ..} 24 | 25 | The ``count()`` method internally requests the EventLog Viewer's "count" JSON endpoint. 26 | -------------------------------------------------------------------------------- /docs/user/comments.rst: -------------------------------------------------------------------------------- 1 | .. _comments: 2 | 3 | 4 | Alert Comments 5 | -------------- 6 | 7 | Comments are useful in providing additional information to other members of your team (or other teams) about your 8 | alerts. Those with ADMIN and USER roles can add comments to an alert, but VIEWERS can not. ADMINs can delete 9 | either their own or other people's comments. USERs can delete only their own comments. 10 | 11 | Adding Comments 12 | ^^^^^^^^^^^^^^^ 13 | 14 | Follow these steps: 15 | 16 | * Open the alert definition where you want to add your comment. 17 | * Either click on the top-right link `Comments` to add a **general** comment (for all entities), or click on the balloon on the left side of the entity name to add a comment on a **specific** entity. 18 | * In the comments window, type your comment. Use as many lines as you need. 19 | * Click the `Post comment` button and save your comment. Done! 20 | 21 | Seeing Existing Comments 22 | ^^^^^^^^^^^^^^^^^^^^^^^^ 23 | 24 | It's easy: Just open the alert definition, then click on `Comments` (top-right link). 25 | 26 | Deleting Comments 27 | ^^^^^^^^^^^^^^^^^ 28 | 29 | Deleting is also easy: Open the alert definition, click on the top right-link `Comments`, click on the cross above the comment, and delete. 30 | -------------------------------------------------------------------------------- /docs/user/notifications/google_hangouts_chat.rst: -------------------------------------------------------------------------------- 1 | Google Hangouts Chat 2 | ------- 3 | 4 | Notify Google Hangouts Chat room with alert status. 5 | 6 | .. py:function:: send_google_hangouts_chat(webhook_link=None, message=None, color='red', threading='alert') 7 | 8 | Send Google Hangouts Chat notification. 9 | 10 | :param webhook_link: Webhook Link in Google Hangouts Chat Room. Create a `Google Hangouts Chat Webhook`_ and copy the link here. 11 | :type webhook_link: str 12 | 13 | :param multiline: Should the Text in the notification span multiple lines or not? Default is ``True``. 14 | :type multiline: bool 15 | 16 | :param message: Message to be sent. If ``None``, then a message constructed from the alert will be sent. 17 | :type message: str 18 | 19 | :param color: Message color. Default is ``red`` if alert is raised. 20 | :type color: str 21 | 22 | :param threading: Message threading behaviour. Allowed values are ``alert`` (thread per alert entity), ``date`` (thread per day), ``alert-date`` (thread per alert entity per day) or ``none`` (unique thread per notification). Default is ``alert``. 23 | :type threading: str 24 | 25 | .. note:: 26 | 27 | Message color will be determined based on alert status. If alert has ended, then ``color`` will be ``green``, otherwise ``color`` argument will be used. 28 | 29 | .. _Google Hangouts Chat Webhook: https://developers.google.com/hangouts/chat/how-tos/webhooks 30 | -------------------------------------------------------------------------------- /docs/index.rst: -------------------------------------------------------------------------------- 1 | ZMON Docs 2 | ========= 3 | 4 | .. toctree:: 5 | :hidden: 6 | :maxdepth: 1 7 | 8 | intro 9 | getting-started 10 | 11 | .. _user-docs: 12 | 13 | .. toctree:: 14 | :hidden: 15 | :maxdepth: 2 16 | :caption: User Documentation 17 | 18 | user/entities 19 | user/check-definitions 20 | user/alert-definitions 21 | 22 | user/dashboards 23 | user/grafana 24 | user/tv-login 25 | 26 | user/check-commands 27 | user/alert-ref/alert_reference_functions 28 | user/notifications 29 | 30 | .. toctree:: 31 | :hidden: 32 | :maxdepth: 2 33 | :caption: Guides 34 | 35 | user/monitoringonaws 36 | 37 | 38 | .. _installation: 39 | 40 | .. toctree:: 41 | :maxdepth: 2 42 | :caption: Deploying ZMON 43 | :hidden: 44 | 45 | installation/requirements 46 | installation/components 47 | installation/configuration 48 | 49 | .. _developer-docs: 50 | 51 | .. toctree:: 52 | :hidden: 53 | :maxdepth: 2 54 | :caption: Developer Documentation 55 | 56 | developer/rest-api 57 | developer/zmon-cli 58 | developer/zmon-python-client 59 | developer/python-tutorial 60 | developer/tests 61 | developer/redis 62 | 63 | 64 | .. toctree:: 65 | :hidden: 66 | :maxdepth: 1 67 | :caption: Appendix 68 | 69 | apendix/glossary 70 | 71 | .. include:: intro.rst 72 | 73 | 74 | Indices and Tables 75 | ================== 76 | 77 | * :ref:`genindex` 78 | * :ref:`modindex` 79 | * :ref:`search` 80 | -------------------------------------------------------------------------------- /docs/user/check-ref/cassandra_wrapper.rst: -------------------------------------------------------------------------------- 1 | Cassandra 2 | --------- 3 | 4 | Provides access to a Cassandra cluster via ``cassandra()`` wrapper object. 5 | 6 | .. py:function:: cassandra(node, keyspace, username=None, password=None, port=9042, connect_timeout=1, protocol_version=3) 7 | 8 | 9 | Initialize cassandra wrapper. 10 | 11 | :param node: Cassandra host. 12 | :type node: str 13 | 14 | :param keyspace: Cassandra keyspace used during the session. 15 | :type keyspace: str 16 | 17 | :param username: Username used in connection. It is recommended to use unprivileged user for cassandra checks. 18 | :type username: str 19 | 20 | :param password: Password used in connection. 21 | :type password: str 22 | 23 | :param port: Cassandra host port. Default is 9042. 24 | :type port: int 25 | 26 | :param connect_timeout: Connection timeout. 27 | :type connect_timeout: int 28 | 29 | :param protocol_version: Protocol version used in connection. Default is 3. 30 | :type protocol_version: str 31 | 32 | .. note:: 33 | 34 | You should always use an unprivileged user to access your databases. Use ``plugin.cassandra.user`` and ``plugin.cassandra.pass`` to configure credentials for the zmon-worker. 35 | 36 | .. py:function:: execute(stmt) 37 | 38 | Execute a CQL statement against the specified keyspace. 39 | 40 | :param stmt: CQL statement 41 | :type stmt: str 42 | 43 | :return: CQL result 44 | :rtype: list 45 | -------------------------------------------------------------------------------- /docs/user/check-definitions.rst: -------------------------------------------------------------------------------- 1 | .. _check-definitions: 2 | 3 | ***************** 4 | Check Definitions 5 | ***************** 6 | 7 | Checks are ZMON's way of gathering data from arbitrary entities, e.g. databases, micro services, hosts and more. 8 | Create them as describe below using either the UI or the CLI. 9 | 10 | Key properties 11 | ============== 12 | 13 | Command 14 | ------- 15 | 16 | The command is being executed by the worker and is considered the data gathering part. 17 | It is executed once per selected entity and its result made available to all attached alerts. 18 | You have different wrappers at hand and the ``entity`` variable is also available for access. 19 | 20 | Entity Filter 21 | ------------- 22 | 23 | Select the entities you want the check to execute against in general, often only a type filter is applied, sometimes more specific. 24 | The alert allows you to do more fine grained filtering. 25 | This proves useful to allow checks to be easily reused. 26 | 27 | Interval 28 | -------- 29 | 30 | Specify the interval in seconds at which you want the check to be executed. 31 | 32 | Owning team 33 | ----------- 34 | 35 | This is the team originally creating the check, right now this has little effect. 36 | 37 | Creating new checks 38 | =================== 39 | 40 | Using trial run 41 | --------------- 42 | 43 | Using the CLI 44 | ------------- 45 | 46 | .. code-block:: bash 47 | 48 | $ zmon check init new-check.yaml 49 | $ zmon check update new-check.yaml 50 | -------------------------------------------------------------------------------- /docs/installation/requirements.rst: -------------------------------------------------------------------------------- 1 | .. _requirements: 2 | 3 | ************ 4 | Requirements 5 | ************ 6 | 7 | The requirements below are all open soure technologies that need to be available for ZMON to run with all its features. 8 | 9 | Redis 10 | ===== 11 | 12 | The Redis service is one of the core dependencies, ZMON uses Redis for its task queue and to store its current state. 13 | 14 | PostgreSQL 15 | ========== 16 | 17 | PostgreSQL is ZMONs data store for entities, checks, alerts, dashboards and Grafana dashboards. 18 | The entities service relies on PostgreSQL's jsonb data type thus you need a PostgreSQL 9.4+ running. 19 | 20 | Cassandra 21 | ========= 22 | 23 | Cassandra needs to be available for KairosDB if you want to have historic data and make use of Grafana, this is highly suggested. 24 | We strongly recommend to run Cassandra 3.7+ and using TimeWindow compaction strategy for KairosDB. 25 | This will nicely split your SSTables into a single file per day (depending on your config). 26 | 27 | KairosDB 28 | ======== 29 | 30 | KairosDB is our time series database of choice, however by now we are running our own fork_. This is not required for standard volume scenarios we believe. 31 | ZMON will store every metric gathered in KairosDB so that you can use it directly or via Graphana to access historic data. 32 | ZMON itself allows you to plot charts from KairosDB in Dashboard widgets or go to check/alert specific charts directly. 33 | 34 | .. _fork: https://github.com/zalando-zmon/kairosdb 35 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | ZMON source code on GitHub is no longer in active development. Zalando will no longer actively review issues or merge pull-requests. 2 | 3 | ZMON is still being used at Zalando and serves us well for many purposes. We are now deeper into our observability journey and understand better that we need other telemetry sources and tools to elevate our understanding of the systems we operate. We support the `OpenTelemetry `_ initiative and recommended others starting their journey to begin there. 4 | 5 | If members of the community are interested in continuing developing ZMON, consider forking it. Please review the licence before you do. 6 | 7 | ================== 8 | ZMON Documentation 9 | ================== 10 | 11 | .. image:: https://readthedocs.org/projects/zmon/badge/?version=latest 12 | :target: https://readthedocs.org/projects/zmon/?badge=latest 13 | :alt: Documentation Status 14 | 15 | First install the Sphinx_ documentation generator. 16 | 17 | .. code-block:: bash 18 | 19 | $ sudo pip install sphinx sphinx_rtd_theme 20 | 21 | 22 | Generate the ZMON HTML documentation locally: 23 | 24 | .. code-block:: bash 25 | 26 | $ cd docs; make html 27 | 28 | .. _Sphinx: http://sphinx-doc.org/ 29 | 30 | Run docs locally: 31 | 32 | .. code-block:: bash 33 | 34 | $ python -m SimpleHTTPServer 8888 35 | 36 | If you are using Python3: 37 | 38 | .. code-block:: bash 39 | 40 | $ python3 -m http.server 8888 41 | 42 | Docs at: 43 | `http://localhost:8888/_build/html/` 44 | -------------------------------------------------------------------------------- /docs/user/notifications/mail.rst: -------------------------------------------------------------------------------- 1 | Mail 2 | ---- 3 | 4 | Send email notifications. 5 | 6 | .. py:function:: send_mail(subject=None, cc=None, html=False, hide_recipients=True, include_value=True, include_definition=True, include_captures=True, include_entity=True, per_entity=True) 7 | 8 | Send email notification. 9 | 10 | :param subject: Email subject. 11 | You must use a unicode string (e.g. `u'äöüß'`) if you have non-ASCII 12 | characters in there. 13 | If None, the alert name will be used. 14 | :type subject: str or unicode or None 15 | 16 | :param cc: List of CC recipients. 17 | :type cc: list 18 | 19 | :param html: HTML email. 20 | :type html: bool 21 | 22 | :param hide_recipients: Hide recipients. Will be sent as BCC. 23 | :type hide_recipients: bool 24 | 25 | :param include_value: Include alert value in notification message. 26 | :type include_value: bool 27 | 28 | :param include_definition: Include alert definition details in notification message. 29 | :type include_definition: bool 30 | 31 | :param include_captures: Include alert captures in message. 32 | :type include_captures: bool 33 | 34 | :param include_entity: Include affected entities in notification message. 35 | :type include_entity: bool 36 | 37 | :param per_entity: Send new email notification per entity. Default is ``True``. 38 | :type per_entity: bool 39 | 40 | 41 | .. note:: 42 | 43 | ``send_email`` is an alias for this notification function. 44 | -------------------------------------------------------------------------------- /docs/user/check-ref/ebs_wrapper.rst: -------------------------------------------------------------------------------- 1 | EBS 2 | --- 3 | 4 | Allows to describe EBS objects (currently, only Snapshots are supported). 5 | 6 | 7 | .. py:function:: ebs() 8 | 9 | 10 | Methods of EBS 11 | ^^^^^^^^^^^^^^ 12 | 13 | .. py:function:: list_snapshots(account_id, max_items) 14 | 15 | List the EBS Snapshots owned by the given account_id. 16 | By default, listing is possible for up to 1000 items, so we use pagination internally to overcome this. 17 | 18 | :param account_id: AWS account id number (as a string). Defaults to the AWS account id where the check is running. 19 | :param max_items: the maximum number of snapshots to list. Defaults to 100. 20 | :return: an ``EBSSnapshotsList`` object 21 | 22 | .. py:class:: EBSSnapshotsList 23 | 24 | .. py:method:: items() 25 | 26 | Returns a list of dicts like 27 | 28 | .. code-block:: json 29 | 30 | { 31 | "id": "snap-12345", 32 | "description": "Snapshot description...", 33 | "size": 123, 34 | "start_time": "2017-07-16T01:01:21Z", 35 | "state": "completed" 36 | } 37 | 38 | Example usage: 39 | 40 | .. code-block:: python 41 | 42 | ebs().list_snapshots().items() 43 | 44 | snapshots = ebs().list_snapshots(max_items=1000).items() # for listing more than the default of 100 snapshots 45 | start_time = snapshots[0]["start_time"].isoformat() # returns a string that can be passed to time() 46 | age = time() - time(start_time) 47 | -------------------------------------------------------------------------------- /docs/user/check-ref/ldap_wrapper.rst: -------------------------------------------------------------------------------- 1 | LDAP 2 | ---- 3 | 4 | Retrieve OpenLDAP statistics (needs "cn=Monitor" database installed in LDAP server). :: 5 | 6 | ldap().statistics() 7 | 8 | This would return a dict like: 9 | 10 | .. code-block:: json 11 | 12 | { 13 | "connections_current": 77, 14 | "connections_per_sec": 27.86, 15 | "entries": 359369, 16 | "max_file_descriptors": 65536, 17 | "operations_add_per_sec": 0.0, 18 | "operations_bind_per_sec": 27.99, 19 | "operations_delete_per_sec": 0.0, 20 | "operations_extended_per_sec": 0.23, 21 | "operations_modify_per_sec": 0.09, 22 | "operations_search_per_sec": 24.09, 23 | "operations_unbind_per_sec": 27.82, 24 | "waiters_read": 76, 25 | "waiters_write": 0 26 | } 27 | 28 | All information is based on the cn=Monitor OpenLDAP tree. You can get more information in the `OpenLDAP Administrator's Guide`_. 29 | The meaning of the different fields is as follows: 30 | 31 | ``connections_current`` 32 | Number of currently established TCP connections. 33 | 34 | ``connections_per_sec`` 35 | Increase of connections per second. 36 | 37 | ``entries`` 38 | Number of LDAP records. 39 | 40 | ``operations_*_per_sec`` 41 | Number of operations per second per operation type (add, bind, search, ..). 42 | 43 | ``waiters_read`` 44 | Number of waiters for read (whatever that means, OpenLDAP documentation does not say anything). 45 | 46 | .. _OpenLDAP Administrator's Guide: http://www.openldap.org/doc/admin24/monitoringslapd.html#Monitor%20Information 47 | -------------------------------------------------------------------------------- /docs/user/notifications/pagerduty.rst: -------------------------------------------------------------------------------- 1 | Pagerduty 2 | --------- 3 | 4 | Notify `Pagerduty `_ of a new alert status. If alert is **active**, then a new pagerduty incident with type ``trigger`` will be sent. If alert is **inactive** then incident type will be updated to ``resolve``. 5 | 6 | .. note:: 7 | 8 | Pagerduty notification plugin uses API v2. 9 | 10 | 11 | .. py:function:: notify_pagerduty(message='', per_entity=False, include_alert=True, routing_key=None, alert_class=None, alert_group=None, **kwargs) 12 | 13 | Send notifications to Pagerduty. 14 | 15 | :param message: Incident message. If empty, then a message will be generated from the alert data. 16 | :type message: str 17 | 18 | :param per_entity: Send new alert per entity. This affects the ``dedup_key`` value and impacts how de-duplication is handled in Pagerduty. Default is ``False``. 19 | :type per_entity: bool 20 | 21 | :param include_alert: Include alert data in incident payload ``custom_details``. Default is ``True``. 22 | :type include_alert: bool 23 | 24 | :param routing_key: Pagerduty service ``routing_key``. If not specified, then the :ref:`service key configured ` for the worker will be used. 25 | :type routing_key: str 26 | 27 | :param alert_class: Set the Pagerduty incident class. 28 | :type alert_class: str 29 | 30 | :param alert_group: Set the Pagerduty incident group. 31 | :type alert_group: str 32 | 33 | Example: 34 | 35 | .. code-block:: python 36 | 37 | notify_pagerduty(message='Number of failed requests is too high!', include_alert=True, alert_class='API health', alert_group='production') 38 | -------------------------------------------------------------------------------- /docs/user/alert-definition-inheritance.rst: -------------------------------------------------------------------------------- 1 | .. _alert-definition-inheritance: 2 | 3 | Alert Definition Inheritance 4 | ---------------------------- 5 | 6 | Alert definition *inheritance* allows one to create an alert definition based on another alert whereby a child reuses attributes from the parent. 7 | Each alert definition can only inherit from a single alert definition (``single inheritance``). 8 | 9 | Template 10 | ^^^^^^^^ 11 | 12 | A Template is basically an alert definition with a subset of attributes that **is not evaluated and can only be used for extension**. 13 | 14 | To create a template: 15 | 16 | #. Select the check definition 17 | #. click **Add New Alert Definition** 18 | #. Set attributes to reuse and activate checkbox ``template`` 19 | 20 | Extending 21 | ^^^^^^^^^ 22 | 23 | In general one can inherit from any alert definition/template. One should open the alert definition details and click ``inherit`` on the top right corner. 24 | To override a field, just type in a new value. An icon should appear on the left side, meaning that the field will be overridden. 25 | To rollback the change and keep the value defined on the parent, one should click in ``override`` icon. 26 | 27 | Overriding 28 | ^^^^^^^^^^ 29 | 30 | By default the child alert retains all attributes of the parent alert with the exception of the following mandatory attributes: 31 | - team 32 | - responsible team 33 | - status 34 | 35 | These attributes are used for ``authorization`` (see :ref:`permissions` for details) therefore, they cannot be reused. If one changes these attributes on the parent alert definition, child alerts are not affected and you don't loose access rights. 36 | All the remaining attributes can be overridden, replacing the parent alert definition with its own values. 37 | -------------------------------------------------------------------------------- /docs/user/check-ref/zomcat_wrapper.rst: -------------------------------------------------------------------------------- 1 | Zomcat 2 | ------ 3 | 4 | Retrieve zomcat instance status (memory, CPU, threads). :: 5 | 6 | zomcat().health() 7 | 8 | This would return a dict like: 9 | 10 | .. code-block:: json 11 | 12 | { 13 | "cpu_percentage": 5.44, 14 | "gc_percentage": 0.11, 15 | "gcs_per_sec": 0.25, 16 | "heap_memory_percentage": 6.52, 17 | "heartbeat_enabled": true, 18 | "http_errors_per_sec": 0.0, 19 | "jobs_enabled": true, 20 | "nonheap_memory_percentage": 20.01, 21 | "requests_per_sec": 1.09, 22 | "threads": 128, 23 | "time_per_request": 42.58 24 | } 25 | 26 | Most of the values are retrieved via JMX: 27 | 28 | ``cpu_percentage`` 29 | CPU usage in percent (retrieved from JMX). 30 | 31 | ``gc_percentage`` 32 | Percentage of time spent in garbage collection runs. 33 | 34 | ``gcs_per_sec`` 35 | Garbage collections per second. 36 | 37 | ``heap_memory_percentage`` 38 | Percentage of heap memory used. 39 | 40 | ``nonheap_memory_percentage`` 41 | Percentage of non-heap memory (e.g. permanent generation) used. 42 | 43 | ``heartbeat_enabled`` 44 | Boolean indicating whether heartbeat.jsp is enabled (``true``) or not (``false``). If ``/heartbeat.jsp`` cannot be retrieved, the value is ``null``. 45 | 46 | ``http_errors_per_sec`` 47 | Number of Tomcat HTTP errors per second (all 4xx and 5xx HTTP status codes). 48 | 49 | ``jobs_enabled`` 50 | Boolean indicating whether jobs are enabled (``true``) or not (``false``). If ``/jobs.monitor`` cannot be retrieved, the value is ``null``. 51 | 52 | ``requests_per_sec`` 53 | Number of HTTP/AJP requests per second. 54 | 55 | ``threads`` 56 | Total number of threads. 57 | 58 | ``time_per_request`` 59 | Average time in milliseconds per HTTP/AJP request. 60 | -------------------------------------------------------------------------------- /docs/user/notifications/opsgenie.rst: -------------------------------------------------------------------------------- 1 | Opsgenie 2 | -------- 3 | 4 | Notify `Opsgenie `_ of a new alert status. If alert is **active**, then a new opsgenie alert will be created. If alert is **inactive** then the alert will be closed. 5 | 6 | 7 | .. py:function:: notify_opsgenie(message='', teams=None, per_entity=False, priority=None, include_alert=True, description='', custom_fields=None, **kwargs) 8 | 9 | Send notifications to Opsgenie. 10 | 11 | :param message: Alert message. If empty, then a message will be generated from the alert data. 12 | :type message: str 13 | 14 | :param teams: Opsgenie teams to be notified. Value can be a single team or a list of teams. 15 | :type teams: str | list 16 | 17 | :param per_entity: Send new alert per entity. This affects the ``alias`` value and impacts how de-duplication is handled in Opsgenie. Default is ``False``. 18 | :type per_entity: bool 19 | 20 | :param priority: Set Opsgenie priority for this notification. Valid values are ``P1``, ``P2``, ``P3``, ``P4`` or ``P5``. 21 | :type priority: str 22 | 23 | :param include_alert: Include alert data in alert body ``details``. Default is ``True``. 24 | :type include_alert: bool 25 | 26 | :param include_captures: Include captures data in alert body ``details``. Default is ``False``. 27 | :type include_captures: bool 28 | 29 | :param description: An optional description. If present, this is inserted into the opsgenie alert description field. 30 | :type description: str 31 | 32 | :param custom_fields: If present, this will added the given fields into the ops genie details field. 33 | :type custom_fields: dict 34 | 35 | 36 | Example: 37 | 38 | .. code-block:: python 39 | 40 | notify_opsgenie(teams=['zmon', 'ops'], message='Number of failed requests is too high!', include_alert=True) 41 | 42 | 43 | .. note:: 44 | 45 | If ``priority`` is not set, then ZMON will set the priority according to the alert priority. 46 | -------------------------------------------------------------------------------- /docs/user/notifications/hipchat.rst: -------------------------------------------------------------------------------- 1 | Hipchat 2 | ------- 3 | 4 | Notify Hipchat room with alert status. 5 | 6 | .. py:function:: send_hipchat(room=None, message=None, token=None, message_format='html', notify=False, color='red', link=False, link_text='go to alert') 7 | 8 | Send Hipchat notification to specified room. 9 | 10 | :param room: Room to be notified. 11 | :type room: str 12 | 13 | :param message: Message to be sent. If ``None``, then a message constructed from the alert will be sent. 14 | :type message: str 15 | 16 | :param token: Hipchat API token. 17 | :type token: str 18 | 19 | :param message_format: message format - ``html`` (default) or ``text`` (which correctly treats @mentions). 20 | :type message_format: str 21 | 22 | :param notify: Hipchat notify flag. Default is False. 23 | :type notify: bool 24 | 25 | :param color: Message color. Default is ``red`` if alert is raised. 26 | :type color: str 27 | 28 | :param link: Add link to Hipchat message. Default is ``False``. 29 | :type link: bool 30 | 31 | :param link_text: if ``link`` param is ``True``, this will be displayed as a link in the hipchat message. Default is ``go to alert``. 32 | :type link_text: str 33 | 34 | .. note:: 35 | 36 | Message color will be determined based on alert status. If alert has ended, then ``color`` will be ``green``, otherwise ``color`` argument will be used. 37 | 38 | Example message - using html format (default): 39 | 40 | .. code-block:: python 41 | 42 | { 43 | "message": "NEW ALERT: Requests failing with status 500 on host-production-1-entity", 44 | "color": "red", 45 | "notify": true 46 | } 47 | 48 | Example message - using text format with @mention: 49 | 50 | .. code-block:: python 51 | 52 | { 53 | "message": "@here NEW ALERT: Requests failing with status 500 on host-production-1-entity", 54 | "color": "red", 55 | "notify": true, 56 | "message_format": "text" 57 | } 58 | 59 | -------------------------------------------------------------------------------- /docs/developer/zmon-python-client.rst: -------------------------------------------------------------------------------- 1 | .. _zmon-python-client: 2 | 3 | ************* 4 | Python Client 5 | ************* 6 | 7 | ZMON provides a python client library that can be imported and used in your own software. 8 | 9 | Installation 10 | ------------ 11 | 12 | ZMON python client library is part of :ref:`ZMON CLI `. 13 | 14 | .. code-block:: bash 15 | 16 | pip3 install --upgrade zmon-cli 17 | 18 | Usage 19 | ----- 20 | 21 | Using ZMON client is pretty straight forward. 22 | 23 | .. code-block:: python 24 | 25 | >>> from zmon_cli.client import Zmon 26 | 27 | >>> zmon = Zmon('https://zmon.example.org', token='123') 28 | 29 | >>> entity = zmon.get_entity('entity-1') 30 | { 31 | 'id': 'entity-1', 32 | 'team': 'ZMON', 33 | 'type': 'instance', 34 | 'data': {'host': '192.168.20.16', 'port': 8080, 'name': 'entity-1-instance'} 35 | } 36 | 37 | >>> zmon.delete_entity('entity-102') 38 | True 39 | 40 | >>> check = zmon.get_check_definition(123) 41 | 42 | >>> check['command'] 43 | http('http://www.custom-service.example.org/health').code() 44 | 45 | >>> check['command'] = "http('http://localhost:9090/health').code()" 46 | 47 | >>> zmon.update_check_definition(check) 48 | { 49 | 'command': "http('http://localhost:9090/health').code()", 50 | 'description': 'Check service health', 51 | 'entities': [{'application_id': 'custom-service', 'type': 'instance'}], 52 | 'id': 123, 53 | 'interval': 60, 54 | 'last_modified_by': 'admin', 55 | 'name': 'Check service health', 56 | 'owning_team': 'ZMON', 57 | 'potential_analysis': None, 58 | 'potential_impact': None, 59 | 'potential_solution': None, 60 | 'source_url': None, 61 | 'status': 'ACTIVE', 62 | 'technical_details': None 63 | } 64 | 65 | Client 66 | ------ 67 | 68 | Exceptions 69 | ========== 70 | 71 | .. autoclass:: zmon_cli.client.ZmonError 72 | :members: 73 | 74 | .. autoclass:: zmon_cli.client.ZmonArgumentError 75 | :members: 76 | 77 | Zmon 78 | ==== 79 | 80 | .. autoclass:: zmon_cli.client.Zmon 81 | :members: 82 | -------------------------------------------------------------------------------- /docs/developer/redis.rst: -------------------------------------------------------------------------------- 1 | ==================== 2 | Redis Data Structure 3 | ==================== 4 | 5 | ZMON stores its primary working data in Redis. This page describes the used Redis keys and data structures. 6 | 7 | Queues are Redis keys like ``zmon:queue:`` of type "list", e.g. ``zmon:queue:default``. 8 | 9 | New queue items are added by the ZMON Scheduler via the `Redis "rpush" command`_. 10 | 11 | Important Redis key patterns are: 12 | 13 | ``zmon:queue:`` 14 | List of worker tasks for given queue. 15 | ``zmon:checks`` 16 | Set of all executed check IDs. 17 | ``zmon:checks:`` 18 | Set of entity IDs having check results. 19 | ``zmon:checks::`` 20 | List of last N check results. The first list item contains the most recent check result. 21 | Each check result is a JSON object with the keys ``ts`` (result timestamp), ``td`` (check duration), ``value`` (actual result value) and ``worker`` (ID of worker having produced the check result). 22 | ``zmon:alerts`` 23 | Set of all active alert IDs. 24 | ``zmon:alerts:`` 25 | Set of entity IDs in alert state. 26 | ``zmon:alerts::entities`` 27 | Hash of entity IDs to alert captures. This hash contains *all* entity IDs matched by the alert, i.e. not only entities in alert state. 28 | ``zmon:alerts::`` 29 | Alert detail JSON containing alert start time, captures, worker, etc. 30 | ``zmon:downtimes`` 31 | Set of all alert IDs having downtimes. 32 | ``zmon:downtimes:`` 33 | Set of all entity IDs having a downtime for this alert. 34 | ``zmon:downtimes::`` 35 | Hash of downtimes for this entity/alert. Each hash value is a JSON object with keys ``start_time``, ``end_time`` and ``comment``. 36 | ``zmon:active_downtimes`` 37 | Set of currently active downtimes. Each set item has the form ``::``. 38 | ``zmon:metrics`` 39 | Set of worker and scheduler IDs with metrics. 40 | ``zmon:metrics::ts`` 41 | Timestamp of last worker or scheduler metrics update. 42 | ``zmon:metrics::check.count`` 43 | Increasing counter of executed (or scheduled) checks. 44 | 45 | .. _Redis "rpush" command: http://redis.io/commands/rpush 46 | -------------------------------------------------------------------------------- /docs/user/check-ref/entities_wrapper.rst: -------------------------------------------------------------------------------- 1 | Entities 2 | -------- 3 | 4 | Provides access to ZMON entities. 5 | 6 | .. py:function:: entities(service_url, infrastructure_account, verify=True, oauth2=False) 7 | 8 | Initialize entities wrapper. 9 | 10 | :param service_url: Entities service url. 11 | :type service_url: str 12 | 13 | :param infrastructure_account: Infrastructure account used to filter entities. 14 | :type infrastructure_account: str 15 | 16 | :param verify: Verify SSL connection. Default is ``True``. 17 | :type username: bool 18 | 19 | :param oauth2: Use OAUTH for authentication. Default is ``False``. 20 | :type oauth2: bool 21 | 22 | .. note:: 23 | 24 | If `service_url` or `infrastructure_account` were not supplied, their corresponding values in worker plugin config will be used. 25 | 26 | 27 | Methods of Entities 28 | ^^^^^^^^^^^^^^^^^^^ 29 | 30 | .. py:function:: search_local(**kwargs) 31 | 32 | Search entities in local infrastructure account. If `infrastructure_account` is not supplied in kwargs, then should search entities "local" to your filtered entities by using the same `infrastructure_account` as a default filter. 33 | 34 | :param kwargs: Filtering kwargs 35 | :type kwargs: str 36 | 37 | :return: Entities 38 | :rtype: list 39 | 40 | Example searching all ``instance`` entities in local account: 41 | 42 | .. code-block:: python 43 | 44 | entities().search_local(type='instance') 45 | 46 | 47 | .. py:function:: search_all(**kwargs) 48 | 49 | Search all entities. 50 | 51 | :param kwargs: Filtering kwargs 52 | :type kwargs: str 53 | 54 | :return: Entities 55 | :rtype: list 56 | 57 | 58 | .. py:function:: alert_coverage(**kwargs) 59 | 60 | Return alert coverage for infrastructure_account. 61 | 62 | :param kwargs: Filtering kwargs 63 | :type kwargs: str 64 | 65 | :return: Alert coverage result. 66 | :rtype: list 67 | 68 | 69 | .. code-block:: python 70 | 71 | entities().alert_coverage(type='instance', infrastructure_account='1052643') 72 | 73 | [ 74 | { 75 | 'alerts': [], 76 | 'entities': [ 77 | {'id': 'app-1-instance', 'type': 'instance'} 78 | ] 79 | } 80 | ] 81 | -------------------------------------------------------------------------------- /docs/user/notifications/http.rst: -------------------------------------------------------------------------------- 1 | HTTP 2 | ---- 3 | 4 | Provides notification by invoking HTTP call to certain endpoint. HTTP notification uses ``POST`` method when invoking the call. 5 | 6 | 7 | .. py:function:: notify_http(url=None, body=None, params=None, headers=None, timeout=5, oauth2=False, include_alert=True) 8 | 9 | Send HTTP notification to specified endpoint. 10 | 11 | :param url: HTTP endpoint URL. If not passed, then default URL will be used in worker configuration. 12 | :type url: str 13 | 14 | :param body: Request body. 15 | :type body: dict 16 | 17 | :param params: Request URL params. 18 | :type params: dict 19 | 20 | :param headers: HTTP headers. 21 | :type headers: dict 22 | 23 | :param timeout: Request timeout. Default is 5 seconds. 24 | :type timeout: int 25 | 26 | :param oauth2: Add OAUTH2 authentication headers. Default is False. 27 | :type oauth2: bool 28 | 29 | :param include_alert: Include alert data in request body. Default is ``True``. 30 | :type include_alert: bool 31 | 32 | Example: 33 | 34 | .. code-block:: python 35 | 36 | notify_http('https://some-notification-service/alert', body={'zmon': True}, headers={'X-TOKEN': 1234}) 37 | 38 | 39 | .. note:: 40 | 41 | If ``include_alert`` is ``True``, then request body will include alert data. This is usually useful, since it provides valuable info like ``is_alert`` and ``changed`` which can indicate whether the alert has **started** or **ended**. 42 | 43 | .. code-block:: python 44 | 45 | { 46 | "body": null, 47 | "alert": { 48 | "is_alert": true, 49 | "changed": true, 50 | "duration": 2.33, 51 | "captures": {}, 52 | "entity": {"type": "GLOBAL", "id": "GLOBAL"}, 53 | "worker": "plocal.zmon", 54 | "value": {"td": 0.00037, "worker": "plocal.zmon", "ts": 1472032348.665247, "value": 51.67797677979191}, 55 | "alert_def": { 56 | "name": "Random Example Alert", "parameters": null, "check_id": 4, "entities_map": [], "responsible_team": "ZMON", "period": "", "priority": 1, 57 | "notifications": ["notify_http()"], "team": "ZMON", "id": 3, "condition": ">40" 58 | } 59 | } 60 | } 61 | -------------------------------------------------------------------------------- /docs/installation/components.rst: -------------------------------------------------------------------------------- 1 | ************************ 2 | Essential ZMON Components 3 | ************************ 4 | 5 | To use ZMON requires these four components: zmon-controller_, zmon-scheduler_, zmon-worker_, and zmon-eventlog-service_. 6 | 7 | .. image:: ../images/components.svg 8 | 9 | Controller 10 | ========== 11 | 12 | zmon-controller_ runs ZMON's AngularJS frontend and serves as an endpoint for retrieving data and managing your ZMON deployment via REST API (with help from the command line client). It needs a connection configured to: 13 | 14 | * PostgreSQL to store/retrieve all kind of data: entities, checks, dashboards, alerts 15 | * Redis, to keep the state of ZMON's alerts 16 | * KairosDB, if you want charts/Grafana 17 | 18 | To provide a means of authentication and authorization, you can choose between the following options: 19 | 20 | * A basic credential file 21 | * An OAuth2 identity provider, e.g., GitHub 22 | 23 | Scheduler 24 | ========= 25 | 26 | zmon-scheduler_ is responsible for keeping track of all existing entities, checks and alerts and scheduling checks in time for applicable entities, which are then executed by the worker. 27 | 28 | Needs connections to: 29 | 30 | * Redis, which serves ZMON as a task queue 31 | * Controller, to get check/alerts/entities 32 | * Custom adapters might need connections for entity discovery in your platform 33 | 34 | Worker 35 | ====== 36 | 37 | zmon-worker_ does the heavy lifting — executing tasks against entities and evaluating all alerts assigned to this check. Tasks are picked up from Redis and the resulting check value plus alert state changes are written back to Redis. 38 | 39 | Needs connection to: 40 | * Redis to retrieve tasks and update current state 41 | * KairosDB if you want to have metrics 42 | * EventLog service to store history events for alert state changes 43 | 44 | EventLog Service 45 | ================ 46 | 47 | zmon-eventlog-service_ is our slim implementation of an event store, keeping track of Events related to alert state changes as well as events like alert and check modification by the user. 48 | 49 | Needs connection to: 50 | * PostgreSQL to store events using jsonb 51 | 52 | .. _zmon-controller: https://github.com/zalando-zmon/zmon-controller 53 | .. _zmon-scheduler: https://github.com/zalando-zmon/zmon-scheduler 54 | .. _zmon-worker: https://github.com/zalando-zmon/zmon-worker 55 | .. _zmon-eventlog-service: https://github.com/zalando-zmon/zmon-eventlog-service 56 | -------------------------------------------------------------------------------- /docs/apendix/glossary.rst: -------------------------------------------------------------------------------- 1 | .. _glossary: 2 | 3 | ******** 4 | Glossary 5 | ******** 6 | 7 | .. KEEP IN ALPHABETCAL ORDER! 8 | 9 | .. glossary:: 10 | 11 | alert definition 12 | Alert definitions define when to trigger an alert and for which entity. 13 | See :ref:`alert-definitions` 14 | 15 | alert condition 16 | Python expression defining the "threshold" when to trigger an alert. See :ref:`alert-condition`. 17 | 18 | check command 19 | Python expression defining the value of a check. See :ref:`check-commands`. 20 | 21 | check definition 22 | A check definition provides a source of data for alerts to monitor. See :ref:`check-definitions` 23 | 24 | dashboard 25 | A dashboard is the main monitoring page of ZMON and consists of widgets and the list of active alerts. 26 | See :ref:`dashboards` 27 | 28 | downtime 29 | In ZMON, downtime refers to a period of time where certain alerts/entities should not be triggered. 30 | One use case for downtimes are scheduled maintenance works. See :ref:`downtimes` 31 | 32 | entity 33 | Entities are "objects" to be monitored. Entities can be hosts, Zomcat instances, but they can also be more abstract things like app domains. 34 | See :ref:`entities` 35 | 36 | JSON 37 | JavaScript Object Notation. A minimal data interchange format. You probably already know it. If you don't, there's good documentation on its `official page `_. 38 | 39 | Markdown 40 | A simple markup language that can mostly pass for plain text. There's an `introduction `_ and a `syntax reference `_ on its official page. 41 | 42 | time period 43 | Alert definition's time period can restrict its active alerting to certain time frames. This allows for alerts to be active e.g. only during work hours. 44 | See :ref:`time-periods` 45 | 46 | YAML 47 | Not actually Yet Another Markup Language. A powerful but succinct data interchange format. This document should be sufficient to learn how to use YAML in ZMON. In case it isn't, the `Wikipedia entry on YAML `_ is actually slightly more useful that the `official documentation `_. 48 | 49 | Note that YAML is a strict superset of :term:`JSON`. That is, wherever YAML is required, JSON can be used instead. 50 | 51 | .. KEEP IN ALPHABETCAL ORDER! 52 | -------------------------------------------------------------------------------- /docs/user/check-ref/datapipeline_wrapper.rst: -------------------------------------------------------------------------------- 1 | .. _datapipeline: 2 | 3 | Data Pipeline 4 | ------------- 5 | 6 | If running on AWS you can use ``datapipeline()`` to access AWS Data Pipelines' health easily. 7 | 8 | .. py:function:: datapipeline(region=None) 9 | 10 | Initialize Data Pipeline wrapper. 11 | 12 | :param region: AWS region for Data Pipeline queries. Eg. "eu-west-1". Defaults to the region in which the check is being executed. Note that Data Pipeline is not availabe in "eu-central-1" at time of writing. 13 | :type region: str 14 | 15 | 16 | Methods of Data Pipeline 17 | ^^^^^^^^^^^^^^^^^^^^^^^^ 18 | .. py:method:: get_details(pipeline_ids) 19 | 20 | Query AWS Data Pipeline IDs supplied as a String (single) or list of Strings (multiple). 21 | Return a dict of ID(s) and status dicts as described in `describe_pipelines boto documentation`_. 22 | 23 | :param pipeline_ids: Data Pipeline IDs. Example ``df-0123456789ABCDEFGHI`` 24 | :type pipeline_ids: Union[str, list] 25 | :rtype: dict 26 | 27 | Example query with single Data Pipeline ID supplied in a list: 28 | 29 | .. code-block:: python 30 | 31 | datapipeline().get_details(pipeline_ids=['df-exampleA']) 32 | { 33 | "df-exampleA": { 34 | "@lastActivationTime": "2018-01-30T14:23:52", 35 | "pipelineCreator": "ABCDEF:auser", 36 | "@scheduledPeriod": "24 hours", 37 | "@accountId": "0123456789", 38 | "name": "exampleA", 39 | "@latestRunTime": "2018-01-04T03:00:00", 40 | "@id": "df-0441325MB6VYFI6MUU1", 41 | "@healthStatusUpdatedTime": "2018-01-01T10:00:00", 42 | "@creationTime": "2018-01-01T10:00:00", 43 | "@userId": "0123456789", 44 | "@sphere": "PIPELINE", 45 | "@nextRunTime": "2018-01-05T03:00:00", 46 | "@scheduledStartTime": "2018-01-02T03:00:00", 47 | "@healthStatus": "HEALTHY", 48 | "uniqueId": "exampleA", 49 | "*tags": "[{\"key\":\"DataPipelineName\",\"value\":\"exampleA\"},{\"key\":\"DataPipelineId\",\"value\":\"df-exampleA\"}]", 50 | "@version": "2", 51 | "@firstActivationTime": "2018-01-01T10:00:00", 52 | "@pipelineState": "SCHEDULED" 53 | } 54 | } 55 | 56 | .. _describe_pipelines boto documentation: http://boto3.readthedocs.io/en/latest/reference/services/datapipeline.html#DataPipeline.Client.describe_pipelines 57 | -------------------------------------------------------------------------------- /docs/user/check-ref/memcached_wrapper.rst: -------------------------------------------------------------------------------- 1 | Memcached 2 | --------- 3 | 4 | Read-only access to memcached servers is provided by the :py:func:`memcached` function. 5 | 6 | 7 | .. py:function:: memcached([host=some.host], [port=11211]) 8 | 9 | Returns a connection to the Memcached server at :samp:`{}:{}`, where :samp:`{}` is the value 10 | of the current entity's ``host`` attribute, and :samp:`{}` is the given port (default ``11211``). See 11 | below for a list of methods provided by the returned connection object. 12 | 13 | 14 | Methods of the Memcached Connection 15 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 16 | 17 | The object returned by the :py:func:`memcached` function provides the following methods: 18 | 19 | .. py:method:: get(key) 20 | 21 | Returns the string stored at `key`. If `key` does not exist an error is raised. 22 | 23 | :: 24 | 25 | memcached().get("example_memcached_key") 26 | 27 | 28 | .. py:method:: json(key) 29 | 30 | Returns the data of the key as unserialized JSON data. I.e. you can store a JSON object as 31 | value of the key and get a dict back 32 | 33 | :: 34 | 35 | memcached().json("example_memcached_key") 36 | 37 | 38 | 39 | .. py:method:: stats([extra_keys=[STR,STR]) 40 | 41 | Returns a ``dict`` with general Memcached statistics such as memory usage and operations/s. 42 | All values are extracted using the `Memcached STATS command`_. 43 | 44 | The `extra_keys` may be retrieved as returned as well from the memcached server's `stats` 45 | command, e.g. `version` or `uptime`. 46 | 47 | Example result: 48 | 49 | .. code-block:: json 50 | 51 | { 52 | "incr_hits_per_sec": 0, 53 | "incr_misses_per_sec": 0, 54 | "touch_misses_per_sec": 0, 55 | "decr_misses_per_sec": 0, 56 | "touch_hits_per_sec": 0, 57 | "get_expired_per_sec": 0, 58 | "get_hits_per_sec": 100.01, 59 | "cmd_get_per_sec": 119.98, 60 | "cas_hits_per_sec": 0, 61 | "cas_badval_per_sec": 0, 62 | "delete_misses_per_sec": 0, 63 | "bytes_read_per_sec": 6571.76, 64 | "auth_errors_per_sec": 0, 65 | "cmd_set_per_sec": 19.97, 66 | "bytes_written_per_sec": 6309.17, 67 | "get_flushed_per_sec": 0, 68 | "delete_hits_per_sec": 0, 69 | "cmd_flush_per_sec": 0, 70 | "curr_items": 37217768, 71 | "decr_hits_per_sec": 0, 72 | "connections_per_sec": 0.02, 73 | "cas_misses_per_sec": 0, 74 | "cmd_touch_per_sec": 0, 75 | "bytes": 3902170728, 76 | "evictions_per_sec": 0, 77 | "auth_cmds_per_sec": 0, 78 | "get_misses_per_sec": 19.97 79 | } 80 | 81 | 82 | .. _Memcached documentation: https://lzone.de/cheat-sheet/memcached 83 | .. _Memcached STATS command: https://lzone.de/cheat-sheet/memcached#stats 84 | -------------------------------------------------------------------------------- /docs/user/check-ref/history_wrapper.rst: -------------------------------------------------------------------------------- 1 | History 2 | -------- 3 | 4 | Wrapper for KairosDB to access history data about checks. 5 | 6 | 7 | .. py:function:: history(url=None, check_id='', entities=None, oauth2=False) 8 | 9 | 10 | Methods of History 11 | ^^^^^^^^^^^^^^^^^^^ 12 | 13 | .. py:function:: result(time_from=ONE_WEEK_AND_5MIN, time_to=ONE_WEEK) 14 | 15 | Return query result. 16 | 17 | :param time_from: Relative time from in seconds. Default is ``ONE_WEEK_AND_5MIN``. 18 | :type application: int 19 | 20 | :param time_to: Relative time to in seconds. Default is ``ONE_WEEK``. 21 | :type application: int 22 | 23 | :return: Json result 24 | :rtype: dict 25 | 26 | .. py:function:: get_one(time_from=ONE_WEEK_AND_5MIN, time_to=ONE_WEEK) 27 | 28 | Return first result values. 29 | 30 | :param time_from: Relative time from in seconds. Default is ``ONE_WEEK_AND_5MIN``. 31 | :type application: int 32 | 33 | :param time_to: Relative time to in seconds. Default is ``ONE_WEEK``. 34 | :type application: int 35 | 36 | :return: List of values 37 | :rtype: list 38 | 39 | .. py:function:: get_aggregated(key, aggregator, time_from=ONE_WEEK_AND_5MIN, time_to=ONE_WEEK) 40 | 41 | Return first result values. If no ``key`` filtering matches, empty list is returned. 42 | 43 | :param key: Tag key used in filtering the results. 44 | :type key: str 45 | 46 | :param aggregator: Aggregator used in query. (e.g 'avg') 47 | :type aggregator: str 48 | 49 | :param time_from: Relative time from in seconds. Default is ``ONE_WEEK_AND_5MIN``. 50 | :type application: int 51 | 52 | :param time_to: Relative time to in seconds. Default is ``ONE_WEEK``. 53 | :type application: int 54 | 55 | :return: List of values 56 | :rtype: list 57 | 58 | .. py:function:: get_avg(key, time_from=ONE_WEEK_AND_5MIN, time_to=ONE_WEEK) 59 | 60 | Return aggregated average. 61 | 62 | :param key: Tag key used in filtering the results. 63 | :type key: str 64 | 65 | :param time_from: Relative time from in seconds. Default is ``ONE_WEEK_AND_5MIN``. 66 | :type application: int 67 | 68 | :param time_to: Relative time to in seconds. Default is ``ONE_WEEK``. 69 | :type application: int 70 | 71 | :return: List of values 72 | :rtype: list 73 | 74 | .. py:function:: get_std_dev(key, time_from=ONE_WEEK_AND_5MIN, time_to=ONE_WEEK) 75 | 76 | Return aggregated standard deviation. 77 | 78 | :param key: Tag key used in filtering the results. 79 | :type key: str 80 | 81 | :param time_from: Relative time from in seconds. Default is ``ONE_WEEK_AND_5MIN``. 82 | :type application: int 83 | 84 | :param time_to: Relative time to in seconds. Default is ``ONE_WEEK``. 85 | :type application: int 86 | 87 | :return: List of values 88 | :rtype: list 89 | 90 | .. py:function:: distance(self, weeks=4, snap_to_bin=True, bin_size='1h', dict_extractor_path='') 91 | 92 | For detailed docs on distance function please see :ref:`History distance functionality ` . 93 | -------------------------------------------------------------------------------- /docs/user/entities.rst: -------------------------------------------------------------------------------- 1 | .. _entities: 2 | 3 | ******** 4 | Entities 5 | ******** 6 | 7 | Entities describe what you want to monitor in your infrastructure. 8 | This can be as basic as a host, with its attributes hostname and IP; or something more complex, like a PostgreSQL sharded cluster with its identifier and set of connection strings. 9 | 10 | ZMON gives you two options for automation in/integration with your platform: storing entities via zmon-controller_'s entity service, or discovering them via the adapters in zmon-scheduler_. 11 | At Zalando we use both, connecting ZMON to tools like our CMDB but also pushing entities via REST API. 12 | 13 | ZMON's entity service describes entities with a single JSON document. 14 | 15 | - Any entity must contain an ID that is unique within your ZMON deployment. We often use a pattern like ``(:)`` to create uniqueness at the host and application levels, but this is up to you. 16 | - Any entity must contain a type which describes the kind of entity, like an object class. 17 | 18 | At the check execution we bind entity properties as default values to the functions executed, e.g. the IP gets used for relative ``http()`` requests. 19 | 20 | Format 21 | ------ 22 | 23 | Generally, ZMON entity is a set of properties that can be represented as a multi-level dictionary. For example: 24 | 25 | .. code-block:: json 26 | 27 | { 28 | "id":"arbitrary_entity_id", 29 | "type":"some_type", 30 | "oneMoreProperty":"foo", 31 | "nestedProperty": { 32 | "subProperty1": "foo", 33 | "subProperty2": "bar", 34 | } 35 | } 36 | 2 notes here to keep in mind: 37 | 38 | 1. ``id`` and ``type`` properties are **mandatory**. 39 | 2. ZMON filtering (e.g. in ZMON UI) **does not support nested properties**. 40 | 41 | 42 | Examples 43 | -------- 44 | 45 | In working with the Vagrant Box, you can use the scheduler instance entity like this: 46 | 47 | .. code-block:: json 48 | 49 | { 50 | "id":"localhost:3421", 51 | "type":"instance", 52 | "host":"localhost", 53 | "project":"zmon-scheduler-ng", 54 | "ports": {"3421":3421} 55 | } 56 | 57 | Here, you can use the "ports" dictionary to also describe additional open ports. 58 | As with Spring Boot, a second port is usually added, exposing management features. 59 | 60 | Now let's look at an example of the PostgreSQL instance: 61 | 62 | .. code-block:: json 63 | 64 | { 65 | "id":"localhost:5432", 66 | "type":"database", 67 | "name":"zmon-cluster", 68 | "shards": {"zmon":"localhost:5432/local_zmon_db"} 69 | } 70 | 71 | Usage of the property "shards" is given by how ZMON's worker exposes PostgreSQL clusters to the sql() function. 72 | 73 | View more examples here_. 74 | 75 | If you'd like to create an entity by yourself, check `ZMON CLI tool`_ 76 | 77 | .. _zmon-controller: https://github.com/zalando-zmon/zmon-controller 78 | .. _zmon-scheduler: https://github.com/zalando-zmon/zmon-scheduler 79 | .. _here: https://github.com/zalando-zmon/zmon-demo/tree/master/bootstrap/entities 80 | .. _ZMON CLI tool: https://docs.zmon.io/en/latest/developer/zmon-cli.html#entities 81 | -------------------------------------------------------------------------------- /docs/user/tv-login.rst: -------------------------------------------------------------------------------- 1 | .. _tv-login: 2 | 3 | ************************* 4 | "Read Only" Display Login 5 | ************************* 6 | 7 | The ZMON front end requires users to login. 8 | However a very common way of deploying dashboards is on TV screens running across office spaces to e.g. render Grafana or ZMON dashboards. 9 | For this ZMON provides you with a way to login a read only authenticated user via one-time tokens. 10 | 11 | Those tokens can be created by any real user by login in first and switching to TV mode or via the ZMON CLI. 12 | 13 | How does it work 14 | ================ 15 | 16 | First time a valid one time token is used to login we associate a random UUID with it and the device IP. 17 | Both are registered within ZMON to create a persisted session, thus this will continue to work after the frontend gets deployed. 18 | 19 | Tokens can't be reused. Once used, it can no longer be used and you need to create a new one. You'll need a different token per additional 20 | device or location. One time token sessions will last up to 365 days. 21 | 22 | 23 | Using the menu option 24 | +++++++++++++++++++++ 25 | 26 | First you need to login using your own personal credentials or Single Sign-On mechanism. After logging in you can use the top right 27 | drop-down menu with your username to reveal the "Switch to TV mode" option. 28 | 29 | .. image:: /images/switch-tv-mode.png 30 | 31 | Clicking this option will replace your login session with a new session using a newly created one time token, but your personal session 32 | will still be valid!. You must log out before leaving the device unattended. 33 | 34 | A pop-up dialog will ask you to take action. If you decide to Logout, a new Tab will open to log you out. You can safely 35 | close this Tab after successful logout and return to ZMON, which will now be on TV Mode. 36 | 37 | For more information on the Logout URL, please check :doc:`/installation/configuration`. 38 | 39 | .. image:: /images/tv-mode-logout-dialog.png 40 | 41 | You'll be able to confirm by checking the username in the drop-down menu where your username used to be present. There will be a new username with 42 | the pattern "ZMON_TV_123abc". 43 | 44 | .. image:: /images/tv-mode.png 45 | 46 | After this you can leave the device safely unattended. TV mode allows only read access to ZMON. 47 | 48 | Using the ZMON CLI 49 | ++++++++++++++++++ 50 | 51 | You can also generate one time tokens using the command line tool. The tool also allows you to list which tokens you already generated. 52 | 53 | Getting a token 54 | =============== 55 | 56 | .. code-block:: bash 57 | 58 | zmon onetime-token get 59 | 60 | Retrieving new one-time token ... 61 | https://zmon.example.org/tv/AocciOWf/ 62 | OK 63 | 64 | 65 | Login with token 66 | ================ 67 | 68 | Use the URL in the target browser to login directly. This will create a read-only session. 69 | 70 | .. code-block:: bash 71 | 72 | https:///tv/ 73 | 74 | .. note:: 75 | 76 | Please make sure you access the generated URL in order to login. Appending the to any other ZMON device or location won't work. 77 | 78 | Listing existing tokens 79 | ======================= 80 | 81 | .. code-block:: bash 82 | 83 | zmon onetime-token list 84 | 85 | - bound_at: 2008-05-08 12:16:21.696000 86 | bound_expires: 1234567800000 87 | bound_ip: '' 88 | created: 2008-05-08 12:16:20.533000 89 | token: 1234abCD 90 | -------------------------------------------------------------------------------- /docs/user/alert-definition-parameters.rst: -------------------------------------------------------------------------------- 1 | .. _alert-definition-parameters: 2 | 3 | 4 | Alert Definition Parameters 5 | --------------------------- 6 | 7 | Alert definition *parameters* allows one to decouple alert condition from constants that are used inside it. 8 | 9 | Use Case: Technical alert condition 10 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 11 | 12 | If your alert condition is highly technical with a lot of Python code in it, it is often makes sense to split actual calculation from threshold values and move such constant values into parameters. 13 | 14 | The same may apply in certain cases to alert definitions created by technical staff, which later need to be adjusted by non-technical people - if you split calculation from variable definition, you may let non-technical people just change values without touching calculation logic. 15 | 16 | Use Case: Same alert, different priorities 17 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 18 | 19 | Another use case where we recommend to use parameters is when you need to have the same alert come up with a different priority depending on threshold values. 20 | 21 | In such case, refer to :ref:`alert inheritance ` for configuring inherited alerts. 22 | 23 | Proposed structure would look like: 24 | 25 | * Base alert "A" with alert condition and parameters, check *template* box 26 | * Alert "B1" inherits from "A" specifying *priority* RED and associated parameter values 27 | * Alert "B2" inherits from "A" specifying *priority* YELLOW and associated parameter values 28 | 29 | An example: Setting a simple parameter in trial run 30 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 31 | 32 | In the zmon2 web interface click on the trial run button. 33 | 34 | 1. In the **Check Command** text box enter:: 35 | 36 | normalvariate(50, 20) 37 | 38 | This is a simple normal probability function that produce a float number 50% of the time over 50.0, so it's good to test things. 39 | 40 | 2. In the **Alert Condition** enter:: 41 | 42 | value>capture(threshold=threshold) + len(capture(params=params)) 43 | 44 | 3. In the **Parameters** selector enter two values (by clicking the plus sign): 45 | 46 | +------------+------------+-----------+ 47 | | Name | Value | Type | 48 | +============+============+===========+ 49 | | threshold | 50.0 | Float | 50 | +------------+------------+-----------+ 51 | | anything | Kartoffel | String | 52 | +------------+------------+-----------+ 53 | 54 | 4. In the **Entity Filter** text box enter: 55 | 56 | .. code-block:: json 57 | 58 | [ 59 | { 60 | "type": "GLOBAL" 61 | } 62 | ] 63 | 64 | 5. In the **Interval** enter: 10 65 | 66 | If you run this Trial you can get an Alert or an 'OK', but the interesting thing will be in the **Captures** column. 67 | See how the parameters that you entered are evaluated in the alert condition with the value that you provided. 68 | Notice also that there is a special parameter called **params** that holds a dict with all the parameters that you entered, this is done so the user can iterate over all the parameters and take conditional decisions, providing a kind of introspection capability, but this is only for advanced users. 69 | 70 | Last but not least: *Most of the time you don't need to capture the parameter values*, we did it like this so you can visually see that the parameters are evaluated, this means that you can run exactly the same check with this **Alert Condition**:: 71 | 72 | value>threshold + len(params) 73 | -------------------------------------------------------------------------------- /docs/user/grafana.rst: -------------------------------------------------------------------------------- 1 | .. _grafana: 2 | 3 | ********************* 4 | Grafana3 and KairosDB 5 | ********************* 6 | 7 | Grafana is a powerful open-source tool for creating dashboards to visualize metric data. 8 | ZMON deploys Grafana 3.x along with the new KairosDB plugin to read metric data from KairosDB. 9 | Grafana is served directly from the ZMON controller. 10 | Read requests are proxied through the controller so as not to expose the write/delete API from KairosDB. 11 | Dashboards are also saved via the controller, so there's no need for any additional data store. 12 | 13 | http://grafana.org 14 | 15 | Example of latency and requests charted via Grafana: 16 | 17 | .. image:: /images/grafana-example1.png 18 | 19 | Check data 20 | ========== 21 | 22 | Workers will send all their data to KairosDB. Depending on the KairosDB setting, data is stored forever or you may set a TTL in KairosDB. ZMON will not clean up or roll up any data. 23 | 24 | Serialization 25 | ------------- 26 | 27 | In the simplest case you would have a check producing a single numeric value. 28 | In Zalando's experience this is very rare. 29 | 30 | Zmon also supports arbitrarily nested dictionaries of numeric values. 31 | Anything that is not a dictionary or a number will be silently dropped. 32 | The value is flattened into a single-level dictionary such that the elements can be stored in KairosDB (key-value storage). 33 | 34 | .. code-block:: json 35 | 36 | { 37 | "load": {"1min":1,"5min":3,"15min":2}, 38 | "memory_free": 16000 39 | } 40 | 41 | Will be flattened to an equivalent of 42 | 43 | .. code-block:: json 44 | 45 | { 46 | "load.1min": 1, 47 | "load.5min": 3, 48 | "load.15min": 2, 49 | "memory_free": 16000 50 | } 51 | 52 | You might also want to output a list. The simple workaround is to generate a dictionary whose 53 | keys are some identifier extracted from the elements. 54 | 55 | e.g. transform this list: 56 | 57 | .. code-block:: json 58 | 59 | { 60 | "partitions": [ 61 | { 62 | "count": 2254839, 63 | "partition": "0", 64 | "stream_id": "55491eb8-3ccc-40c5-b7c6-69bf38df3e16" 65 | }, 66 | { 67 | "count": 2029956, 68 | "partition": "1", 69 | "stream_id": "aa938451-d115-4e90-a5da-1ac4b435a4e9" 70 | }, 71 | 72 | into the following dictionary: 73 | 74 | .. code-block:: json 75 | 76 | { 77 | "partitions": { 78 | "0": { 79 | "count": 2254839, 80 | "partition": "0", 81 | "stream_id": "55491eb8-3ccc-40c5-b7c6-69bf38df3e16" 82 | }, 83 | "1": { 84 | "count": 2029956, 85 | "partition": "1", 86 | "stream_id": "aa938451-d115-4e90-a5da-1ac4b435a4e9" 87 | }, 88 | 89 | this will be stored the same way as the value (remember that strings are dropped): 90 | 91 | .. code-block:: json 92 | 93 | { 94 | "partitions.0.count": 2254839, 95 | "partitions.1.count": 2029956 96 | } 97 | 98 | Tagging 99 | ------- 100 | 101 | KairosDB creates timer series with a name and allows us to tag data points with additional (tagname, tagvalue) pairs. 102 | 103 | ZMON stores all data to a single check in a time series named: "zmon.check.". 104 | 105 | Single data points are then tagged as follows to describe their contents: 106 | 107 | * entity: entity instance id (some character replace rules are applied) 108 | * key: containing the dict key after serialization of check value (see above) 109 | * metric: contains the last segment of "key" split by "." (making selection easier in tooling) 110 | * application: the application label attribute of the entity 111 | -------------------------------------------------------------------------------- /docs/developer/zmon-cli.rst: -------------------------------------------------------------------------------- 1 | .. _zmon-cli: 2 | 3 | ******************* 4 | Command Line Client 5 | ******************* 6 | 7 | The command line client makes your life easier when interacting with the REST API. The ZMON scheduler will refresh modified data (checks, alerts, entities every 60 seconds). 8 | 9 | Installation 10 | ------------ 11 | 12 | .. code-block:: bash 13 | 14 | pip3 install --upgrade zmon-cli 15 | 16 | Configuration 17 | ^^^^^^^^^^^^^ 18 | 19 | Configure your zmon cli by running ``configure``- 20 | 21 | .. code-block:: bash 22 | 23 | zmon configure 24 | 25 | Authentication 26 | ^^^^^^^^^^^^^^ 27 | 28 | ZMON CLI tool must authenticate against ZMON. Internally it uses zign to obtain access token, but you can override that behaviour by exporting a variable ZMON_TOKEN. 29 | 30 | .. code-block:: bash 31 | 32 | export ZMON_TOKEN=myfancytoken 33 | 34 | If you are using github for authentication, have an unprivileged personal access token ready. 35 | 36 | Entities 37 | -------- 38 | .. _cli-entities: 39 | 40 | Create or update 41 | ^^^^^^^^^^^^^^^^ 42 | Pushing entities with the zmon cli is as easy as: 43 | 44 | .. code-block:: bash 45 | 46 | zmon entities push \ 47 | '{"id":"localhost:3421","type":"instance","name":"zmon-scheduler-ng","host":"localhost","ports":{"3421":3421}}' 48 | 49 | Existing entities with the same ID will be updated. 50 | 51 | The client however also supports loading data from .json and .yaml files, both may contain a list for creating/updating many entities at once. 52 | 53 | .. code-block:: bash 54 | 55 | zmon entities push your-entities.yaml 56 | 57 | 58 | .. Note:: 59 | Creating an entity of type GLOBAL is not allowed. GLOBAL as an entity type is reserved for ZMON's internal use. 60 | 61 | 62 | .. Tip:: 63 | 64 | All commands and subcommands can be abbreviated, i.e. the following lines are equivalent: 65 | 66 | .. code-block:: bash 67 | 68 | $ zmon entities push my-data.yaml 69 | $ zmon ent pu my-data.yaml 70 | 71 | Search and filter 72 | ^^^^^^^^^^^^^^^^^ 73 | 74 | Show all entities: 75 | 76 | .. code-block:: bash 77 | 78 | zmon entities 79 | 80 | Filter by type "instance" 81 | 82 | .. code-block:: bash 83 | 84 | zmon entities filter type instance 85 | 86 | 87 | Check Definitions 88 | ----------------- 89 | .. _cli-cd: 90 | 91 | Initializing 92 | ^^^^^^^^^^^^ 93 | 94 | When starting from scratch use: 95 | 96 | .. code-block:: bash 97 | 98 | zmon check-definition init your-new-check.yaml 99 | 100 | 101 | Get 102 | ^^^ 103 | 104 | Retrieve an existing check defintion as YAML. 105 | 106 | .. code-block:: bash 107 | 108 | zmon check-definition get 1234 109 | 110 | Create and Update 111 | ^^^^^^^^^^^^^^^^^ 112 | 113 | Create or update from file, existing check with same "owning_team" and "name" will be updated. 114 | 115 | .. code-block:: bash 116 | 117 | zmon check-definition update your-check.yaml 118 | 119 | Alert Definitions 120 | ----------------- 121 | 122 | Similar to check defintions you can also manage your alert definitions via the ZMON cli. 123 | 124 | Keep in mind that for alerts the same constraints apply as in the UI. For creating/modifying an alert you need to be a member of the team selected for "team" (unlike the responsible team). 125 | 126 | Init 127 | ^^^^ 128 | 129 | .. code-block:: bash 130 | 131 | zmon alert-definition init your-new-alert.yaml 132 | 133 | Create 134 | ^^^^^^ 135 | 136 | .. code-block:: bash 137 | 138 | zmon alert-definition create your-new-alert.yaml 139 | 140 | Get 141 | ^^^ 142 | 143 | .. code-block:: bash 144 | 145 | zmon alert-definition get 1999 146 | 147 | Update 148 | ^^^^^^ 149 | 150 | .. code-block:: bash 151 | 152 | zmon alert-definition update host-load-5.yaml 153 | -------------------------------------------------------------------------------- /docs/user/check-ref/s3_wrapper.rst: -------------------------------------------------------------------------------- 1 | S3 2 | --- 3 | 4 | Allows data to be pulled from S3 Objects. 5 | 6 | 7 | .. py:function:: s3() 8 | 9 | 10 | Methods of S3 11 | ^^^^^^^^^^^^^^ 12 | 13 | .. py:function:: get_object_metadata(bucket_name, key) 14 | 15 | Get the metadata associated with the given ``bucket_name`` and ``key``. The metadata allows you to check for the 16 | existance of the key within the bucket and to check how large the object is without reading the whole object into 17 | memory. 18 | 19 | :param bucket_name: the name of the S3 Bucket 20 | :param key: the key that identifies the S3 Object within the S3 Bucket 21 | :return: an ``S3ObjectMetadata`` object 22 | 23 | .. py:class:: S3ObjectMetadata 24 | 25 | .. py:method:: exists() 26 | 27 | Will return True if the object exists. 28 | 29 | .. py:method:: size() 30 | 31 | Returns the size in bytes for the object. Will return -1 for objects that do not exist. 32 | 33 | Example usage: 34 | 35 | .. code-block:: python 36 | 37 | s3().get_object_metadata('my bucket', 'mykeypart1/mykeypart2').exists() 38 | s3().get_object_metadata('my bucket', 'mykeypart1/mykeypart2').size() 39 | 40 | 41 | .. py:function:: get_object(bucket_name, key) 42 | 43 | Get the S3 Object associated with the given ``bucket_name`` and ``key``. This method will cause the object to be 44 | read into memory. 45 | 46 | :param bucket_name: the name of the S3 Bucket 47 | :param key: the key that identifies the S3 Object within the S3 Bucket 48 | :return: an ``S3Object`` object 49 | 50 | .. py:class:: S3Object 51 | 52 | .. py:method:: text() 53 | 54 | Get the S3 Object data 55 | 56 | .. py:method:: json() 57 | 58 | If the object exists, parse the object as JSON. 59 | 60 | :return: a dict containing the parsed JSON or None if the object does not exist. 61 | 62 | .. py:method:: exists() 63 | 64 | Will return True if the object exists. 65 | 66 | .. py:method:: size() 67 | 68 | Returns the size in bytes for the object. Will return -1 for objects that do not exist. 69 | 70 | Example usage: 71 | 72 | .. code-block:: python 73 | 74 | s3().get_object('my bucket', 'mykeypart1/my_text_doc.txt').text() 75 | 76 | s3().get_object('my bucket', 'mykeypart1/my_json_doc.json').json() 77 | 78 | 79 | .. py:function:: list_bucket(bucket_name, prefix, max_items=100, recursive=True) 80 | 81 | List the S3 Object associated with the given ``bucket_name``, matching ``prefix``. 82 | By default, listing is possible for up to 1000 keys, so we use pagination internally to overcome this. 83 | 84 | :param bucket_name: the name of the S3 Bucket 85 | :param prefix: the prefix to search under 86 | :param max_items: the maximum number of objects to list. Defaults to 100. 87 | :param recursive: if the listing should contain deeply nested keys. Defaults to True. 88 | :return: an ``S3FileList`` object 89 | 90 | .. py:class:: S3FileList 91 | 92 | .. py:method:: files() 93 | 94 | Returns a list of dicts like 95 | 96 | .. code-block:: json 97 | 98 | { 99 | "file_name": "foo", 100 | "size": 12345, 101 | "last_modified": "2017-07-17T01:01:21Z" 102 | } 103 | 104 | Example usage: 105 | 106 | .. code-block:: python 107 | 108 | s3().list_bucket('my bucket', 'some_prefix').files() 109 | 110 | files = s3().list_bucket('my bucket', 'some_prefix', 10000).files() # for listing a lot of keys 111 | last_modified = files[0]["last_modified"].isoformat() # returns a string that can be passed to time() 112 | age = time() - time(last_modified) 113 | -------------------------------------------------------------------------------- /docs/user/check-ref/scalyr_wrapper.rst: -------------------------------------------------------------------------------- 1 | Scalyr 2 | ------ 3 | 4 | Wrapper 5 | ^^^^^^^ 6 | 7 | The ``scalyr()`` wrapper enables querying Scalyr from your AWS worker if the credentials have been specified for the worker instance(s). 8 | For more description of each type of query, please refer to https://www.scalyr.com/help/api . 9 | 10 | Default parameters: 11 | 12 | * ``minutes`` specifies the start time of the query. I.e. "5" will mean 5 minutes ago. 13 | * ``end`` specifies the end time of the query. I.e. "2" will mean until 2 minutes ago. If set to ``None``, then the end is set to 24h after ``minutes``. The default "0" means `now`. 14 | For ``minutes`` and ``end`` you can also specify absolute times like "2017-10-11T10:45:00+0800". 15 | 16 | .. py:method:: count(query, minutes=5, end=0) 17 | 18 | Run a count query against Scalyr, depending on number of queries you may run into rate limit. 19 | 20 | 21 | :: 22 | 23 | scalyr().count(' ERROR ') 24 | 25 | 26 | .. py:method:: timeseries(query, minutes=30, end=0) 27 | 28 | Runs a timeseries query against Scalyr with more generous rate limits. (New time series are created on the fly by Scalyr) 29 | 30 | 31 | .. py:method:: facets(filter, field, max_count=5, minutes=30, end=0) 32 | 33 | This method is used to retrieve the most common values for a field. 34 | 35 | 36 | .. py:method:: logs(query, max_count=100, minutes=5, continuation_token=None, columns=None, end=0) 37 | 38 | Runs a query against Scalyr and returns logs that match the query. At most ``max_count`` log lines will be returned. 39 | More can be fetched with the same query by passing back the continuation_token from the last response into the 40 | logs method. 41 | 42 | Specific columns can be returned (as defined in scalyr parser) using the columns array e.g. ``columns=['severity','threadName','timestamp']``. 43 | If this is unspecified, only the message column will be returned. 44 | 45 | An example logs result as JSON: 46 | 47 | .. code-block:: json 48 | 49 | { 50 | "messages": [ 51 | "message line 1", 52 | "message line 2" 53 | ], 54 | "continuation_token": "a token" 55 | } 56 | 57 | 58 | .. py:method:: power_query(query, minutes=5, end=0) 59 | 60 | Runs a power query against Scalyr and returns the results as response. You can create and test power queries also via the _UI:https://eu.scalyr.com/query . More information on power queries can be found _here:https://eu.scalyr.com/help/power-queries 61 | 62 | An example response as JSON: 63 | 64 | .. code-block:: json 65 | 66 | { 67 | "columns": [ 68 | { 69 | "name": "cluster" 70 | }, 71 | { 72 | "name": "application" 73 | }, 74 | { 75 | "name": "volume" 76 | } 77 | ], 78 | "warnings": [], 79 | "values": [ 80 | [ 81 | "cluster-1-eu-central-1:kube-1", 82 | "application-2", 83 | 9481810.0 84 | ], 85 | [ 86 | "cluster-2-eu-central-1:kube-1", 87 | "application-1", 88 | 8109726.0 89 | ] 90 | ], 91 | "matchingEvents": 8123.0, 92 | "status": "success", 93 | "omittedEvents": 0.0 94 | } 95 | 96 | 97 | Custom Scalyr Region 98 | ^^^^^^^^^^^^^^^^^^^^ 99 | 100 | By default the Scalyr wrapper uses https://www.scalyr.com/ as the default region. Overriding is possible using ``scalyr(scalyr_region='eu')`` if you want to use their Europe environment https://eu.scalyr.com/. 101 | 102 | 103 | :: 104 | 105 | scalyr(scalyr_region='eu').count(' ERROR ') 106 | -------------------------------------------------------------------------------- /docs/user/check-ref/kairosdb_wrapper.rst: -------------------------------------------------------------------------------- 1 | .. _check-kairosdb: 2 | 3 | KairosDB 4 | -------- 5 | 6 | Provides read access to the target KairosDB 7 | 8 | 9 | .. py:function:: kairosdb(url, oauth2=False) 10 | 11 | 12 | Methods of KairosDB 13 | ^^^^^^^^^^^^^^^^^^^ 14 | 15 | .. py:function:: query(name, group_by = None, tags = None, start = -5, end = 0, time_unit='seconds', aggregators = None, start_absolute = None, end_absolute = None) 16 | 17 | Query kairosdb. 18 | 19 | :param name: Metric name. 20 | :type name: str 21 | 22 | :param group_by: List of fields to group by. 23 | :type group_by: list 24 | 25 | :param tags: Filtering tags. Example of `tags` object: 26 | 27 | .. code-block:: python 28 | 29 | { 30 | "key": ["max"] 31 | } 32 | 33 | :type tags: dict 34 | 35 | :param start: Relative start time. Default is 5. Should be greater than or equal 1. 36 | :type start: int 37 | 38 | :param end: End time. Default is 0. If not 0, then it should be greater than or equal to 1. 39 | :type end: int 40 | 41 | :param time_unit: Time unit ('seconds', 'minutes', 'hours'). Default is 'minutes'. 42 | :type time_unit: str 43 | 44 | :param aggregators: List of aggregators. Aggregator is an object that looks like 45 | 46 | .. code-block:: python 47 | 48 | { 49 | "name": "max", 50 | "sampling": { 51 | "value": "1", 52 | "unit": "minutes" 53 | }, 54 | "align_sampling": true 55 | } 56 | 57 | :type aggregators: list 58 | 59 | :param start_absolute: Absolute start time in milliseconds, overrides the start parameter which is relative 60 | :type start_absolute: long 61 | 62 | :param end_absolute: Absolute end time in milliseconds, overrides the end parameter which is relative 63 | :type end_absolute: long 64 | 65 | :return: Result queries. 66 | :rtype: dict 67 | 68 | 69 | .. py:function:: query_batch(self, metrics, start=5, end=0, time_unit='minutes', start_absolute=None, end_absolute=None) 70 | 71 | Query kairosdb for several checks at once. 72 | 73 | :param metrics: list of KairosDB metric queries, one query per metric name, e.g. 74 | 75 | .. code-block:: python 76 | 77 | [ 78 | { 79 | 'name': 'metric_name', # name of the metric 80 | 'group_by': ['foo'], # list of fields to group by 81 | 'aggregators': [ # list of aggregator objects 82 | { # structure of a single aggregator 83 | 'name': 'max', 84 | 'sampling': { 85 | 'value': '1', 86 | 'unit': 'minutes' 87 | }, 88 | 'align_sampling': True 89 | } 90 | ], 91 | 'tags': { # dict with filtering tags 92 | 'key': ['max'] # a key is a tag name, list of values is used to filter 93 | # all the records with given tag and given values 94 | } 95 | } 96 | ] 97 | 98 | :type metrics: dict 99 | 100 | :param start: Relative start time. Default is 5. 101 | :type start: int 102 | 103 | :param end: End time. Default is 0. 104 | :type end: int 105 | 106 | :param time_unit: Time unit ('seconds', 'minutes', 'hours'). Default is 'minutes'. 107 | :type time_unit: str 108 | 109 | :param start_absolute: Absolute start time in milliseconds, overrides the start parameter which is relative 110 | :type start_absolute: long 111 | 112 | :return: Array of results for each queried metric 113 | :rtype: list 114 | -------------------------------------------------------------------------------- /docs/user/check-ref/snmp_wrapper.rst: -------------------------------------------------------------------------------- 1 | SNMP 2 | ---- 3 | 4 | Provides a wrapper for SNMP functions listed below. SNMP checks require 5 | specifying hosts in the entities filter. The partial object `snmp()` accepts a 6 | `timeout=seconds` parameter, default is 5 seconds timeout. **NOTE**: this timeout 7 | is per answer, so multiple answers will add up and may block the whole check 8 | 9 | .. py:method:: memory() 10 | 11 | :: 12 | 13 | snmp().memory() 14 | 15 | Returns host's memory usage statistics. All values are in KiB (1024 Bytes). 16 | 17 | Example check result as JSON: 18 | 19 | .. code-block:: json 20 | 21 | { 22 | "ram_buffer": 359404, 23 | "ram_cache": 6478944, 24 | "ram_free": 20963524, 25 | "ram_shared": 0, 26 | "ram_total": 37066332, 27 | "ram_total_free": 22963392, 28 | "swap_free": 1999868, 29 | "swap_min": 16000, 30 | "swap_total": 1999868, 31 | } 32 | 33 | .. py:method:: load() 34 | 35 | :: 36 | 37 | snmp().load() 38 | 39 | Returns host's CPU load average (1 minute, 5 minute and 15 minute averages). 40 | 41 | Example check result as JSON: 42 | 43 | .. code-block:: json 44 | 45 | {"load1": 0.95, "load5": 0.69, "load15": 0.72} 46 | 47 | .. py:method:: cpu() 48 | 49 | :: 50 | 51 | snmp().cpu() 52 | 53 | Returns host's CPU usage in percent. 54 | 55 | Example check result as JSON: 56 | 57 | .. code-block:: json 58 | 59 | {"cpu_system": 0, "cpu_user": 17, "cpu_idle": 81} 60 | 61 | 62 | .. py:method:: df() 63 | 64 | :: 65 | 66 | snmp().df() 67 | 68 | Example check result as JSON: 69 | 70 | .. code-block:: json 71 | 72 | { 73 | "/data/postgres-wal-nfs-example": { 74 | "available_space": 524287840, 75 | "device": "example0-2-stp-123:/vol/example_pgwal", 76 | "percentage_inodes_used": 0, 77 | "percentage_space_used": 0, 78 | "total_size": 524288000, 79 | "used_space": 160, 80 | } 81 | } 82 | 83 | .. py:method:: logmatch() 84 | 85 | :: 86 | 87 | snmp().logmatch() 88 | 89 | .. py:method:: interfaces() 90 | 91 | :: 92 | 93 | snmp().interfaces() 94 | 95 | Example check result as JSON: 96 | 97 | .. code-block:: json 98 | 99 | { 100 | "lo": { 101 | "in_octets": 63481918397415, 102 | "in_discards": 11, 103 | "adStatus": 1, 104 | "out_octets": 63481918397415, 105 | "opStatus": 1, 106 | "out_discards": 0, 107 | "speed": "10", 108 | "in_error": 0, 109 | "out_error": 0 110 | }, 111 | "eth1": { 112 | "in_octets": 55238870608924, 113 | "in_discards": 8344, 114 | "adStatus": 1, 115 | "out_octets": 6801703429894, 116 | "opStatus": 1, 117 | "out_discards": 0, 118 | "speed": "10000", 119 | "in_error": 0, 120 | "out_error": 0 121 | }, 122 | "eth0": { 123 | "in_octets": 3538944286327, 124 | "in_discards": 1130, 125 | "adStatus": 1, 126 | "out_octets": 16706789573119, 127 | "opStatus": 1, 128 | "out_discards": 0, 129 | "speed": "10000", 130 | "in_error": 0, 131 | "out_error": 0 132 | } 133 | } 134 | 135 | .. py:method:: get() 136 | 137 | :: 138 | 139 | snmp().get('iso.3.6.1.4.1.42253.1.2.3.1.4.7.47.98.105.110.47.115.104', 'stunnel', int) 140 | 141 | Example check result as JSON: 142 | 143 | .. code-block:: json 144 | 145 | { 146 | "stunnel": 0 147 | } 148 | -------------------------------------------------------------------------------- /docs/user/check-ref/elastic_search_wrapper.rst: -------------------------------------------------------------------------------- 1 | Elasticsearch 2 | ------------- 3 | 4 | Provides search queries and health check against an Elasticsearch cluster. 5 | 6 | 7 | .. py:function:: elasticsearch(url=None, timeout=10, oauth2=False) 8 | 9 | .. note:: 10 | 11 | If ``url`` is **None**, then the plugin will use the default Elasticsearch cluster set in worker configuration. 12 | 13 | Methods of Elasticsearch 14 | ^^^^^^^^^^^^^^^^^^^^^^^^ 15 | 16 | .. py:function:: search(indices=None, q='', body=None, source=True, size=DEFAULT_SIZE) 17 | 18 | Search ES cluster using URI or Request body search. If ``body`` is None then GET request will be used. 19 | 20 | :param indices: List of indices to search. Limited to only 10 indices. ['_all'] will search all available 21 | indices, which effectively leads to same results as `None`. Indices can accept wildcard form. 22 | :type indices: list 23 | 24 | :param q: Search query string. Will be ignored if ``body`` is not None. 25 | :type q: str 26 | 27 | :param body: Dict holding an ES query DSL. 28 | :type body: dict 29 | 30 | :param source: Whether to include `_source` field in query response. 31 | :type source: bool 32 | 33 | :param size: Number of hits to return. Maximum value is 1000. Set to 0 if interested in hits count only. 34 | :type size: int 35 | 36 | :return: ES query result. 37 | :rtype: dict 38 | 39 | Example query: 40 | 41 | .. code-block:: python 42 | 43 | elasticsearch('http://es-cluster').search(indices=['logstash-*'], q='client:192.168.20.* AND http_status:500', size=0, source=False) 44 | 45 | { 46 | "_shards": { 47 | "failed": 0, 48 | "successful": 5, 49 | "total": 5 50 | }, 51 | "hits": { 52 | "hits": [], 53 | "max_score": 0.0, 54 | "total": 1 55 | }, 56 | "timed_out": false, 57 | "took": 2 58 | } 59 | 60 | .. py:function:: count(indices=None, q='', body=None) 61 | 62 | Return ES count of matching query. 63 | 64 | :param indices: List of indices to search. Limited to only 10 indices. ['_all'] will search all available 65 | indices, which effectively leads to same results as `None`. Indices can accept wildcard form. 66 | :type indices: list 67 | 68 | :param q: Search query string. Will be ignored if ``body`` is not None. 69 | :type q: str 70 | 71 | :param body: Dict holding an ES query DSL. 72 | :type body: dict 73 | 74 | :return: ES query result. 75 | :rtype: dict 76 | 77 | Example query: 78 | 79 | .. code-block:: python 80 | 81 | elasticsearch('http://es-cluster').count(indices=['logstash-*'], q='client:192.168.20.* AND http_status:500') 82 | 83 | { 84 | "_shards": { 85 | "failed": 0, 86 | "successful": 16, 87 | "total": 16 88 | }, 89 | "count": 12 90 | } 91 | 92 | .. py:method:: health() 93 | 94 | Return ES cluster health. 95 | 96 | :return: Cluster health result. 97 | :rtype: dict 98 | 99 | .. code-block:: python 100 | 101 | elasticsearch('http://es-cluster').health() 102 | 103 | { 104 | "active_primary_shards": 11, 105 | "active_shards": 11, 106 | "active_shards_percent_as_number": 50.0, 107 | "cluster_name": "big-logs-cluster", 108 | "delayed_unassigned_shards": 0, 109 | "initializing_shards": 0, 110 | "number_of_data_nodes": 1, 111 | "number_of_in_flight_fetch": 0, 112 | "number_of_nodes": 1, 113 | "number_of_pending_tasks": 0, 114 | "relocating_shards": 0, 115 | "status": "yellow", 116 | "task_max_waiting_in_queue_millis": 0, 117 | "timed_out": false, 118 | "unassigned_shards": 11 119 | } 120 | -------------------------------------------------------------------------------- /docs/user/check-ref/http_wrapper.rst: -------------------------------------------------------------------------------- 1 | HTTP 2 | ---- 3 | 4 | Access to HTTP (and HTTPS) endpoints is provided by the :py:func:`http` function. 5 | 6 | .. py:function:: http(url, [method='GET'], [timeout=10], [max_retries=0], [verify=True], [oauth2=False], [allow_redirects=None], [headers=None]) 7 | 8 | :param str url: The URL that is to be queried. See below for details. 9 | :param str method: The HTTP request method. Allowed values are ``GET`` or ``HEAD``. 10 | :param float timeout: The timeout for the HTTP request, in seconds. Defaults to :py:obj:`10`. 11 | :param int max_retries: The number of times the HTTP request should be retried if it fails. Defaults to :py:obj:`0`. 12 | :param bool verify: Can be set to :py:obj:`False` to disable SSL certificate verification. 13 | :param bool oauth2: Can be set to :py:obj:`True` to inject a OAuth 2 ``Bearer`` access token in the outgoing request 14 | :param str oauth2_token_name: The name of the OAuth 2 token. Default is ``uid``. 15 | :param bool allow_redirects: Follow request redirects. If ``None`` then it will be set to :py:obj:`True` in case of ``GET`` and :py:obj:`False` in case of ``HEAD`` request. 16 | :param dict headers: The headers to be used in the HTTP request. 17 | :return: An object encapsulating the response from the server. See below. 18 | 19 | For checks on entities that define the attributes :py:attr:`url` or :py:attr:`host`, the given URL may be relative. In that case, the URL :samp:`http://<{value}><{url}>` is queried, where :samp:`<{value}>` is the value of that attribute, and :samp:`<{url}>` is the URL passed to this function. If an entity defines both :py:attr:`url` and :py:attr:`host`, the former is used. 20 | 21 | This function cannot query URLs using a scheme other than HTTP or HTTPS; URLs that do not start with :samp:`http://` or :samp:`https://` are considered to be relative. 22 | 23 | Example: 24 | 25 | .. code-block:: python 26 | 27 | http('http://www.example.org/data?fetch=json').json() 28 | 29 | # avoid raising error in case the response error status (e.g. 500 or 503) 30 | # but you are interested in the response json 31 | http('http://www.example.org/data?fetch=json').json(raise_error=False) 32 | 33 | 34 | HTTP Responses 35 | ^^^^^^^^^^^^^^ 36 | 37 | The object returned by the :py:func:`http` function provides methods: :py:meth:`json`, :py:meth:`text`, :py:meth:`headers`, :py:meth:`cookies`, :py:meth:`content_size`, :py:meth:`time` and :py:meth:`code`. 38 | 39 | .. py:method:: json(raise_error=True) 40 | 41 | This method returns an object representing the content of the JSON response from the queried endpoint. Usually, this will be a map (represented by a Python :py:obj:`dict`), but, depending on the endpoint, it may also be a list, string, set, integer, floating-point number, or Boolean. 42 | 43 | .. py:method:: text(raise_error=True) 44 | 45 | Returns the text response from queried endpoint:: 46 | 47 | http("/heartbeat.jsp", timeout=5).text().strip()=='OK: JVM is running' 48 | 49 | Since we’re using a relative url, this check has to be defined for 50 | specific entities (e.g. type=zomcat will run it on all zomcat 51 | instances). The strip function removes all leading and trailing 52 | whitespace. 53 | 54 | .. py:method:: headers(raise_error=True) 55 | 56 | Returns the response headers in a case-insensitive dict-like object:: 57 | 58 | http("/api/json", timeout=5).headers()['content-type']=='application/json' 59 | 60 | .. py:method:: cookies(raise_error=True) 61 | 62 | Returns the response cookies in a dict like object:: 63 | 64 | http("/heartbeat.jsp", timeout=5).cookies()['my_custom_cookie'] == 'custom_cookie_value' 65 | 66 | .. py:method:: content_size(raise_error=True) 67 | 68 | Returns the length of the response content:: 69 | 70 | http("/heartbeat.jsp", timeout=5).content_size() > 1024 71 | 72 | .. py:method:: time(raise_error=True) 73 | 74 | Returns the elapsed time in seconds until response was received:: 75 | 76 | http("/heartbeat.jsp", timeout=5).time() > 1.5 77 | 78 | .. py:method:: code() 79 | 80 | Return HTTP status code from the queried endpoint.:: 81 | 82 | http("/heartbeat.jsp", timeout=5).code() 83 | 84 | .. _http-actuator: 85 | 86 | .. py:method:: actuator_metrics(prefix='zmon.response.', raise_error=True) 87 | 88 | Parses the json result of a metrics endpoint into a map ep->method->status->metric 89 | 90 | http("/metrics", timeout=5).actuator_metrics() 91 | 92 | .. _http-prometheus: 93 | 94 | .. py:method:: prometheus() 95 | 96 | Parse the resulting text result according to the Prometheus specs using their prometheus_client. 97 | 98 | http("/metrics", timeout=5).prometheus() 99 | 100 | .. _http-prometheus_flat: 101 | 102 | .. py:method:: prometheus_flat() 103 | 104 | Parse the resulting text result according to the Prometheus specs using their prometheus_client 105 | and flattens the outcome. 106 | 107 | http("/metrics", timeout=5).prometheus_flat() 108 | 109 | .. _http-jolokia: 110 | 111 | .. py:method:: jolokia(read_requests, raise_error=False) 112 | 113 | Does a POST request to the endpoint given in the wrapper, with validating the endpoint and setting 114 | the request to be read-only. 115 | 116 | :param read_requests: see https://jolokia.org/reference/html/protocol.html#post-request 117 | :type read_requests: list 118 | :param raise_error: bool 119 | :return: Jolokia response 120 | 121 | Example: 122 | 123 | .. code-block:: python 124 | 125 | requests = [ 126 | {'mbean': 'org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency'}, 127 | {'mbean': 'org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency'}, 128 | ] 129 | results = http('http://{}:8778/jolokia/'.format(entity['ip']), timeout=15).jolokia(requests) 130 | -------------------------------------------------------------------------------- /docs/user/check-ref/appdynamics_wrapper.rst: -------------------------------------------------------------------------------- 1 | AppDynamics 2 | ------------- 3 | 4 | Enable AppDynamics Healthrule violations check and *optionally* query underlying Elasticsearch cluster raw logs. 5 | 6 | .. py:function:: appdynamics(url=None, username=None, password=None, es_url=None, index_prefix='') 7 | 8 | Initialize AppDynamics wrapper. 9 | 10 | :param url: Appdynamics url. 11 | :type url: str 12 | 13 | :param username: Appdynamics username. 14 | :type username: str 15 | 16 | :param password: Appdynamics password. 17 | :type password: str 18 | 19 | :param es_url: Appdynamics Elasticsearch cluster url. 20 | :type es_url: str 21 | 22 | :param index_prefix: Appdynamics Elasticsearch cluster logs index prefix. 23 | :type index_prefix: str 24 | 25 | .. note:: 26 | 27 | If ``username`` and ``password`` are not supplied, then OAUTH2 will be used. 28 | 29 | If ``appdynamics()`` is initialized with no args, then plugin configuration values will be used. 30 | 31 | Methods of AppDynamics 32 | ^^^^^^^^^^^^^^^^^^^^^^ 33 | 34 | .. py:function:: healthrule_violations(application, time_range_type=BEFORE_NOW, duration_in_mins=5, start_time=None, end_time=None, severity=None) 35 | 36 | Return Healthrule violations for AppDynamics application. 37 | 38 | :param application: Application name or ID 39 | :type application: str 40 | 41 | :param time_range_type: Valid time range type. Valid range types are BEFORE_NOW, BEFORE_TIME, AFTER_TIME and BETWEEN_TIMES. Default is BEFORE_NOW. 42 | :type time_range_type: str 43 | 44 | :param duration_in_mins: Time duration in mins. Required for BEFORE_NOW, AFTER_TIME, BEFORE_TIME range types. Default is 5 mins. 45 | :type duration_in_mins: int 46 | 47 | :param start_time: Start time (in milliseconds) from which the metric data is returned. Default is 5 mins ago. 48 | :type start_time: int 49 | 50 | :param end_time: End time (in milliseconds) until which the metric data is returned. Default is now. 51 | :type end_time: int 52 | 53 | :param severity: Filter results based on severity. Valid values are CRITICAL or WARNING. 54 | :type severity: str 55 | 56 | :return: List of healthrule violations 57 | :rtype: list 58 | 59 | Example query: 60 | 61 | .. code-block:: python 62 | 63 | appdynamics('https://appdynamics/controller/rest').healthrule_violations('49', time_range_type='BEFORE_NOW', duration_in_mins=5) 64 | 65 | [ 66 | { 67 | affectedEntityDefinition: { 68 | entityId: 408, 69 | entityType: "BUSINESS_TRANSACTION", 70 | name: "/error" 71 | }, 72 | detectedTimeInMillis: 0, 73 | endTimeInMillis: 0, 74 | id: 39637, 75 | incidentStatus: "OPEN", 76 | name: "Backend errrors (percentage)", 77 | severity: "CRITICAL", 78 | startTimeInMillis: 1462244635000, 79 | } 80 | ] 81 | 82 | .. py:function:: metric_data(application, metric_path, time_range_type=BEFORE_NOW, duration_in_mins=5, start_time=None, end_time=None, rollup=True) 83 | 84 | AppDynamics's metric-data API 85 | 86 | :param application: Application name or ID 87 | :type application: str 88 | 89 | :param metric_path: The path to the metric in the metric hierarchy 90 | :type metric_path: str 91 | 92 | :param time_range_type: Valid time range type. Valid range types are BEFORE_NOW, BEFORE_TIME, AFTER_TIME and 93 | BETWEEN_TIMES. Default is BEFORE_NOW. 94 | :type time_range_type: str 95 | 96 | :param duration_in_mins: Time duration in mins. Required for BEFORE_NOW, AFTER_TIME, BEFORE_TIME range types. 97 | :type duration_in_mins: int 98 | 99 | :param start_time: Start time (in milliseconds) from which the metric data is returned. Default is 5 mins ago. 100 | :type start_time: int 101 | 102 | :param end_time: End time (in milliseconds) until which the metric data is returned. Default is now. 103 | :type end_time: int 104 | 105 | :param rollup: By default, the values of the returned metrics are rolled up into a single data point 106 | (rollup=True). To get separate results for all values within the time range, set the 107 | ``rollup`` parameter to ``False``. 108 | :type rollup: bool 109 | 110 | :return: metric values for a metric 111 | :rtype: list 112 | 113 | .. py:function:: query_logs(q='', body=None, size=100, source_type=SOURCE_TYPE_APPLICATION_LOG, duration_in_mins=5) 114 | 115 | Perform search query on AppDynamics ES logs. 116 | 117 | :param q: Query string used in search. 118 | :type q: str 119 | 120 | :param body: (dict) holding an ES query DSL. 121 | :type body: dict 122 | 123 | :param size: Number of hits to return. Default is 100. 124 | :type size: int 125 | 126 | :param source_type: ``sourceType`` field filtering. Default to ``application-log``, and will be part of ``q``. 127 | :type source_type: str 128 | 129 | :param duration_in_mins: Duration in mins before current time. Default is 5 mins. 130 | :type duration_in_mins: int 131 | 132 | :return: ES query result ``hits``. 133 | :rtype: list 134 | 135 | .. py:function:: count_logs(q='', body=None, source_type=SOURCE_TYPE_APPLICATION_LOG, duration_in_mins=5) 136 | 137 | Perform count query on AppDynamics ES logs. 138 | 139 | :param q: Query string used in search. Will be ingnored if ``body`` is not None. 140 | :type q: str 141 | 142 | :param body: (dict) holding an ES query DSL. 143 | :type body: dict 144 | 145 | :param source_type: ``sourceType`` field filtering. Default to ``application-log``, and will be part of ``q``. 146 | :type source_type: str 147 | 148 | :param duration_in_mins: Duration in mins before current time. Default is 5 mins. Will be ignored if ``body`` is not None. 149 | :type duration_in_mins: int 150 | 151 | :return: Query match count. 152 | :rtype: int 153 | 154 | .. note:: 155 | 156 | In case of passing an ES query DSL in ``body``, then all filter parameters should be explicitly added in the query body (e.g. ``eventTimestamp``, ``application_id``, ``sourceType``). 157 | -------------------------------------------------------------------------------- /docs/user/check-ref/redis_wrapper.rst: -------------------------------------------------------------------------------- 1 | Redis 2 | ----- 3 | 4 | Read-only access to Redis servers is provided by the :py:func:`redis` function. 5 | 6 | 7 | .. py:function:: redis([port=6379], [db=0], [socket_connect_timeout=1], [socket_timeout=5], [ssl=False], 8 | [ssl_cert_reqs='required']) 9 | 10 | Returns a connection to the Redis server at :samp:`{}:{}`, where :samp:`{}` is the value 11 | of the current entity's ``host`` attribute, and :samp:`{}` is the given port (default ``6379``). See 12 | below for a list of methods provided by the returned connection object. 13 | 14 | :param host: Redis host. 15 | :type host: str 16 | 17 | :param password: If set - enables authentication to the destination redis server with the password provided. Default is None. 18 | :type password: str 19 | 20 | .. note:: 21 | 22 | If ``password`` param is not supplied, then plugin configuration values will be used. 23 | You can use ``plugin.redis.password`` to configure redis password authentication for zmon-worker. 24 | 25 | Please also have a look at the `Redis documentation`_. 26 | 27 | 28 | Methods of the Redis Connection 29 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 30 | 31 | The object returned by the :py:func:`redis` function provides the following methods: 32 | 33 | 34 | .. py:method:: llen(key) 35 | 36 | Returns the length of the list stored at `key`. If `key` does not exist, it's value is treated as if it were 37 | an empty list, and 0 is returned. If `key` exists but is not a list, an error is raised. 38 | 39 | :: 40 | 41 | redis().llen("prod_eventlog_queue") 42 | 43 | 44 | .. py:method:: lrange(key, start, stop) 45 | 46 | Returns the elements of the list stored at `key` in the range [`start`, `stop`]. If `key` does not 47 | exist, it's value is treated as if it were an empty list. If `key` exists but is not a list, an 48 | error is raised. 49 | 50 | The parameters `start` and `stop` are zero-based indexes. Negative numbers are converted to indexes 51 | by adding the length of the list, so that ``-1`` is the last element of the list, ``-2`` the 52 | second-to-last element of the list, and so on. 53 | 54 | Indexes outside the range of the list are not an error: If both `start` and `stop` are less than 0 or 55 | greater than or equal to the length of the list, an empty list is returned. Otherwise, if `start` is 56 | less than 0, it is treated as if it were 0, and if `stop` is greater than or equal to the the length 57 | of the list, it is treated as if it were equal to the length of the list minus 1. If `start` is 58 | greater than `stop`, an empty list is returned. 59 | 60 | Note that this method is subtly different from Python's list slicing syntax, where ``list[start:stop]`` 61 | returns elements in the range [`start`, `stop`). 62 | 63 | :: 64 | 65 | redis().lrange("prod_eventlog_queue", 0, 9) # Returns *ten* elements! 66 | redis().lrange("prod_eventlog_queue", 0, -1) # Returns the entire list. 67 | 68 | 69 | .. py:method:: get(key) 70 | 71 | Returns the string stored at `key`. If `key` does not exist, returns ``None``. If `key` exists 72 | but is not a string, an error is raised. 73 | 74 | :: 75 | 76 | redis().get("example_redis_key") 77 | 78 | 79 | .. py:method:: keys(pattern) 80 | 81 | Returns list of keys from Redis matching pattern. 82 | 83 | :: 84 | 85 | redis().keys("*downtime*") 86 | 87 | 88 | .. py:method:: hget(key, field) 89 | 90 | Returns the value of the field `field` of the hash `key`. If `key` does not exist or does not have 91 | a field named `field`, returns ``None``. If `key` exists but is not a hash, an error is raised. 92 | 93 | :: 94 | 95 | redis().hget("example_hash_key", "example_field_name") 96 | 97 | 98 | .. py:method:: hgetall(key) 99 | 100 | Returns a ``dict`` of all fields of the hash `key`. If `key` does not exist, returns an empty ``dict``. 101 | If `key` exists but is not a hash, an error is raised. 102 | 103 | :: 104 | 105 | redis().hgetall("example_hash_key") 106 | 107 | 108 | .. py:method:: hlen(key) 109 | 110 | Returns number of keys in hash ``key``. 111 | 112 | :: 113 | 114 | redis().hlen("example_hash_key") 115 | 116 | 117 | .. py:method:: scan(cursor, [match=None], [count=None]) 118 | 119 | Returns a ``set`` with the next cursor and the results from this scan. 120 | Please see the Redis documentation on how to use this function exactly: http://redis.io/commands/scan 121 | 122 | :: 123 | 124 | redis().scan(0, 'prefix*', 10) 125 | 126 | 127 | .. py:method:: smembers(key) 128 | 129 | Returns members of set ``key`` in Redis. 130 | 131 | :: 132 | 133 | redis().smembers("zmon:alert:1") 134 | 135 | 136 | .. py:method:: ttl(key) 137 | 138 | Return the time to live of an expiring key. 139 | 140 | :: 141 | 142 | redis().ttl('lock') 143 | 144 | .. py:method:: scard(key) 145 | 146 | Return the number of elements in set ``key`` 147 | 148 | :: 149 | 150 | redis().scard("example_hash_key") 151 | 152 | 153 | .. py:method:: zcard(key) 154 | 155 | Return the number of elements in the sorted set ``key`` 156 | 157 | :: 158 | 159 | redis().zcard("example_sorted_set_key") 160 | 161 | 162 | .. py:method:: info([section]) 163 | 164 | Returns a ``dict`` containing all information exposed by the `Redis INFO command`_. 165 | 166 | .. py:method:: statistics() 167 | 168 | Returns a ``dict`` with general Redis statistics such as memory usage and operations/s. 169 | All values are extracted using the `Redis INFO command`_. 170 | 171 | Example result: 172 | 173 | .. code-block:: json 174 | 175 | { 176 | "blocked_clients": 2, 177 | "commands_processed_per_sec": 15946.48, 178 | "connected_clients": 162, 179 | "connected_slaves": 0, 180 | "connections_received_per_sec": 0.5, 181 | "dbsize": 27351, 182 | "evicted_keys_per_sec": 0.0, 183 | "expired_keys_per_sec": 0.0, 184 | "instantaneous_ops_per_sec": 29626, 185 | "keyspace_hits_per_sec": 1195.43, 186 | "keyspace_misses_per_sec": 1237.99, 187 | "used_memory": 50781216, 188 | "used_memory_rss": 63475712 189 | } 190 | 191 | Please note that the values for both `used_memory` and `used_memory_rss` are in Bytes. 192 | 193 | .. _Redis documentation: http://redis.io/ 194 | .. _Redis INFO command: http://redis.io/commands/info 195 | -------------------------------------------------------------------------------- /docs/getting-started.rst: -------------------------------------------------------------------------------- 1 | *************** 2 | Getting Started 3 | *************** 4 | 5 | To quickly get started with ZMON, use the preconfigured Vagrant box featured on the `main ZMON repository`_. 6 | Make sure you've installed Vagrant *(at least 1.7.4)* and a Vagrant provider like VirtualBox on your machine. 7 | Clone the repository with Git: 8 | 9 | .. code-block:: bash 10 | 11 | $ git clone https://github.com/zalando/zmon.git 12 | $ cd zmon/ 13 | 14 | From within the cloned repository, run: 15 | 16 | .. code-block:: bash 17 | 18 | $ vagrant up 19 | 20 | Bootstrapping the image for the first time will take a bit of time. 21 | You might want to grab some coffee while you wait. :) 22 | 23 | When it's finally up, Vagrant will report on how to reach the ZMON web interface: 24 | 25 | .. code-block:: bash 26 | 27 | ==> default: ZMON installation is done! 28 | ==> default: Goto: https://localhost:8443 29 | ==> default: Login with your GitHub credentials 30 | 31 | Creating Your First Alert 32 | ========================= 33 | 34 | Log In 35 | ------ 36 | 37 | Open your web browser and navigate to the URL reported by Vagrant: e.g. https://localhost:8443/. 38 | Click on *Sign In*. This will redirect you to Github where you sign in and authorize the ZMON app. 39 | Then it takes you back and you are logged in. 40 | 41 | .. note:: 42 | 43 | For your own deployment create your own app in Github with your redirect URL. 44 | In ZMON you can then limit users allowed access to your Github organization. 45 | 46 | Checks and Alerts 47 | ----------------- 48 | 49 | An alert shown on ZMON's dashboard typically consists of two parts: the *check-definition*, which is responsible for 50 | fetching the underlying data; and the *alert-definition*, which defines the condition under which the alert will trigger. 51 | Multiple alerts with different alert conditions can operate on the same check, fetching data only once. 52 | 53 | Let's explore this concept now by creating a simple check and defining some alerts on it. 54 | 55 | Create a new Check 56 | ------------------ 57 | 58 | One way to create a new check from scratch is via the :ref:`cli-usage`. 59 | A more convenient way, however, is to use the "Trial Run" feature. 60 | It enables you to develop checks and alerts, execute them immediately, and inspect the result. 61 | Once you are happy with your check command and filter, you can save it from the Trial Run directly. 62 | Some users prefer to download the YAML definition from there to store and maintain it in Git. 63 | 64 | Create an Alert 65 | --------------- 66 | 67 | In the top navigation of ZMON's web interface, select `Check defs `_ from the list and click on *Website HTTP status*. 68 | Then click *"Add New Alert Definition"* to create a new alert for this particular check. 69 | Fill out the form (see example values below), and hit *"Save"*: 70 | 71 | ==================== ========================== 72 | **Name** Oops ... website is gone! 73 | -------------------- -------------------------- 74 | **Description** Website was not reachable. 75 | -------------------- -------------------------- 76 | **Priority** Priority 1 (red) 77 | -------------------- -------------------------- 78 | **Alert Condition** value != 200 79 | -------------------- -------------------------- 80 | **Team** Team 1 81 | -------------------- -------------------------- 82 | **Responsible Team** Team 1 83 | -------------------- -------------------------- 84 | **Status** ACTIVE 85 | ==================== ========================== 86 | 87 | After you hit save, it will take a few seconds until it is picked up and executed. 88 | 89 | View Dashboard 90 | -------------- 91 | 92 | If the alerts condition evaluates to anything but ``False`` the alert will appear on the dashboard. 93 | This means not only for ``True``, but also e.g. in case of exceptions triggered, e.g. due to timeouts or failure to connect. 94 | Currently there's only one dashboard, and it is configured to show all present alerts. 95 | To view the dashboard, select `Dashboards `_ from the main menu and click on *Example Dashboard*. 96 | 97 | To see the alert, you must simulate the error condition; try modifying its condition or the check-definition to return an error code). 98 | You do this, set the URL in the check command to http://httpstat.us/500. 99 | (The number in the URL represents the HTTP error code you will get.) 100 | 101 | To see the actual error code in the alert, you might want to create/modify it like this: 102 | 103 | ==================== ================================ 104 | **Name** Website gone with status {code} 105 | -------------------- -------------------------------- 106 | **Description** Website was not reachable. 107 | -------------------- -------------------------------- 108 | **Priority** Priority 1 (red) 109 | -------------------- -------------------------------- 110 | **Alert Condition** capture(code=value)!=200 111 | -------------------- -------------------------------- 112 | **Team** Team 1 113 | -------------------- -------------------------------- 114 | **Responsible Team** Team 1 115 | -------------------- -------------------------------- 116 | **Status** ACTIVE 117 | ==================== ================================ 118 | 119 | .. _cli-usage: 120 | 121 | Using the CLI 122 | ============= 123 | 124 | The ZMON Vagrant box comes preinstalled with *zmon-cli*. 125 | To use the CLI, log in to the running Vagrant box with: 126 | 127 | .. code-block:: bash 128 | 129 | $ vagrant ssh 130 | 131 | The Vagrant box also contains some sample yaml files for creating entities, checks and alerts. 132 | You can find these in */vagrant/examples*. 133 | 134 | As an example of using ZMON's CLI, let's create a check to verify that google.com is reachable. 135 | *cd* to */vagrant/examples/check-definitions* and, using zmon-cli, create a new check-definition: 136 | 137 | .. code-block:: bash 138 | 139 | $ cd /vagrant/examples/check-definitions 140 | $ zmon check-definitions init website-availability.yaml 141 | $ vim website-availability.yaml 142 | 143 | Edit the newly created *website-availability.yaml* to contain the following code. (type :kbd:`i` for insert-mode) 144 | 145 | .. code-block:: yaml 146 | 147 | name: "Website HTTP status" 148 | owning_team: "Team 1" 149 | command: http("http://httpstat.us/200", timeout=5).code() 150 | description: "Returns current http status code for Website" 151 | interval: 60 152 | entities: 153 | - type: GLOBAL 154 | status: ACTIVE 155 | 156 | Type :kbd:`ESC :wq RETURN` to save the file. 157 | 158 | To push the updated check definition to ZMON, run: 159 | 160 | .. code-block:: bash 161 | 162 | $ zmon check-definitions update website-availability.yaml 163 | Updating check definition... http://localhost:8080/#/check-definitions/view/2 164 | 165 | Find more detailed information here: :ref:`zmon-cli`. 166 | 167 | .. _main ZMON repository: https://github.com/zalando/zmon 168 | -------------------------------------------------------------------------------- /docs/user/check-ref/sql_wrappers.rst: -------------------------------------------------------------------------------- 1 | .. _sql-function: 2 | 3 | SQL 4 | --- 5 | 6 | .. py:function:: sql([shard]) 7 | 8 | Provides a wrapper for connection to PostgreSQL database and allows 9 | executing queries. All queries are executed in read-only transactions. 10 | The connection wrapper requires one parameters: list of shard connections. 11 | The shard connections must come from the entity definition (see :ref:`database-entities`). 12 | Example query for log database which returns a primitive long value: 13 | 14 | .. code-block:: python 15 | 16 | sql().execute("SELECT count(*) FROM zl_data.log WHERE log_created > now() - '1 hour'::interval").result() 17 | 18 | Example query which will return a single dict with keys ``a`` and ``b``:: 19 | 20 | sql().execute('SELECT 1 AS a, 2 AS b').result() 21 | 22 | The SQL wrapper will automatically sum up values over all shards:: 23 | 24 | sql().execute('SELECT count(1) FROM zc_data.customer').result() # will return a single integer value (sum over all shards) 25 | 26 | It's also possible to query a single shard by providing its name:: 27 | 28 | sql(shard='customer1').execute('SELECT COUNT(1) AS c FROM zc_data.customer').results() # returns list of values from a single shard 29 | 30 | It's also possible to query another database on the same server overwriting the shards information:: 31 | 32 | sql(shards={'customer_db' : entity['host'] + ':' + str(entity['port']) + '/another_db'}).execute('SELECT COUNT(1) AS c FROM my_table').results() 33 | 34 | To execute a SQL statement on all LIVE customer shards, for example, use the following entity filter: 35 | 36 | .. code-block:: json 37 | 38 | [ 39 | { 40 | "type": "database", 41 | "name": "customer", 42 | "environment": "live", 43 | "role": "master" 44 | } 45 | ] 46 | 47 | The check command will have the form 48 | 49 | .. code-block:: python 50 | 51 | >>> sql().execute('SELECT 1 AS a').result() 52 | 8 53 | # Returns a single value: the sum over the result from all shards 54 | 55 | >>> sql().execute('SELECT 1 AS a').results() 56 | [{'a': 1}, {'a': 1}, {'a': 1}, {'a': 1}, {'a': 1}, {'a': 1}, {'a': 1}, {'a': 1}] 57 | # Returns a list of the results from all shards 58 | 59 | >>> sql(shard='customer1').execute('SELECT 1 AS a').results() 60 | [{'a': 1}] 61 | # Returns the result from the specified shard in a list of length one 62 | 63 | >>> sql().execute('SELECT 1 AS a, 2 AS b').result() 64 | {'a': 8, 'b': 16} 65 | # Returns a dict of the two values, which are each the sum over the result from all shards 66 | 67 | The results() function has several additional parameters: :: 68 | 69 | sql().execute('SELECT 1 AS ONE, 2 AS TWO FROM dual').results([max_results=100], [raise_if_limit_exceeded=True]) 70 | 71 | ``max_results`` 72 | The maximum number of rows you expect to get from the call. If not specified, defaults to 100. You cannot have an 73 | unlimited number of rows. There is an absolute maximum of 1,000,000 results that cannot be overridden. 74 | Note: If you require processing of larger dataset, it 75 | is recommended to revisit architecture of your monitoring subsystem and possibly move logic that does calculation 76 | into external web service callable by ZMON 2.0. 77 | 78 | ``raise_if_limit_exceeded`` 79 | Raises an exception if the limit of rows would have been exceeded by the issued query. 80 | 81 | .. py:function:: orasql() 82 | 83 | Provides a wrapper for connection to Oracle database and allows 84 | executing queries. All queries are executed in read-only transactions. 85 | The connection wrapper requires three parameters: host, port and sid, 86 | that must come from the entity definition (see :ref:`database-entities`). 87 | One idiosyncratic behaviour to be aware, is that when your query produces 88 | more than one value, and you get a dict with keys being the column names 89 | or aliases you used in your query, you will always get back those keys 90 | *in uppercase*. For clarity, we recommend that you write all aliases 91 | and column names in uppercase, to avoid confusion due to case changes. 92 | Example query of the simplest query, which returns a single value: 93 | 94 | .. code-block:: python 95 | 96 | orasql().execute("SELECT 'OK' from dual").result() 97 | 98 | Example query which will return a single dict with keys ``ONE`` and ``TWO``:: 99 | 100 | orasql().execute('SELECT 1 AS ONE, 2 AS TWO from dual').result() 101 | 102 | To execute a SQL statement on a LIVE server, tagged with the name business_intelligence, for example, 103 | use the following entity filter: 104 | 105 | .. code-block:: json 106 | 107 | [ 108 | { 109 | "type": "oracledb", 110 | "name": "business_intelligence", 111 | "environment": "live", 112 | "role": "master" 113 | } 114 | ] 115 | 116 | 117 | .. py:function:: exacrm() 118 | 119 | Provides a wrapper for connection to the CRM Exasol database executing 120 | queries. 121 | The connection wrapper requires one parameter: the query. 122 | 123 | Example query: 124 | 125 | .. code-block:: python 126 | 127 | exacrm().execute("SELECT 'OK';").result() 128 | 129 | To execute a SQL statement on the itr-crmexa* servers use the following 130 | entity filter: 131 | 132 | .. code-block:: json 133 | 134 | [ 135 | { 136 | "type": "host", 137 | "host_role_id": "117" 138 | } 139 | ] 140 | 141 | .. py:function:: mysql([shard]) 142 | 143 | Provides a wrapper for connection to MySQL database and allows 144 | executing queries. 145 | The connection wrapper requires one parameters: list of shard connections. 146 | The shard connections must come from the entity definition (see :ref:`database-entities`). 147 | Example query of the simplest query, which returns a single value: 148 | 149 | .. code-block:: python 150 | 151 | mysql().execute("SELECT count(*) FROM mysql.user").result() 152 | 153 | Example query which will return a single dict with keys ``h`` and ``u``:: 154 | 155 | mysql().execute('SELECT host AS h, user AS u FROM mysql.user').result() 156 | 157 | The SQL wrapper will automatically sum up values over all shards:: 158 | 159 | mysql().execute('SELECT count(1) FROM zc_data.customer').result() # will return a single integer value (sum over all shards) 160 | 161 | It's also possible to query a single shard by providing its name:: 162 | 163 | mysql(shard='customer1').execute('SELECT COUNT(1) AS c FROM zc_data.customer').results() # returns list of values from a single shard 164 | 165 | To execute a SQL statement on all LIVE customer shards, for example, use the following entity filter: 166 | 167 | .. code-block:: json 168 | 169 | [ 170 | { 171 | "type": "mysqldb", 172 | "name": "lounge", 173 | "environment": "live", 174 | "role": "master" 175 | } 176 | ] 177 | -------------------------------------------------------------------------------- /docs/Makefile: -------------------------------------------------------------------------------- 1 | # Makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line. 5 | SPHINXOPTS = 6 | SPHINXBUILD = sphinx-build 7 | PAPER = 8 | BUILDDIR = _build 9 | 10 | # User-friendly check for sphinx-build 11 | ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $?), 1) 12 | $(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/) 13 | endif 14 | 15 | # Internal variables. 16 | PAPEROPT_a4 = -D latex_paper_size=a4 17 | PAPEROPT_letter = -D latex_paper_size=letter 18 | ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . 19 | # the i18n builder cannot share the environment and doctrees with the others 20 | I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . 21 | 22 | .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext 23 | 24 | help: 25 | @echo "Please use \`make ' where is one of" 26 | @echo " html to make standalone HTML files" 27 | @echo " dirhtml to make HTML files named index.html in directories" 28 | @echo " singlehtml to make a single large HTML file" 29 | @echo " pickle to make pickle files" 30 | @echo " json to make JSON files" 31 | @echo " htmlhelp to make HTML files and a HTML help project" 32 | @echo " qthelp to make HTML files and a qthelp project" 33 | @echo " devhelp to make HTML files and a Devhelp project" 34 | @echo " epub to make an epub" 35 | @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" 36 | @echo " latexpdf to make LaTeX files and run them through pdflatex" 37 | @echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx" 38 | @echo " text to make text files" 39 | @echo " man to make manual pages" 40 | @echo " texinfo to make Texinfo files" 41 | @echo " info to make Texinfo files and run them through makeinfo" 42 | @echo " gettext to make PO message catalogs" 43 | @echo " changes to make an overview of all changed/added/deprecated items" 44 | @echo " xml to make Docutils-native XML files" 45 | @echo " pseudoxml to make pseudoxml-XML files for display purposes" 46 | @echo " linkcheck to check all external links for integrity" 47 | @echo " doctest to run all doctests embedded in the documentation (if enabled)" 48 | 49 | clean: 50 | rm -rf $(BUILDDIR)/* 51 | 52 | html: 53 | $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html 54 | @echo 55 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." 56 | 57 | dirhtml: 58 | $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml 59 | @echo 60 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." 61 | 62 | singlehtml: 63 | $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml 64 | @echo 65 | @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." 66 | 67 | pickle: 68 | $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle 69 | @echo 70 | @echo "Build finished; now you can process the pickle files." 71 | 72 | json: 73 | $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json 74 | @echo 75 | @echo "Build finished; now you can process the JSON files." 76 | 77 | htmlhelp: 78 | $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp 79 | @echo 80 | @echo "Build finished; now you can run HTML Help Workshop with the" \ 81 | ".hhp project file in $(BUILDDIR)/htmlhelp." 82 | 83 | qthelp: 84 | $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp 85 | @echo 86 | @echo "Build finished; now you can run "qcollectiongenerator" with the" \ 87 | ".qhcp project file in $(BUILDDIR)/qthelp, like this:" 88 | @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/PyScaffold.qhcp" 89 | @echo "To view the help file:" 90 | @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/PyScaffold.qhc" 91 | 92 | devhelp: 93 | $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp 94 | @echo 95 | @echo "Build finished." 96 | @echo "To view the help file:" 97 | @echo "# mkdir -p $HOME/.local/share/devhelp/PyScaffold" 98 | @echo "# ln -s $(BUILDDIR)/devhelp $HOME/.local/share/devhelp/PyScaffold" 99 | @echo "# devhelp" 100 | 101 | epub: 102 | $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub 103 | @echo 104 | @echo "Build finished. The epub file is in $(BUILDDIR)/epub." 105 | 106 | latex: 107 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 108 | @echo 109 | @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." 110 | @echo "Run \`make' in that directory to run these through (pdf)latex" \ 111 | "(use \`make latexpdf' here to do that automatically)." 112 | 113 | latexpdf: 114 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 115 | @echo "Running LaTeX files through pdflatex..." 116 | $(MAKE) -C $(BUILDDIR)/latex all-pdf 117 | @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." 118 | 119 | latexpdfja: 120 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 121 | @echo "Running LaTeX files through platex and dvipdfmx..." 122 | $(MAKE) -C $(BUILDDIR)/latex all-pdf-ja 123 | @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." 124 | 125 | text: 126 | $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text 127 | @echo 128 | @echo "Build finished. The text files are in $(BUILDDIR)/text." 129 | 130 | man: 131 | $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man 132 | @echo 133 | @echo "Build finished. The manual pages are in $(BUILDDIR)/man." 134 | 135 | texinfo: 136 | $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo 137 | @echo 138 | @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." 139 | @echo "Run \`make' in that directory to run these through makeinfo" \ 140 | "(use \`make info' here to do that automatically)." 141 | 142 | info: 143 | $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo 144 | @echo "Running Texinfo files through makeinfo..." 145 | make -C $(BUILDDIR)/texinfo info 146 | @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." 147 | 148 | gettext: 149 | $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale 150 | @echo 151 | @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." 152 | 153 | changes: 154 | $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes 155 | @echo 156 | @echo "The overview file is in $(BUILDDIR)/changes." 157 | 158 | linkcheck: 159 | $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck 160 | @echo 161 | @echo "Link check complete; look for any errors in the above output " \ 162 | "or in $(BUILDDIR)/linkcheck/output.txt." 163 | 164 | doctest: 165 | $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest 166 | @echo "Testing of doctests in the sources finished, look at the " \ 167 | "results in $(BUILDDIR)/doctest/output.txt." 168 | 169 | xml: 170 | $(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml 171 | @echo 172 | @echo "Build finished. The XML files are in $(BUILDDIR)/xml." 173 | 174 | pseudoxml: 175 | $(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml 176 | @echo 177 | @echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml." 178 | -------------------------------------------------------------------------------- /docs/developer/python-tutorial.rst: -------------------------------------------------------------------------------- 1 | .. _python-tutorial: 2 | 3 | *********************** 4 | A Short Python Tutorial 5 | *********************** 6 | 7 | This tutorial explains by example how to process a :py:obj:`dict` using Python's list comprehension facilities. 8 | 9 | Suppose we're interested in the total number or order failures. 10 | 11 | #. First, we need to query the appropriate endpoint to get the data, and call the :py:meth:`json` method. :: 12 | 13 | http('http://www.example.com/foo/bar/data.json').json() 14 | 15 | This endpoint returns JSON data that is structured as follows (with much of the data omitted):: 16 | 17 | { 18 | ... 19 | "itr-http04_orderfails": [1, 0], 20 | "itr-http05_addtocart": [0.05, 0.0875], 21 | "http17_addtocart": [0.075, 0.066667], 22 | "http27_requests": [14.666667, 12.195833], 23 | "http13_orderfails": [null, 2], 24 | ... 25 | } 26 | 27 | The parsed object will therefore be a :py:obj:`dict` mapping strings to lists of numbers, which may contain :py:obj:`None` values. 28 | 29 | #. We need to find all entries ending in :samp:`_orderfails`. In Python, we can transform a :py:obj:`dict` in a list of tuples :samp:`({key}, {value})` using the :py:meth:`items` method:: 30 | 31 | http(...).json().items() 32 | 33 | We now need to filter this list to include only order failure information. Using a loop and an if statement, this could be accomplished like this:: 34 | 35 | result = [] 36 | for key, value in http(...).json().items(): 37 | if key.endswith('_orderfails'): 38 | result.append(value) 39 | 40 | (Note how the tuples in the list returned by :py:meth:`items` are automatically "unpacked", their elements being assigned to :py:obj:`key` and :py:obj:`value`, respectively.) 41 | 42 | Since the check command needs to be a single expression, not a series of statements, this is unfortunately not an option. Fortunately, Python provides a feature called list comprehension, which allows us to express the code above as follows:: 43 | 44 | [value for key, value in http(...).json().items() if key.endswith('_orderfails')] 45 | 46 | That is, code of the form :: 47 | 48 | result = [] 49 | for ELEMENT in LIST: 50 | if CONDITION: 51 | result.append(RESULT_ELEMENT) 52 | 53 | becomes :: 54 | 55 | [RESULT_ELEMENT for ELEMENT in LIST if CONDITION] 56 | 57 | (The ``if CONDITION`` part is optional.) 58 | 59 | We now have a list of lists ``[[1, 0], [None, 2]]``. 60 | 61 | #. In order to sum the list, we'd need to flatten it first, so that it has the form ``[1, 0, None, 2]``. This can be accomplished with the :py:func:`chain` function. Given one or more iterable objects (such as lists), :py:func:`chain` returns a new iterable object produced by concatenating the given objects. That is :: 62 | 63 | chain([1, 0], [None, 2]) 64 | 65 | would return :: 66 | 67 | [1, 0, None, 2] 68 | 69 | Unfortunately, the lists we want to chain are themselves elements of a list, and calling ``chain([[1, 0], [None, 2]])`` would just concatenate the list with nothing and return the it unchanged. We therefore need to tell Python to unpack the list, so that each of its elements becomes a new argument for the invocation of :py:func:`chain`. 70 | 71 | This can be accomplished by the ``*`` operator:: 72 | 73 | chain(*[[1, 0], [None, 2]]) 74 | 75 | That is, out expression is now :: 76 | 77 | chain(*[value for key, value in http(...).json().items() if key.endswith('_orderfails')]) 78 | 79 | #. Now we need to remove that pesky :py:obj:`None` from the list. This could be accomplished with another list comprehension:: 80 | 81 | [value for value in chain(...) if value is not None] 82 | 83 | For didactic reasons, we shall use the :py:func:`filter` function instead. :py:func:`filter` takes two arguments: a function that is called for each element in the filtered list and indicates whether that element should be in the resulting list, and the list that is to be filtered itself. We can create an anonymous function for this purpose using a lambda expression:: 84 | 85 | filter(lambda element: element is not None, chain(...)) 86 | 87 | In this case, we can use a somewhat obscure shortcut, though. If the function given to :py:func:`filter` is :py:obj:`None`, the identity function is used. Therefore, objects will be included in the resulting list if and only if they are "truthy", which :py:obj:`None` isn't. The integer :py:obj:`0` isn't truthy either, but this isn't a problem in this case since the presence or absence of zeros does not affect the sum. Therefore, we can use the expression :: 88 | 89 | filter(None, chain(*[value for key, value in http(...).json().items() if key.endswith('_orderfails')])) 90 | 91 | #. Finally, we need to sum the elements of the list. For that, we can just use the :py:func:`sum` function, so that the expression is now :: 92 | 93 | sum(filter(None, chain(*[value for key, value in http(...).json().items() if key.endswith('_orderfails')]))) 94 | 95 | 96 | Python Recipes 97 | ============== 98 | 99 | .. describe:: Merging Data Into One Result 100 | 101 | You can merge heterogeneous data into a single result object: 102 | 103 | .. code-block:: python 104 | 105 | { 106 | 'http_data': http(...).json()[...], 107 | 'jmx_data': jmx().query(...).results()[...], 108 | 'sql_data': sql().execute(...)[...], 109 | } 110 | 111 | 112 | .. describe:: Mapping SQL Results by ID 113 | 114 | The SQL ``results()`` methods returns a list of maps (``[{'id': 1, 'data': 1000}, {'id': 2, 'data': 2000}]``). You can convert this to a single map (``{1: 1000, 2: 2000}``) like this: 115 | 116 | .. code-block:: python 117 | 118 | { row['id']: row['data'] for row in sql().execute(...).results() } 119 | 120 | 121 | .. describe:: Using Multiple Captures 122 | 123 | If you have a alert condition such as 124 | 125 | .. code-block:: python 126 | 127 | FOO > 10 or BAR > 10 128 | 129 | adding capures is a bit tricky. If you use 130 | 131 | .. code-block:: python 132 | 133 | capture(foo=FOO) > 10 or capture(bar=BAR) > 10 134 | 135 | and both ``FOO`` and ``BAR`` are greater than 10, only ``foo`` will be captured because the ``or`` uses short-circuit evaluation (``True or X`` is true for all ``X``, so ``X`` doesn't need to be evaluated). Instead, you can use 136 | 137 | .. code-block:: python 138 | 139 | any([capture(foo=FOO) > 10, capture(bar=BAR) > 10]) 140 | 141 | which will always evaluate both comparisons and thus capture both values. 142 | 143 | 144 | .. describe:: Defining Temporary Variables 145 | 146 | You aren't supposed to be able to do define variables, but you can work around this restriction as follows: 147 | 148 | .. code-block:: python 149 | 150 | (lambda x: 151 | # Some complex operation using x multiple times 152 | )( 153 | x = sql().execute(...) # Some complex or expensive query 154 | ) 155 | 156 | 157 | .. describe:: Defining Functions 158 | 159 | Since you can define variables with the trick above, you can also define functions: 160 | 161 | .. code-block:: python 162 | 163 | (lambda f: 164 | # Some complex operation calling f multiple times 165 | )( 166 | f = lambda a, b, c: sql().execute(...) # Some code using the arguments a, b, and c 167 | ) 168 | 169 | -------------------------------------------------------------------------------- /docs/user/check-ref/cloudwatch_wrapper.rst: -------------------------------------------------------------------------------- 1 | .. _cloudwatch: 2 | 3 | CloudWatch 4 | ---------- 5 | 6 | If running on AWS you can use ``cloudwatch()`` to access AWS metrics easily. 7 | 8 | .. py:function:: cloudwatch(region=None, assume_role_arn=None) 9 | 10 | Initialize CloudWatch wrapper. 11 | 12 | :param region: AWS region for CloudWatch queries. Will be auto-detected if not supplied. 13 | :type region: str 14 | 15 | :param assume_role_arn: AWS IAM role ARN to be assumed. This can be useful in cross-account CloudWatch queries. 16 | :type assume_role_arn: str 17 | 18 | 19 | Methods of Cloudwatch 20 | ^^^^^^^^^^^^^^^^^^^^^ 21 | 22 | .. py:method:: query_one(dimensions, metric_name, statistics, namespace, period=60, minutes=5, start=None, end=None, extended_statistics=None) 23 | 24 | Query a single AWS CloudWatch metric and return a single scalar value (float). 25 | Metric will be aggregated over the last five minutes using the provided aggregation type. 26 | 27 | This method is a more low-level variant of the ``query`` method: all parameters, including all dimensions need to be known. 28 | 29 | :param dimensions: Cloudwatch dimensions. Example ``{'LoadBalancerName': 'my-elb-name'}`` 30 | :type dimensions: dict 31 | 32 | :param metric_name: Cloudwatch metric. Example ``'Latency'``. 33 | :type metric_name: str 34 | 35 | :param statistics: Cloudwatch metric statistics. Example ``'Sum'`` 36 | :type statistics: list 37 | 38 | :param namespace: Cloudwatch namespace. Example ``'AWS/ELB'`` 39 | :type namespace: str 40 | 41 | :param period: Cloudwatch statistics granularity in seconds. Default is 60. 42 | :type period: int 43 | 44 | :param minutes: Used to determine ``start`` time of the Cloudwatch query. Default is 5. Ignored if ``start`` is supplied. 45 | :type minutes: int 46 | 47 | :param start: Cloudwatch start timestamp. Default is ``None``. 48 | :type start: int 49 | 50 | :param end: Cloudwatch end timestamp. Default is ``None``. If not supplied, then end time is now. 51 | :type end: int 52 | 53 | :param extended_statistics: Cloudwatch ExtendedStatistics for percentiles query. Example ``['p95', 'p99']``. 54 | :type extended_statistics: list 55 | 56 | :return: Return a float if single value, dict otherwise. 57 | :rtype: float, dict 58 | 59 | 60 | Example query with percentiles for AWS ALB: 61 | 62 | .. code-block:: python 63 | 64 | cloudwatch().query_one({'LoadBalancer': 'app/my-alb/1234'}, 'TargetResponseTime', 'Average', 'AWS/ApplicationELB', extended_statistics=['p95', 'p99', 'p99.45']) 65 | { 66 | 'Average': 0.224, 67 | 'p95': 0.245, 68 | 'p99': 0.300, 69 | 'p99.45': 0.500 70 | } 71 | 72 | .. note:: 73 | 74 | In very rare cases, e.g. for ELB metrics, you may see only 1/2 or 1-2/3 of the value in ZMON due to a race condition of what data is already present in cloud watch. 75 | To fix this click "evaluate" on the alert, this will trigger the check and move its execution time to a new start time. 76 | 77 | .. py:method:: query(dimensions, metric_name, statistics='Sum', namespace=None, period=60, minutes=5) 78 | 79 | Query AWS CloudWatch for metrics. Metrics will be aggregated over the last five minutes using the provided aggregation type (default "Sum"). 80 | 81 | *dimensions* is a dictionary to filter the metrics to query. See the `list_metrics boto documentation`_. 82 | You can provide the special value "NOT_SET" for a dimension to only query metrics where the given key is not set. 83 | This makes sense e.g. for ELB metrics as they are available both per AZ ("AvailabilityZone" has a value) and aggregated over all AZs ("AvailabilityZone" not set). 84 | Additionally you can include the special "*" character in a dimension value to do fuzzy (shell globbing) matching. 85 | 86 | *metric_name* is the name of the metric to filter against (e.g. "RequestCount"). 87 | 88 | *namespace* is an optional namespace filter (e.g. "AWS/EC2). 89 | 90 | To query an ELB for requests per second: 91 | 92 | .. code-block:: python 93 | 94 | # both using special "NOT_SET" and "*" in dimensions here: 95 | val = cloudwatch().query({'AvailabilityZone': 'NOT_SET', 'LoadBalancerName': 'pierone-*'}, 'RequestCount', 'Sum')['RequestCount'] 96 | requests_per_second = val / 60 97 | 98 | You can find existing metrics with the AWS CLI tools: 99 | 100 | .. code-block:: bash 101 | 102 | $ aws cloudwatch list-metrics --namespace "AWS/EC2" 103 | 104 | Use the "dimensions" argument to select on what dimension(s) to aggregate over: 105 | 106 | .. code-block:: bash 107 | 108 | $ aws cloudwatch list-metrics --namespace "AWS/EC2" --dimensions Name=AutoScalingGroupName,Value=my-asg-FEYBCZF 109 | 110 | The desired metric can now be queried in ZMON: 111 | 112 | .. code-block:: python 113 | 114 | cloudwatch().query({'AutoScalingGroupName': 'my-asg-*'}, 'DiskReadBytes', 'Sum') 115 | 116 | 117 | .. _list_metrics boto documentation: http://boto.readthedocs.org/en/latest/ref/cloudwatch.html#boto.ec2.cloudwatch.CloudWatchConnection.list_metrics 118 | 119 | 120 | .. py:method:: alarms(alarm_names=None, alarm_name_prefix=None, state_value=STATE_ALARM, action_prefix=None, max_records=50) 121 | 122 | Retrieve cloudwatch alarms filtered by state value. 123 | 124 | See `describe_alarms boto documentation`_ for more details. 125 | 126 | :param alarm_names: List of alarm names. 127 | :type alarm_names: list 128 | 129 | :param alarm_name_prefix: Prefix of alarms. Cannot be specified if ``alarm_names`` is specified. 130 | :type alarm_name_prefix: str 131 | 132 | :param state_value: State value used in alarm filtering. Available values are ``OK``, ``ALARM`` (default) and ``INSUFFICIENT_DATA``. 133 | :type state_value: str 134 | 135 | :param action_prefix: Action name prefix. Example ``arn:aws:autoscaling:`` to filter results for all autoscaling related alarms. 136 | :type action_prefix: str 137 | 138 | :param max_records: Maximum records to be returned. Default is 50. 139 | :type max_records: int 140 | 141 | :return: List of MetricAlarms. 142 | :rtype: list 143 | 144 | 145 | .. _describe_alarms boto documentation: http://boto3.readthedocs.io/en/latest/reference/services/cloudwatch.html#CloudWatch.Client.describe_alarms 146 | 147 | .. code-block:: python 148 | 149 | cloudwatch().alarms(state_value='ALARM')[0] 150 | { 151 | 'ActionsEnabled': True, 152 | 'AlarmActions': ['arn:aws:autoscaling:...'], 153 | 'AlarmArn': 'arn:aws:cloudwatch:...', 154 | 'AlarmConfigurationUpdatedTimestamp': datetime.datetime(2016, 5, 12, 10, 44, 15, 707000, tzinfo=tzutc()), 155 | 'AlarmDescription': 'Scale-down if CPU < 50% for 10.0 minutes (Average)', 156 | 'AlarmName': 'metric-alarm-for-service-x', 157 | 'ComparisonOperator': 'LessThanThreshold', 158 | 'Dimensions': [ 159 | { 160 | 'Name': 'AutoScalingGroupName', 161 | 'Value': 'service-x-asg' 162 | } 163 | ], 164 | 'EvaluationPeriods': 2, 165 | 'InsufficientDataActions': [], 166 | 'MetricName': 'CPUUtilization', 167 | 'Namespace': 'AWS/EC2', 168 | 'OKActions': [], 169 | 'Period': 300, 170 | 'StateReason': 'Threshold Crossed: 1 datapoint (36.1) was less than the threshold (50.0).', 171 | 'StateReasonData': '{...}', 172 | 'StateUpdatedTimestamp': datetime.datetime(2016, 5, 12, 10, 44, 16, 294000, tzinfo=tzutc()), 173 | 'StateValue': 'ALARM', 174 | 'Statistic': 'Average', 175 | 'Threshold': 50.0 176 | } 177 | -------------------------------------------------------------------------------- /docs/conf.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | # 5 | # ZMON documentation build configuration file, created by 6 | # sphinx-quickstart on Fri Jan 24 19:24:14 2014. 7 | # 8 | # This file is execfile()d with the current directory set to its containing dir. 9 | # 10 | # Note that not all possible configuration values are present in this 11 | # autogenerated file. 12 | # 13 | # All configuration values have a default; values that are commented out 14 | # serve to show the default. 15 | 16 | import sys 17 | import os 18 | import sphinx_rtd_theme 19 | 20 | # If extensions (or modules to document with autodoc) are in another directory, 21 | # add these directories to sys.path here. If the directory is relative to the 22 | # documentation root, use os.path.abspath to make it absolute, like shown here. 23 | #sys.path.insert(0, os.path.abspath('../../zmon-worker/src')) 24 | 25 | # -- General configuration ----------------------------------------------------- 26 | 27 | # If your documentation needs a minimal Sphinx version, state it here. 28 | # needs_sphinx = '1.0' 29 | 30 | # Add any Sphinx extension module names here, as strings. They can be extensions 31 | # coming with Sphinx (named 'sphinx.ext.*') or your custom ones. 32 | extensions = ['sphinx.ext.viewcode', 'sphinx.ext.autodoc', 'sphinx.ext.intersphinx'] 33 | 34 | # Add any paths that contain templates here, relative to this directory. 35 | templates_path = ['_templates'] 36 | 37 | # The suffix of source filenames. 38 | source_suffix = '.rst' 39 | 40 | # The encoding of source files. 41 | # source_encoding = 'utf-8-sig' 42 | 43 | # The master toctree document. 44 | master_doc = 'index' 45 | 46 | # General information about the project. 47 | project = u'ZMON' 48 | copyright = u'2014, Zalando SE' 49 | 50 | # The version info for the project you're documenting, acts as replacement for 51 | # |version| and |release|, also used in various other places throughout the 52 | # built documents. 53 | # 54 | # The short X.Y version. 55 | version = '2.0' 56 | # The full version, including alpha/beta/rc tags. 57 | release = '2.0' 58 | 59 | # The language for content autogenerated by Sphinx. Refer to documentation 60 | # for a list of supported languages. 61 | # language = None 62 | 63 | # There are two options for replacing |today|: either, you set today to some 64 | # non-false value, then it is used: 65 | # today = '' 66 | # Else, today_fmt is used as the format for a strftime call. 67 | # today_fmt = '%B %d, %Y' 68 | 69 | # List of patterns, relative to source directory, that match files and 70 | # directories to ignore when looking for source files. 71 | exclude_patterns = ['_build'] 72 | 73 | # The reST default role (used for this markup: `text`) to use for all documents. 74 | # default_role = None 75 | 76 | # If true, '()' will be appended to :func: etc. cross-reference text. 77 | # add_function_parentheses = True 78 | 79 | # If true, the current module name will be prepended to all description 80 | # unit titles (such as .. function::). 81 | # add_module_names = True 82 | 83 | # If true, sectionauthor and moduleauthor directives will be shown in the 84 | # output. They are ignored by default. 85 | # show_authors = False 86 | 87 | # The name of the Pygments (syntax highlighting) style to use. 88 | pygments_style = 'sphinx' 89 | 90 | # A list of ignored prefixes for module index sorting. 91 | # modindex_common_prefix = [] 92 | 93 | # -- Options for HTML output --------------------------------------------------- 94 | 95 | # The theme to use for HTML and HTML Help pages. See the documentation for 96 | # a list of builtin themes. 97 | html_theme = 'sphinx_rtd_theme' 98 | #html_style = 'zmon.css' 99 | 100 | # Theme options are theme-specific and customize the look and feel of a theme 101 | # further. For a list of options available for each theme, see the 102 | # documentation. 103 | 104 | # Add any paths that contain custom themes here, relative to this directory. 105 | # html_theme_path = [] 106 | 107 | # The name for this set of Sphinx documents. If None, it defaults to 108 | # " v documentation". 109 | # html_title = None 110 | 111 | # A shorter title for the navigation bar. Default is the same as html_title. 112 | # html_short_title = None 113 | 114 | # The name of an image file (relative to this directory) to place at the top 115 | # of the sidebar. 116 | # html_logo = None 117 | 118 | # The name of an image file (within the static path) to use as favicon of the 119 | # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 120 | # pixels large. 121 | # html_favicon = None 122 | 123 | # Add any paths that contain custom static files (such as style sheets) here, 124 | # relative to this directory. They are copied after the builtin static files, 125 | # so a file named "default.css" will overwrite the builtin "default.css". 126 | html_static_path = ['_static'] 127 | 128 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, 129 | # using the given strftime format. 130 | # html_last_updated_fmt = '%b %d, %Y' 131 | 132 | # If true, SmartyPants will be used to convert quotes and dashes to 133 | # typographically correct entities. 134 | # html_use_smartypants = True 135 | 136 | # Custom sidebar templates, maps document names to template names. 137 | # html_sidebars = {} 138 | 139 | # Additional templates that should be rendered to pages, maps page names to 140 | # template names. 141 | # html_additional_pages = {} 142 | 143 | # If false, no module index is generated. 144 | # html_domain_indices = True 145 | 146 | # If false, no index is generated. 147 | # html_use_index = True 148 | 149 | # If true, the index is split into individual pages for each letter. 150 | # html_split_index = False 151 | 152 | # If true, links to the reST sources are added to the pages. 153 | # html_show_sourcelink = True 154 | 155 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. 156 | # html_show_sphinx = True 157 | 158 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. 159 | # html_show_copyright = True 160 | 161 | # If true, an OpenSearch description file will be output, and all pages will 162 | # contain a tag referring to it. The value of this option must be the 163 | # base URL from which the finished HTML is served. 164 | # html_use_opensearch = '' 165 | 166 | # This is the file name suffix for HTML files (e.g. ".xhtml"). 167 | # html_file_suffix = None 168 | 169 | # Output file base name for HTML help builder. 170 | htmlhelp_basename = 'ZMONdoc' 171 | 172 | # -- Options for LaTeX output -------------------------------------------------- 173 | 174 | latex_elements = {} 175 | # The paper size ('letterpaper' or 'a4paper'). 176 | # 'papersize': 'letterpaper', 177 | 178 | # The font size ('10pt', '11pt' or '12pt'). 179 | # 'pointsize': '10pt', 180 | 181 | # Additional stuff for the LaTeX preamble. 182 | # 'preamble': '', 183 | 184 | # Grouping the document tree into LaTeX files. List of tuples 185 | # (source start file, target name, title, author, documentclass [howto/manual]). 186 | latex_documents = [( 187 | 'index', 188 | 'ZMON.tex', 189 | u'ZMON Documentation', 190 | u'Henning Jacobs', 191 | 'manual', 192 | )] 193 | 194 | # The name of an image file (relative to this directory) to place at the top of 195 | # the title page. 196 | # latex_logo = None 197 | 198 | # For "manual" documents, if this is true, then toplevel headings are parts, 199 | # not chapters. 200 | # latex_use_parts = False 201 | 202 | # If true, show page references after internal links. 203 | # latex_show_pagerefs = False 204 | 205 | # If true, show URL addresses after external links. 206 | # latex_show_urls = False 207 | 208 | # Documents to append as an appendix to all manuals. 209 | # latex_appendices = [] 210 | 211 | # If false, no module index is generated. 212 | # latex_domain_indices = True 213 | 214 | # -- Options for manual page output -------------------------------------------- 215 | 216 | # One entry per manual page. List of tuples 217 | # (source start file, name, description, authors, manual section). 218 | man_pages = [( 219 | 'index', 220 | 'zmon', 221 | u'ZMON Documentation', 222 | [u'Henning Jacobs'], 223 | 1, 224 | )] 225 | 226 | # If true, show URL addresses after external links. 227 | # man_show_urls = False 228 | 229 | # -- Options for Texinfo output ------------------------------------------------ 230 | 231 | # Grouping the document tree into Texinfo files. List of tuples 232 | # (source start file, target name, title, author, 233 | # dir menu entry, description, category) 234 | texinfo_documents = [( 235 | 'index', 236 | 'ZMON', 237 | u'ZMON Documentation', 238 | u'Henning Jacobs', 239 | 'ZMON', 240 | 'One line description of project.', 241 | 'Miscellaneous', 242 | )] 243 | 244 | # Documents to append as an appendix to all manuals. 245 | # texinfo_appendices = [] 246 | 247 | # If false, no module index is generated. 248 | # texinfo_domain_indices = True 249 | 250 | # How to display URL addresses: 'footnote', 'no', or 'inline'. 251 | # texinfo_show_urls = 'footnote' 252 | 253 | intersphinx_mapping = {'python': ('http://docs.python.org/2', None)} 254 | -------------------------------------------------------------------------------- /docs/intro.rst: -------------------------------------------------------------------------------- 1 | ************ 2 | Introduction 3 | ************ 4 | 5 | ZMON is a flexible and extensible open-source platform monitoring tool developed at Zalando_ and is in production use since early 2014. 6 | It offers proven scaling with its distributed nature and fast storage with KairosDB on top of Cassandra. 7 | ZMON splits checking(data acquisition) from the alerting responsibilities and uses abstract entities to describe what's being monitored. 8 | Its checks and alerts rely on Python expressions, giving the user a lot of power and connectivity. 9 | Besides the UI it provides RESTful APIs to manage and configure most properties automatically. 10 | 11 | Anyone can use ZMON, but offers particular advantages for technical organizations with many autonomous teams. 12 | Its front end (see Demo_ / Bootstrap_ / Kubernetes_/ Vagrant_) comes with Grafana3 "built-in," enabling teams to create and manage their own data-driven dashboards along side ZMON's own team/personal dashboards for alerts and custom widgets. 13 | Being able to inherit and clone alerts makes it easier for teams to reuse and share code. 14 | Alerts can trigger HipChat, Slack, and E-Mail notifications. 15 | iOS and Android clients are works in progress, but push notifications are already implemented. 16 | 17 | ZMON also enables painless integration with CMDBs and deployment tools. 18 | It also supports service discovery via custom adapters or its built-in entity service's REST API. 19 | For an example, see zmon-aws-agent_ to learn how we connect AWS service discovery with our monitoring in the cloud. 20 | 21 | Feel free to contact us via `slack.zmon.io`_. 22 | 23 | ZMON Components 24 | =============== 25 | 26 | .. image:: images/components.svg 27 | 28 | A minimum ZMON setup requires these four components: 29 | 30 | - zmon-controller_: UI/Grafana/Oauth2 Login/Github Login 31 | - zmon-scheduler_: Scheduling check/alert evaluation 32 | - zmon-worker_: Doing the heavy lifting 33 | - zmon-eventlog-service_: History for state changes and modifications 34 | 35 | Plus the storage covered in the :ref:`requirements` section. 36 | 37 | The following components are optional: 38 | 39 | - zmon-cli_: A command line client for managing entities/checks/alerts if needed 40 | - zmon-aws-agent_: Works with the AWS API to retrieve "known" applications 41 | - zmon-data-service_: API for multi DC federation: receiver for remote workers primarily 42 | - zmon-metric-cache_: Small scale special purpose metric store for API metrics in ZMON's cloud UI 43 | - zmon-notification-service_: Provides mobile API and push notification support for GCM to Android/iOS app 44 | - zmon-android_: An Android client for ZMON monitoring 45 | - zmon-ios_: An iOS client for ZMON monitoring 46 | 47 | ZMON Origins 48 | ============ 49 | 50 | ZMON was born in late 2013 during Zalando's annual `Hack Week`_, when a group of Zalando engineers aimed to develop a replacement for ICINGA. 51 | Scalability, manageability and flexibility were all critical, as Zalando's small teams needed to be able to monitor their services independent of each other. 52 | In early 2014, Zalando teams began migrating all checks to ZMON, which continues to serve Zalando Tech. 53 | 54 | Entities 55 | ======== 56 | 57 | ZMON uses entities to describe your infrastructure or platform, and to bind check variables to fixed values. 58 | 59 | .. code-block:: json 60 | 61 | { 62 | "type":"host", 63 | "id":"cassandra01", 64 | "host":"cassandra01", 65 | "role":"cassandra-host", 66 | "ip":"192.168.1.17", 67 | "dc":"data-center-1" 68 | } 69 | 70 | Or more abstract objects: 71 | 72 | .. code-block:: json 73 | 74 | { 75 | "type":"postgresql-cluster", 76 | "id":"article-cluster", 77 | "name":"article-cluster", 78 | "shards": { 79 | "shard1":"articledb01:5432/shard1", 80 | "shard2":"articledb02:5432/shard2" 81 | } 82 | } 83 | 84 | Entity properties are not defined in any schema, so you can add properties as you see fit. This enables finer-grained filtering or selection of entities later on. As an example, host entities can include a physical model to later select the proper hardware checks. 85 | 86 | Below you see an exmple of the entity view with alerts per entity. 87 | 88 | .. image:: images/entities.png 89 | 90 | Checks 91 | ====== 92 | 93 | A check describes how data is acquired. Its key properties are: a command to execute and an entity filter. The filter selects a subset of entities by requiring an overlap on specified properties. An example: 94 | 95 | .. code-block:: json 96 | 97 | { 98 | "type":"postgresql-cluster", "name":"article-cluster" 99 | } 100 | 101 | The check command itself is an executable Python_ expression. ZMON provides many custom wrappers that bind to the selected entity. The following example uses a PostgreSQL wrapper to execute a query on every shard defined above: 102 | 103 | .. code-block:: python 104 | 105 | # sql() in this context is aware of the "shards" property 106 | 107 | sql().execute('SELECT count(1) FROM articles "total"').result() 108 | 109 | A check command always returns a value to the alert. This can be of any Python type. 110 | 111 | Not familiar with Python's functional expressions? No worries: ZMON allows you to define a top-level function and define your command in an easier, less functional way: 112 | 113 | .. code-block:: python 114 | 115 | def check(): 116 | # sql() binds to the entity used and thus knows the connection URLs 117 | return sql().execute('SELECT count(1) FROM articles "total"').result() 118 | 119 | Alerts 120 | ====== 121 | 122 | A basic alert consists of an alert condition, an entity filter, and a team. 123 | An alert has only two states: up or down. 124 | An alert is up if it yields anything but False; this also includes exceptions thrown during evaluation of the check or alert, e.g. in the event of connection problems. 125 | ZMON does not support levels of criticality, or something like "unknown", but you have a color option to customize sort and style on your dashboard (red, orange, yellow). 126 | 127 | Let's revisit the above PostgreSQL check again. The alert below would either popup if there are no articles found or if we get an exception connecting to the PostgreSQL database. 128 | 129 | .. code-block:: yaml 130 | 131 | team: database 132 | entities: 133 | - type: postgresql-cluster 134 | alert_condition: | 135 | value <= 0 136 | 137 | Alerts raised by exceptions are marked in the dashboard with a "!". 138 | 139 | Via ZMON's UI, alerts support parameters to the alert condition. 140 | This makes it easy for teams/users to implement different thresholds, and — with the priority field defining the dashboard color — render their dashboards to reflect their priorities. 141 | 142 | Dashboards 143 | ========== 144 | 145 | Dashboards include a widget area where you can render important data with charts, gauges, or plain text. 146 | Another section features rendering of all active alerts for the team filter, defined at the dashboard level. 147 | Using the team filter, select the alerts you want your dashboard to include. 148 | Specify multiple teams, if necessary. TAGs are supported to subselect topics. 149 | 150 | .. image:: images/dashboard.png 151 | 152 | REST API and CLI 153 | ================ 154 | 155 | To make your life easier, ZMON's REST API manages all the essential moving parts to support your daily work — creating and updating entities to allow for sync-up with your existing infrastructure. 156 | When you create and modify checks and alerts, the scheduler will quickly pick up these changes so you won't have to restart or deploy anything. 157 | 158 | And ZMON's command line client - a slim wrapper around the REST API - also adds usability by making it simpler to work with YAML files or push collections of entities. 159 | 160 | Development Status 161 | ================== 162 | The team behind ZMON continues to improve performance and functionality. Please let us know via GitHub's issues tracker if you find any bugs or issues. 163 | 164 | .. _Python: http://www.python.org 165 | .. _Zalando: https://tech.zalando.de/ 166 | .. _zmon-controller: https://github.com/zalando-zmon/zmon-controller 167 | .. _Demo: https://demo.zmon.io 168 | .. _Bootstrap: https://github.com/zalando-zmon/zmon-demo 169 | .. _Vagrant: https://github.com/zalando/zmon 170 | .. _zmon-scheduler: https://github.com/zalando-zmon/zmon-scheduler 171 | .. _zmon-worker: https://github.com/zalando-zmon/zmon-worker 172 | .. _zmon-eventlog-service: https://github.com/zalando-zmon/zmon-eventlog-service 173 | .. _zmon-android: https://github.com/zalando-zmon/zmon-android 174 | .. _zmon-ios: https://github.com/zalando-zmon/zmon-ios 175 | .. _zmon-cli: https://github.com/zalando-zmon/zmon-cli 176 | .. _zmon-actuator: https://github.com/zalando-zmon/zmon-actuator 177 | .. _zmon-aws-agent: https://github.com/zalando-zmon/zmon-aws-agent 178 | .. _zmon-data-service: https://github.com/zalando-zmon/zmon-data-service 179 | .. _zmon-notification-service: https://github.com/zalando-zmon/zmon-notification-service 180 | .. _zmon-metric-cache: https://github.com/zalando-zmon/zmon-metric-cache 181 | .. _Hack Week: https://tech.zalando.de/blog/?tags=Hack%20Week 182 | .. _slack.zmon.io: https://slack.zmon.io 183 | .. _Kubernetes: https://github.com/zalando-zmon/zmon-kubernetes 184 | -------------------------------------------------------------------------------- /docs/user/check-ref/kubernetes_wrapper.rst: -------------------------------------------------------------------------------- 1 | Kubernetes 2 | ---------- 3 | 4 | Provides a wrapper for querying Kubernetes cluster resources. 5 | 6 | 7 | .. py:function:: kubernetes(namespace='default') 8 | 9 | If ``namespace`` is ``None`` then **all** namespaces will be queried. This however will increase the number of calls to Kubernetes API server. 10 | 11 | .. note:: 12 | 13 | - Kubernetes wrapper will authenticate using service account, which assumes the worker is running in a Kubernetes cluster. 14 | - All Kubernetes wrapper calls are scoped to the Kubernetes cluster hosting the worker. It is not intended to be used in querying multiple clusters. 15 | 16 | .. _labelSelectors: 17 | 18 | Label Selectors 19 | ^^^^^^^^^^^^^^^ 20 | 21 | Kubernetes API provides a way to filter resources using `labelSelector `_. Kubernetes wrapper provides a friendly syntax for filtering. 22 | 23 | The following examples show different usage of the Kubernetes wrapper utilizing label filtering: 24 | 25 | .. code-block:: python 26 | 27 | # Get all pods with label ``application`` equal to ``zmon-worker`` 28 | kubernetes().pods(application='zmon-worker') 29 | kubernetes().pods(application__eq='zmon-worker') 30 | 31 | 32 | # Get all pods with label ``application`` **not equal to** ``zmon-worker`` 33 | kubernetes().pods(application__neq='zmon-worker') 34 | 35 | 36 | # Get all pods with label ``application`` **any of** ``zmon-worker`` or ``zmon-agent`` 37 | kubernetes().pods(application__in=['zmon-worker', 'zmon-agent']) 38 | 39 | # Get all pods with label ``application`` **not any of** ``zmon-worker`` or ``zmon-agent`` 40 | kubernetes().pods(application__notin=['zmon-worker', 'zmon-agent']) 41 | 42 | 43 | Methods of Kubernetes 44 | ^^^^^^^^^^^^^^^^^^^^^ 45 | 46 | .. py:function:: pods(name=None, phase=None, ready=None, **kwargs) 47 | 48 | Return list of `Pods `_. 49 | 50 | :param name: Pod name. 51 | :type name: str 52 | 53 | :param phase: Pod status phase. Valid values are: Pending, Running, Failed, Succeeded or Unknown. 54 | :type phase: str 55 | 56 | :param ready: Pod readiness status. If ``None`` then all pods are returned. 57 | :type ready: bool 58 | 59 | :param kwargs: Pod :ref:`labelSelectors` filters. 60 | :type kwargs: dict 61 | 62 | :return: List of pods. Typical pod has "metadata", "status" and "spec" fields. 63 | :rtype: list 64 | 65 | .. py:function:: nodes(name=None, **kwargs) 66 | 67 | Return list of `Nodes `_. Namespace does not apply. 68 | 69 | :param name: Node name. 70 | :type name: str 71 | 72 | :param kwargs: Node :ref:`labelSelectors` filters. 73 | :type kwargs: dict 74 | 75 | :return: List of nodes. Typical pod has "metadata", "status" and "spec" fields. 76 | :rtype: list 77 | 78 | .. py:function:: services(name=None, **kwargs) 79 | 80 | Return list of `Services `_. 81 | 82 | :param name: Service name. 83 | :type name: str 84 | 85 | :param kwargs: Service :ref:`labelSelectors` filters. 86 | :type kwargs: dict 87 | 88 | :return: List of services. Typical service has "metadata", "status" and "spec" fields. 89 | :rtype: list 90 | 91 | .. py:function:: endpoints(name=None, **kwargs) 92 | 93 | Return list of Endpoints. 94 | 95 | :param name: Endpoint name. 96 | :type name: str 97 | 98 | :param kwargs: Endpoint :ref:`labelSelectors` filters. 99 | :type kwargs: dict 100 | 101 | :return: List of Endpoints. Typical Endpoint has "metadata", and "subsets" fields. 102 | :rtype: list 103 | 104 | .. py:function:: ingresses(name=None, **kwargs) 105 | 106 | Return list of `Ingresses `_. 107 | 108 | :param name: Ingress name. 109 | :type name: str 110 | 111 | :param kwargs: Ingress :ref:`labelSelectors` filters. 112 | :type kwargs: dict 113 | 114 | :return: List of Ingresses. Typical Ingress has "metadata", "spec" and "status" fields. 115 | :rtype: list 116 | 117 | .. py:function:: statefulsets(name=None, replicas=None, **kwargs) 118 | 119 | Return list of `Statefulsets `_. 120 | 121 | :param name: Statefulset name. 122 | :type name: str 123 | 124 | :param replicas: Statefulset replicas. 125 | :type replicas: int 126 | 127 | :param kwargs: Statefulset :ref:`labelSelectors` filters. 128 | :type kwargs: dict 129 | 130 | :return: List of Statefulsets. Typical Statefulset has "metadata", "status" and "spec" fields. 131 | :rtype: list 132 | 133 | .. py:function:: daemonsets(name=None, **kwargs) 134 | 135 | Return list of `Daemonsets `_. 136 | 137 | :param name: Daemonset name. 138 | :type name: str 139 | 140 | :param kwargs: Daemonset :ref:`labelSelectors` filters. 141 | :type kwargs: dict 142 | 143 | :return: List of Daemonsets. Typical Daemonset has "metadata", "status" and "spec" fields. 144 | :rtype: list 145 | 146 | .. py:function:: replicasets(name=None, replicas=None, **kwargs) 147 | 148 | Return list of `ReplicaSets `_. 149 | 150 | :param name: ReplicaSet name. 151 | :type name: str 152 | 153 | :param replicas: ReplicaSet replicas. 154 | :type replicas: int 155 | 156 | :param kwargs: ReplicaSet :ref:`labelSelectors` filters. 157 | :type kwargs: dict 158 | 159 | :return: List of ReplicaSets. Typical ReplicaSet has "metadata", "status" and "spec" fields. 160 | :rtype: list 161 | 162 | .. py:function:: deployments(name=None, replicas=None, ready=None, **kwargs) 163 | 164 | Return list of `Deployments `_. 165 | 166 | :param name: Deployment name. 167 | :type name: str 168 | 169 | :param replicas: Deployment replicas. 170 | :type replicas: int 171 | 172 | :param ready: Deployment readiness status. 173 | :type ready: bool 174 | 175 | :param kwargs: Deployment :ref:`labelSelectors` filters. 176 | :type kwargs: dict 177 | 178 | :return: List of Deployments. Typical Deployment has "metadata", "status" and "spec" fields. 179 | :rtype: list 180 | 181 | .. py:function:: configmaps(name=None, **kwargs) 182 | 183 | Return list of `ConfigMaps `_. 184 | 185 | :param name: ConfigMap name. 186 | :type name: str 187 | 188 | :param kwargs: ConfigMap :ref:`labelSelectors` filters. 189 | :type kwargs: dict 190 | 191 | :return: List of ConfigMaps. Typical ConfigMap has "metadata" and "data". 192 | :rtype: list 193 | 194 | .. py:function:: persistentvolumeclaims(name=None, phase=None, **kwargs) 195 | 196 | Return list of `PersistentVolumeClaims `_. 197 | 198 | :param name: PersistentVolumeClaim name. 199 | :type name: str 200 | 201 | :param phase: Volume phase. 202 | :type phase: str 203 | 204 | :param kwargs: PersistentVolumeClaim :ref:`labelSelectors` filters. 205 | :type kwargs: dict 206 | 207 | :return: List of PersistentVolumeClaims. Typical PersistentVolumeClaim has "metadata", "status" and "spec" fields. 208 | :rtype: list 209 | 210 | .. py:function:: persistentvolumes(name=None, phase=None, **kwargs) 211 | 212 | Return list of `PersistentVolumes `_. 213 | 214 | :param name: PersistentVolume name. 215 | :type name: str 216 | 217 | :param phase: Volume phase. 218 | :type phase: str 219 | 220 | :param kwargs: PersistentVolume :ref:`labelSelectors` filters. 221 | :type kwargs: dict 222 | 223 | :return: List of PersistentVolumes. Typical PersistentVolume has "metadata", "status" and "spec" fields. 224 | :rtype: list 225 | 226 | .. py:function:: jobs(name=None, **kwargs) 227 | 228 | Return list of `Jobs `_. 229 | 230 | :param name: Job name. 231 | :type name: str 232 | 233 | :param **kwargs: Job labelSelector filters. 234 | :type **kwargs: dict 235 | 236 | :return: List of Jobs. Typical Job has "metadata", "status" and "spec". 237 | :rtype: list 238 | 239 | .. py:function:: cronjobs(name=None, **kwargs) 240 | 241 | Return list of `CronJobs `_. 242 | 243 | :param name: CronJob name. 244 | :type name: str 245 | 246 | :param **kwargs: CronJob labelSelector filters. 247 | :type **kwargs: dict 248 | 249 | :return: List of CronJobs. Typical CronJob has "metadata", "status" and "spec". 250 | :rtype: list 251 | 252 | .. py:function:: metrics() 253 | 254 | Return API server metrics in prometheus format. 255 | 256 | :return: Cluster metrics. 257 | :rtype: dict 258 | -------------------------------------------------------------------------------- /docs/user/monitoringonaws.rst: -------------------------------------------------------------------------------- 1 | .. _monitoringonaws: 2 | 3 | ***************** 4 | Monitoring on AWS 5 | ***************** 6 | 7 | This section assumes that you're running zmon-aws-agent_, which automatically discovers your EC2 instances, auto-scaling of groups, ELBs, and more. 8 | 9 | ZMON AWS agent syncs the following entities from AWS infrastructure: 10 | 11 | - EC2 instances 12 | - Auto-Scaling groups 13 | - ELBs (classic and ELBv2) 14 | - Elasticaches 15 | - RDS instances 16 | - DynamoDB tables 17 | - IAM/ACM certificates 18 | 19 | .. note:: 20 | 21 | ZMON AWS Agent can be also deployed via a single `appliance`_, which runs AWS Agent, `ZMON worker`_ and `ZMON scheduler`_. 22 | 23 | CloudWatch Metrics 24 | ------------------ 25 | You can achieve most basic monitoring with AWS CloudWatch_. CloudWatch EC2 metrics contain the following information: 26 | 27 | - CPU Utilization 28 | - Network traffic 29 | - Disk throughput/operations per second (only for ephemeral storage; EBS volumes are not included) 30 | 31 | ZMON allows querying arbitrary CloudWatch metrics using the :ref:`cloudwatch() ` wrapper. 32 | 33 | Security Groups 34 | --------------- 35 | 36 | Depending on your AWS setup, you'll probably have to open particular ports/instances to access from ZMON. Using a limited set of ports to expose management APIs and the Prometheus node exporter will make your life easier. ZMON allows parsing of Prometheus metrics via the :ref:`http().prometheus() `. 37 | 38 | You can deploy ZMON into each of your AWS accounts to allow cross-team monitoring and dashboards. Make sure that your security groups allow ZMON to connect to port 9100 of your monitored instances. 39 | 40 | Not having the proper security groups configured is mainly visible by not getting the expected results at all, as packages are dropped by the EC2 instance rather then e.g. getting a connection refused. 41 | 42 | Low-Level or Basic Properties 43 | ----------------------------- 44 | 45 | EC2 Instances 46 | ============= 47 | 48 | Having enough **diskspace** on your instance is important; `here's a sample check`_. By default, you can only get space used from CloudWatch_. Using Amazon's own script, you can push free space to CloudWatch and pull this data via ZMON. Alternatively, you can run the `Prometheus Node exporter`_ to pull disk space data from the EC2 node itself via HTTP. 49 | 50 | Similarly, you can pull CPU-related metrics from CloudWatch. The Prometheus Node exporter also exposes these metrics. 51 | 52 | You also need enough available **INodes**. 53 | 54 | Regarding **memory**, you can either query via CloudWatch, use Prometheus Node exporter to feed ZMON, or go with low-level ``snmp()`` [not recommended]. 55 | 56 | The following block shows *part* of EC2 instance entity properties: 57 | 58 | .. code-block:: yaml 59 | 60 | id: a-app-1-2QBrR1[aws:123456789:eu-west-1] 61 | type: instance 62 | aws_id: i-87654321 63 | created_by: agent 64 | host: 172.33.173.201 65 | infrastructure_account: aws:123456789 66 | instance_type: t2.medium 67 | ip: 172.33.173.201 68 | ports: 69 | '5432': 5432 70 | '8008': 8008 71 | region: eu-west-1 72 | 73 | An example check using :ref:`cloudwatch wrapper ` and entity properties would look like the following: 74 | 75 | .. code-block:: python 76 | 77 | cloudwatch().query_one({'InstanceId': entity['aws_id']}, 'CPUUtilization', 'Average', 'AWS/EC2', period=120) 78 | 79 | 80 | Elastic Load Balancers 81 | ====================== 82 | 83 | You can query AWS CloudWatch to get ELB-specific metrics. The ZMON agent will put data into the ELB entity, allowing you to monitor instance and healthy instance count. 84 | 85 | .. code-block:: yaml 86 | 87 | id: elb-a-app-1[aws:123456789:eu-west-1] 88 | type: elb 89 | elb_type: classic 90 | active_members: 1 91 | created_by: agent 92 | dns_name: internal-a-app-1.eu-west-1.elb.amazonaws.com 93 | host: internal-a-app-1.eu-west-1.elb.amazonaws.com 94 | infrastructure_account: aws:123456789 95 | members: 3 96 | region: eu-west-1 97 | scheme: internal 98 | 99 | ZMON AWS agent will detect both ELBs, classic and application load balancers. Both ELBs entities will be created in ZMON with ``type:elb``. In order to distinguish between them in your checks, there is another property ``elb_type`` which holds either ``classic`` or ``application``. 100 | 101 | Since Cloudwatch metrics are different for each ELB type, please check `CloudWatch ELB metrics`_ for detailed reference. An example check using :ref:`Cloudwatch wrapper ` and entity properties would look like the following: 102 | 103 | .. code-block:: python 104 | 105 | # Classic ELB 106 | lb_name = entity['name'] 107 | key = 'LoadBalancerName' 108 | namespace = 'AWS/ELB' 109 | 110 | # Check if Application ELBv2 entity 111 | if entity.get('elb_type') == 'application': 112 | lb_name = entity['cloudwatch_name'] 113 | key = 'LoadBalancer' 114 | namespace = 'AWS/ApplicationELB' 115 | 116 | cloudwatch().query_one({key: lb_name}, 'RequestCount', 'Sum', namespace) 117 | 118 | .. note:: 119 | 120 | ELB entities contain a special flag ``dns_traffic`` which is an indicator about the load balancer being actively serving traffic. 121 | 122 | Auto-Scaling Groups 123 | =================== 124 | 125 | ZMON's agent creates an auto-scaling group entity that provides you with the number of desired instances and the number of instances in a healthy state. This enables you to monitor whether the ASG actually works and hosts spawn into a productive state. 126 | 127 | .. code-block:: yaml 128 | 129 | id: asg-proxy-1[aws:123456789:eu-central-1] 130 | type: asg 131 | name: proxy-1 132 | created_by: agent 133 | desired_capacity: 2 134 | dns_traffic: 'true' 135 | dns_weight: 200 136 | infrastructure_account: aws:123456789 137 | instances: 138 | - aws_id: i-123456 139 | ip: 172.33.109.201 140 | - aws_id: i-654321 141 | ip: 172.33.109.202 142 | max_size: 4 143 | min_size: 2 144 | region: eu-central-1 145 | 146 | RDS Instances 147 | ============= 148 | 149 | ZMON AWS agent will detect RDS instances and store them as entities with type ``database``. 150 | 151 | .. code-block:: yaml 152 | 153 | id: rds-db-1[aws:123456789] 154 | type: database 155 | name: db-1 156 | created_by: agent 157 | engine: postgres 158 | host: db-1.rds.amazonaws.com 159 | infrastructure_account: aws:123456789 160 | port: 5432 161 | region: eu-west-1 162 | 163 | .. code-block:: python 164 | 165 | cloudwatch().query_one({'DBInstanceIdentifier': entity['name']}, 'DatabaseConnections', 'Sum', 'AWS/RDS') 166 | 167 | ElastiCache Redis 168 | ================= 169 | 170 | Elasticache instances are stored as entities with type ``elc``. 171 | 172 | .. code-block:: yaml 173 | 174 | id: elc-redis-1[aws:123456789:eu-central-1] 175 | type: elc 176 | cluster_id: all-redis-001 177 | cluster_num_nodes: 1 178 | created_by: agent 179 | engine: redis 180 | host: redis-1.cache.amazonaws.com 181 | infrastructure_account: aws:123456789 182 | port: 6379 183 | region: eu-central-1 184 | 185 | IAM/ACM Certificates 186 | ==================== 187 | 188 | ZMON AWS agent will also sync IAM/ACM SSL certificates, with type ``certificate``. Certificate entities could be used to create an alert in case a certificate is about to expire for instance. 189 | 190 | .. code-block:: yaml 191 | 192 | id: cert-acm-example.org[aws:123456789:eu-central-1] 193 | type: certificate 194 | name: '*.example.org' 195 | status: ISSUED 196 | arn: arn:aws:acm:eu-central-1:123456789:certificate/123456-123456-123456-123456 197 | certificate_type: acm 198 | created_by: agent 199 | expiration: '2017-07-28T12:00:00+00:00' 200 | infrastructure_account: aws:123456789 201 | region: eu-central-1 202 | 203 | 204 | Application API Monitoring 205 | -------------------------- 206 | 207 | When monitoring an application, you'll usually want to check the number of received requests, latency patterns, and the number of returned status codes. 208 | These data points form a pretty clear picture of what is going on with the application. 209 | 210 | Additional metrics will help you find problems as well as opportunities for improvement. 211 | Assuming that your applications provide HTTP APIs hidden behind ELBs, you can use ZMON to gather this data from CloudWatch. 212 | 213 | For more detailed data, ZMON offers options for different languages and frameworks. 214 | One is zmon-actuator_ for Spring Boot. 215 | ZMON gathers the data by querying a JSON endpoint ``/metrics`` adhering to the DropWizard metrics layout with some convention on the naming of timers. 216 | Basically on timer per API path and status code. 217 | 218 | We also recommend checking out Friboo_ for working with Clojure, the Python/Flask framework Connexion_ or Markscheider_ for Play/Scala development. 219 | 220 | The :ref:`http(url=...).actuator_metrics() ` will parse the data into a Python dict that allows you to easily monitor and alert on changes in API behavior. 221 | 222 | This also drives ZMON's cloud UI. 223 | 224 | .. image:: /images/cloud1.png 225 | 226 | .. _appliance: https://github.com/zalando-zmon/zmon-appliance 227 | .. _CloudWatch ELB metrics: http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/elb-metricscollected.html 228 | .. _CloudWatch: https://aws.amazon.com/cloudwatch/ 229 | .. _Connexion: https://github.com/zalando/connexion 230 | .. _Friboo: https://github.com/zalando-stups/friboo 231 | .. _here's a sample check: https://github.com/zalando/zmon/tree/master/examples/check-definitions/11-ec2-diskspace.yaml 232 | .. _Prometheus Node exporter: https://github.com/prometheus/node_exporter 233 | .. _ZMON scheduler: https://github.com/zalando-zmon/zmon-scheduler 234 | .. _ZMON worker: https://github.com/zalando-zmon/zmon-worker 235 | .. _zmon-actuator: https://github.com/zalando-zmon/zmon-actuator 236 | .. _zmon-aws-agent: https://github.com/zalando-zmon/zmon-aws-agent 237 | .. _markscheider: https://github.com/zalando-incubator/markscheider 238 | -------------------------------------------------------------------------------- /docs/user/alert-definitions.rst: -------------------------------------------------------------------------------- 1 | .. _alert-definitions: 2 | 3 | ***************** 4 | Alert Definitions 5 | ***************** 6 | 7 | Alert definitions specify when (condition, time period) and who (team) to notify for a desired monitoring event. 8 | Alert definitions can be defined in the ZMON web frontend and via the :ref:`ZMON CLI `. 9 | 10 | The following fields exist for alert definitions: 11 | 12 | name 13 | The alert's display name on the dashboard. 14 | This field can contain curly-brace variables like ``{mycapture}`` that are replaced by capture's value when the alert is triggered. 15 | It's also possible to format decimal precision (e.g. "My alert ``{mycapture:.2f}``" would show as "My alert ``123.45``" if mycapture is ``123.456789``). 16 | To include a comma separated list of entities as part of the alert's name, just use the special placeholder ``{entities}``. 17 | 18 | description 19 | Meaningful text for people trying to handle the alert, e.g. incident support. 20 | 21 | priority 22 | The alert's dashboard priority. This defines color and sort order on the dashboard. 23 | 24 | condition 25 | Valid Python expression to return true when alert should be triggered. 26 | 27 | parameters 28 | You may apply parameters your alert condition using variables. More details :ref:`here ` 29 | 30 | entities filter 31 | Additional filter to apply the alert definition only to a subset of entities. 32 | 33 | notifications 34 | List of :ref:`notification ` commands, e.g. to send out emails. 35 | 36 | time_period 37 | Notification time period. 38 | 39 | team 40 | Team dashboard to show alert on. 41 | 42 | responsible_team 43 | Additional team field to allow delegating alert monitoring to other teams. 44 | The responsible team's name will be shown on the dashboard. 45 | 46 | status 47 | Alerts will only be triggered if status is "ACTIVE". 48 | 49 | template 50 | A template is an alert definition that is not evaluated and can only be used for extension. More details :ref:`here ` 51 | 52 | .. _alert-condition: 53 | 54 | Condition 55 | --------- 56 | Simple expressions can start directly with an operator. To trigger an alert if the check result value is larger than zero: 57 | 58 | .. code-block:: python 59 | 60 | > 0 61 | 62 | You can use the ``value`` variable to create more complex conditions: 63 | 64 | .. code-block:: python 65 | 66 | value >= 10 and value <= 100 67 | 68 | Some more examples of valid conditions: 69 | 70 | .. code-block:: python 71 | 72 | == 'OK' 73 | != False 74 | value in ('banana', 'apple') 75 | 76 | If the value already is a dictionary (hash map), we can apply all the Python magic to it: 77 | 78 | .. code-block:: python 79 | 80 | ['mykey'] > 100 # check a specific dict value 81 | 'error-message' in value # trigger alert if key is present 82 | not empty([ k for k, v in value.items() if v > 100 ]) # trigger alert if some dict value is > 100 83 | 84 | .. _captures: 85 | 86 | Captures 87 | -------- 88 | 89 | You can capture intermediate results in alert conditions by using the 90 | ``capture`` function. This allows easier debugging of complex alert 91 | conditions. 92 | 93 | .. code-block:: python 94 | 95 | capture(value["a"]/value["b"]) > 0 96 | capture(myval=value["a"]/value["b"]) > 0 97 | any([capture(foo=FOO) > 10, capture(bar=BAR) > 10]) 98 | 99 | Please refer to Recipes section in :ref:`Python Tutorial ` for some Python tricks you may use. 100 | 101 | Named captures can be used to customize the alert display on the :term:`dashboard` by using template substitution in the alert name. 102 | 103 | If you call your capture *dashboard*, it will be used on dashboard next to entity name instead of entity value. 104 | For example, if you have a host-based alert that fails on z-host1 and z-host2, you would normally see something like that 105 | 106 | ALERT TITLE (N) 107 | z-host1 (value1), z-host2 (value2) 108 | 109 | Once you introduce capture called *dashboard*, you will get something like 110 | 111 | ALERT TITLE (N) 112 | z-host1 (capturevalue1), z-host2 (capturevalue2) 113 | 114 | where capturevalue1 is value of "dashboard" capture evaluated against z-host1. 115 | 116 | Example alert condition (based on PF/System check for diskspace) 117 | 118 | .. code-block:: python 119 | 120 | "ERROR" not in value 121 | and 122 | capture(dashboard=(lambda d: '{}:{}'.format(d.keys()[0], d[d.keys()[0]]['percentage_space_used']) if d else d)(dict((k, v) for k,v in value.iteritems() if v.get('percentage_space_used', 0) >= 90)))) 123 | 124 | Entity (Exclude) Filter 125 | ----------------------- 126 | 127 | The :ref:`check definition ` already defines on what entities the checks should run. 128 | Usually the check definition's ``entities`` are broader than you want. 129 | A diskspace check might be defined for all hosts, but you want to trigger alerts only for hosts you are interested in. 130 | The alert definition's ``entities`` field allows to filter entities by their attributes. 131 | 132 | See :ref:`entities` for details on supported entities and their attributes. 133 | 134 | Note: The entity name can be included in the alert message by using a special placeholder `{entities}`` on the alert name. 135 | 136 | Notifications 137 | ------------- 138 | 139 | ZMON notifications lets you know when you have a new alert without check the web UI. 140 | This section will explain how to use the different options available to notify about changes in alert states. 141 | We support E-Mail, HipChat, Slack and one SMS provider that we have been using. 142 | 143 | The notifications field is a list of function calls (see below for examples), calling one of the following methods of notification: 144 | 145 | .. py:function:: send_email(email*, [subject, message, repeat]) 146 | .. py:function:: send_sms(number*, [message, repeat]) 147 | .. py:function:: send_push([message, repeat, url, key]) 148 | .. py:function:: send_slack([channel, message, repeat, token]) 149 | .. py:function:: send_hipchat([room, message, color='red', repeat, token, message_format='html', notify=False]) 150 | 151 | If the alert has the top priority and should be handled immediately, you can specify the repeat interval for each notification. 152 | In this case, you will be notified periodically, according to the specified interval, while the alert persists. 153 | The interval is specified in seconds. 154 | 155 | To receive push notifications you need one of the ZMON mobile apps (configured for your deployment) and subscribe to alert ids, before you can receive notifications. 156 | 157 | In addition, you may use :ref:`notification-groups` to configure groups of people with associated **emails** and/or **phone numbers** and use these groups in notifications like this: 158 | 159 | Example JSON email and SMS configuration using groups: 160 | 161 | .. code-block:: yaml 162 | 163 | [ 164 | "send_sms('active:2nd-database')", 165 | "send_email('group:2nd-database')" 166 | ] 167 | 168 | In the above example you send SMS to **active** member of **2nd-database** group and send email to **all members** of the group. 169 | 170 | Example JSON email configuration: 171 | 172 | .. code-block:: yaml 173 | 174 | [ 175 | "send_mail('a@example.org', 'b@example.org')", 176 | "send_mail('a@example.com', 'b@example.com', subject='Critical Alert please do something!')", 177 | "send_mail('c@example.com', repeat=60)" 178 | ] 179 | 180 | Example JSON Slack configuration: 181 | 182 | .. code-block:: yaml 183 | 184 | [ 185 | "send_slack()", 186 | "send_slack(channel='#incidents')", 187 | "send_slack(channel='#incidents', token='your-token')" 188 | ] 189 | 190 | Example JSON HipChat configuration: 191 | 192 | .. code-block:: yaml 193 | 194 | [ 195 | "send_hipchat()", 196 | "send_hipchat(room='#incidents', color='red')", 197 | "send_hipchat(room='#incidents', token='your-token')", 198 | "send_hipchat(room='#incidents', token='your-token', notify=True)", 199 | "send_hipchat(room='#incidents', token='your-token', notify=True, message='@here Plz check it', message_format='text')" 200 | ] 201 | 202 | Example JSON Push configuration: 203 | 204 | .. code-block:: yaml 205 | 206 | [ 207 | "send_push()" 208 | ] 209 | 210 | Example JSON SMS configuration: 211 | 212 | .. code-block:: yaml 213 | 214 | [ 215 | "send_sms('0049123555555', '0123111111')", 216 | "send_sms('0049123555555', '0123111111', message='Critical Alert please do something!')", 217 | "send_sms('0029123555556', repeat=300)" 218 | ] 219 | 220 | Example email: 221 | 222 | :: 223 | 224 | From: ZMON 225 | Date: 2014-05-28 18:37 GMT+01:00 226 | Subject: NEW ALERT: Low Orders/m: 84.9% of last weeks on GLOBAL 227 | To: Undisclosed Recipients 228 | 229 | New alert on GLOBAL: Low Orders/m: {percentage_wow:.1f}% of last weeks 230 | 231 | 232 | Current value: {'2w_ago': 188.8, 'now': 180.8, '1w_ago': 186.6, '3w_ago': 196.4, '4w_ago': 208.8} 233 | 234 | 235 | Captures: 236 | 237 | percentage_wow: 184.9185496584 238 | 239 | last_weeks_avg: 195.15 240 | 241 | 242 | 243 | Alert Definition 244 | Name (ID): Low Orders/m: {percentage_wow:.1f}% of last weeks (ID: 190) 245 | Priority: 1 246 | Check ID: 203 247 | Condition capture(percentage_wow=100. * value['now']/capture(last_weeks_avg=(value['1w_ago'] + value['2w_ago'] + value['3w_ago'] + value['4w_ago'])/4. )) < 85 248 | Team: Platform/Software 249 | Resp. Team: Platform/Software 250 | Notifications: [u"send_mail('example@example.com')"] 251 | 252 | Entity 253 | 254 | id: GLOBAL 255 | 256 | type: GLOBAL 257 | 258 | percentage_wow: 184.9185496584 259 | 260 | last_weeks_avg: 195.15 261 | 262 | Example SMS: 263 | 264 | :: 265 | 266 | Message details: 267 | Type: Text Message 268 | From: zmon2 269 | Message text: 270 | NEW ALERT: DB instances test alert on all shards on customer-integration-master 271 | 272 | 273 | .. _time-periods: 274 | 275 | Time periods 276 | ------------ 277 | 278 | ZMON 2.0 allows specifying time periods (in UTC) in alert definitions. 279 | When specified, user will be notified about the alert only when it occurs during given period. 280 | Examples below cover most common use cases of time periods’ definitions. 281 | 282 | To specify a time period from Monday through Friday, 9:00 to 17:00, use a 283 | period such as 284 | 285 | wd {Mon-Fri} hr {9-16} 286 | 287 | When specifying a range by using -, it is best to think of - as meaning through. 288 | It is 9:00 through 16:00, which is just before 17:00 (16:59:59). 289 | 290 | To specify a time period from Monday through Friday, 9:00 to 17:00 on Monday, Wednesday, and Friday, and 9:00 to 15:00 on Tuesday and Thursday, use a period such as 291 | 292 | wd {Mon Wed Fri} hr {9-16}, wd{Tue Thu} hr {9-14} 293 | 294 | To specify a time period that extends Mon-Fri 9-16, but alternates weeks in a month, use a period such as 295 | 296 | wk {1 3 5} wd {Mon Wed Fri} hr {9-16} 297 | 298 | A period that specifies winter in the northern hemisphere: 299 | 300 | mo {Nov-Feb} 301 | 302 | This is equivalent to the previous example: 303 | 304 | mo {Jan-Feb Nov-Dec} 305 | 306 | As is 307 | 308 | mo {jan feb nov dec} 309 | 310 | And this is too: 311 | 312 | mo {Jan Feb}, mo {Nov Dec} 313 | 314 | To specify a period that describes every other half-hour, use something like: 315 | 316 | minute { 0-29 } 317 | 318 | To specify the morning, use 319 | 320 | hour { 0-11 } 321 | 322 | Remember, 11 is not 11:00:00, but rather 11:00:00 - 11:59:59. 323 | 324 | 5 second blocks: 325 | 326 | sec {0-4 10-14 20-24 30-34 40-44 50-54} 327 | 328 | To specify every first half-hour on alternating week days, and the 329 | second half-hour the rest of the week, use the period 330 | 331 | wd {1 3 5 7} min {0-29}, wd {2 4 6} min {30-59} 332 | 333 | For more examples and syntax reference, please refer to this `documentation `_, 334 | note that suffixes like `am` or `pm` for hours are **not** supported, only 335 | integers between 0 and 23. In doubt, try calling with python with your period definition 336 | like 337 | 338 | .. code-block:: python 339 | 340 | from timeperiod import in_period 341 | in_period('hr { 0 - 23 }') 342 | 343 | This should not throw an exception. The 344 | timeperiod module in use is `timeperiod2 `_. 345 | The `in_period` function accepts a second parameter which is a 346 | `datetime `_ like 347 | 348 | .. code-block:: python 349 | 350 | from datetime import datetime 351 | from timeperiod import in_period 352 | in_period('hr { 7 - 23 }', datetime(2018, 1, 8, 2, 15)) # check 2018-01-08 02:15:00 353 | 354 | 355 | .. include:: alert-definition-inheritance.rst 356 | .. include:: alert-definition-parameters.rst 357 | .. include:: downtimes.rst 358 | .. include:: comments.rst 359 | -------------------------------------------------------------------------------- /docs/installation/configuration.rst: -------------------------------------------------------------------------------- 1 | *********************** 2 | Component Configuration 3 | *********************** 4 | 5 | In this section we assume that you want to use Docker as means of deployment. 6 | The ZMON Dockerimages in Zalando's Open Source registry are exactly the ones we use ourselves, injecting all configuartion via environment variables. 7 | 8 | If this does not fit your needs you can run the artifacts directly and decide to use environment variables or modify the example config files. 9 | 10 | At this point we also assume the requirements in terms of PostgreSQL, Redis and KairosDB are available and you have the credentials at hand. 11 | If not see :ref:`requirements`. The minimal configuration options below are taken from the Demo's Bootstrap_ script! 12 | 13 | Authentication 14 | ============== 15 | 16 | For the ZMON controller we assume that it is publicly accessible. 17 | Thus the UI always requires users to login and the REST API, too. 18 | The REST API relies on tokens via the ``Authorization: Bearer `` header to allow access. 19 | For environments where you have no OAauth2 setup you can configure pre-shared keys for API access. 20 | 21 | .. note:: 22 | 23 | Feel free to look at Zalando's `Plan-B `_, which is a freely available OAuth2 provider we use for our platform to secure service to service communication. 24 | 25 | Creating a preshared token can be achieved like this and adding them to the Controller configuration. 26 | 27 | .. code-block:: bash 28 | 29 | SCHEDULER_TOKEN=$(makepasswd --string=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ --chars 32) 30 | 31 | .. warning:: 32 | 33 | Due to magic in matching env vars token must be ALL UPPERCASE 34 | 35 | Scheduler and worker both at times call the controller's REST API thus you need to configure tokens for them. 36 | For the scheduler, KairosDB, eventlog-service and metric-cache if deployed we assume for now they are private. 37 | Theses services are accessed only by worker and controller and do not need to be public. 38 | Same is true for Redis, PostgreSQL and Cassandra. 39 | However in general we advise you to setup proper credentials and roles where possible. 40 | 41 | Running Docker 42 | ============== 43 | 44 | First we need to figure out what tags to run. 45 | Belows bash snippet helps you to retrieve and set the latest available tags. 46 | 47 | .. code-block:: bash 48 | 49 | function get_latest () { 50 | name=$1 51 | # REST API returns tags sorted by time 52 | tag=$(curl --silent https://registry.opensource.zalan.do/teams/stups/artifacts/$name/tags | jq .[].name -r | tail -n 1) 53 | echo "$name:$tag" 54 | } 55 | 56 | echo "Retrieving latest versions.." 57 | REPO=registry.opensource.zalan.do/stups 58 | POSTGRES_IMAGE=$REPO/postgres:9.4.5-1 59 | REDIS_IMAGE=$REPO/redis:3.2.0-alpine 60 | CASSANDRA_IMAGE=$REPO/cassandra:2.1.5-1 61 | ZMON_KAIROSDB_IMAGE=$REPO/$(get_latest kairosdb) 62 | ZMON_EVENTLOG_SERVICE_IMAGE=$REPO/$(get_latest zmon-eventlog-service) 63 | ZMON_CONTROLLER_IMAGE=$REPO/$(get_latest zmon-controller) 64 | ZMON_SCHEDULER_IMAGE=$REPO/$(get_latest zmon-scheduler) 65 | ZMON_WORKER_IMAGE=$REPO/$(get_latest zmon-worker) 66 | ZMON_METRIC_CACHE=$REPO/$(get_latest zmon-metric-cache) 67 | 68 | To run the selected images use Docker's run command together with the options explained below. 69 | We use the following wrapper for this: 70 | 71 | .. code-block:: bash 72 | 73 | function run_docker () { 74 | name=$1 75 | shift 1 76 | echo "Starting Docker container ${name}.." 77 | # ignore non-existing containers 78 | docker kill $name &> /dev/null || true 79 | docker rm -f $name &> /dev/null || true 80 | docker run --restart "on-failure:10" --net zmon-demo -d --name $name $@ 81 | } 82 | 83 | run_docker zmon-controller \ 84 | # -e ......... \ 85 | # -e ......... \ 86 | $ZMON_CONTROLLER_IMAGE 87 | 88 | Controller 89 | ========== 90 | 91 | Authentication 92 | ^^^^^^^^^^^^^^ 93 | 94 | Configure your Github application 95 | 96 | .. code-block:: bash 97 | 98 | -e SPRING_PROFILES_ACTIVE=github \ 99 | -e ZMON_OAUTH2_SSO_CLIENT_ID=64210244ddd8378699d6 \ 100 | -e ZMON_OAUTH2_SSO_CLIENT_SECRET=48794a58705d1ba66ec9b0f06a3a44ecb273c048 \ 101 | 102 | Make everyone admin for now: 103 | 104 | .. code-block:: bash 105 | 106 | -e ZMON_AUTHORITIES_SIMPLE_ADMINS=* \ 107 | 108 | 109 | Logout URL 110 | ^^^^^^^^^^ 111 | 112 | When switching to TV Mode, you can use this to enable the Pop-up dialog described in 113 | :doc:`/user/tv-login` which opens the Logout URL in a new Tab to terminate the user's session. 114 | 115 | .. code-block:: bash 116 | 117 | -e ZMON_LOGOUT_URL="https://example.com/logout" 118 | 119 | Dependencies 120 | ^^^^^^^^^^^^ 121 | 122 | Configure PostgreSQL access: 123 | 124 | .. code-block:: bash 125 | 126 | -e POSTGRES_URL=jdbc:postgresql://$PGHOST:5432/local_zmon_db \ 127 | -e POSTGRES_PASSWORD=$PGPASSWORD \ 128 | 129 | Setup Redis connection: 130 | 131 | .. code-block:: bash 132 | 133 | -e REDIS_HOST=zmon-redis \ 134 | -e REDIS_PORT=6379 \ 135 | 136 | Set CORS allowed origins: 137 | 138 | .. code-block:: bash 139 | 140 | -e ENDPOINTS_CORS_ALLOWED_ORIGINS=https://demo.zmon.io \ 141 | 142 | Setup URLs for other services: 143 | 144 | .. code-block:: bash 145 | 146 | -e ZMON_EVENTLOG_URL=http://zmon-eventlog-service:8081/ \ 147 | -e ZMON_KAIROSDB_URL=http://zmon-kairosdb:8083/ \ 148 | -e ZMON_METRICCACHE_URL=http://zmon-metric-cache:8086/ \ 149 | -e ZMON_SCHEDULER_URL=http://zmon-scheduler:8085/ \ 150 | 151 | And last but not least, configure a preshared token, to allow the scheduler and worker to access the REST API. Remember tokens need to all uppercase here. 152 | 153 | .. code-block:: bash 154 | 155 | -e PRESHARED_TOKENS_${SCHEDULER_TOKEN}_UID=zmon-scheduler \ 156 | -e PRESHARED_TOKENS_${SCHEDULER_TOKEN}_EXPIRES_AT=1758021422 \ 157 | -e PRESHARED_TOKENS_${SCHEDULER_TOKEN}_AUTHORITY=user 158 | 159 | Firebase and Webpush 160 | ^^^^^^^^^^^^^^^^^^^^ 161 | 162 | Enable desktop push notification UI with the following options: 163 | 164 | .. code-block:: bash 165 | 166 | -e ZMON_ENABLE_FIREBASE=true \ 167 | -e ZMON_NOTIFICATIONSERVICE_URL=http://zmon-notification-service:8087/ \ 168 | -e ZMON_FIREBASE_API_KEY="AIzaSyBM1ktKS5u_d2jxWPHVU7Xk39s-PG5gy7c" \ 169 | -e ZMON_FIREBASE_AUTH_DOMAIN="zmon-demo.firebaseapp.com" \ 170 | -e ZMON_FIREBASE_DATABASE_URL="https://zmon-demo.firebaseio.com" \ 171 | -e ZMON_FIREBASE_STORAGE_BUCKET="zmon-demo.appspot.com" \ 172 | -e ZMON_FIREBASE_MESSAGING_SENDER_ID="280881042812" \ 173 | 174 | This feature requires additional config for the worker and to run the notification-service. 175 | 176 | Scheduler 177 | ========= 178 | 179 | Specify the Redis server you want to use: 180 | 181 | .. code-block:: bash 182 | 183 | -e SCHEDULER_REDIS_HOST=zmon-redis \ 184 | -e SCHEDULER_REDIS_PORT=6379 \ 185 | 186 | Setup access to the controller and entity service (both provided by the controller): 187 | Not the reuse of the above defined pre shared key! 188 | 189 | .. code-block:: bash 190 | 191 | -e SCHEDULER_OAUTH2_STATIC_TOKEN=$SCHEDULER_TOKEN \ 192 | -e SCHEDULER_URLS_WITHOUT_REST=true \ 193 | -e SCHEDULER_ENTITY_SERVICE_URL=http://zmon-controller:8080/ \ 194 | -e SCHEDULER_CONTROLLER_URL=http://zmon-controller:8080/ \ 195 | 196 | If you run into scenarios of different queues or the demand for different levels of parallelism, e.g. limiting number of queries run at MySQL/PostgreSQL databases use the following as an example: 197 | 198 | .. code-block:: bash 199 | 200 | -e SPRING_APPLICATION_JSON='{"scheduler":{"queue_property_mapping":{"zmon:queue:mysql":[{"type":"mysql"}]}}}' 201 | 202 | This will route checks agains entities of type "mysql" to another queue. 203 | 204 | Worker 205 | ====== 206 | 207 | The worker configuration is split into essential configuration options, like Redis and KairosDB and the plugin configuration, e.g. PostgreSQL credentials, ... 208 | 209 | Essential Options 210 | ^^^^^^^^^^^^^^^^^ 211 | 212 | Configure Redis Access: 213 | 214 | .. code-block:: bash 215 | 216 | -e WORKER_REDIS_SERVERS=zmon-redis:6379 \ 217 | 218 | Configure parallelism and throughput: 219 | 220 | .. code-block:: bash 221 | 222 | -e WORKER_ZMON_QUEUES=zmon:queue:default/25,zmon:queue:mysql/3 223 | 224 | Specify the number of worker processes that are polling the queues and execute tasks. 225 | You can specify multiple queues here to listen to. 226 | 227 | Configure KairosDB: 228 | 229 | .. code-block:: bash 230 | 231 | -e WORKER_KAIROSDB_HOST=zmon-kairosdb \ 232 | 233 | Configure EventLog service: 234 | 235 | .. code-block:: bash 236 | 237 | -e WORKER_EVENTLOG_HOST=zmon-eventlog-service \ 238 | -e WORKER_EVENTLOG_PORT=8081 \ 239 | 240 | Configure Worker token to access controller API: (relying on Python tokens library here) 241 | 242 | .. code-block:: bash 243 | 244 | -e OAUTH2_ACCESS_TOKENS=uid=$WORKER_TOKEN \ 245 | 246 | Configure Worker named tokens to access external APIs: 247 | 248 | .. code-block:: bash 249 | 250 | -e WORKER_PLUGIN_HTTP_OAUTH2_TOKENS=token_name1=scope1,scope2,scope3:token_name2=scope1,scope2 251 | 252 | Configure Metric Cache (optional): 253 | 254 | .. code-block:: bash 255 | 256 | -e WORKER_METRICCACHE_URL=http://zmon-metric-cache:8086/api/v1/rest-api-metrics/ \ 257 | -e WORKER_METRICCACHE_CHECK_ID=9 \ 258 | 259 | .. _notification-options-label: 260 | 261 | Notification Options 262 | ^^^^^^^^^^^^^^^^^^^^ 263 | 264 | Firebase and Webpush 265 | -------------------- 266 | To trigger notifications for desktop web and mobile apps set the following params to point to notification service. 267 | 268 | ``WORKER_NOTIFICATION_SERVICE_URL`` 269 | Notification service base url 270 | 271 | ``WORKER_NOTIFICATION_SERVICE_KEY`` 272 | (optional, if not using oauth2) A shared key configured in the notification service 273 | 274 | 275 | Hipchat 276 | ------- 277 | ``WORKER_NOTIFICATIONS_HIPCHAT_TOKEN`` 278 | Access token for HipChat notifications. 279 | ``WORKER_NOTIFICATIONS_HIPCHAT_URL`` 280 | URL of HipChat server. 281 | 282 | HTTP 283 | ---- 284 | 285 | This allows to trigger HTTP Post calls to arbitrary services. 286 | 287 | ``WORKER_NOTIFICATIONS_HTTP_DEFAULT_URL`` 288 | HTTP endpoint default URL. 289 | ``WORKER_NOTIFICATIONS_HTTP_WHITELIST_URLS`` 290 | List of whitelist URL endpoints. If URL is not in this list, then exception will be raised. 291 | ``WORKER_NOTIFICATIONS_HTTP_ALLOW_ALL`` 292 | Allow any URL to be used in HTTP notification. 293 | ``WORKER_NOTIFICATIONS_HTTP_HEADERS`` 294 | Default headers to be used in HTTP requests. 295 | 296 | Mail 297 | ---- 298 | ``WORKER_NOTIFICATIONS_MAIL_HOST`` 299 | SMTP host for email notifications. 300 | ``WORKER_NOTIFICATIONS_MAIL_PORT`` 301 | SMTP port for email notifications. 302 | ``WORKER_NOTIFICATIONS_MAIL_SENDER`` 303 | Sender address for email notifications. 304 | ``WORKER_NOTIFICATIONS_MAIL_USER`` 305 | SMTP user for email notifications. 306 | ``WORKER_NOTIFICATIONS_MAIL_PASSWORD`` 307 | SMTP password for email notifications. 308 | 309 | Slack 310 | ----- 311 | ``WORKER_NOTIFICATIONS_SLACK_WEBHOOK`` 312 | Slack webhook for channel notifications. 313 | 314 | Twilio 315 | ------ 316 | ``WORKER_NOTIFICATION_SERVICE_URL`` 317 | URL of notification service (needs to be publicly accessible) 318 | ``WORKER_NOTIFICATION_SERVICE_KEY`` 319 | (optional, if not using oauth2) Preshared key to call notification service 320 | 321 | Pagerduty 322 | --------- 323 | ``WORKER_NOTIFICATIONS_PAGERDUTY_SERVICEKEY`` 324 | Routing key for a Pagerduty service 325 | 326 | 327 | 328 | Plug-In Options 329 | --------------- 330 | 331 | All plug-in options have the prefix ``WORKER_PLUGIN__``, i.e. if you want to set option "bar" of the plugin "foo" to "123" via environment variable: 332 | 333 | .. code-block:: bash 334 | 335 | WORKER_PLUGIN_FOO_BAR=123 336 | 337 | If you plan to access your PostgreSQL cluster specify the credentials below. We suggest to use a distinct user for ZMON with limited read only privileges. 338 | 339 | .. code-block:: bash 340 | 341 | WORKER_PLUGIN_SQL_USER 342 | WORKER_PLUGIN_SQL_PASS 343 | 344 | If you need to access MySQL specify the user credentials below, again we suggest to use a user with limited privileges only. 345 | 346 | .. code-block:: bash 347 | 348 | WORKER_PLUGIN_MYSQL_USER 349 | WORKER_PLUGIN_MYSQL_PASS 350 | 351 | 352 | .. _Bootstrap: https://github.com/zalando-zmon/zmon-demo 353 | 354 | 355 | Notification Service 356 | ==================== 357 | 358 | Optional component to service mobile API, push notifications and Twilio notifications. 359 | 360 | Authentication 361 | ^^^^^^^^^^^^^^ 362 | 363 | ``SPRING_APPLICATION_JSON`` 364 | Use this to define pre-shared keys if not using OAuth2. Specify key and max validity. 365 | 366 | .. code-block:: json 367 | 368 | {"notifications":{"shared_keys":{"": 1504981053654}}} 369 | 370 | 371 | Firebase and Web Push 372 | ^^^^^^^^^^^^^^^^^^^^^ 373 | 374 | ``NOTIFICATIONS_GOOGLE_PUSH_SERVICE_API_KEY`` 375 | Private Firebase messaging server key 376 | 377 | ``NOTIFICATIONS_ZMON_URL`` 378 | ZMON's base URL 379 | 380 | 381 | Twilio options 382 | ^^^^^^^^^^^^^^ 383 | 384 | ``NOTIFICATIONS_TWILIO_API_KEY`` 385 | Private API Key 386 | ``NOTIFICATIONS_TWILIO_USER`` 387 | User 388 | ``NOTIFICATIONS_TWILIO_PHONE_NUMBER`` 389 | Phone number to use 390 | ``NOTIFICATIONS_DOMAIN`` 391 | Domain under which notification service is reachable 392 | --------------------------------------------------------------------------------