├── docs
    ├── requirements.txt
    ├── .gitignore
    ├── images
    │   ├── cloud1.png
    │   ├── dashboard.png
    │   ├── entities.png
    │   ├── tv-mode.png
    │   ├── dashboard1.png
    │   ├── switch-tv-mode.png
    │   ├── grafana-example1.png
    │   └── tv-mode-logout-dialog.png
    ├── user
    │   ├── check-ref
    │   │   ├── mongodb_wrapper.rst
    │   │   ├── tcp_wrapper.rst
    │   │   ├── ping_wrapper.rst
    │   │   ├── jmx_wrapper.rst
    │   │   ├── dns_wrapper.rst
    │   │   ├── counter_wrapper.rst
    │   │   ├── eventlog_wrapper.rst
    │   │   ├── cassandra_wrapper.rst
    │   │   ├── ebs_wrapper.rst
    │   │   ├── ldap_wrapper.rst
    │   │   ├── zomcat_wrapper.rst
    │   │   ├── entities_wrapper.rst
    │   │   ├── datapipeline_wrapper.rst
    │   │   ├── memcached_wrapper.rst
    │   │   ├── history_wrapper.rst
    │   │   ├── s3_wrapper.rst
    │   │   ├── scalyr_wrapper.rst
    │   │   ├── kairosdb_wrapper.rst
    │   │   ├── snmp_wrapper.rst
    │   │   ├── elastic_search_wrapper.rst
    │   │   ├── http_wrapper.rst
    │   │   ├── appdynamics_wrapper.rst
    │   │   ├── redis_wrapper.rst
    │   │   ├── sql_wrappers.rst
    │   │   ├── cloudwatch_wrapper.rst
    │   │   └── kubernetes_wrapper.rst
    │   ├── notifications
    │   │   ├── hubot.rst
    │   │   ├── push.rst
    │   │   ├── slack.rst
    │   │   ├── twilio.rst
    │   │   ├── google_hangouts_chat.rst
    │   │   ├── mail.rst
    │   │   ├── pagerduty.rst
    │   │   ├── opsgenie.rst
    │   │   ├── hipchat.rst
    │   │   └── http.rst
    │   ├── notifications.rst
    │   ├── downtimes.rst
    │   ├── comments.rst
    │   ├── check-definitions.rst
    │   ├── alert-definition-inheritance.rst
    │   ├── entities.rst
    │   ├── tv-login.rst
    │   ├── alert-definition-parameters.rst
    │   ├── grafana.rst
    │   ├── monitoringonaws.rst
    │   └── alert-definitions.rst
    ├── developer
    │   ├── tests.rst
    │   ├── zmon-python-client.rst
    │   ├── redis.rst
    │   ├── zmon-cli.rst
    │   └── python-tutorial.rst
    ├── index.rst
    ├── installation
    │   ├── requirements.rst
    │   ├── components.rst
    │   └── configuration.rst
    ├── apendix
    │   └── glossary.rst
    ├── getting-started.rst
    ├── Makefile
    ├── conf.py
    └── intro.rst
├── .zappr.yaml
└── README.rst


/docs/requirements.txt:
--------------------------------------------------------------------------------
1 | zmon-cli>=1.0.59
2 | 


--------------------------------------------------------------------------------
/docs/.gitignore:
--------------------------------------------------------------------------------
1 | .*.un~
2 | _build
3 | *.swp
4 | 


--------------------------------------------------------------------------------
/.zappr.yaml:
--------------------------------------------------------------------------------
1 | X-Zalando-Team: zmon
2 | X-Zalando-Type: doc
3 | 


--------------------------------------------------------------------------------
/docs/images/cloud1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zalando-zmon/zmon-docs/HEAD/docs/images/cloud1.png


--------------------------------------------------------------------------------
/docs/images/dashboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zalando-zmon/zmon-docs/HEAD/docs/images/dashboard.png


--------------------------------------------------------------------------------
/docs/images/entities.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zalando-zmon/zmon-docs/HEAD/docs/images/entities.png


--------------------------------------------------------------------------------
/docs/images/tv-mode.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zalando-zmon/zmon-docs/HEAD/docs/images/tv-mode.png


--------------------------------------------------------------------------------
/docs/images/dashboard1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zalando-zmon/zmon-docs/HEAD/docs/images/dashboard1.png


--------------------------------------------------------------------------------
/docs/images/switch-tv-mode.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zalando-zmon/zmon-docs/HEAD/docs/images/switch-tv-mode.png


--------------------------------------------------------------------------------
/docs/images/grafana-example1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zalando-zmon/zmon-docs/HEAD/docs/images/grafana-example1.png


--------------------------------------------------------------------------------
/docs/images/tv-mode-logout-dialog.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zalando-zmon/zmon-docs/HEAD/docs/images/tv-mode-logout-dialog.png


--------------------------------------------------------------------------------
/docs/user/check-ref/mongodb_wrapper.rst:
--------------------------------------------------------------------------------
 1 | MongoDB
 2 | -------
 3 | 
 4 | Provides access to a MongoDB cluster
 5 | 
 6 | .. py:function:: mongodb(host, port=27017)
 7 | 
 8 | Methods of MongoDB
 9 | ^^^^^^^^^^^^^^^^^^
10 | 
11 | .. py:function:: find(database, collection, query)
12 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/tcp_wrapper.rst:
--------------------------------------------------------------------------------
 1 | TCP
 2 | ---
 3 | 
 4 | This function opens a TCP connection to a host on a given port. If the
 5 | connection succeeds, it returns ‘OK’. The host can be provided directly for global checks or resolved from
 6 | entities filter. Assuming that we have an entity filter type=host, the
 7 | example below will try to connect to every host on port 22::
 8 | 
 9 |     tcp().open(22)
10 | 


--------------------------------------------------------------------------------
/docs/user/notifications/hubot.rst:
--------------------------------------------------------------------------------
 1 | Hubot
 2 | -----
 3 | 
 4 | Send Hubot notification.
 5 | 
 6 | .. py:function:: notify_hubot(queue, hubot_url, message=None)
 7 | 
 8 |     Send Hubot notification.
 9 | 
10 |     :param queue: Hubot queue.
11 |     :type queue: str
12 | 
13 |     :param hubot_url: Hubot url.
14 |     :type hubot_url: str
15 | 
16 |     :param message: Notification message.
17 |     :type message: str
18 | 
19 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/ping_wrapper.rst:
--------------------------------------------------------------------------------
 1 | Ping
 2 | ----
 3 | 
 4 | Simple ICMP ping function which returns ``True`` if the ping command returned without error and ``False`` otherwise.
 5 | 
 6 | .. py:function:: ping(timeout=1)
 7 | 
 8 |     ::
 9 | 
10 |         ping()
11 | 
12 |     The ``timeout`` argument specifies the timeout in seconds.
13 |     Internally it just runs the following system command::
14 | 
15 |         ping -c 1 -w <TIMEOUT> <HOST>
16 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/jmx_wrapper.rst:
--------------------------------------------------------------------------------
 1 | JMX
 2 | ---
 3 | 
 4 | To use JMXQuery, run "jmxquery" (this is not yet released)
 5 | 
 6 | Queries beans’ attributes on hosts specified in entities filter::
 7 | 
 8 |     jmx().query('java.lang:type=Memory', 'HeapMemoryUsage', 'NonHeapMemoryUsage').results()
 9 | 
10 | Another example::
11 | 
12 |     jmx().query('java.lang:type=Threading', 'ThreadCount', 'DaemonThreadCount', 'PeakThreadCount').results()
13 | 
14 | This would return a dict like:
15 | 
16 | .. code-block:: json
17 | 
18 |     {
19 |         "DaemonThreadCount": 524,
20 |         "PeakThreadCount": 583,
21 |         "ThreadCount": 575
22 |     }
23 | 


--------------------------------------------------------------------------------
/docs/user/notifications/push.rst:
--------------------------------------------------------------------------------
 1 | Push
 2 | -----
 3 | 
 4 | Send push notification via ZMON `notification service <https://github.com/zalando-zmon/zmon-notification-service>`_.
 5 | 
 6 | .. py:function:: send_push(url=None, key=None, message=None)
 7 | 
 8 |     Send Push notification to mobile devices.
 9 | 
10 |     :param url: Notification service base URL.
11 |     :type url: str
12 | 
13 |     :param key: Notification service API key.
14 |     :type key: str
15 | 
16 |     :param message: Message to be sent in notification.
17 |     :type message: str
18 | 
19 | .. note::
20 | 
21 |     If Message is ``None`` then it will be generated from alert status.
22 | 
23 | 


--------------------------------------------------------------------------------
/docs/user/notifications/slack.rst:
--------------------------------------------------------------------------------
 1 | Slack
 2 | -----
 3 | 
 4 | Notify Slack channel with alert status. A ``webhook`` is required for notifications.
 5 | 
 6 | .. py:function:: notify_slack(webhook=None, channel='#general', message=None)
 7 | 
 8 |     Send Slack notification to specified channel.
 9 | 
10 |     :param webhook: Slack webhook. If not set, then webhook set in configuration will be used.
11 |     :type webhook: str
12 | 
13 |     :param channel: Channel to be notified. Default is ``#general``.
14 |     :type channel: str
15 | 
16 |     :param message: Message to be sent. If ``None``, then a message constructed from the alert will be sent.
17 |     :type message: str
18 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/dns_wrapper.rst:
--------------------------------------------------------------------------------
 1 | DNS
 2 | ---
 3 | 
 4 | The ``dns()`` function provide a way to resolve hosts.
 5 | 
 6 | .. py:function:: dns(host=None)
 7 | 
 8 | 
 9 | Methods of DNS
10 | ^^^^^^^^^^^^^^
11 | 
12 | .. py:method:: resolve(host=None)
13 | 
14 |     Return IP address of host. If host is ``None``, then will resolve host used in initialization. If both are ``None`` then exception will be raised.
15 | 
16 |     :return: IP address
17 |     :rtype: str
18 | 
19 |     Example query:
20 | 
21 |     .. code-block:: python
22 | 
23 |         dns('google.de').resolve()
24 |         '173.194.65.94'
25 | 
26 |         dns().resolve('google.de')
27 |         '173.194.65.94'
28 | 


--------------------------------------------------------------------------------
/docs/user/notifications.rst:
--------------------------------------------------------------------------------
 1 | .. _notifications:
 2 | 
 3 | ***********************
 4 | Notifications Reference
 5 | ***********************
 6 | 
 7 | ZMON provides several means of notification in case of alerts. Notifications will be triggered when alert status change. Please refer to
 8 | :ref:`Notification options <notification-options-label>` for different worker configuration options.
 9 | 
10 | .. include:: notifications/google_hangouts_chat.rst
11 | .. include:: notifications/hipchat.rst
12 | .. include:: notifications/http.rst
13 | .. include:: notifications/hubot.rst
14 | .. include:: notifications/mail.rst
15 | .. include:: notifications/opsgenie.rst
16 | .. include:: notifications/pagerduty.rst
17 | .. include:: notifications/push.rst
18 | .. include:: notifications/slack.rst
19 | .. include:: notifications/twilio.rst
20 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/counter_wrapper.rst:
--------------------------------------------------------------------------------
 1 | Counter
 2 | -------
 3 | 
 4 | The ``counter()`` function allows you to get increment rates of increasing counter values.
 5 | Main use case for using ``counter()`` is to get rates per second of JMX counter beans (e.g. "Tomcat Requests").
 6 | The counter function requires one parameter ``key`` to identify the counter.
 7 | 
 8 | 
 9 | .. py:method:: per_second(value)
10 | 
11 |     ::
12 | 
13 |         counter('requests').per_second(get_total_requests())
14 | 
15 |     Returns the value's increment rate per second. Value must be a float or integer.
16 | 
17 | .. py:method:: per_minute(value)
18 | 
19 |     ::
20 | 
21 |         counter('requests').per_minute(get_total_requests())
22 | 
23 |     Convenience method to return the value's increment rate per minute (same as result of ``per_second()`` divided by 60).
24 | 
25 | Internally counter values and timestamps are stored in Redis.
26 | 


--------------------------------------------------------------------------------
/docs/user/notifications/twilio.rst:
--------------------------------------------------------------------------------
 1 | Twilio
 2 | ------
 3 | 
 4 | Use Twilio to receive phone calls if alerts pop up. This includes basic ACK and escalation. Requires account at Twilio and the notifiction service deployed. Low investment to get going though. WORK IN PROGRESS.
 5 | 
 6 | .. py:function:: notifiy_twilio(numbers=[], message="ZMON Alert Up: Some Alert")
 7 | 
 8 |     Make phone call to supplied numbers. First number will be called immediately. After two minutes, another call is made to that number if no ACK. Other numbers follow at 5min interval without ACK.
 9 | 
10 |     :param message: Message to be sent. If ``None``, then a message constructed from the alert will be sent.
11 |     :type message: str
12 | 
13 |     :param numbers: Numbers to call
14 |     :type numers: list
15 | 
16 | 
17 | .. note::
18 | 
19 |     Remember to configure your worker for this.
20 | 
21 |     .. code-block:: bash
22 | 
23 |         NOTIFICATION_SERVICE_URL
24 |         NOTIFICATION_SERVICE_KEY
25 | 


--------------------------------------------------------------------------------
/docs/developer/tests.rst:
--------------------------------------------------------------------------------
 1 | *****
 2 | Tests
 3 | *****
 4 | 
 5 | Acceptance and Unit Tests
 6 | -------------------------
 7 | 
 8 | These tests must be run from inside the vagrant box.::
 9 | 
10 |        $ vagrant ssh
11 |        vagrant@zmon:~$ cd /vagrant/vagrant/
12 |        vagrant@zmon:/vagrant/vagrant$ sudo ./test.sh
13 | 
14 | An example output of the previous command can look similar to this::
15 | 
16 |        Starting Xvfb...
17 |        [13:36:12] Using gulpfile /vagrant/zmon-controller/src/main/webapp/gulpfile.js
18 |        [13:36:12] Starting 'test'...
19 |        Starting selenium standalone server...
20 |        Selenium standalone server started at http://10.0.2.15:47833/wd/hub
21 |        Testing dashboard features
22 |          should display the search form - pass
23 | 
24 |        Finished in 3.24 seconds
25 |        1 test, 1 assertion, 0 failures
26 | 
27 |        Shutting down selenium standalone server.
28 |        [13:36:22] Finished 'test' after 10 s
29 | 
30 | Only one single acceptance test and no unit tests are provided so far. This is still a work in progress.
31 | 


--------------------------------------------------------------------------------
/docs/user/downtimes.rst:
--------------------------------------------------------------------------------
 1 | .. _downtimes:
 2 | 
 3 | 
 4 | Downtimes
 5 | ---------
 6 | 
 7 | This functionality allows the user to acknowledge an existing alert or create a downtime schedule for an anticipated service
 8 | interruption. When acknowleding an existing alert, the user has to provide the predicted duration, and when creating
 9 | a scheduled downtime - start and end date. If the downtime is currently active, meaning an alert occured within the
10 | downtime period, the alert notification won't be shown in the dashboard and it'll be greyed out in alert details page.
11 | Please note that the downtime will not be evaluated immediately after creation, meaning that the alert might appear
12 | as active until it's evaluated again by the worker. E.g. if the user defined a downtime for an alert which is evaluated
13 | every minute and the last evaluation was 5 seconds ago, it would take approximately one more minute for the alert to
14 | appear in "downtime state".
15 | 
16 | To acknowledge an alert or to schedule a new downtime, the user has to go to the specific alert details page and click
17 | on a downtime button next to the desired alert.
18 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/eventlog_wrapper.rst:
--------------------------------------------------------------------------------
 1 | EventLog
 2 | --------
 3 | 
 4 | The ``eventlog()`` function allows you to conveniently count EventLog_ events by type and time.
 5 | 
 6 | 
 7 | .. py:method:: count(event_type_ids, time_from, [time_to=None], [group_by=None])
 8 | 
 9 |     Return event counts for given parameters.
10 | 
11 |     *event_type_ids* is either a single integer (use hex notation, e.g. ``0x96001``) or a list of integers.
12 | 
13 |     *time_from* is a string time specification (``'-5m'`` means 5 minutes ago, ``'-1h'`` means 1 hour ago).
14 | 
15 |     *time_to* is a string time specification and defaults to *now* if not given.
16 | 
17 |     *group_by* can specify an EventLog field name to group counts by
18 | 
19 |     ::
20 | 
21 |         eventlog().count(0x96001, time_from='-1m')                         # returns a single number
22 |         eventlog().count([0x96001, 0x63005], time_from='-1m')              # returns dict {'96001': 123, '63005': 456}
23 |         eventlog().count(0x96001, time_from='-1m', group_by='appDomainId') # returns dict {'1': 123, '5': 456, ..}
24 | 
25 |     The ``count()`` method internally requests the EventLog Viewer's "count" JSON endpoint.
26 | 


--------------------------------------------------------------------------------
/docs/user/comments.rst:
--------------------------------------------------------------------------------
 1 | .. _comments:
 2 | 
 3 | 
 4 | Alert Comments
 5 | --------------
 6 | 
 7 | Comments are useful in providing additional information to other members of your team (or other teams) about your
 8 | alerts. Those with ADMIN and USER roles can add comments to an alert, but VIEWERS can not. ADMINs can delete
 9 | either their own or other people's comments. USERs can delete only their own comments.
10 | 
11 | Adding Comments
12 | ^^^^^^^^^^^^^^^
13 | 
14 | Follow these steps:
15 | 
16 | * Open the alert definition where you want to add your comment.
17 | * Either click on the top-right link `Comments` to add a **general** comment (for all entities), or click on the balloon on the left side of the entity name to add a comment on a **specific** entity.
18 | * In the comments window, type your comment. Use as many lines as you need.
19 | * Click the `Post comment` button and save your comment. Done!
20 | 
21 | Seeing Existing Comments
22 | ^^^^^^^^^^^^^^^^^^^^^^^^
23 | 
24 | It's easy: Just open the alert definition, then click on `Comments` (top-right link).
25 | 
26 | Deleting Comments
27 | ^^^^^^^^^^^^^^^^^
28 | 
29 | Deleting is also easy: Open the alert definition, click on the top right-link `Comments`, click on the cross above the comment, and delete.
30 | 


--------------------------------------------------------------------------------
/docs/user/notifications/google_hangouts_chat.rst:
--------------------------------------------------------------------------------
 1 | Google Hangouts Chat
 2 | -------
 3 | 
 4 | Notify Google Hangouts Chat room with alert status.
 5 | 
 6 | .. py:function:: send_google_hangouts_chat(webhook_link=None, message=None, color='red', threading='alert')
 7 | 
 8 |     Send Google Hangouts Chat notification.
 9 |     
10 |     :param webhook_link: Webhook Link in Google Hangouts Chat Room. Create a `Google Hangouts Chat Webhook`_ and copy the link here.
11 |     :type webhook_link: str
12 |     
13 |     :param multiline: Should the Text in the notification span multiple lines or not? Default is ``True``.
14 |     :type multiline: bool
15 | 
16 |     :param message: Message to be sent. If ``None``, then a message constructed from the alert will be sent.
17 |     :type message: str
18 | 
19 |     :param color: Message color. Default is ``red`` if alert is raised.
20 |     :type color: str
21 | 
22 |     :param threading: Message threading behaviour. Allowed values are ``alert`` (thread per alert entity), ``date`` (thread per day), ``alert-date`` (thread per alert entity per day) or ``none`` (unique thread per notification). Default is ``alert``.
23 |     :type threading: str
24 | 
25 | .. note::
26 | 
27 |     Message color will be determined based on alert status. If alert has ended, then ``color`` will be ``green``, otherwise ``color`` argument will be used.
28 | 
29 | .. _Google Hangouts Chat Webhook: https://developers.google.com/hangouts/chat/how-tos/webhooks
30 | 


--------------------------------------------------------------------------------
/docs/index.rst:
--------------------------------------------------------------------------------
 1 | ZMON Docs
 2 | =========
 3 | 
 4 | .. toctree::
 5 |    :hidden:
 6 |    :maxdepth: 1
 7 | 
 8 |    intro
 9 |    getting-started
10 | 
11 | .. _user-docs:
12 | 
13 | .. toctree::
14 |    :hidden:
15 |    :maxdepth: 2
16 |    :caption: User Documentation
17 | 
18 |    user/entities
19 |    user/check-definitions
20 |    user/alert-definitions
21 | 
22 |    user/dashboards
23 |    user/grafana
24 |    user/tv-login
25 | 
26 |    user/check-commands
27 |    user/alert-ref/alert_reference_functions
28 |    user/notifications
29 | 
30 | .. toctree::
31 |    :hidden:
32 |    :maxdepth: 2
33 |    :caption: Guides
34 | 
35 |    user/monitoringonaws
36 | 
37 | 
38 | .. _installation:
39 | 
40 | .. toctree::
41 |    :maxdepth: 2
42 |    :caption: Deploying ZMON
43 |    :hidden:
44 | 
45 |    installation/requirements
46 |    installation/components
47 |    installation/configuration
48 | 
49 | .. _developer-docs:
50 | 
51 | .. toctree::
52 |    :hidden:
53 |    :maxdepth: 2
54 |    :caption: Developer Documentation
55 | 
56 |    developer/rest-api
57 |    developer/zmon-cli
58 |    developer/zmon-python-client
59 |    developer/python-tutorial
60 |    developer/tests
61 |    developer/redis
62 | 
63 | 
64 | .. toctree::
65 |    :hidden:
66 |    :maxdepth: 1
67 |    :caption: Appendix
68 | 
69 |    apendix/glossary
70 | 
71 | .. include:: intro.rst
72 | 
73 | 
74 | Indices and Tables
75 | ==================
76 | 
77 | * :ref:`genindex`
78 | * :ref:`modindex`
79 | * :ref:`search`
80 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/cassandra_wrapper.rst:
--------------------------------------------------------------------------------
 1 | Cassandra
 2 | ---------
 3 | 
 4 | Provides access to a Cassandra cluster via ``cassandra()`` wrapper object.
 5 | 
 6 | .. py:function:: cassandra(node, keyspace, username=None, password=None, port=9042, connect_timeout=1, protocol_version=3)
 7 | 
 8 | 
 9 |     Initialize cassandra wrapper.
10 | 
11 |     :param node: Cassandra host.
12 |     :type node: str
13 | 
14 |     :param keyspace: Cassandra keyspace used during the session.
15 |     :type keyspace: str
16 | 
17 |     :param username: Username used in connection. It is recommended to use unprivileged user for cassandra checks.
18 |     :type username: str
19 | 
20 |     :param password: Password used in connection.
21 |     :type password: str
22 | 
23 |     :param port: Cassandra host port. Default is 9042.
24 |     :type port: int
25 | 
26 |     :param connect_timeout: Connection timeout.
27 |     :type connect_timeout: int
28 | 
29 |     :param protocol_version: Protocol version used in connection. Default is 3.
30 |     :type protocol_version: str
31 | 
32 | .. note::
33 | 
34 |     You should always use an unprivileged user to access your databases. Use ``plugin.cassandra.user`` and ``plugin.cassandra.pass`` to configure credentials for the zmon-worker.
35 | 
36 | .. py:function:: execute(stmt)
37 | 
38 |     Execute a CQL statement against the specified keyspace.
39 | 
40 |     :param stmt: CQL statement
41 |     :type stmt: str
42 | 
43 |     :return: CQL result
44 |     :rtype: list
45 | 


--------------------------------------------------------------------------------
/docs/user/check-definitions.rst:
--------------------------------------------------------------------------------
 1 | .. _check-definitions:
 2 | 
 3 | *****************
 4 | Check Definitions
 5 | *****************
 6 | 
 7 | Checks are ZMON's way of gathering data from arbitrary entities, e.g. databases, micro services, hosts and more.
 8 | Create them as describe below using either the UI or the CLI.
 9 | 
10 | Key properties
11 | ==============
12 | 
13 | Command
14 | -------
15 | 
16 | The command is being executed by the worker and is considered the data gathering part.
17 | It is executed once per selected entity and its result made available to all attached alerts.
18 | You have different wrappers at hand and the ``entity`` variable is also available for access.
19 | 
20 | Entity Filter
21 | -------------
22 | 
23 | Select the entities you want the check to execute against in general, often only a type filter is applied, sometimes more specific.
24 | The alert allows you to do more fine grained filtering.
25 | This proves useful to allow checks to be easily reused.
26 | 
27 | Interval
28 | --------
29 | 
30 | Specify the interval in seconds at which you want the check to be executed.
31 | 
32 | Owning team
33 | -----------
34 | 
35 | This is the team originally creating the check, right now this has little effect.
36 | 
37 | Creating new checks
38 | ===================
39 | 
40 | Using trial run
41 | ---------------
42 | 
43 | Using the CLI
44 | -------------
45 | 
46 | .. code-block:: bash
47 | 
48 |     $ zmon check init new-check.yaml
49 |     $ zmon check update new-check.yaml
50 | 


--------------------------------------------------------------------------------
/docs/installation/requirements.rst:
--------------------------------------------------------------------------------
 1 | .. _requirements:
 2 | 
 3 | ************
 4 | Requirements
 5 | ************
 6 | 
 7 | The requirements below are all open soure technologies that need to be available for ZMON to run with all its features.
 8 | 
 9 | Redis
10 | =====
11 | 
12 | The Redis service is one of the core dependencies, ZMON uses Redis for its task queue and to store its current state.
13 | 
14 | PostgreSQL
15 | ==========
16 | 
17 | PostgreSQL is ZMONs data store for entities, checks, alerts, dashboards and Grafana dashboards.
18 | The entities service relies on PostgreSQL's jsonb data type thus you need a PostgreSQL 9.4+ running.
19 | 
20 | Cassandra
21 | =========
22 | 
23 | Cassandra needs to be available for KairosDB if you want to have historic data and make use of Grafana, this is highly suggested.
24 | We strongly recommend to run Cassandra 3.7+ and using TimeWindow compaction strategy for KairosDB.
25 | This will nicely split your SSTables into a single file per day (depending on your config).
26 | 
27 | KairosDB
28 | ========
29 | 
30 | KairosDB is our time series database of choice, however by now we are running our own fork_. This is not required for standard volume scenarios we believe.
31 | ZMON will store every metric gathered in KairosDB so that you can use it directly or via Graphana to access historic data.
32 | ZMON itself allows you to plot charts from KairosDB in Dashboard widgets or go to check/alert specific charts directly.
33 | 
34 | .. _fork: https://github.com/zalando-zmon/kairosdb
35 | 


--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
 1 | ZMON source code on GitHub is no longer in active development. Zalando will no longer actively review issues or merge pull-requests.
 2 | 
 3 | ZMON is still being used at Zalando and serves us well for many purposes. We are now deeper into our observability journey and understand better that we need other telemetry sources and tools to elevate our understanding of the systems we operate. We support the `OpenTelemetry <https://opentelemetry.io>`_ initiative and recommended others starting their journey to begin there.
 4 | 
 5 | If members of the community are interested in continuing developing ZMON, consider forking it. Please review the licence before you do.
 6 | 
 7 | ==================
 8 | ZMON Documentation
 9 | ==================
10 | 
11 | .. image:: https://readthedocs.org/projects/zmon/badge/?version=latest
12 |    :target: https://readthedocs.org/projects/zmon/?badge=latest
13 |    :alt: Documentation Status
14 | 
15 | First install the Sphinx_ documentation generator.
16 | 
17 | .. code-block:: bash
18 | 
19 |     $ sudo pip install sphinx sphinx_rtd_theme
20 | 
21 | 
22 | Generate the ZMON HTML documentation locally:
23 | 
24 | .. code-block:: bash
25 | 
26 |     $ cd docs; make html
27 | 
28 | .. _Sphinx: http://sphinx-doc.org/
29 | 
30 | Run docs locally:
31 | 
32 | .. code-block:: bash
33 | 
34 |     $ python -m SimpleHTTPServer 8888
35 | 
36 | If you are using Python3:
37 | 
38 | .. code-block:: bash
39 | 
40 |     $ python3 -m http.server 8888
41 | 
42 | Docs at:
43 |     `http://localhost:8888/_build/html/`
44 | 


--------------------------------------------------------------------------------
/docs/user/notifications/mail.rst:
--------------------------------------------------------------------------------
 1 | Mail
 2 | ----
 3 | 
 4 | Send email notifications.
 5 | 
 6 | .. py:function:: send_mail(subject=None, cc=None, html=False, hide_recipients=True, include_value=True, include_definition=True, include_captures=True, include_entity=True, per_entity=True)
 7 | 
 8 |     Send email notification.
 9 | 
10 |     :param subject: Email subject.
11 |                     You must use a unicode string (e.g. `u'äöüß'`) if you have non-ASCII
12 |                     characters in there.
13 |                     If None, the alert name will be used.
14 |     :type subject: str or unicode or None
15 | 
16 |     :param cc: List of CC recipients.
17 |     :type cc: list
18 | 
19 |     :param html: HTML email.
20 |     :type html: bool
21 | 
22 |     :param hide_recipients: Hide recipients. Will be sent as BCC.
23 |     :type hide_recipients: bool
24 | 
25 |     :param include_value: Include alert value in notification message.
26 |     :type include_value: bool
27 | 
28 |     :param include_definition: Include alert definition details in notification message.
29 |     :type include_definition: bool
30 | 
31 |     :param include_captures: Include alert captures in message.
32 |     :type include_captures: bool
33 | 
34 |     :param include_entity: Include affected entities in notification message.
35 |     :type include_entity: bool
36 | 
37 |     :param per_entity: Send new email notification per entity. Default is ``True``.
38 |     :type per_entity: bool
39 | 
40 | 
41 | .. note::
42 | 
43 |     ``send_email`` is an alias for this notification function.
44 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/ebs_wrapper.rst:
--------------------------------------------------------------------------------
 1 | EBS
 2 | ---
 3 | 
 4 | Allows to describe EBS objects (currently, only Snapshots are supported).
 5 | 
 6 | 
 7 | .. py:function:: ebs()
 8 | 
 9 | 
10 | Methods of EBS
11 | ^^^^^^^^^^^^^^
12 | 
13 | .. py:function:: list_snapshots(account_id, max_items)
14 | 
15 |     List the EBS Snapshots owned by the given account_id.
16 |     By default, listing is possible for up to 1000 items, so we use pagination internally to overcome this.
17 | 
18 |     :param account_id: AWS account id number (as a string).  Defaults to the AWS account id where the check is running.
19 |     :param max_items: the maximum number of snapshots to list.  Defaults to 100.
20 |     :return: an ``EBSSnapshotsList`` object
21 | 
22 |     .. py:class:: EBSSnapshotsList
23 | 
24 |         .. py:method:: items()
25 | 
26 |             Returns a list of dicts like
27 | 
28 |             .. code-block:: json
29 | 
30 |                {
31 |                    "id": "snap-12345",
32 |                    "description": "Snapshot description...",
33 |                    "size": 123,
34 |                    "start_time": "2017-07-16T01:01:21Z",
35 |                    "state": "completed"
36 |                }
37 | 
38 |     Example usage:
39 | 
40 |     .. code-block:: python
41 | 
42 |        ebs().list_snapshots().items()
43 | 
44 |        snapshots = ebs().list_snapshots(max_items=1000).items()  # for listing more than the default of 100 snapshots
45 |        start_time = snapshots[0]["start_time"].isoformat()  # returns a string that can be passed to time()
46 |        age = time() - time(start_time)
47 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/ldap_wrapper.rst:
--------------------------------------------------------------------------------
 1 | LDAP
 2 | ----
 3 | 
 4 | Retrieve OpenLDAP statistics (needs "cn=Monitor" database installed in LDAP server). ::
 5 | 
 6 |     ldap().statistics()
 7 | 
 8 | This would return a dict like:
 9 | 
10 | .. code-block:: json
11 | 
12 |     {
13 |         "connections_current": 77,
14 |         "connections_per_sec": 27.86,
15 |         "entries": 359369,
16 |         "max_file_descriptors": 65536,
17 |         "operations_add_per_sec": 0.0,
18 |         "operations_bind_per_sec": 27.99,
19 |         "operations_delete_per_sec": 0.0,
20 |         "operations_extended_per_sec": 0.23,
21 |         "operations_modify_per_sec": 0.09,
22 |         "operations_search_per_sec": 24.09,
23 |         "operations_unbind_per_sec": 27.82,
24 |         "waiters_read": 76,
25 |         "waiters_write": 0
26 |     }
27 | 
28 | All information is based on the cn=Monitor OpenLDAP tree. You can get more information in the `OpenLDAP Administrator's Guide`_.
29 | The meaning of the different fields is as follows:
30 | 
31 | ``connections_current``
32 |     Number of currently established TCP connections.
33 | 
34 | ``connections_per_sec``
35 |     Increase of connections per second.
36 | 
37 | ``entries``
38 |     Number of LDAP records.
39 | 
40 | ``operations_*_per_sec``
41 |     Number of operations per second per operation type (add, bind, search, ..).
42 | 
43 | ``waiters_read``
44 |     Number of waiters for read (whatever that means, OpenLDAP documentation does not say anything).
45 | 
46 | .. _OpenLDAP Administrator's Guide: http://www.openldap.org/doc/admin24/monitoringslapd.html#Monitor%20Information
47 | 


--------------------------------------------------------------------------------
/docs/user/notifications/pagerduty.rst:
--------------------------------------------------------------------------------
 1 | Pagerduty
 2 | ---------
 3 | 
 4 | Notify `Pagerduty <https://www.pagerduty.com/>`_ of a new alert status. If alert is **active**, then a new pagerduty incident with type ``trigger`` will be sent. If alert is **inactive** then incident type will be updated to ``resolve``.
 5 | 
 6 | .. note::
 7 | 
 8 |     Pagerduty notification plugin uses API v2.
 9 | 
10 | 
11 | .. py:function:: notify_pagerduty(message='', per_entity=False, include_alert=True, routing_key=None, alert_class=None, alert_group=None, **kwargs)
12 | 
13 |     Send notifications to Pagerduty.
14 | 
15 |     :param message: Incident message. If empty, then a message will be generated from the alert data.
16 |     :type message: str
17 | 
18 |     :param per_entity: Send new alert per entity. This affects the ``dedup_key`` value and impacts how de-duplication is handled in Pagerduty. Default is ``False``.
19 |     :type per_entity: bool
20 | 
21 |     :param include_alert: Include alert data in incident payload ``custom_details``. Default is ``True``.
22 |     :type include_alert: bool
23 | 
24 |     :param routing_key: Pagerduty service ``routing_key``. If not specified, then the :ref:`service key configured <notification-options-label>` for the worker will be used.
25 |     :type routing_key: str
26 | 
27 |     :param alert_class: Set the Pagerduty incident class.
28 |     :type alert_class: str
29 | 
30 |     :param alert_group: Set the Pagerduty incident group.
31 |     :type alert_group: str
32 | 
33 |     Example:
34 | 
35 |     .. code-block:: python
36 | 
37 |         notify_pagerduty(message='Number of failed requests is too high!', include_alert=True, alert_class='API health', alert_group='production')
38 | 


--------------------------------------------------------------------------------
/docs/user/alert-definition-inheritance.rst:
--------------------------------------------------------------------------------
 1 | .. _alert-definition-inheritance:
 2 | 
 3 | Alert Definition Inheritance
 4 | ----------------------------
 5 | 
 6 | Alert definition *inheritance* allows one to create an alert definition based on another alert whereby a child reuses attributes from the parent.
 7 | Each alert definition can only inherit from a single alert definition (``single inheritance``).
 8 | 
 9 | Template
10 | ^^^^^^^^
11 | 
12 | A Template is basically an alert definition with a subset of attributes that **is not evaluated and can only be used for extension**.
13 | 
14 | To create a template:
15 | 
16 | #. Select the check definition
17 | #. click **Add New Alert Definition**
18 | #. Set attributes to reuse and activate checkbox ``template``
19 | 
20 | Extending
21 | ^^^^^^^^^
22 | 
23 | In general one can inherit from any alert definition/template. One should open the alert definition details and click ``inherit`` on the top right corner.
24 | To override a field, just type in a new value. An icon should appear on the left side, meaning that the field will be overridden.
25 | To rollback the change and keep the value defined on the parent, one should click in ``override`` icon.
26 | 
27 | Overriding
28 | ^^^^^^^^^^
29 | 
30 | By default the child alert retains all attributes of the parent alert with the exception of the following mandatory attributes:
31 |  - team
32 |  - responsible team
33 |  - status
34 | 
35 | These attributes are used for ``authorization`` (see :ref:`permissions` for details) therefore, they cannot be reused. If one changes these attributes on the parent alert definition, child alerts are not affected and you don't loose access rights.
36 | All the remaining attributes can be overridden, replacing the parent alert definition with its own values.
37 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/zomcat_wrapper.rst:
--------------------------------------------------------------------------------
 1 | Zomcat
 2 | ------
 3 | 
 4 | Retrieve zomcat instance status (memory, CPU, threads). ::
 5 | 
 6 |     zomcat().health()
 7 | 
 8 | This would return a dict like:
 9 | 
10 | .. code-block:: json
11 | 
12 |     {
13 |         "cpu_percentage": 5.44,
14 |         "gc_percentage": 0.11,
15 |         "gcs_per_sec": 0.25,
16 |         "heap_memory_percentage": 6.52,
17 |         "heartbeat_enabled": true,
18 |         "http_errors_per_sec": 0.0,
19 |         "jobs_enabled": true,
20 |         "nonheap_memory_percentage": 20.01,
21 |         "requests_per_sec": 1.09,
22 |         "threads": 128,
23 |         "time_per_request": 42.58
24 |     }
25 | 
26 | Most of the values are retrieved via JMX:
27 | 
28 | ``cpu_percentage``
29 |     CPU usage in percent (retrieved from JMX).
30 | 
31 | ``gc_percentage``
32 |     Percentage of time spent in garbage collection runs.
33 | 
34 | ``gcs_per_sec``
35 |     Garbage collections per second.
36 | 
37 | ``heap_memory_percentage``
38 |     Percentage of heap memory used.
39 | 
40 | ``nonheap_memory_percentage``
41 |     Percentage of non-heap memory (e.g. permanent generation) used.
42 | 
43 | ``heartbeat_enabled``
44 |     Boolean indicating whether heartbeat.jsp is enabled (``true``) or not (``false``). If ``/heartbeat.jsp`` cannot be retrieved, the value is ``null``.
45 | 
46 | ``http_errors_per_sec``
47 |     Number of Tomcat HTTP errors per second (all 4xx and 5xx HTTP status codes).
48 | 
49 | ``jobs_enabled``
50 |     Boolean indicating whether jobs are enabled (``true``) or not (``false``). If ``/jobs.monitor`` cannot be retrieved, the value is ``null``.
51 | 
52 | ``requests_per_sec``
53 |     Number of HTTP/AJP requests per second.
54 | 
55 | ``threads``
56 |     Total number of threads.
57 | 
58 | ``time_per_request``
59 |     Average time in milliseconds per HTTP/AJP request.
60 | 


--------------------------------------------------------------------------------
/docs/user/notifications/opsgenie.rst:
--------------------------------------------------------------------------------
 1 | Opsgenie
 2 | --------
 3 | 
 4 | Notify `Opsgenie <https://www.opsgenie.com/>`_ of a new alert status. If alert is **active**, then a new opsgenie alert will be created. If alert is **inactive** then the alert will be closed.
 5 | 
 6 | 
 7 | .. py:function:: notify_opsgenie(message='', teams=None, per_entity=False, priority=None, include_alert=True, description='', custom_fields=None, **kwargs)
 8 | 
 9 |     Send notifications to Opsgenie.
10 | 
11 |     :param message: Alert message. If empty, then a message will be generated from the alert data.
12 |     :type message: str
13 | 
14 |     :param teams: Opsgenie teams to be notified. Value can be a single team or a list of teams.
15 |     :type teams: str | list
16 | 
17 |     :param per_entity: Send new alert per entity. This affects the ``alias`` value and impacts how de-duplication is handled in Opsgenie. Default is ``False``.
18 |     :type per_entity: bool
19 | 
20 |     :param priority: Set Opsgenie priority for this notification. Valid values are ``P1``, ``P2``, ``P3``, ``P4`` or ``P5``.
21 |     :type priority: str
22 | 
23 |     :param include_alert: Include alert data in alert body ``details``. Default is ``True``.
24 |     :type include_alert: bool
25 | 
26 |     :param include_captures: Include captures data in alert body ``details``. Default is ``False``.
27 |     :type include_captures: bool
28 | 
29 |     :param description: An optional description. If present, this is inserted into the opsgenie alert description field.
30 |     :type description: str
31 |     
32 |     :param custom_fields: If present, this will added the given fields into the ops genie details field.
33 |     :type custom_fields: dict
34 | 
35 | 
36 |     Example:
37 | 
38 |     .. code-block:: python
39 | 
40 |         notify_opsgenie(teams=['zmon', 'ops'], message='Number of failed requests is too high!', include_alert=True)
41 | 
42 | 
43 | .. note::
44 | 
45 |     If ``priority`` is not set, then ZMON will set the priority according to the alert priority.
46 | 


--------------------------------------------------------------------------------
/docs/user/notifications/hipchat.rst:
--------------------------------------------------------------------------------
 1 | Hipchat
 2 | -------
 3 | 
 4 | Notify Hipchat room with alert status.
 5 | 
 6 | .. py:function:: send_hipchat(room=None, message=None, token=None, message_format='html', notify=False, color='red', link=False, link_text='go to alert')
 7 | 
 8 |     Send Hipchat notification to specified room.
 9 | 
10 |     :param room: Room to be notified.
11 |     :type room: str
12 | 
13 |     :param message: Message to be sent. If ``None``, then a message constructed from the alert will be sent.
14 |     :type message: str
15 | 
16 |     :param token: Hipchat API token.
17 |     :type token: str
18 |     
19 |     :param message_format: message format - ``html`` (default) or ``text`` (which correctly treats @mentions).
20 |     :type message_format: str    
21 | 
22 |     :param notify: Hipchat notify flag. Default is False.
23 |     :type notify: bool
24 | 
25 |     :param color: Message color. Default is ``red`` if alert is raised.
26 |     :type color: str
27 | 
28 |     :param link: Add link to Hipchat message. Default is ``False``.
29 |     :type link: bool
30 | 
31 |     :param link_text: if ``link`` param is ``True``, this will be displayed as a link in the hipchat message. Default is  ``go to alert``.
32 |     :type link_text: str
33 | 
34 | .. note::
35 | 
36 |     Message color will be determined based on alert status. If alert has ended, then ``color`` will be ``green``, otherwise ``color`` argument will be used.
37 | 
38 |     Example message - using html format (default):
39 | 
40 |     .. code-block:: python
41 | 
42 |         {
43 |             "message": "NEW ALERT: Requests failing with status 500 on host-production-1-entity",
44 |             "color": "red",
45 |             "notify": true
46 |         }
47 |         
48 |     Example message - using text format with @mention:
49 | 
50 |     .. code-block:: python
51 | 
52 |         {
53 |             "message": "@here NEW ALERT: Requests failing with status 500 on host-production-1-entity",
54 |             "color": "red",
55 |             "notify": true,
56 |             "message_format": "text"
57 |         }
58 |         
59 | 


--------------------------------------------------------------------------------
/docs/developer/zmon-python-client.rst:
--------------------------------------------------------------------------------
 1 | .. _zmon-python-client:
 2 | 
 3 | *************
 4 | Python Client
 5 | *************
 6 | 
 7 | ZMON provides a python client library that can be imported and used in your own software.
 8 | 
 9 | Installation
10 | ------------
11 | 
12 | ZMON python client library is part of :ref:`ZMON CLI <zmon-cli>`.
13 | 
14 | .. code-block:: bash
15 | 
16 |     pip3 install --upgrade zmon-cli
17 | 
18 | Usage
19 | -----
20 | 
21 | Using ZMON client is pretty straight forward.
22 | 
23 | .. code-block:: python
24 | 
25 |     >>> from zmon_cli.client import Zmon
26 | 
27 |     >>> zmon = Zmon('https://zmon.example.org', token='123')
28 | 
29 |     >>> entity = zmon.get_entity('entity-1')
30 |     {
31 |         'id': 'entity-1',
32 |         'team': 'ZMON',
33 |         'type': 'instance',
34 |         'data': {'host': '192.168.20.16', 'port': 8080, 'name': 'entity-1-instance'}
35 |     }
36 | 
37 |     >>> zmon.delete_entity('entity-102')
38 |     True
39 | 
40 |     >>> check = zmon.get_check_definition(123)
41 | 
42 |     >>> check['command']
43 |     http('http://www.custom-service.example.org/health').code()
44 | 
45 |     >>> check['command'] = "http('http://localhost:9090/health').code()"
46 | 
47 |     >>> zmon.update_check_definition(check)
48 |     {
49 |         'command': "http('http://localhost:9090/health').code()",
50 |         'description': 'Check service health',
51 |         'entities': [{'application_id': 'custom-service', 'type': 'instance'}],
52 |         'id': 123,
53 |         'interval': 60,
54 |         'last_modified_by': 'admin',
55 |         'name': 'Check service health',
56 |         'owning_team': 'ZMON',
57 |         'potential_analysis': None,
58 |         'potential_impact': None,
59 |         'potential_solution': None,
60 |         'source_url': None,
61 |         'status': 'ACTIVE',
62 |         'technical_details': None
63 |     }
64 | 
65 | Client
66 | ------
67 | 
68 | Exceptions
69 | ==========
70 | 
71 | .. autoclass:: zmon_cli.client.ZmonError
72 |     :members:
73 | 
74 | .. autoclass:: zmon_cli.client.ZmonArgumentError
75 |     :members:
76 | 
77 | Zmon
78 | ====
79 | 
80 | .. autoclass:: zmon_cli.client.Zmon
81 |     :members:
82 | 


--------------------------------------------------------------------------------
/docs/developer/redis.rst:
--------------------------------------------------------------------------------
 1 | ====================
 2 | Redis Data Structure
 3 | ====================
 4 | 
 5 | ZMON stores its primary working data in Redis. This page describes the used Redis keys and data structures.
 6 | 
 7 | Queues are Redis keys like ``zmon:queue:<NAME>`` of type "list", e.g. ``zmon:queue:default``.
 8 | 
 9 | New queue items are added by the ZMON Scheduler via the `Redis "rpush" command`_.
10 | 
11 | Important Redis key patterns are:
12 | 
13 | ``zmon:queue:<QUEUE-NAME>``
14 |     List of worker tasks for given queue.
15 | ``zmon:checks``
16 |     Set of all executed check IDs.
17 | ``zmon:checks:<CHECK-ID>``
18 |     Set of entity IDs having check results.
19 | ``zmon:checks:<CHECK-ID>:<ENTITY-ID>``
20 |     List of last N check results. The first list item contains the most recent check result.
21 |     Each check result is a JSON object with the keys ``ts`` (result timestamp), ``td`` (check duration), ``value`` (actual result value) and ``worker`` (ID of worker having produced the check result).
22 | ``zmon:alerts``
23 |     Set of all active alert IDs.
24 | ``zmon:alerts:<ALERT-ID>``
25 |     Set of entity IDs in alert state.
26 | ``zmon:alerts:<ALERT-ID>:entities``
27 |     Hash of entity IDs to alert captures. This hash contains *all* entity IDs matched by the alert, i.e. not only entities in alert state.
28 | ``zmon:alerts:<ALERT-ID>:<ENTITY-ID>``
29 |     Alert detail JSON containing alert start time, captures, worker, etc.
30 | ``zmon:downtimes``
31 |     Set of all alert IDs having downtimes.
32 | ``zmon:downtimes:<ALERT-ID>``
33 |     Set of all entity IDs having a downtime for this alert.
34 | ``zmon:downtimes:<ALERT-ID>:<ENTITY-ID>``
35 |     Hash of downtimes for this entity/alert. Each hash value is a JSON object with keys ``start_time``, ``end_time`` and ``comment``.
36 | ``zmon:active_downtimes``
37 |     Set of currently active downtimes. Each set item has the form ``<ALERT-ID>:<ENTITY-ID>:<DOWNTIME-ID>``.
38 | ``zmon:metrics``
39 |     Set of worker and scheduler IDs with metrics.
40 | ``zmon:metrics:<WORKER-OR-SCHEDULER-ID>:ts``
41 |     Timestamp of last worker or scheduler metrics update.
42 | ``zmon:metrics:<WORKER-OR-SCHEDULER-ID>:check.count``
43 |     Increasing counter of executed (or scheduled) checks.
44 | 
45 | .. _Redis "rpush" command: http://redis.io/commands/rpush
46 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/entities_wrapper.rst:
--------------------------------------------------------------------------------
 1 | Entities
 2 | --------
 3 | 
 4 | Provides access to ZMON entities.
 5 | 
 6 | .. py:function:: entities(service_url, infrastructure_account, verify=True, oauth2=False)
 7 | 
 8 |     Initialize entities wrapper.
 9 | 
10 |     :param service_url: Entities service url.
11 |     :type service_url: str
12 | 
13 |     :param infrastructure_account: Infrastructure account used to filter entities.
14 |     :type infrastructure_account: str
15 | 
16 |     :param verify: Verify SSL connection. Default is ``True``.
17 |     :type username: bool
18 | 
19 |     :param oauth2: Use OAUTH for authentication. Default is ``False``.
20 |     :type oauth2: bool
21 | 
22 | .. note::
23 | 
24 |     If `service_url` or `infrastructure_account` were not supplied, their corresponding values in worker plugin config will be used.
25 | 
26 | 
27 | Methods of Entities
28 | ^^^^^^^^^^^^^^^^^^^
29 | 
30 | .. py:function:: search_local(**kwargs)
31 | 
32 |     Search entities in local infrastructure account. If `infrastructure_account` is not supplied in kwargs, then should search entities "local" to your filtered entities by using the same `infrastructure_account` as a default filter.
33 | 
34 |     :param kwargs: Filtering kwargs
35 |     :type kwargs: str
36 | 
37 |     :return: Entities
38 |     :rtype: list
39 | 
40 |     Example searching all ``instance`` entities in local account:
41 | 
42 |     .. code-block:: python
43 | 
44 |         entities().search_local(type='instance')
45 | 
46 | 
47 | .. py:function:: search_all(**kwargs)
48 | 
49 |     Search all entities.
50 | 
51 |     :param kwargs: Filtering kwargs
52 |     :type kwargs: str
53 | 
54 |     :return: Entities
55 |     :rtype: list
56 | 
57 | 
58 | .. py:function:: alert_coverage(**kwargs)
59 | 
60 |     Return alert coverage for infrastructure_account.
61 | 
62 |     :param kwargs: Filtering kwargs
63 |     :type kwargs: str
64 | 
65 |     :return: Alert coverage result.
66 |     :rtype: list
67 | 
68 | 
69 |     .. code-block:: python
70 | 
71 |         entities().alert_coverage(type='instance', infrastructure_account='1052643')
72 | 
73 |         [
74 |             {
75 |                 'alerts': [],
76 |                 'entities': [
77 |                     {'id': 'app-1-instance', 'type': 'instance'}
78 |                 ]
79 |             }
80 |         ]
81 | 


--------------------------------------------------------------------------------
/docs/user/notifications/http.rst:
--------------------------------------------------------------------------------
 1 | HTTP
 2 | ----
 3 | 
 4 | Provides notification by invoking HTTP call to certain endpoint. HTTP notification uses ``POST`` method when invoking the call.
 5 | 
 6 | 
 7 | .. py:function:: notify_http(url=None, body=None, params=None, headers=None, timeout=5, oauth2=False, include_alert=True)
 8 | 
 9 |     Send HTTP notification to specified endpoint.
10 | 
11 |     :param url: HTTP endpoint URL. If not passed, then default URL will be used in worker configuration.
12 |     :type url: str
13 | 
14 |     :param body: Request body.
15 |     :type body: dict
16 | 
17 |     :param params: Request URL params.
18 |     :type params: dict
19 | 
20 |     :param headers: HTTP headers.
21 |     :type headers: dict
22 | 
23 |     :param timeout: Request timeout. Default is 5 seconds.
24 |     :type timeout: int
25 | 
26 |     :param oauth2: Add OAUTH2 authentication headers. Default is False.
27 |     :type oauth2: bool
28 | 
29 |     :param include_alert: Include alert data in request body. Default is ``True``.
30 |     :type include_alert: bool
31 | 
32 |     Example:
33 | 
34 |     .. code-block:: python
35 | 
36 |         notify_http('https://some-notification-service/alert', body={'zmon': True}, headers={'X-TOKEN': 1234})
37 | 
38 | 
39 | .. note::
40 | 
41 |     If ``include_alert`` is ``True``, then request body will include alert data. This is usually useful, since it provides valuable info like ``is_alert`` and ``changed`` which can indicate whether the alert has **started** or **ended**.
42 | 
43 |     .. code-block:: python
44 | 
45 |         {
46 |             "body": null,
47 |             "alert": {
48 |                 "is_alert": true,
49 |                 "changed": true,
50 |                 "duration": 2.33,
51 |                 "captures": {},
52 |                 "entity": {"type": "GLOBAL", "id": "GLOBAL"},
53 |                 "worker": "plocal.zmon",
54 |                 "value": {"td": 0.00037, "worker": "plocal.zmon", "ts": 1472032348.665247, "value": 51.67797677979191},
55 |                 "alert_def": {
56 |                     "name": "Random Example Alert", "parameters": null, "check_id": 4, "entities_map": [], "responsible_team": "ZMON", "period": "", "priority": 1,
57 |                     "notifications": ["notify_http()"], "team": "ZMON", "id": 3, "condition": ">40"
58 |                 }
59 |             }
60 |         }
61 | 


--------------------------------------------------------------------------------
/docs/installation/components.rst:
--------------------------------------------------------------------------------
 1 | ************************
 2 | Essential ZMON Components
 3 | ************************
 4 | 
 5 | To use ZMON requires these four components: zmon-controller_, zmon-scheduler_, zmon-worker_, and zmon-eventlog-service_.
 6 | 
 7 | .. image:: ../images/components.svg
 8 | 
 9 | Controller
10 | ==========
11 | 
12 | zmon-controller_ runs ZMON's AngularJS frontend and serves as an endpoint for retrieving data and managing your ZMON deployment via REST API (with help from the command line client). It needs a connection configured to:
13 | 
14 |  * PostgreSQL to store/retrieve all kind of data: entities, checks, dashboards, alerts
15 |  * Redis, to keep the state of ZMON's alerts
16 |  * KairosDB, if you want charts/Grafana
17 | 
18 | To provide a means of authentication and authorization, you can choose between the following options:
19 | 
20 |  * A basic credential file
21 |  * An OAuth2 identity provider, e.g., GitHub
22 | 
23 | Scheduler
24 | =========
25 | 
26 | zmon-scheduler_ is responsible for keeping track of all existing entities, checks and alerts and scheduling checks in time for applicable entities, which are then executed by the worker.
27 | 
28 | Needs connections to:
29 | 
30 |  * Redis, which serves ZMON as a task queue
31 |  * Controller, to get check/alerts/entities
32 |  * Custom adapters might need connections for entity discovery in your platform
33 | 
34 | Worker
35 | ======
36 | 
37 | zmon-worker_ does the heavy lifting — executing tasks against entities and evaluating all alerts assigned to this check. Tasks are picked up from Redis and the resulting check value plus alert state changes are written back to Redis.
38 | 
39 | Needs connection to:
40 |  * Redis to retrieve tasks and update current state
41 |  * KairosDB if you want to have metrics
42 |  * EventLog service to store history events for alert state changes
43 | 
44 | EventLog Service
45 | ================
46 | 
47 | zmon-eventlog-service_ is our slim implementation of an event store, keeping track of Events related to alert state changes as well as events like alert and check modification by the user.
48 | 
49 | Needs connection to:
50 |  * PostgreSQL to store events using jsonb
51 | 
52 | .. _zmon-controller: https://github.com/zalando-zmon/zmon-controller
53 | .. _zmon-scheduler: https://github.com/zalando-zmon/zmon-scheduler
54 | .. _zmon-worker: https://github.com/zalando-zmon/zmon-worker
55 | .. _zmon-eventlog-service: https://github.com/zalando-zmon/zmon-eventlog-service
56 | 


--------------------------------------------------------------------------------
/docs/apendix/glossary.rst:
--------------------------------------------------------------------------------
 1 | .. _glossary:
 2 | 
 3 | ********
 4 | Glossary
 5 | ********
 6 | 
 7 | .. KEEP IN ALPHABETCAL ORDER!
 8 | 
 9 | .. glossary::
10 | 
11 |     alert definition
12 |         Alert definitions define when to trigger an alert and for which entity.
13 |         See :ref:`alert-definitions`
14 | 
15 |     alert condition
16 |         Python expression defining the "threshold" when to trigger an alert. See :ref:`alert-condition`.
17 | 
18 |     check command
19 |         Python expression defining the value of a check. See :ref:`check-commands`.
20 | 
21 |     check definition
22 |         A check definition provides a source of data for alerts to monitor. See :ref:`check-definitions`
23 | 
24 |     dashboard
25 |         A dashboard is the main monitoring page of ZMON and consists of widgets and the list of active alerts.
26 |         See :ref:`dashboards`
27 | 
28 |     downtime
29 |         In ZMON, downtime refers to a period of time where certain alerts/entities should not be triggered.
30 |         One use case for downtimes are scheduled maintenance works. See :ref:`downtimes`
31 | 
32 |     entity
33 |         Entities are "objects" to be monitored. Entities can be hosts, Zomcat instances, but they can also be more abstract things like app domains.
34 |         See :ref:`entities`
35 | 
36 |     JSON
37 |         JavaScript Object Notation. A minimal data interchange format. You probably already know it. If you don't, there's good documentation on its `official page <http://json.org/>`_.
38 | 
39 |     Markdown
40 |         A simple markup language that can mostly pass for plain text. There's an `introduction <http://daringfireball.net/projects/markdown/basics>`_ and a `syntax reference <http://daringfireball.net/projects/markdown/syntax>`_ on its official page.
41 | 
42 |     time period
43 |         Alert definition's time period can restrict its active alerting to certain time frames. This allows for alerts to be active e.g. only during work hours.
44 |         See :ref:`time-periods`
45 | 
46 |     YAML
47 |         Not actually Yet Another Markup Language. A powerful but succinct data interchange format. This document should be sufficient to learn how to use YAML in ZMON. In case it isn't, the `Wikipedia entry on YAML <http://en.wikipedia.org/wiki/Yaml>`_ is actually slightly more useful that the `official documentation <http://yaml.org/spec/1.1/#id857168>`_.
48 | 
49 |         Note that YAML is a strict superset of :term:`JSON`. That is, wherever YAML is required, JSON can be used instead.
50 | 
51 | .. KEEP IN ALPHABETCAL ORDER!
52 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/datapipeline_wrapper.rst:
--------------------------------------------------------------------------------
 1 | .. _datapipeline:
 2 | 
 3 | Data Pipeline
 4 | -------------
 5 | 
 6 | If running on AWS you can use ``datapipeline()`` to access AWS Data Pipelines' health easily.
 7 | 
 8 | .. py:function:: datapipeline(region=None)
 9 | 
10 |     Initialize Data Pipeline wrapper.
11 | 
12 |     :param region: AWS region for Data Pipeline queries. Eg. "eu-west-1". Defaults to the region in which the check is being executed. Note that Data Pipeline is not availabe in "eu-central-1" at time of writing.
13 |     :type region: str
14 | 
15 | 
16 | Methods of Data Pipeline
17 | ^^^^^^^^^^^^^^^^^^^^^^^^
18 | .. py:method:: get_details(pipeline_ids)
19 | 
20 |     Query AWS Data Pipeline IDs supplied as a String (single) or list of Strings (multiple).
21 |     Return a dict of ID(s) and status dicts as described in `describe_pipelines boto documentation`_.
22 | 
23 |     :param pipeline_ids: Data Pipeline IDs. Example ``df-0123456789ABCDEFGHI``
24 |     :type pipeline_ids: Union[str, list]
25 |     :rtype: dict
26 | 
27 |     Example query with single Data Pipeline ID supplied in a list:
28 | 
29 |     .. code-block:: python
30 | 
31 |         datapipeline().get_details(pipeline_ids=['df-exampleA'])
32 |         {
33 |             "df-exampleA": {
34 |                 "@lastActivationTime": "2018-01-30T14:23:52",
35 |                 "pipelineCreator": "ABCDEF:auser",
36 |                 "@scheduledPeriod": "24 hours",
37 |                 "@accountId": "0123456789",
38 |                 "name": "exampleA",
39 |                 "@latestRunTime": "2018-01-04T03:00:00",
40 |                 "@id": "df-0441325MB6VYFI6MUU1",
41 |                 "@healthStatusUpdatedTime": "2018-01-01T10:00:00",
42 |                 "@creationTime": "2018-01-01T10:00:00",
43 |                 "@userId": "0123456789",
44 |                 "@sphere": "PIPELINE",
45 |                 "@nextRunTime": "2018-01-05T03:00:00",
46 |                 "@scheduledStartTime": "2018-01-02T03:00:00",
47 |                 "@healthStatus": "HEALTHY",
48 |                 "uniqueId": "exampleA",
49 |                 "*tags": "[{\"key\":\"DataPipelineName\",\"value\":\"exampleA\"},{\"key\":\"DataPipelineId\",\"value\":\"df-exampleA\"}]",
50 |                 "@version": "2",
51 |                 "@firstActivationTime": "2018-01-01T10:00:00",
52 |                 "@pipelineState": "SCHEDULED"
53 |             }
54 |         }
55 | 
56 | .. _describe_pipelines boto documentation: http://boto3.readthedocs.io/en/latest/reference/services/datapipeline.html#DataPipeline.Client.describe_pipelines
57 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/memcached_wrapper.rst:
--------------------------------------------------------------------------------
 1 | Memcached
 2 | ---------
 3 | 
 4 | Read-only access to memcached servers is provided by the :py:func:`memcached` function.
 5 | 
 6 | 
 7 | .. py:function:: memcached([host=some.host], [port=11211])
 8 | 
 9 |     Returns a connection to the Memcached server at :samp:`{<host>}:{<port>}`, where :samp:`{<host>}` is the value
10 |     of the current entity's ``host`` attribute, and :samp:`{<port>}` is the given port (default ``11211``). See
11 |     below for a list of methods provided by the returned connection object.
12 | 
13 | 
14 | Methods of the Memcached Connection
15 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
16 | 
17 | The object returned by the :py:func:`memcached` function provides the following methods:
18 | 
19 | .. py:method:: get(key)
20 | 
21 |     Returns the string stored at `key`. If `key` does not exist an error is raised.
22 | 
23 |     ::
24 | 
25 |         memcached().get("example_memcached_key")
26 | 
27 | 
28 | .. py:method:: json(key)
29 | 
30 |     Returns the data of the key as unserialized JSON data. I.e. you can store a JSON object as
31 |     value of the key and get a dict back
32 | 
33 |     ::
34 | 
35 |         memcached().json("example_memcached_key")
36 | 
37 | 
38 | 
39 | .. py:method:: stats([extra_keys=[STR,STR])
40 | 
41 |     Returns a ``dict`` with general Memcached statistics such as memory usage and operations/s.
42 |     All values are extracted using the `Memcached STATS command`_.
43 | 
44 |     The `extra_keys` may be retrieved as returned as well from the memcached server's `stats`
45 |     command, e.g. `version` or `uptime`.
46 | 
47 |     Example result:
48 | 
49 | .. code-block:: json
50 | 
51 |     {
52 |         "incr_hits_per_sec": 0,
53 |         "incr_misses_per_sec": 0,
54 |         "touch_misses_per_sec": 0,
55 |         "decr_misses_per_sec": 0,
56 |         "touch_hits_per_sec": 0,
57 |         "get_expired_per_sec": 0,
58 |         "get_hits_per_sec": 100.01,
59 |         "cmd_get_per_sec": 119.98,
60 |         "cas_hits_per_sec": 0,
61 |         "cas_badval_per_sec": 0,
62 |         "delete_misses_per_sec": 0,
63 |         "bytes_read_per_sec": 6571.76,
64 |         "auth_errors_per_sec": 0,
65 |         "cmd_set_per_sec": 19.97,
66 |         "bytes_written_per_sec": 6309.17,
67 |         "get_flushed_per_sec": 0,
68 |         "delete_hits_per_sec": 0,
69 |         "cmd_flush_per_sec": 0,
70 |         "curr_items": 37217768,
71 |         "decr_hits_per_sec": 0,
72 |         "connections_per_sec": 0.02,
73 |         "cas_misses_per_sec": 0,
74 |         "cmd_touch_per_sec": 0,
75 |         "bytes": 3902170728,
76 |         "evictions_per_sec": 0,
77 |         "auth_cmds_per_sec": 0,
78 |         "get_misses_per_sec": 19.97
79 |     }
80 | 
81 | 
82 | .. _Memcached documentation: https://lzone.de/cheat-sheet/memcached
83 | .. _Memcached STATS command: https://lzone.de/cheat-sheet/memcached#stats
84 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/history_wrapper.rst:
--------------------------------------------------------------------------------
 1 | History
 2 | --------
 3 | 
 4 | Wrapper for KairosDB to access history data about checks.
 5 | 
 6 | 
 7 | .. py:function:: history(url=None, check_id='', entities=None, oauth2=False)
 8 | 
 9 | 
10 | Methods of History
11 | ^^^^^^^^^^^^^^^^^^^
12 | 
13 | .. py:function:: result(time_from=ONE_WEEK_AND_5MIN, time_to=ONE_WEEK)
14 | 
15 |     Return query result.
16 | 
17 |     :param time_from: Relative time from in seconds. Default is ``ONE_WEEK_AND_5MIN``.
18 |     :type application: int
19 | 
20 |     :param time_to: Relative time to in seconds. Default is ``ONE_WEEK``.
21 |     :type application: int
22 | 
23 |     :return: Json result
24 |     :rtype: dict
25 | 
26 | .. py:function:: get_one(time_from=ONE_WEEK_AND_5MIN, time_to=ONE_WEEK)
27 | 
28 |     Return first result values.
29 | 
30 |     :param time_from: Relative time from in seconds. Default is ``ONE_WEEK_AND_5MIN``.
31 |     :type application: int
32 | 
33 |     :param time_to: Relative time to in seconds. Default is ``ONE_WEEK``.
34 |     :type application: int
35 | 
36 |     :return: List of values
37 |     :rtype: list
38 | 
39 | .. py:function:: get_aggregated(key, aggregator, time_from=ONE_WEEK_AND_5MIN, time_to=ONE_WEEK)
40 | 
41 |     Return first result values. If no ``key`` filtering matches, empty list is returned.
42 | 
43 |     :param key: Tag key used in filtering the results.
44 |     :type key: str
45 | 
46 |     :param aggregator: Aggregator used in query. (e.g 'avg')
47 |     :type aggregator: str
48 | 
49 |     :param time_from: Relative time from in seconds. Default is ``ONE_WEEK_AND_5MIN``.
50 |     :type application: int
51 | 
52 |     :param time_to: Relative time to in seconds. Default is ``ONE_WEEK``.
53 |     :type application: int
54 | 
55 |     :return: List of values
56 |     :rtype: list
57 | 
58 | .. py:function:: get_avg(key, time_from=ONE_WEEK_AND_5MIN, time_to=ONE_WEEK)
59 | 
60 |     Return aggregated average.
61 | 
62 |     :param key: Tag key used in filtering the results.
63 |     :type key: str
64 | 
65 |     :param time_from: Relative time from in seconds. Default is ``ONE_WEEK_AND_5MIN``.
66 |     :type application: int
67 | 
68 |     :param time_to: Relative time to in seconds. Default is ``ONE_WEEK``.
69 |     :type application: int
70 | 
71 |     :return: List of values
72 |     :rtype: list
73 | 
74 | .. py:function:: get_std_dev(key, time_from=ONE_WEEK_AND_5MIN, time_to=ONE_WEEK)
75 | 
76 |     Return aggregated standard deviation.
77 | 
78 |     :param key: Tag key used in filtering the results.
79 |     :type key: str
80 | 
81 |     :param time_from: Relative time from in seconds. Default is ``ONE_WEEK_AND_5MIN``.
82 |     :type application: int
83 | 
84 |     :param time_to: Relative time to in seconds. Default is ``ONE_WEEK``.
85 |     :type application: int
86 | 
87 |     :return: List of values
88 |     :rtype: list
89 | 
90 | .. py:function:: distance(self, weeks=4, snap_to_bin=True, bin_size='1h', dict_extractor_path='')
91 | 
92 |     For detailed docs on distance function please see :ref:`History distance functionality <history-distance-label>` .
93 | 


--------------------------------------------------------------------------------
/docs/user/entities.rst:
--------------------------------------------------------------------------------
 1 | .. _entities:
 2 | 
 3 | ********
 4 | Entities
 5 | ********
 6 | 
 7 | Entities describe what you want to monitor in your infrastructure.
 8 | This can be as basic as a host, with its attributes hostname and IP; or something more complex, like a PostgreSQL sharded cluster with its identifier and set of connection strings.
 9 | 
10 | ZMON gives you two options for automation in/integration with your platform: storing entities via zmon-controller_'s entity service, or discovering them via the adapters in zmon-scheduler_.
11 | At Zalando we use both, connecting ZMON to tools like our CMDB but also pushing entities via REST API.
12 | 
13 | ZMON's entity service describes entities with a single JSON document.
14 | 
15 | - Any entity must contain an ID that is unique within your ZMON deployment. We often use a pattern like ``<hostname>(:<port>)`` to create uniqueness at the host and application levels, but this is up to you.
16 | - Any entity must contain a type which describes the kind of entity, like an object class.
17 | 
18 | At the check execution we bind entity properties as default values to the functions executed, e.g. the IP gets used for relative ``http()`` requests.
19 | 
20 | Format
21 | ------
22 | 
23 | Generally, ZMON entity is a set of properties that can be represented as a multi-level dictionary. For example:
24 | 
25 | .. code-block:: json
26 | 
27 |     {
28 |         "id":"arbitrary_entity_id",
29 |         "type":"some_type",
30 |         "oneMoreProperty":"foo",
31 |         "nestedProperty": {
32 |             "subProperty1": "foo",
33 |             "subProperty2": "bar",
34 |         }
35 |     }
36 | 2 notes here to keep in mind:
37 | 
38 | 1. ``id`` and ``type`` properties are **mandatory**.
39 | 2. ZMON filtering (e.g. in ZMON UI) **does not support nested properties**.
40 | 
41 | 
42 | Examples
43 | --------
44 | 
45 | In working with the Vagrant Box, you can use the scheduler instance entity like this:
46 | 
47 | .. code-block:: json
48 | 
49 |     {
50 |         "id":"localhost:3421",
51 |         "type":"instance",
52 |         "host":"localhost",
53 |         "project":"zmon-scheduler-ng",
54 |         "ports": {"3421":3421}
55 |     }
56 | 
57 | Here, you can use the "ports" dictionary to also describe additional open ports.
58 | As with Spring Boot, a second port is usually added, exposing management features.
59 | 
60 | Now let's look at an example of the PostgreSQL instance:
61 | 
62 | .. code-block:: json
63 | 
64 |     {
65 |         "id":"localhost:5432",
66 |         "type":"database",
67 |         "name":"zmon-cluster",
68 |         "shards": {"zmon":"localhost:5432/local_zmon_db"}
69 |     }
70 | 
71 | Usage of the property "shards" is given by how ZMON's worker exposes PostgreSQL clusters to the sql() function.
72 | 
73 | View more examples here_.
74 | 
75 | If you'd like to create an entity by yourself, check `ZMON CLI tool`_
76 | 
77 | .. _zmon-controller: https://github.com/zalando-zmon/zmon-controller
78 | .. _zmon-scheduler: https://github.com/zalando-zmon/zmon-scheduler
79 | .. _here: https://github.com/zalando-zmon/zmon-demo/tree/master/bootstrap/entities
80 | .. _ZMON CLI tool: https://docs.zmon.io/en/latest/developer/zmon-cli.html#entities
81 | 


--------------------------------------------------------------------------------
/docs/user/tv-login.rst:
--------------------------------------------------------------------------------
 1 | .. _tv-login:
 2 | 
 3 | *************************
 4 | "Read Only" Display Login
 5 | *************************
 6 | 
 7 | The ZMON front end requires users to login.
 8 | However a very common way of deploying dashboards is on TV screens running across office spaces to e.g. render Grafana or ZMON dashboards.
 9 | For this ZMON provides you with a way to login a read only authenticated user via one-time tokens.
10 | 
11 | Those tokens can be created by any real user by login in first and switching to TV mode or via the ZMON CLI.
12 | 
13 | How does it work
14 | ================
15 | 
16 | First time a valid one time token is used to login we associate a random UUID with it and the device IP.
17 | Both are registered within ZMON to create a persisted session, thus this will continue to work after the frontend gets deployed.
18 | 
19 | Tokens can't be reused. Once used, it can no longer be used and you need to create a new one. You'll need a different token per additional
20 | device or location. One time token sessions will last up to 365 days.
21 | 
22 | 
23 | Using the menu option
24 | +++++++++++++++++++++
25 | 
26 | First you need to login using your own personal credentials or Single Sign-On mechanism. After logging in you can use the top right
27 | drop-down menu with your username to reveal the "Switch to TV mode" option.
28 | 
29 | .. image:: /images/switch-tv-mode.png
30 | 
31 | Clicking this option will replace your login session with a new session using a newly created one time token, but your personal session
32 | will still be valid!. You must log out before leaving the device unattended.
33 | 
34 | A pop-up dialog will ask you to take action. If you decide to Logout, a new Tab will open to log you out. You can safely
35 | close this Tab after successful logout and return to ZMON, which will now be on TV Mode.
36 | 
37 | For more information on the Logout URL, please check :doc:`/installation/configuration`.
38 | 
39 | .. image:: /images/tv-mode-logout-dialog.png
40 | 
41 | You'll be able to confirm by checking the username in the drop-down menu where your username used to be present. There will be a new username with
42 | the pattern "ZMON_TV_123abc".
43 | 
44 | .. image:: /images/tv-mode.png
45 | 
46 | After this you can leave the device safely unattended. TV mode allows only read access to ZMON.
47 | 
48 | Using the ZMON CLI
49 | ++++++++++++++++++
50 | 
51 | You can also generate one time tokens using the command line tool. The tool also allows you to list which tokens you already generated.
52 | 
53 | Getting a token
54 | ===============
55 | 
56 | .. code-block:: bash
57 | 
58 |     zmon onetime-token get
59 | 
60 |     Retrieving new one-time token ...
61 |     https://zmon.example.org/tv/AocciOWf/
62 |     OK
63 | 
64 | 
65 | Login with token
66 | ================
67 | 
68 | Use the URL in the target browser to login directly. This will create a read-only session.
69 | 
70 | .. code-block:: bash
71 | 
72 |     https://<your zmon url>/tv/<your token>
73 | 
74 | .. note::
75 | 
76 |     Please make sure you access the generated URL in order to login. Appending the <token> to any other ZMON device or location won't work.
77 | 
78 | Listing existing tokens
79 | =======================
80 | 
81 | .. code-block:: bash
82 | 
83 |     zmon onetime-token list
84 | 
85 |     - bound_at: 2008-05-08 12:16:21.696000
86 |       bound_expires: 1234567800000
87 |       bound_ip: ''
88 |       created: 2008-05-08 12:16:20.533000
89 |       token: 1234abCD
90 | 


--------------------------------------------------------------------------------
/docs/user/alert-definition-parameters.rst:
--------------------------------------------------------------------------------
 1 | .. _alert-definition-parameters:
 2 | 
 3 | 
 4 | Alert Definition Parameters
 5 | ---------------------------
 6 | 
 7 | Alert definition *parameters* allows one to decouple alert condition from constants that are used inside it.
 8 | 
 9 | Use Case: Technical alert condition
10 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
11 | 
12 | If your alert condition is highly technical with a lot of Python code in it, it is often makes sense to split actual calculation from threshold values and move such constant values into parameters.
13 | 
14 | The same may apply in certain cases to alert definitions created by technical staff, which later need to be adjusted by non-technical people - if you split calculation from variable definition, you may let non-technical people just change values without touching calculation logic.
15 | 
16 | Use Case: Same alert, different priorities
17 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
18 | 
19 | Another use case where we recommend to use parameters is when you need to have the same alert come up with a different priority depending on threshold values.
20 | 
21 | In such case, refer to :ref:`alert inheritance <alert-definition-inheritance>` for configuring inherited alerts.
22 | 
23 | Proposed structure would look like:
24 | 
25 | * Base alert "A" with alert condition and parameters, check *template* box
26 | * Alert "B1" inherits from "A" specifying *priority* RED and associated parameter values
27 | * Alert "B2" inherits from "A" specifying *priority* YELLOW and associated parameter values
28 | 
29 | An example: Setting a simple parameter in trial run
30 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
31 | 
32 | In the zmon2 web interface click on the trial run button.
33 | 
34 | 1. In the **Check Command** text box enter::
35 | 
36 |     normalvariate(50, 20)
37 | 
38 | This is a simple normal probability function that produce a float number 50% of the time over 50.0, so it's good to test things.
39 | 
40 | 2. In the **Alert Condition** enter::
41 | 
42 |     value>capture(threshold=threshold) + len(capture(params=params))
43 | 
44 | 3. In the **Parameters** selector enter two values (by clicking the plus sign):
45 | 
46 |     +------------+------------+-----------+
47 |     | Name       | Value      | Type      |
48 |     +============+============+===========+
49 |     | threshold  | 50.0       | Float     |
50 |     +------------+------------+-----------+
51 |     | anything   | Kartoffel  | String    |
52 |     +------------+------------+-----------+
53 | 
54 | 4. In the **Entity Filter** text box enter:
55 | 
56 |     .. code-block:: json
57 | 
58 |         [
59 |             {
60 |                 "type": "GLOBAL"
61 |             }
62 |         ]
63 | 
64 | 5. In the **Interval** enter: 10
65 | 
66 | If you run this Trial you can get an Alert or an 'OK', but the interesting thing will be in the **Captures** column.
67 | See how the parameters that you entered are evaluated in the alert condition with the value that you provided.
68 | Notice also that there is a special parameter called **params** that holds a dict with all the parameters that you entered, this is done so the user can iterate over all the parameters and take conditional decisions, providing a kind of introspection capability, but this is only for advanced users.
69 | 
70 | Last but not least: *Most of the time you don't need to capture the parameter values*, we did it like this so you can visually see that the parameters are evaluated, this means that you can run exactly the same check with this **Alert Condition**::
71 | 
72 |     value>threshold + len(params)
73 | 


--------------------------------------------------------------------------------
/docs/user/grafana.rst:
--------------------------------------------------------------------------------
  1 | .. _grafana:
  2 | 
  3 | *********************
  4 | Grafana3 and KairosDB
  5 | *********************
  6 | 
  7 | Grafana is a powerful open-source tool for creating dashboards to visualize metric data.
  8 | ZMON deploys Grafana 3.x along with the new KairosDB plugin to read metric data from KairosDB.
  9 | Grafana is served directly from the ZMON controller.
 10 | Read requests are proxied through the controller so as not to expose the write/delete API from KairosDB.
 11 | Dashboards are also saved via the controller, so there's no need for any additional data store.
 12 | 
 13 |   http://grafana.org
 14 | 
 15 |  Example of latency and requests charted via Grafana:
 16 | 
 17 |  .. image:: /images/grafana-example1.png
 18 | 
 19 | Check data
 20 | ==========
 21 | 
 22 | Workers will send all their data to KairosDB. Depending on the KairosDB setting, data is stored forever or you may set a TTL in KairosDB. ZMON will not clean up or roll up any data.
 23 | 
 24 | Serialization
 25 | -------------
 26 | 
 27 | In the simplest case you would have a check producing a single numeric value.
 28 | In Zalando's experience this is very rare.
 29 | 
 30 | Zmon also supports arbitrarily nested dictionaries of numeric values.
 31 | Anything that is not a dictionary or a number will be silently dropped.
 32 | The value is flattened into a single-level dictionary such that the elements can be stored in KairosDB (key-value storage).
 33 | 
 34 | .. code-block:: json
 35 | 
 36 |     {
 37 |         "load": {"1min":1,"5min":3,"15min":2},
 38 |         "memory_free": 16000
 39 |     }
 40 | 
 41 | Will be flattened to an equivalent of
 42 | 
 43 | .. code-block:: json
 44 | 
 45 |     {
 46 |         "load.1min": 1,
 47 |         "load.5min": 3,
 48 |         "load.15min": 2,
 49 |         "memory_free": 16000
 50 |     }
 51 | 
 52 | You might also want to output a list. The simple workaround is to generate a dictionary whose
 53 | keys are some identifier extracted from the elements.
 54 | 
 55 | e.g. transform this list:
 56 | 
 57 | .. code-block:: json
 58 | 
 59 | {
 60 |   "partitions": [
 61 |     {
 62 |       "count": 2254839,
 63 |       "partition": "0",
 64 |       "stream_id": "55491eb8-3ccc-40c5-b7c6-69bf38df3e16"
 65 |     },
 66 |     {
 67 |       "count": 2029956,
 68 |       "partition": "1",
 69 |       "stream_id": "aa938451-d115-4e90-a5da-1ac4b435a4e9"
 70 |     },
 71 | 
 72 | into the following dictionary:
 73 | 
 74 | .. code-block:: json
 75 | 
 76 | {
 77 |   "partitions": {
 78 |     "0": {
 79 |       "count": 2254839,
 80 |       "partition": "0",
 81 |       "stream_id": "55491eb8-3ccc-40c5-b7c6-69bf38df3e16"
 82 |     },
 83 |     "1": {
 84 |       "count": 2029956,
 85 |       "partition": "1",
 86 |       "stream_id": "aa938451-d115-4e90-a5da-1ac4b435a4e9"
 87 |     },
 88 | 
 89 | this will be stored the same way as the value (remember that strings are dropped):
 90 | 
 91 | .. code-block:: json
 92 | 
 93 |     {
 94 |         "partitions.0.count": 2254839,
 95 |         "partitions.1.count": 2029956
 96 |     }
 97 | 
 98 | Tagging
 99 | -------
100 | 
101 | KairosDB creates timer series with a name and allows us to tag data points with additional (tagname, tagvalue) pairs.
102 | 
103 | ZMON stores all data to a single check in a time series named: "zmon.check.<checkid>".
104 | 
105 | Single data points are then tagged as follows to describe their contents:
106 | 
107 |  * entity: entity instance id (some character replace rules are applied)
108 |  * key: containing the dict key after serialization of check value (see above)
109 |  * metric: contains the last segment of "key" split by "." (making selection easier in tooling)
110 |  * application: the application label attribute of the entity
111 | 


--------------------------------------------------------------------------------
/docs/developer/zmon-cli.rst:
--------------------------------------------------------------------------------
  1 | .. _zmon-cli:
  2 | 
  3 | *******************
  4 | Command Line Client
  5 | *******************
  6 | 
  7 | The command line client makes your life easier when interacting with the REST API. The ZMON scheduler will refresh modified data (checks, alerts, entities every 60 seconds).
  8 | 
  9 | Installation
 10 | ------------
 11 | 
 12 | .. code-block:: bash
 13 | 
 14 |   pip3 install --upgrade zmon-cli
 15 | 
 16 | Configuration
 17 | ^^^^^^^^^^^^^
 18 | 
 19 | Configure your zmon cli by running ``configure``-
 20 | 
 21 | .. code-block:: bash
 22 | 
 23 |   zmon configure
 24 | 
 25 | Authentication
 26 | ^^^^^^^^^^^^^^
 27 | 
 28 | ZMON CLI tool must authenticate against ZMON. Internally it uses zign to obtain access token, but you can override that behaviour by exporting a variable ZMON_TOKEN.
 29 | 
 30 | .. code-block:: bash
 31 | 
 32 |   export ZMON_TOKEN=myfancytoken
 33 | 
 34 | If you are using github for authentication, have an unprivileged personal access token ready.
 35 | 
 36 | Entities
 37 | --------
 38 | .. _cli-entities:
 39 | 
 40 | Create or update
 41 | ^^^^^^^^^^^^^^^^
 42 | Pushing entities with the zmon cli is as easy as:
 43 | 
 44 | .. code-block:: bash
 45 | 
 46 |   zmon entities push \
 47 |     '{"id":"localhost:3421","type":"instance","name":"zmon-scheduler-ng","host":"localhost","ports":{"3421":3421}}'
 48 | 
 49 | Existing entities with the same ID will be updated.
 50 | 
 51 | The client however also supports loading data from .json and .yaml files, both may contain a list for creating/updating many entities at once.
 52 | 
 53 | .. code-block:: bash
 54 | 
 55 |   zmon entities push your-entities.yaml
 56 | 
 57 | 
 58 | .. Note::
 59 |     Creating an entity of type GLOBAL is not allowed. GLOBAL as an entity type is reserved for ZMON's internal use.
 60 | 
 61 | 
 62 | .. Tip::
 63 | 
 64 |     All commands and subcommands can be abbreviated, i.e. the following lines are equivalent:
 65 | 
 66 |         .. code-block:: bash
 67 | 
 68 |            $ zmon entities push my-data.yaml
 69 |            $ zmon ent pu my-data.yaml
 70 | 
 71 | Search and filter
 72 | ^^^^^^^^^^^^^^^^^
 73 | 
 74 | Show all entities:
 75 | 
 76 | .. code-block:: bash
 77 | 
 78 |   zmon entities
 79 | 
 80 | Filter by type "instance"
 81 | 
 82 | .. code-block:: bash
 83 | 
 84 |   zmon entities filter type instance
 85 | 
 86 | 
 87 | Check Definitions
 88 | -----------------
 89 | .. _cli-cd:
 90 | 
 91 | Initializing
 92 | ^^^^^^^^^^^^
 93 | 
 94 | When starting from scratch use:
 95 | 
 96 | .. code-block:: bash
 97 | 
 98 |   zmon check-definition init your-new-check.yaml
 99 | 
100 | 
101 | Get
102 | ^^^
103 | 
104 | Retrieve an existing check defintion as YAML.
105 | 
106 | .. code-block:: bash
107 | 
108 |   zmon check-definition get 1234
109 | 
110 | Create and Update
111 | ^^^^^^^^^^^^^^^^^
112 | 
113 | Create or update from file, existing check with same "owning_team" and "name" will be updated.
114 | 
115 | .. code-block:: bash
116 | 
117 |   zmon check-definition update your-check.yaml
118 | 
119 | Alert Definitions
120 | -----------------
121 | 
122 | Similar to check defintions you can also manage your alert definitions via the ZMON cli.
123 | 
124 | Keep in mind that for alerts the same constraints apply as in the UI. For creating/modifying an alert you need to be a member of the team selected for "team" (unlike the responsible team).
125 | 
126 | Init
127 | ^^^^
128 | 
129 | .. code-block:: bash
130 | 
131 |   zmon alert-definition init your-new-alert.yaml
132 | 
133 | Create
134 | ^^^^^^
135 | 
136 | .. code-block:: bash
137 | 
138 |   zmon alert-definition create your-new-alert.yaml
139 | 
140 | Get
141 | ^^^
142 | 
143 | .. code-block:: bash
144 | 
145 |   zmon alert-definition get 1999
146 | 
147 | Update
148 | ^^^^^^
149 | 
150 | .. code-block:: bash
151 | 
152 |   zmon alert-definition update host-load-5.yaml
153 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/s3_wrapper.rst:
--------------------------------------------------------------------------------
  1 | S3
  2 | ---
  3 | 
  4 | Allows data to be pulled from S3 Objects.
  5 | 
  6 | 
  7 | .. py:function:: s3()
  8 | 
  9 | 
 10 | Methods of S3
 11 | ^^^^^^^^^^^^^^
 12 | 
 13 | .. py:function:: get_object_metadata(bucket_name, key)
 14 | 
 15 |     Get the metadata associated with the given ``bucket_name`` and ``key``. The metadata allows you to check for the 
 16 |     existance of the key within the bucket and to check how large the object is without reading the whole object into
 17 |     memory.
 18 | 
 19 |     :param bucket_name: the name of the S3 Bucket
 20 |     :param key: the key that identifies the S3 Object within the S3 Bucket
 21 |     :return: an ``S3ObjectMetadata`` object
 22 | 
 23 |     .. py:class:: S3ObjectMetadata
 24 | 
 25 |         .. py:method:: exists()
 26 | 
 27 |             Will return True if the object exists.
 28 | 
 29 |         .. py:method:: size()
 30 | 
 31 |             Returns the size in bytes for the object. Will return -1 for objects that do not exist.
 32 | 
 33 |     Example usage:
 34 | 
 35 |     .. code-block:: python
 36 | 
 37 |         s3().get_object_metadata('my bucket', 'mykeypart1/mykeypart2').exists()
 38 |         s3().get_object_metadata('my bucket', 'mykeypart1/mykeypart2').size()
 39 | 
 40 | 
 41 | .. py:function:: get_object(bucket_name, key)
 42 | 
 43 |     Get the S3 Object associated with the given ``bucket_name`` and ``key``. This method will cause the object to be
 44 |     read into memory.
 45 | 
 46 |     :param bucket_name: the name of the S3 Bucket
 47 |     :param key: the key that identifies the S3 Object within the S3 Bucket
 48 |     :return: an ``S3Object`` object
 49 | 
 50 |     .. py:class:: S3Object
 51 | 
 52 |         .. py:method:: text()
 53 | 
 54 |             Get the S3 Object data
 55 | 
 56 |         .. py:method:: json()
 57 | 
 58 |             If the object exists, parse the object as JSON.
 59 | 
 60 |             :return: a dict containing the parsed JSON or None if the object does not exist.
 61 | 
 62 |         .. py:method:: exists()
 63 | 
 64 |             Will return True if the object exists.
 65 | 
 66 |         .. py:method:: size()
 67 | 
 68 |             Returns the size in bytes for the object. Will return -1 for objects that do not exist.
 69 | 
 70 |     Example usage:
 71 | 
 72 |     .. code-block:: python
 73 | 
 74 |         s3().get_object('my bucket', 'mykeypart1/my_text_doc.txt').text()
 75 | 
 76 |         s3().get_object('my bucket', 'mykeypart1/my_json_doc.json').json()
 77 | 
 78 | 
 79 | .. py:function:: list_bucket(bucket_name, prefix, max_items=100, recursive=True)
 80 | 
 81 |     List the S3 Object associated with the given ``bucket_name``, matching ``prefix``.
 82 |     By default, listing is possible for up to 1000 keys, so we use pagination internally to overcome this.
 83 | 
 84 |     :param bucket_name: the name of the S3 Bucket
 85 |     :param prefix: the prefix to search under
 86 |     :param max_items: the maximum number of objects to list.  Defaults to 100.
 87 |     :param recursive: if the listing should contain deeply nested keys. Defaults to True.
 88 |     :return: an ``S3FileList`` object
 89 | 
 90 |     .. py:class:: S3FileList
 91 | 
 92 |         .. py:method:: files()
 93 | 
 94 |             Returns a list of dicts like
 95 | 
 96 |             .. code-block:: json
 97 | 
 98 |                {
 99 |                    "file_name": "foo",
100 |                    "size": 12345,
101 |                    "last_modified": "2017-07-17T01:01:21Z"
102 |                }
103 | 
104 |     Example usage:
105 | 
106 |     .. code-block:: python
107 | 
108 |        s3().list_bucket('my bucket', 'some_prefix').files()
109 | 
110 |        files = s3().list_bucket('my bucket', 'some_prefix', 10000).files()  # for listing a lot of keys
111 |        last_modified = files[0]["last_modified"].isoformat()  # returns a string that can be passed to time()
112 |        age = time() - time(last_modified) 
113 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/scalyr_wrapper.rst:
--------------------------------------------------------------------------------
  1 | Scalyr
  2 | ------
  3 | 
  4 | Wrapper
  5 | ^^^^^^^
  6 | 
  7 | The ``scalyr()`` wrapper enables querying Scalyr from your AWS worker if the credentials have been specified for the worker instance(s).
  8 | For more description of each type of query, please refer to https://www.scalyr.com/help/api .
  9 | 
 10 | Default parameters:
 11 | 
 12 | * ``minutes`` specifies the start time of the query. I.e. "5" will mean 5 minutes ago.
 13 | * ``end`` specifies the end time of the query. I.e. "2" will mean until 2 minutes ago. If set to ``None``, then the end is set to 24h after ``minutes``. The default "0" means `now`.
 14 | For ``minutes`` and ``end`` you can also specify absolute times like "2017-10-11T10:45:00+0800".
 15 | 
 16 | .. py:method:: count(query, minutes=5, end=0)
 17 | 
 18 |     Run a count query against Scalyr, depending on number of queries you may run into rate limit.
 19 | 
 20 | 
 21 |     ::
 22 | 
 23 |         scalyr().count(' ERROR ')
 24 | 
 25 | 
 26 | .. py:method:: timeseries(query, minutes=30, end=0)
 27 | 
 28 |     Runs a timeseries query against Scalyr with more generous rate limits. (New time series are created on the fly by Scalyr)
 29 | 
 30 | 
 31 | .. py:method:: facets(filter, field, max_count=5, minutes=30, end=0)
 32 | 
 33 |     This method is used to retrieve the most common values for a field.
 34 | 
 35 | 
 36 | .. py:method:: logs(query, max_count=100, minutes=5, continuation_token=None, columns=None, end=0)
 37 | 
 38 |     Runs a query against Scalyr and returns logs that match the query. At most ``max_count`` log lines will be returned.
 39 |     More can be fetched with the same query by passing back the continuation_token from the last response into the
 40 |     logs method.
 41 | 
 42 |     Specific columns can be returned (as defined in scalyr parser) using the columns array e.g. ``columns=['severity','threadName','timestamp']``.
 43 |     If this is unspecified, only the message column will be returned.
 44 | 
 45 |     An example logs result as JSON:
 46 | 
 47 |     .. code-block:: json
 48 | 
 49 |         {
 50 |             "messages": [
 51 |                "message line 1",
 52 |                "message line 2"
 53 |             ],
 54 |             "continuation_token": "a token"
 55 |         }
 56 | 
 57 | 
 58 | .. py:method:: power_query(query, minutes=5, end=0)
 59 | 
 60 |     Runs a power query against Scalyr and returns the results as response. You can create and test power queries also via the _UI:https://eu.scalyr.com/query . More information on power queries can be found _here:https://eu.scalyr.com/help/power-queries
 61 | 
 62 |     An example response as JSON:
 63 | 
 64 |     .. code-block:: json
 65 | 
 66 |         {
 67 |             "columns": [
 68 |                 {
 69 |                     "name": "cluster"
 70 |                 },
 71 |                 {
 72 |                     "name": "application"
 73 |                 },
 74 |                 {
 75 |                     "name": "volume"
 76 |                 }
 77 |             ],
 78 |             "warnings": [],
 79 |             "values": [
 80 |                 [
 81 |                     "cluster-1-eu-central-1:kube-1",
 82 |                     "application-2",
 83 |                     9481810.0
 84 |                 ],
 85 |                 [
 86 |                     "cluster-2-eu-central-1:kube-1",
 87 |                     "application-1",
 88 |                     8109726.0
 89 |                 ]
 90 |             ],
 91 |             "matchingEvents": 8123.0,
 92 |             "status": "success",
 93 |             "omittedEvents": 0.0
 94 |         }
 95 | 
 96 | 
 97 | Custom Scalyr Region
 98 | ^^^^^^^^^^^^^^^^^^^^
 99 | 
100 | By default the Scalyr wrapper uses https://www.scalyr.com/ as the default region. Overriding is possible using ``scalyr(scalyr_region='eu')`` if you want to use their Europe environment https://eu.scalyr.com/.
101 | 
102 | 
103 |     ::
104 | 
105 |         scalyr(scalyr_region='eu').count(' ERROR ')
106 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/kairosdb_wrapper.rst:
--------------------------------------------------------------------------------
  1 | .. _check-kairosdb:
  2 | 
  3 | KairosDB
  4 | --------
  5 | 
  6 | Provides read access to the target KairosDB
  7 | 
  8 | 
  9 | .. py:function:: kairosdb(url, oauth2=False)
 10 | 
 11 | 
 12 | Methods of KairosDB
 13 | ^^^^^^^^^^^^^^^^^^^
 14 | 
 15 | .. py:function:: query(name, group_by = None, tags = None, start = -5, end = 0, time_unit='seconds', aggregators = None, start_absolute = None, end_absolute = None)
 16 | 
 17 |     Query kairosdb.
 18 | 
 19 |     :param name: Metric name.
 20 |     :type name: str
 21 | 
 22 |     :param group_by: List of fields to group by.
 23 |     :type group_by: list
 24 | 
 25 |     :param tags: Filtering tags. Example of `tags` object:
 26 | 
 27 |         .. code-block:: python
 28 | 
 29 |             {
 30 |                 "key": ["max"]
 31 |             }
 32 | 
 33 |     :type tags: dict
 34 | 
 35 |     :param start: Relative start time. Default is 5. Should be greater than or equal 1.
 36 |     :type start: int
 37 | 
 38 |     :param end: End time. Default is 0. If not 0, then it should be greater than or equal to 1.
 39 |     :type end: int
 40 | 
 41 |     :param time_unit: Time unit ('seconds', 'minutes', 'hours'). Default is 'minutes'.
 42 |     :type time_unit: str
 43 | 
 44 |     :param aggregators: List of aggregators. Aggregator is an object that looks like
 45 | 
 46 |         .. code-block:: python
 47 | 
 48 |             {
 49 |                 "name": "max",
 50 |                 "sampling": {
 51 |                     "value": "1",
 52 |                     "unit": "minutes"
 53 |                 },
 54 |                 "align_sampling": true
 55 |             }
 56 | 
 57 |     :type aggregators: list
 58 | 
 59 |     :param start_absolute: Absolute start time in milliseconds, overrides the start parameter which is relative
 60 |     :type start_absolute: long
 61 | 
 62 |     :param end_absolute: Absolute end time in milliseconds, overrides the end parameter which is relative
 63 |     :type end_absolute: long
 64 | 
 65 |     :return: Result queries.
 66 |     :rtype: dict
 67 | 
 68 | 
 69 | .. py:function:: query_batch(self, metrics, start=5, end=0, time_unit='minutes', start_absolute=None, end_absolute=None)
 70 | 
 71 |     Query kairosdb for several checks at once.
 72 | 
 73 |     :param metrics: list of KairosDB metric queries, one query per metric name, e.g.
 74 | 
 75 |         .. code-block:: python
 76 | 
 77 |             [
 78 |                 {
 79 |                     'name': 'metric_name',      # name of the metric
 80 |                     'group_by': ['foo'],        # list of fields to group by
 81 |                     'aggregators': [            # list of aggregator objects
 82 |                         {                       # structure of a single aggregator
 83 |                             'name': 'max',
 84 |                             'sampling': {
 85 |                                 'value': '1',
 86 |                                 'unit': 'minutes'
 87 |                             },
 88 |                             'align_sampling': True
 89 |                         }
 90 |                     ],
 91 |                     'tags': {                   # dict with filtering tags
 92 |                         'key': ['max']          # a key is a tag name, list of values is used to filter
 93 |                                                 # all the records with given tag and given values
 94 |                     }
 95 |                 }
 96 |             ]
 97 | 
 98 |     :type metrics: dict
 99 | 
100 |     :param start: Relative start time. Default is 5.
101 |     :type start: int
102 | 
103 |     :param end: End time. Default is 0.
104 |     :type end: int
105 | 
106 |     :param time_unit: Time unit ('seconds', 'minutes', 'hours'). Default is 'minutes'.
107 |     :type time_unit: str
108 | 
109 |     :param start_absolute: Absolute start time in milliseconds, overrides the start parameter which is relative
110 |     :type start_absolute: long
111 | 
112 |     :return: Array of results for each queried metric
113 |     :rtype: list
114 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/snmp_wrapper.rst:
--------------------------------------------------------------------------------
  1 | SNMP
  2 | ----
  3 | 
  4 | Provides a wrapper for SNMP functions listed below. SNMP checks require
  5 | specifying hosts in the entities filter. The partial object `snmp()` accepts a
  6 | `timeout=seconds` parameter, default is 5 seconds timeout. **NOTE**: this timeout
  7 | is per answer, so multiple answers will add up and may block the whole check
  8 | 
  9 | .. py:method:: memory()
 10 | 
 11 |     ::
 12 | 
 13 |         snmp().memory()
 14 | 
 15 |     Returns host's memory usage statistics. All values are in KiB (1024 Bytes).
 16 | 
 17 |     Example check result as JSON:
 18 | 
 19 |     .. code-block:: json
 20 | 
 21 |         {
 22 |             "ram_buffer": 359404,
 23 |             "ram_cache": 6478944,
 24 |             "ram_free": 20963524,
 25 |             "ram_shared": 0,
 26 |             "ram_total": 37066332,
 27 |             "ram_total_free": 22963392,
 28 |             "swap_free": 1999868,
 29 |             "swap_min": 16000,
 30 |             "swap_total": 1999868,
 31 |         }
 32 | 
 33 | .. py:method:: load()
 34 | 
 35 |     ::
 36 | 
 37 |         snmp().load()
 38 | 
 39 |     Returns host's CPU load average (1 minute, 5 minute and 15 minute averages).
 40 | 
 41 |     Example check result as JSON:
 42 | 
 43 |     .. code-block:: json
 44 | 
 45 |         {"load1": 0.95, "load5": 0.69, "load15": 0.72}
 46 | 
 47 | .. py:method:: cpu()
 48 | 
 49 |     ::
 50 | 
 51 |         snmp().cpu()
 52 | 
 53 |     Returns host's CPU usage in percent.
 54 | 
 55 |     Example check result as JSON:
 56 | 
 57 |     .. code-block:: json
 58 | 
 59 |         {"cpu_system": 0, "cpu_user": 17, "cpu_idle": 81}
 60 | 
 61 | 
 62 | .. py:method:: df()
 63 | 
 64 |     ::
 65 | 
 66 |         snmp().df()
 67 | 
 68 |     Example check result as JSON:
 69 | 
 70 |     .. code-block:: json
 71 | 
 72 |         {
 73 |             "/data/postgres-wal-nfs-example": {
 74 |                 "available_space": 524287840,
 75 |                 "device": "example0-2-stp-123:/vol/example_pgwal",
 76 |                 "percentage_inodes_used": 0,
 77 |                 "percentage_space_used": 0,
 78 |                 "total_size": 524288000,
 79 |                 "used_space": 160,
 80 |             }
 81 |         }
 82 | 
 83 | .. py:method:: logmatch()
 84 | 
 85 |     ::
 86 | 
 87 |         snmp().logmatch()
 88 | 
 89 | .. py:method:: interfaces()
 90 | 
 91 |     ::
 92 | 
 93 |         snmp().interfaces()
 94 | 
 95 |     Example check result as JSON:
 96 | 
 97 |     .. code-block:: json
 98 | 
 99 |         {
100 |             "lo": {
101 |                 "in_octets": 63481918397415,
102 |                 "in_discards": 11,
103 |                 "adStatus": 1,
104 |                 "out_octets": 63481918397415,
105 |                 "opStatus": 1,
106 |                 "out_discards": 0,
107 |                 "speed": "10",
108 |                 "in_error": 0,
109 |                 "out_error": 0
110 |             },
111 |             "eth1": {
112 |                 "in_octets": 55238870608924,
113 |                 "in_discards": 8344,
114 |                 "adStatus": 1,
115 |                 "out_octets": 6801703429894,
116 |                 "opStatus": 1,
117 |                 "out_discards": 0,
118 |                 "speed": "10000",
119 |                 "in_error": 0,
120 |                 "out_error": 0
121 |             },
122 |             "eth0": {
123 |                 "in_octets": 3538944286327,
124 |                 "in_discards": 1130,
125 |                 "adStatus": 1,
126 |                 "out_octets": 16706789573119,
127 |                 "opStatus": 1,
128 |                 "out_discards": 0,
129 |                 "speed": "10000",
130 |                 "in_error": 0,
131 |                 "out_error": 0
132 |             }
133 |         }
134 | 
135 | .. py:method:: get()
136 | 
137 |     ::
138 | 
139 |         snmp().get('iso.3.6.1.4.1.42253.1.2.3.1.4.7.47.98.105.110.47.115.104', 'stunnel', int)
140 | 
141 |     Example check result as JSON:
142 | 
143 |     .. code-block:: json
144 | 
145 |         {
146 |             "stunnel": 0
147 |         }
148 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/elastic_search_wrapper.rst:
--------------------------------------------------------------------------------
  1 | Elasticsearch
  2 | -------------
  3 | 
  4 | Provides search queries and health check against an Elasticsearch cluster.
  5 | 
  6 | 
  7 | .. py:function:: elasticsearch(url=None, timeout=10, oauth2=False)
  8 | 
  9 | .. note::
 10 | 
 11 |     If ``url`` is **None**, then the plugin will use the default Elasticsearch cluster set in worker configuration.
 12 | 
 13 | Methods of Elasticsearch
 14 | ^^^^^^^^^^^^^^^^^^^^^^^^
 15 | 
 16 | .. py:function:: search(indices=None, q='', body=None, source=True, size=DEFAULT_SIZE)
 17 | 
 18 |     Search ES cluster using URI or Request body search. If ``body`` is None then GET request will be used.
 19 | 
 20 |     :param indices: List of indices to search. Limited to only 10 indices. ['_all'] will search all available
 21 |                     indices, which effectively leads to same results as `None`. Indices can accept wildcard form.
 22 |     :type indices: list
 23 | 
 24 |     :param q: Search query string. Will be ignored if ``body`` is not None.
 25 |     :type  q: str
 26 | 
 27 |     :param body: Dict holding an ES query DSL.
 28 |     :type body: dict
 29 | 
 30 |     :param source: Whether to include `_source` field in query response.
 31 |     :type source: bool
 32 | 
 33 |     :param size: Number of hits to return. Maximum value is 1000. Set to 0 if interested in hits count only.
 34 |     :type size: int
 35 | 
 36 |     :return: ES query result.
 37 |     :rtype: dict
 38 | 
 39 |     Example query:
 40 | 
 41 |     .. code-block:: python
 42 | 
 43 |         elasticsearch('http://es-cluster').search(indices=['logstash-*'], q='client:192.168.20.* AND http_status:500', size=0, source=False)
 44 | 
 45 |         {
 46 |             "_shards": {
 47 |                 "failed": 0,
 48 |                 "successful": 5,
 49 |                 "total": 5
 50 |             },
 51 |             "hits": {
 52 |                 "hits": [],
 53 |                 "max_score": 0.0,
 54 |                 "total": 1
 55 |             },
 56 |             "timed_out": false,
 57 |             "took": 2
 58 |         }
 59 | 
 60 | .. py:function:: count(indices=None, q='', body=None)
 61 | 
 62 |     Return ES count of matching query.
 63 | 
 64 |     :param indices: List of indices to search. Limited to only 10 indices. ['_all'] will search all available
 65 |                     indices, which effectively leads to same results as `None`. Indices can accept wildcard form.
 66 |     :type indices: list
 67 | 
 68 |     :param q: Search query string. Will be ignored if ``body`` is not None.
 69 |     :type  q: str
 70 | 
 71 |     :param body: Dict holding an ES query DSL.
 72 |     :type body: dict
 73 | 
 74 |     :return: ES query result.
 75 |     :rtype: dict
 76 | 
 77 |     Example query:
 78 | 
 79 |     .. code-block:: python
 80 | 
 81 |         elasticsearch('http://es-cluster').count(indices=['logstash-*'], q='client:192.168.20.* AND http_status:500')
 82 | 
 83 |         {
 84 |             "_shards": {
 85 |                 "failed": 0,
 86 |                 "successful": 16,
 87 |                 "total": 16
 88 |             },
 89 |             "count": 12
 90 |         }
 91 | 
 92 | .. py:method:: health()
 93 | 
 94 |     Return ES cluster health.
 95 | 
 96 |     :return: Cluster health result.
 97 |     :rtype: dict
 98 | 
 99 |     .. code-block:: python
100 | 
101 |         elasticsearch('http://es-cluster').health()
102 | 
103 |         {
104 |             "active_primary_shards": 11,
105 |             "active_shards": 11,
106 |             "active_shards_percent_as_number": 50.0,
107 |             "cluster_name": "big-logs-cluster",
108 |             "delayed_unassigned_shards": 0,
109 |             "initializing_shards": 0,
110 |             "number_of_data_nodes": 1,
111 |             "number_of_in_flight_fetch": 0,
112 |             "number_of_nodes": 1,
113 |             "number_of_pending_tasks": 0,
114 |             "relocating_shards": 0,
115 |             "status": "yellow",
116 |             "task_max_waiting_in_queue_millis": 0,
117 |             "timed_out": false,
118 |             "unassigned_shards": 11
119 |         }
120 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/http_wrapper.rst:
--------------------------------------------------------------------------------
  1 | HTTP
  2 | ----
  3 | 
  4 | Access to HTTP (and HTTPS) endpoints is provided by the :py:func:`http` function.
  5 | 
  6 | .. py:function:: http(url, [method='GET'], [timeout=10], [max_retries=0], [verify=True], [oauth2=False], [allow_redirects=None], [headers=None])
  7 | 
  8 |     :param str url: The URL that is to be queried. See below for details.
  9 |     :param str method: The HTTP request method. Allowed values are ``GET`` or ``HEAD``.
 10 |     :param float timeout: The timeout for the HTTP request, in seconds. Defaults to :py:obj:`10`.
 11 |     :param int max_retries: The number of times the HTTP request should be retried if it fails. Defaults to :py:obj:`0`.
 12 |     :param bool verify: Can be set to :py:obj:`False` to disable SSL certificate verification.
 13 |     :param bool oauth2: Can be set to :py:obj:`True` to inject a OAuth 2 ``Bearer`` access token in the outgoing request
 14 |     :param str oauth2_token_name: The name of the OAuth 2 token. Default is ``uid``.
 15 |     :param bool allow_redirects: Follow request redirects. If ``None`` then it will be set to :py:obj:`True` in case of ``GET`` and :py:obj:`False` in case of ``HEAD`` request.
 16 |     :param dict headers: The headers to be used in the HTTP request.
 17 |     :return: An object encapsulating the response from the server. See below.
 18 | 
 19 |         For checks on entities that define the attributes :py:attr:`url` or :py:attr:`host`, the given URL may be relative. In that case, the URL :samp:`http://<{value}><{url}>` is queried, where :samp:`<{value}>` is the value of that attribute, and :samp:`<{url}>` is the URL passed to this function. If an entity defines both :py:attr:`url` and :py:attr:`host`, the former is used.
 20 | 
 21 |     This function cannot query URLs using a scheme other than HTTP or HTTPS; URLs that do not start with :samp:`http://` or :samp:`https://` are considered to be relative.
 22 | 
 23 |     Example:
 24 | 
 25 |         .. code-block:: python
 26 | 
 27 |             http('http://www.example.org/data?fetch=json').json()
 28 | 
 29 |             # avoid raising error in case the response error status (e.g. 500 or 503)
 30 |             # but you are interested in the response json
 31 |             http('http://www.example.org/data?fetch=json').json(raise_error=False)
 32 | 
 33 | 
 34 | HTTP Responses
 35 | ^^^^^^^^^^^^^^
 36 | 
 37 | The object returned by the :py:func:`http` function provides methods: :py:meth:`json`, :py:meth:`text`, :py:meth:`headers`, :py:meth:`cookies`, :py:meth:`content_size`, :py:meth:`time` and :py:meth:`code`.
 38 | 
 39 | .. py:method:: json(raise_error=True)
 40 | 
 41 |     This method returns an object representing the content of the JSON response from the queried endpoint. Usually, this will be a map (represented by a Python :py:obj:`dict`), but, depending on the endpoint, it may also be a list, string, set, integer, floating-point number, or Boolean.
 42 | 
 43 | .. py:method:: text(raise_error=True)
 44 | 
 45 |     Returns the text response from queried endpoint::
 46 | 
 47 |         http("/heartbeat.jsp", timeout=5).text().strip()=='OK: JVM is running'
 48 | 
 49 |     Since we’re using a relative url, this check has to be defined for
 50 |     specific entities (e.g. type=zomcat will run it on all zomcat
 51 |     instances). The strip function removes all leading and trailing
 52 |     whitespace.
 53 | 
 54 | .. py:method:: headers(raise_error=True)
 55 | 
 56 |     Returns the response headers in a case-insensitive dict-like object::
 57 | 
 58 |         http("/api/json", timeout=5).headers()['content-type']=='application/json'
 59 | 
 60 | .. py:method:: cookies(raise_error=True)
 61 | 
 62 |     Returns the response cookies in a dict like object::
 63 | 
 64 |         http("/heartbeat.jsp", timeout=5).cookies()['my_custom_cookie'] == 'custom_cookie_value'
 65 | 
 66 | .. py:method:: content_size(raise_error=True)
 67 | 
 68 |     Returns the length of the response content::
 69 | 
 70 |         http("/heartbeat.jsp", timeout=5).content_size() > 1024
 71 | 
 72 | .. py:method:: time(raise_error=True)
 73 | 
 74 |     Returns the elapsed time in seconds until response was received::
 75 | 
 76 |         http("/heartbeat.jsp", timeout=5).time() > 1.5
 77 | 
 78 | .. py:method:: code()
 79 | 
 80 |     Return HTTP status code from the queried endpoint.::
 81 | 
 82 |         http("/heartbeat.jsp", timeout=5).code()
 83 | 
 84 | .. _http-actuator:
 85 | 
 86 | .. py:method:: actuator_metrics(prefix='zmon.response.', raise_error=True)
 87 | 
 88 |     Parses the json result of a metrics endpoint into a map ep->method->status->metric
 89 | 
 90 |         http("/metrics", timeout=5).actuator_metrics()
 91 | 
 92 | .. _http-prometheus:
 93 | 
 94 | .. py:method:: prometheus()
 95 | 
 96 |     Parse the resulting text result according to the Prometheus specs using their prometheus_client.
 97 | 
 98 |         http("/metrics", timeout=5).prometheus()
 99 | 
100 | .. _http-prometheus_flat:
101 | 
102 | .. py:method:: prometheus_flat()
103 | 
104 |     Parse the resulting text result according to the Prometheus specs using their prometheus_client 
105 |     and flattens the outcome.
106 | 
107 |         http("/metrics", timeout=5).prometheus_flat()
108 | 
109 | .. _http-jolokia:
110 | 
111 | .. py:method:: jolokia(read_requests, raise_error=False)
112 | 
113 |     Does a POST request to the endpoint given in the wrapper, with validating the endpoint and setting
114 |     the request to be read-only.
115 | 
116 |     :param read_requests: see https://jolokia.org/reference/html/protocol.html#post-request
117 |     :type read_requests: list
118 |     :param raise_error: bool
119 |     :return: Jolokia response
120 | 
121 |     Example:
122 | 
123 |         .. code-block:: python
124 | 
125 |             requests = [
126 |                 {'mbean': 'org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency'},
127 |                 {'mbean': 'org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency'},
128 |             ]
129 |             results = http('http://{}:8778/jolokia/'.format(entity['ip']), timeout=15).jolokia(requests)
130 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/appdynamics_wrapper.rst:
--------------------------------------------------------------------------------
  1 | AppDynamics
  2 | -------------
  3 | 
  4 | Enable AppDynamics Healthrule violations check and *optionally* query underlying Elasticsearch cluster raw logs.
  5 | 
  6 | .. py:function:: appdynamics(url=None, username=None, password=None, es_url=None, index_prefix='')
  7 | 
  8 |     Initialize AppDynamics wrapper.
  9 | 
 10 |     :param url: Appdynamics url.
 11 |     :type url: str
 12 | 
 13 |     :param username: Appdynamics username.
 14 |     :type username: str
 15 | 
 16 |     :param password: Appdynamics password.
 17 |     :type password: str
 18 | 
 19 |     :param es_url: Appdynamics Elasticsearch cluster url.
 20 |     :type es_url: str
 21 | 
 22 |     :param index_prefix: Appdynamics Elasticsearch cluster logs index prefix.
 23 |     :type index_prefix: str
 24 | 
 25 | .. note::
 26 | 
 27 |     If ``username`` and ``password`` are not supplied, then OAUTH2 will be used.
 28 | 
 29 |     If ``appdynamics()`` is initialized with no args, then plugin configuration values will be used.
 30 | 
 31 | Methods of AppDynamics
 32 | ^^^^^^^^^^^^^^^^^^^^^^
 33 | 
 34 | .. py:function:: healthrule_violations(application, time_range_type=BEFORE_NOW, duration_in_mins=5, start_time=None, end_time=None, severity=None)
 35 | 
 36 |     Return Healthrule violations for AppDynamics application.
 37 | 
 38 |     :param application: Application name or ID
 39 |     :type application: str
 40 | 
 41 |     :param time_range_type: Valid time range type. Valid range types are BEFORE_NOW, BEFORE_TIME, AFTER_TIME and BETWEEN_TIMES. Default is BEFORE_NOW.
 42 |     :type time_range_type: str
 43 | 
 44 |     :param duration_in_mins: Time duration in mins. Required for BEFORE_NOW, AFTER_TIME, BEFORE_TIME range types. Default is 5 mins.
 45 |     :type duration_in_mins: int
 46 | 
 47 |     :param start_time: Start time (in milliseconds) from which the metric data is returned. Default is 5 mins ago.
 48 |     :type start_time: int
 49 | 
 50 |     :param end_time: End time (in milliseconds) until which the metric data is returned. Default is now.
 51 |     :type end_time: int
 52 | 
 53 |     :param severity: Filter results based on severity. Valid values are CRITICAL or WARNING.
 54 |     :type severity: str
 55 | 
 56 |     :return: List of healthrule violations
 57 |     :rtype: list
 58 | 
 59 |     Example query:
 60 | 
 61 |     .. code-block:: python
 62 | 
 63 |         appdynamics('https://appdynamics/controller/rest').healthrule_violations('49', time_range_type='BEFORE_NOW', duration_in_mins=5)
 64 | 
 65 |         [
 66 |             {
 67 |                 affectedEntityDefinition: {
 68 |                     entityId: 408,
 69 |                     entityType: "BUSINESS_TRANSACTION",
 70 |                     name: "/error"
 71 |                 },
 72 |                 detectedTimeInMillis: 0,
 73 |                 endTimeInMillis: 0,
 74 |                 id: 39637,
 75 |                 incidentStatus: "OPEN",
 76 |                 name: "Backend errrors (percentage)",
 77 |                 severity: "CRITICAL",
 78 |                 startTimeInMillis: 1462244635000,
 79 |             }
 80 |         ]
 81 | 
 82 | .. py:function:: metric_data(application, metric_path, time_range_type=BEFORE_NOW, duration_in_mins=5, start_time=None, end_time=None, rollup=True)
 83 | 
 84 |     AppDynamics's metric-data API
 85 | 
 86 |     :param application: Application name or ID
 87 |     :type application: str
 88 | 
 89 |     :param metric_path: The path to the metric in the metric hierarchy
 90 |     :type metric_path: str
 91 | 
 92 |     :param time_range_type: Valid time range type. Valid range types are BEFORE_NOW, BEFORE_TIME, AFTER_TIME and
 93 |                             BETWEEN_TIMES. Default is BEFORE_NOW.
 94 |     :type time_range_type: str
 95 | 
 96 |     :param duration_in_mins: Time duration in mins. Required for BEFORE_NOW, AFTER_TIME, BEFORE_TIME range types.
 97 |     :type duration_in_mins: int
 98 | 
 99 |     :param start_time: Start time (in milliseconds) from which the metric data is returned. Default is 5 mins ago.
100 |     :type start_time: int
101 | 
102 |     :param end_time: End time (in milliseconds) until which the metric data is returned. Default is now.
103 |     :type end_time: int
104 | 
105 |     :param rollup: By default, the values of the returned metrics are rolled up into a single data point
106 |                    (rollup=True). To get separate results for all values within the time range, set the
107 |                    ``rollup`` parameter to ``False``.
108 |     :type rollup: bool
109 | 
110 |     :return: metric values for a metric
111 |     :rtype: list
112 | 
113 | .. py:function:: query_logs(q='', body=None, size=100, source_type=SOURCE_TYPE_APPLICATION_LOG, duration_in_mins=5)
114 | 
115 |     Perform search query on AppDynamics ES logs.
116 | 
117 |     :param q: Query string used in search.
118 |     :type q: str
119 | 
120 |     :param body: (dict) holding an ES query DSL.
121 |     :type body: dict
122 | 
123 |     :param size: Number of hits to return. Default is 100.
124 |     :type size: int
125 | 
126 |     :param source_type: ``sourceType`` field filtering. Default to ``application-log``, and will be part of ``q``.
127 |     :type source_type: str
128 | 
129 |     :param duration_in_mins: Duration in mins before current time. Default is 5 mins.
130 |     :type duration_in_mins: int
131 | 
132 |     :return: ES query result ``hits``.
133 |     :rtype: list
134 | 
135 | .. py:function:: count_logs(q='', body=None, source_type=SOURCE_TYPE_APPLICATION_LOG, duration_in_mins=5)
136 | 
137 |     Perform count query on AppDynamics ES logs.
138 | 
139 |     :param q: Query string used in search. Will be ingnored if ``body`` is not None.
140 |     :type q: str
141 | 
142 |     :param body: (dict) holding an ES query DSL.
143 |     :type body: dict
144 | 
145 |     :param source_type: ``sourceType`` field filtering. Default to ``application-log``, and will be part of ``q``.
146 |     :type source_type: str
147 | 
148 |     :param duration_in_mins: Duration in mins before current time. Default is 5 mins. Will be ignored if ``body`` is not None.
149 |     :type duration_in_mins: int
150 | 
151 |     :return: Query match count.
152 |     :rtype: int
153 | 
154 | .. note::
155 | 
156 |     In case of passing an ES query DSL in ``body``, then all filter parameters should be explicitly added in the query body (e.g. ``eventTimestamp``, ``application_id``, ``sourceType``).
157 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/redis_wrapper.rst:
--------------------------------------------------------------------------------
  1 | Redis
  2 | -----
  3 | 
  4 | Read-only access to Redis servers is provided by the :py:func:`redis` function.
  5 | 
  6 | 
  7 | .. py:function:: redis([port=6379], [db=0], [socket_connect_timeout=1], [socket_timeout=5], [ssl=False],
  8 |                  [ssl_cert_reqs='required'])
  9 | 
 10 |     Returns a connection to the Redis server at :samp:`{<host>}:{<port>}`, where :samp:`{<host>}` is the value
 11 |     of the current entity's ``host`` attribute, and :samp:`{<port>}` is the given port (default ``6379``). See
 12 |     below for a list of methods provided by the returned connection object.
 13 | 
 14 |     :param host: Redis host.
 15 |     :type host: str
 16 | 
 17 |     :param password:  If set - enables authentication to the destination redis server with the password provided. Default is None.
 18 |     :type password: str
 19 | 
 20 | .. note::
 21 | 
 22 |     If ``password`` param is not supplied, then plugin configuration values will be used.
 23 |     You can use ``plugin.redis.password`` to configure redis password authentication for zmon-worker.
 24 | 
 25 | Please also have a look at the `Redis documentation`_.
 26 | 
 27 | 
 28 | Methods of the Redis Connection
 29 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 30 | 
 31 | The object returned by the :py:func:`redis` function provides the following methods:
 32 | 
 33 | 
 34 | .. py:method:: llen(key)
 35 | 
 36 |     Returns the length of the list stored at `key`. If `key` does not exist, it's value is treated as if it were
 37 |     an empty list, and 0 is returned. If `key` exists but is not a list, an error is raised.
 38 | 
 39 |     ::
 40 | 
 41 |         redis().llen("prod_eventlog_queue")
 42 | 
 43 | 
 44 | .. py:method:: lrange(key, start, stop)
 45 | 
 46 |     Returns the elements of the list stored at `key` in the range [`start`, `stop`]. If `key` does not
 47 |     exist, it's value is treated as if it were an empty list. If `key` exists but is not a list, an
 48 |     error is raised.
 49 | 
 50 |     The parameters `start` and `stop` are zero-based indexes. Negative numbers are converted to indexes
 51 |     by adding the length of the list, so that ``-1`` is the last element of the list, ``-2`` the
 52 |     second-to-last element of the list, and so on.
 53 | 
 54 |     Indexes outside the range of the list are not an error: If both `start` and `stop` are less than 0 or
 55 |     greater than or equal to the length of the list, an empty list is returned. Otherwise, if `start` is
 56 |     less than 0, it is treated as if it were 0, and if `stop` is greater than or equal to the the length
 57 |     of the list, it is treated as if it were equal to the length of the list minus 1. If `start` is
 58 |     greater than `stop`, an empty list is returned.
 59 | 
 60 |     Note that this method is subtly different from Python's list slicing syntax, where ``list[start:stop]``
 61 |     returns elements in the range [`start`, `stop`).
 62 | 
 63 |     ::
 64 | 
 65 |         redis().lrange("prod_eventlog_queue", 0, 9)   # Returns *ten* elements!
 66 |         redis().lrange("prod_eventlog_queue", 0, -1)  # Returns the entire list.
 67 | 
 68 | 
 69 | .. py:method:: get(key)
 70 | 
 71 |     Returns the string stored at `key`. If `key` does not exist, returns ``None``. If `key` exists
 72 |     but is not a string, an error is raised.
 73 | 
 74 |     ::
 75 | 
 76 |         redis().get("example_redis_key")
 77 | 
 78 | 
 79 | .. py:method:: keys(pattern)
 80 | 
 81 |     Returns list of keys from Redis matching pattern.
 82 | 
 83 |     ::
 84 | 
 85 |         redis().keys("*downtime*")
 86 | 
 87 | 
 88 | .. py:method:: hget(key, field)
 89 | 
 90 |     Returns the value of the field `field` of the hash `key`. If `key` does not exist or does not have
 91 |     a field named `field`, returns ``None``. If `key` exists but is not a hash, an error is raised.
 92 | 
 93 |     ::
 94 | 
 95 |         redis().hget("example_hash_key", "example_field_name")
 96 | 
 97 | 
 98 | .. py:method:: hgetall(key)
 99 | 
100 |     Returns a ``dict`` of all fields of the hash `key`. If `key` does not exist, returns an empty ``dict``.
101 |     If `key` exists but is not a hash, an error is raised.
102 | 
103 |     ::
104 | 
105 |         redis().hgetall("example_hash_key")
106 | 
107 | 
108 | .. py:method:: hlen(key)
109 | 
110 |     Returns number of keys in hash ``key``.
111 | 
112 |     ::
113 | 
114 |         redis().hlen("example_hash_key")
115 |         
116 | 
117 | .. py:method:: scan(cursor, [match=None], [count=None])
118 | 
119 |     Returns a ``set`` with the next cursor and the results from this scan.
120 |     Please see the Redis documentation on how to use this function exactly: http://redis.io/commands/scan
121 | 
122 |     ::
123 | 
124 |         redis().scan(0, 'prefix*', 10)
125 | 
126 | 
127 | .. py:method:: smembers(key)
128 | 
129 |     Returns members of set ``key`` in Redis.
130 | 
131 |     ::
132 | 
133 |         redis().smembers("zmon:alert:1")
134 | 
135 | 
136 | .. py:method:: ttl(key)
137 | 
138 |     Return the time to live of an expiring key.
139 | 
140 |     ::
141 | 
142 |         redis().ttl('lock')
143 | 
144 | .. py:method:: scard(key)
145 | 
146 |     Return the number of elements in set ``key``
147 | 
148 |     ::
149 | 
150 |         redis().scard("example_hash_key")
151 | 
152 | 
153 | .. py:method:: zcard(key)
154 | 
155 |     Return the number of elements in the sorted set ``key``
156 | 
157 |     ::
158 | 
159 |         redis().zcard("example_sorted_set_key")
160 | 
161 | 
162 | .. py:method:: info([section])
163 | 
164 |     Returns a ``dict`` containing all information exposed by the `Redis INFO command`_.
165 | 
166 | .. py:method:: statistics()
167 | 
168 |     Returns a ``dict`` with general Redis statistics such as memory usage and operations/s.
169 |     All values are extracted using the `Redis INFO command`_.
170 | 
171 |     Example result:
172 | 
173 |     .. code-block:: json
174 | 
175 |         {
176 |             "blocked_clients": 2,
177 |             "commands_processed_per_sec": 15946.48,
178 |             "connected_clients": 162,
179 |             "connected_slaves": 0,
180 |             "connections_received_per_sec": 0.5,
181 |             "dbsize": 27351,
182 |             "evicted_keys_per_sec": 0.0,
183 |             "expired_keys_per_sec": 0.0,
184 |             "instantaneous_ops_per_sec": 29626,
185 |             "keyspace_hits_per_sec": 1195.43,
186 |             "keyspace_misses_per_sec": 1237.99,
187 |             "used_memory": 50781216,
188 |             "used_memory_rss": 63475712
189 |         }
190 | 
191 |     Please note that the values for both `used_memory` and `used_memory_rss` are in Bytes.
192 | 
193 | .. _Redis documentation: http://redis.io/
194 | .. _Redis INFO command: http://redis.io/commands/info
195 | 


--------------------------------------------------------------------------------
/docs/getting-started.rst:
--------------------------------------------------------------------------------
  1 | ***************
  2 | Getting Started
  3 | ***************
  4 | 
  5 | To quickly get started with ZMON, use the preconfigured Vagrant box featured on the `main ZMON repository`_.
  6 | Make sure you've installed Vagrant *(at least 1.7.4)* and a Vagrant provider like VirtualBox on your machine.
  7 | Clone the repository with Git:
  8 | 
  9 | .. code-block:: bash
 10 | 
 11 |    $ git clone https://github.com/zalando/zmon.git
 12 |    $ cd zmon/
 13 | 
 14 | From within the cloned repository, run:
 15 | 
 16 | .. code-block:: bash
 17 | 
 18 |    $ vagrant up
 19 | 
 20 | Bootstrapping the image for the first time will take a bit of time.
 21 | You might want to grab some coffee while you wait. :)
 22 | 
 23 | When it's finally up, Vagrant will report on how to reach the ZMON web interface:
 24 | 
 25 | .. code-block:: bash
 26 | 
 27 |     ==> default: ZMON installation is done!
 28 |     ==> default: Goto: https://localhost:8443
 29 |     ==> default: Login with your GitHub credentials
 30 | 
 31 | Creating Your First Alert
 32 | =========================
 33 | 
 34 | Log In
 35 | ------
 36 | 
 37 | Open your web browser and navigate to the URL reported by Vagrant: e.g. https://localhost:8443/.
 38 | Click on *Sign In*. This will redirect you to Github where you sign in and authorize the ZMON app.
 39 | Then it takes you back and you are logged in.
 40 | 
 41 | .. note::
 42 | 
 43 |   For your own deployment create your own app in Github with your redirect URL.
 44 |   In ZMON you can then limit users allowed access to your Github organization.
 45 | 
 46 | Checks and Alerts
 47 | -----------------
 48 | 
 49 | An alert shown on ZMON's dashboard typically consists of two parts: the *check-definition*, which is responsible for
 50 | fetching the underlying data; and the *alert-definition*, which defines the condition under which the alert will trigger.
 51 | Multiple alerts with different alert conditions can operate on the same check, fetching data only once.
 52 | 
 53 | Let's explore this concept now by creating a simple check and defining some alerts on it.
 54 | 
 55 | Create a new Check
 56 | ------------------
 57 | 
 58 | One way to create a new check from scratch is via the :ref:`cli-usage`.
 59 | A more convenient way, however, is to use the "Trial Run" feature.
 60 | It enables you to develop checks and alerts, execute them immediately, and inspect the result.
 61 | Once you are happy with your check command and filter, you can save it from the Trial Run directly.
 62 | Some users prefer to download the YAML definition from there to store and maintain it in Git.
 63 | 
 64 | Create an Alert
 65 | ---------------
 66 | 
 67 | In the top navigation of ZMON's web interface, select `Check defs <https://localhost:8443/#/check-definitions>`_ from the list and click on *Website HTTP status*.
 68 | Then click *"Add New Alert Definition"* to create a new alert for this particular check.
 69 | Fill out the form (see example values below), and hit *"Save"*:
 70 | 
 71 | ==================== ==========================
 72 | **Name**             Oops ... website is gone!
 73 | -------------------- --------------------------
 74 | **Description**      Website was not reachable.
 75 | -------------------- --------------------------
 76 | **Priority**         Priority 1 (red)
 77 | -------------------- --------------------------
 78 | **Alert Condition**  value != 200
 79 | -------------------- --------------------------
 80 | **Team**             Team 1
 81 | -------------------- --------------------------
 82 | **Responsible Team** Team 1
 83 | -------------------- --------------------------
 84 | **Status**           ACTIVE
 85 | ==================== ==========================
 86 | 
 87 | After you hit save, it will take a few seconds until it is picked up and executed.
 88 | 
 89 | View Dashboard
 90 | --------------
 91 | 
 92 | If the alerts condition evaluates to anything but ``False`` the alert will appear on the dashboard.
 93 | This means not only for ``True``, but also e.g. in case of exceptions triggered, e.g. due to timeouts or failure to connect.
 94 | Currently there's only one dashboard, and it is configured to show all present alerts.
 95 | To view the dashboard, select `Dashboards <https://localhost:8443/#/dashboards>`_ from the main menu and click on *Example Dashboard*.
 96 | 
 97 | To see the alert, you must simulate the error condition; try modifying its condition or the check-definition to return an error code).
 98 | You do this, set the URL in the check command to http://httpstat.us/500.
 99 | (The number in the URL represents the HTTP error code you will get.)
100 | 
101 | To see the actual error code in the alert, you might want to create/modify it like this:
102 | 
103 | ==================== ================================
104 | **Name**             Website gone with status {code}
105 | -------------------- --------------------------------
106 | **Description**      Website was not reachable.
107 | -------------------- --------------------------------
108 | **Priority**         Priority 1 (red)
109 | -------------------- --------------------------------
110 | **Alert Condition**  capture(code=value)!=200
111 | -------------------- --------------------------------
112 | **Team**             Team 1
113 | -------------------- --------------------------------
114 | **Responsible Team** Team 1
115 | -------------------- --------------------------------
116 | **Status**           ACTIVE
117 | ==================== ================================
118 | 
119 | .. _cli-usage:
120 | 
121 | Using the CLI
122 | =============
123 | 
124 | The ZMON Vagrant box comes preinstalled with *zmon-cli*.
125 | To use the CLI, log in to the running Vagrant box with:
126 | 
127 | .. code-block:: bash
128 | 
129 |    $ vagrant ssh
130 | 
131 | The Vagrant box also contains some sample yaml files for creating entities, checks and alerts.
132 | You can find these in */vagrant/examples*.
133 | 
134 | As an example of using ZMON's CLI, let's create a check to verify that google.com is reachable.
135 | *cd* to */vagrant/examples/check-definitions* and, using zmon-cli, create a new check-definition:
136 | 
137 | .. code-block:: bash
138 | 
139 |    $ cd /vagrant/examples/check-definitions
140 |    $ zmon check-definitions init website-availability.yaml
141 |    $ vim website-availability.yaml
142 | 
143 | Edit the newly created *website-availability.yaml* to contain the following code. (type :kbd:`i` for insert-mode)
144 | 
145 | .. code-block:: yaml
146 | 
147 |    name: "Website HTTP status"
148 |    owning_team: "Team 1"
149 |    command: http("http://httpstat.us/200", timeout=5).code()
150 |    description: "Returns current http status code for Website"
151 |    interval: 60
152 |    entities:
153 |     - type: GLOBAL
154 |    status: ACTIVE
155 | 
156 | Type :kbd:`ESC :wq RETURN` to save the file.
157 | 
158 | To push the updated check definition to ZMON, run:
159 | 
160 | .. code-block:: bash
161 | 
162 |    $ zmon check-definitions update website-availability.yaml
163 |    Updating check definition... http://localhost:8080/#/check-definitions/view/2
164 | 
165 | Find more detailed information here: :ref:`zmon-cli`.
166 | 
167 | .. _main ZMON repository: https://github.com/zalando/zmon
168 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/sql_wrappers.rst:
--------------------------------------------------------------------------------
  1 | .. _sql-function:
  2 | 
  3 | SQL
  4 | ---
  5 | 
  6 | .. py:function:: sql([shard])
  7 | 
  8 |     Provides a wrapper for connection to PostgreSQL database and allows
  9 |     executing queries. All queries are executed in read-only transactions.
 10 |     The connection wrapper requires one parameters: list of shard connections.
 11 |     The shard connections must come from the entity definition (see :ref:`database-entities`).
 12 |     Example query for log database which returns a primitive long value:
 13 | 
 14 |     .. code-block:: python
 15 | 
 16 |         sql().execute("SELECT count(*) FROM zl_data.log WHERE log_created > now() - '1 hour'::interval").result()
 17 | 
 18 |     Example query which will return a single dict with keys ``a`` and ``b``::
 19 | 
 20 |         sql().execute('SELECT 1 AS a, 2 AS b').result()
 21 | 
 22 |     The SQL wrapper will automatically sum up values over all shards::
 23 | 
 24 |         sql().execute('SELECT count(1) FROM zc_data.customer').result() # will return a single integer value (sum over all shards)
 25 | 
 26 |     It's also possible to query a single shard by providing its name::
 27 | 
 28 |         sql(shard='customer1').execute('SELECT COUNT(1) AS c FROM zc_data.customer').results() # returns list of values from a single shard
 29 |     
 30 |     It's also possible to query another database on the same server overwriting the shards information::
 31 |     
 32 |         sql(shards={'customer_db' : entity['host'] + ':' + str(entity['port']) + '/another_db'}).execute('SELECT COUNT(1) AS c FROM my_table').results()
 33 | 
 34 |     To execute a SQL statement on all LIVE customer shards, for example, use the following entity filter:
 35 | 
 36 |     .. code-block:: json
 37 | 
 38 |         [
 39 |             {
 40 |                 "type":        "database",
 41 |                 "name":        "customer",
 42 |                 "environment": "live",
 43 |                 "role":        "master"
 44 |             }
 45 |         ]
 46 | 
 47 |     The check command will have the form
 48 | 
 49 |     .. code-block:: python
 50 | 
 51 |         >>> sql().execute('SELECT 1 AS a').result()
 52 |         8
 53 |         # Returns a single value: the sum over the result from all shards
 54 | 
 55 |         >>> sql().execute('SELECT 1 AS a').results()
 56 |         [{'a': 1}, {'a': 1}, {'a': 1}, {'a': 1}, {'a': 1}, {'a': 1}, {'a': 1}, {'a': 1}]
 57 |         # Returns a list of the results from all shards
 58 | 
 59 |         >>> sql(shard='customer1').execute('SELECT 1 AS a').results()
 60 |         [{'a': 1}]
 61 |         # Returns the result from the specified shard in a list of length one
 62 | 
 63 |         >>> sql().execute('SELECT 1 AS a, 2 AS b').result()
 64 |         {'a': 8, 'b': 16}
 65 |         # Returns a dict of the two values, which are each the sum over the result from all shards
 66 | 
 67 |     The results() function has several additional parameters: ::
 68 | 
 69 |         sql().execute('SELECT 1 AS ONE, 2 AS TWO FROM dual').results([max_results=100], [raise_if_limit_exceeded=True])
 70 | 
 71 |     ``max_results``
 72 |         The maximum number of rows you expect to get from the call. If not specified, defaults to 100. You cannot have an
 73 |         unlimited number of rows. There is an absolute maximum of 1,000,000 results that cannot be overridden.
 74 |         Note: If you require processing of larger dataset, it
 75 |         is recommended to revisit architecture of your monitoring subsystem and possibly move logic that does calculation
 76 |         into external web service callable by ZMON 2.0.
 77 | 
 78 |     ``raise_if_limit_exceeded``
 79 |         Raises an exception if the limit of rows would have been exceeded by the issued query.
 80 | 
 81 | .. py:function:: orasql()
 82 | 
 83 |     Provides a wrapper for connection to Oracle database and allows
 84 |     executing queries. All queries are executed in read-only transactions.
 85 |     The connection wrapper requires three parameters: host, port and sid,
 86 |     that must come from the entity definition (see :ref:`database-entities`).
 87 |     One idiosyncratic behaviour to be aware, is that when your query produces
 88 |     more than one value, and you get a dict with keys being the column names
 89 |     or aliases you used in your query, you will always get back those keys
 90 |     *in uppercase*. For clarity, we recommend that you write all aliases
 91 |     and column names in uppercase, to avoid confusion due to case changes.
 92 |     Example query of the simplest query, which returns a single value:
 93 | 
 94 |     .. code-block:: python
 95 | 
 96 |         orasql().execute("SELECT 'OK' from dual").result()
 97 | 
 98 |     Example query which will return a single dict with keys ``ONE`` and ``TWO``::
 99 | 
100 |         orasql().execute('SELECT 1 AS ONE, 2 AS TWO from dual').result()
101 | 
102 |     To execute a SQL statement on a LIVE server, tagged with the name business_intelligence, for example,
103 |     use the following entity filter:
104 | 
105 |     .. code-block:: json
106 | 
107 |         [
108 |             {
109 |                 "type":        "oracledb",
110 |                 "name":        "business_intelligence",
111 |                 "environment": "live",
112 |                 "role":        "master"
113 |             }
114 |         ]
115 | 
116 | 
117 | .. py:function:: exacrm()
118 | 
119 |     Provides a wrapper for connection to the CRM Exasol database executing
120 |     queries.
121 |     The connection wrapper requires one parameter: the query.
122 | 
123 |     Example query:
124 | 
125 |     .. code-block:: python
126 | 
127 |         exacrm().execute("SELECT 'OK';").result()
128 | 
129 |     To execute a SQL statement on the itr-crmexa* servers use the following
130 |     entity filter:
131 | 
132 |     .. code-block:: json
133 | 
134 |         [
135 |            {
136 |                "type": "host",
137 |                 "host_role_id": "117"
138 |            }
139 |         ]
140 | 
141 | .. py:function:: mysql([shard])
142 | 
143 |     Provides a wrapper for connection to MySQL database and allows
144 |     executing queries.
145 |     The connection wrapper requires one parameters: list of shard connections.
146 |     The shard connections must come from the entity definition (see :ref:`database-entities`).
147 |     Example query of the simplest query, which returns a single value:
148 | 
149 |     .. code-block:: python
150 | 
151 |         mysql().execute("SELECT count(*) FROM mysql.user").result()
152 | 
153 |     Example query which will return a single dict with keys ``h`` and ``u``::
154 | 
155 |         mysql().execute('SELECT host AS h, user AS u FROM mysql.user').result()
156 | 
157 |     The SQL wrapper will automatically sum up values over all shards::
158 | 
159 |         mysql().execute('SELECT count(1) FROM zc_data.customer').result() # will return a single integer value (sum over all shards)
160 | 
161 |     It's also possible to query a single shard by providing its name::
162 | 
163 |         mysql(shard='customer1').execute('SELECT COUNT(1) AS c FROM zc_data.customer').results() # returns list of values from a single shard
164 | 
165 |     To execute a SQL statement on all LIVE customer shards, for example, use the following entity filter:
166 | 
167 |     .. code-block:: json
168 | 
169 |         [
170 |             {
171 |                 "type":        "mysqldb",
172 |                 "name":        "lounge",
173 |                 "environment": "live",
174 |                 "role":        "master"
175 |             }
176 |         ]
177 | 


--------------------------------------------------------------------------------
/docs/Makefile:
--------------------------------------------------------------------------------
  1 | # Makefile for Sphinx documentation
  2 | #
  3 | 
  4 | # You can set these variables from the command line.
  5 | SPHINXOPTS    =
  6 | SPHINXBUILD   = sphinx-build
  7 | PAPER         =
  8 | BUILDDIR      = _build
  9 | 
 10 | # User-friendly check for sphinx-build
 11 | ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $?), 1)
 12 | $(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/)
 13 | endif
 14 | 
 15 | # Internal variables.
 16 | PAPEROPT_a4     = -D latex_paper_size=a4
 17 | PAPEROPT_letter = -D latex_paper_size=letter
 18 | ALLSPHINXOPTS   = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
 19 | # the i18n builder cannot share the environment and doctrees with the others
 20 | I18NSPHINXOPTS  = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
 21 | 
 22 | .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext
 23 | 
 24 | help:
 25 | 	@echo "Please use \`make <target>' where <target> is one of"
 26 | 	@echo "  html       to make standalone HTML files"
 27 | 	@echo "  dirhtml    to make HTML files named index.html in directories"
 28 | 	@echo "  singlehtml to make a single large HTML file"
 29 | 	@echo "  pickle     to make pickle files"
 30 | 	@echo "  json       to make JSON files"
 31 | 	@echo "  htmlhelp   to make HTML files and a HTML help project"
 32 | 	@echo "  qthelp     to make HTML files and a qthelp project"
 33 | 	@echo "  devhelp    to make HTML files and a Devhelp project"
 34 | 	@echo "  epub       to make an epub"
 35 | 	@echo "  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
 36 | 	@echo "  latexpdf   to make LaTeX files and run them through pdflatex"
 37 | 	@echo "  latexpdfja to make LaTeX files and run them through platex/dvipdfmx"
 38 | 	@echo "  text       to make text files"
 39 | 	@echo "  man        to make manual pages"
 40 | 	@echo "  texinfo    to make Texinfo files"
 41 | 	@echo "  info       to make Texinfo files and run them through makeinfo"
 42 | 	@echo "  gettext    to make PO message catalogs"
 43 | 	@echo "  changes    to make an overview of all changed/added/deprecated items"
 44 | 	@echo "  xml        to make Docutils-native XML files"
 45 | 	@echo "  pseudoxml  to make pseudoxml-XML files for display purposes"
 46 | 	@echo "  linkcheck  to check all external links for integrity"
 47 | 	@echo "  doctest    to run all doctests embedded in the documentation (if enabled)"
 48 | 
 49 | clean:
 50 | 	rm -rf $(BUILDDIR)/*
 51 | 
 52 | html:
 53 | 	$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
 54 | 	@echo
 55 | 	@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
 56 | 
 57 | dirhtml:
 58 | 	$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
 59 | 	@echo
 60 | 	@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
 61 | 
 62 | singlehtml:
 63 | 	$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
 64 | 	@echo
 65 | 	@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."
 66 | 
 67 | pickle:
 68 | 	$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
 69 | 	@echo
 70 | 	@echo "Build finished; now you can process the pickle files."
 71 | 
 72 | json:
 73 | 	$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
 74 | 	@echo
 75 | 	@echo "Build finished; now you can process the JSON files."
 76 | 
 77 | htmlhelp:
 78 | 	$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
 79 | 	@echo
 80 | 	@echo "Build finished; now you can run HTML Help Workshop with the" \
 81 | 	      ".hhp project file in $(BUILDDIR)/htmlhelp."
 82 | 
 83 | qthelp:
 84 | 	$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
 85 | 	@echo
 86 | 	@echo "Build finished; now you can run "qcollectiongenerator" with the" \
 87 | 	      ".qhcp project file in $(BUILDDIR)/qthelp, like this:"
 88 | 	@echo "# qcollectiongenerator $(BUILDDIR)/qthelp/PyScaffold.qhcp"
 89 | 	@echo "To view the help file:"
 90 | 	@echo "# assistant -collectionFile $(BUILDDIR)/qthelp/PyScaffold.qhc"
 91 | 
 92 | devhelp:
 93 | 	$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
 94 | 	@echo
 95 | 	@echo "Build finished."
 96 | 	@echo "To view the help file:"
 97 | 	@echo "# mkdir -p $HOME/.local/share/devhelp/PyScaffold"
 98 | 	@echo "# ln -s $(BUILDDIR)/devhelp $HOME/.local/share/devhelp/PyScaffold"
 99 | 	@echo "# devhelp"
100 | 
101 | epub:
102 | 	$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
103 | 	@echo
104 | 	@echo "Build finished. The epub file is in $(BUILDDIR)/epub."
105 | 
106 | latex:
107 | 	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
108 | 	@echo
109 | 	@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
110 | 	@echo "Run \`make' in that directory to run these through (pdf)latex" \
111 | 	      "(use \`make latexpdf' here to do that automatically)."
112 | 
113 | latexpdf:
114 | 	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
115 | 	@echo "Running LaTeX files through pdflatex..."
116 | 	$(MAKE) -C $(BUILDDIR)/latex all-pdf
117 | 	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
118 | 
119 | latexpdfja:
120 | 	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
121 | 	@echo "Running LaTeX files through platex and dvipdfmx..."
122 | 	$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja
123 | 	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
124 | 
125 | text:
126 | 	$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
127 | 	@echo
128 | 	@echo "Build finished. The text files are in $(BUILDDIR)/text."
129 | 
130 | man:
131 | 	$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
132 | 	@echo
133 | 	@echo "Build finished. The manual pages are in $(BUILDDIR)/man."
134 | 
135 | texinfo:
136 | 	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
137 | 	@echo
138 | 	@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
139 | 	@echo "Run \`make' in that directory to run these through makeinfo" \
140 | 	      "(use \`make info' here to do that automatically)."
141 | 
142 | info:
143 | 	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
144 | 	@echo "Running Texinfo files through makeinfo..."
145 | 	make -C $(BUILDDIR)/texinfo info
146 | 	@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."
147 | 
148 | gettext:
149 | 	$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
150 | 	@echo
151 | 	@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."
152 | 
153 | changes:
154 | 	$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
155 | 	@echo
156 | 	@echo "The overview file is in $(BUILDDIR)/changes."
157 | 
158 | linkcheck:
159 | 	$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
160 | 	@echo
161 | 	@echo "Link check complete; look for any errors in the above output " \
162 | 	      "or in $(BUILDDIR)/linkcheck/output.txt."
163 | 
164 | doctest:
165 | 	$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
166 | 	@echo "Testing of doctests in the sources finished, look at the " \
167 | 	      "results in $(BUILDDIR)/doctest/output.txt."
168 | 
169 | xml:
170 | 	$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml
171 | 	@echo
172 | 	@echo "Build finished. The XML files are in $(BUILDDIR)/xml."
173 | 
174 | pseudoxml:
175 | 	$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
176 | 	@echo
177 | 	@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."
178 | 


--------------------------------------------------------------------------------
/docs/developer/python-tutorial.rst:
--------------------------------------------------------------------------------
  1 | .. _python-tutorial:
  2 | 
  3 | ***********************
  4 | A Short Python Tutorial
  5 | ***********************
  6 | 
  7 | This tutorial explains by example how to process a :py:obj:`dict` using Python's list comprehension facilities.
  8 | 
  9 | Suppose we're interested in the total number or order failures.
 10 | 
 11 | #. First, we need to query the appropriate endpoint to get the data, and call the :py:meth:`json` method. ::
 12 | 
 13 |         http('http://www.example.com/foo/bar/data.json').json()
 14 | 
 15 |    This endpoint returns JSON data that is structured as follows (with much of the data omitted)::
 16 | 
 17 |         {
 18 |             ...
 19 |             "itr-http04_orderfails": [1, 0],
 20 |             "itr-http05_addtocart": [0.05, 0.0875],
 21 |             "http17_addtocart": [0.075, 0.066667],
 22 |             "http27_requests": [14.666667, 12.195833],
 23 |             "http13_orderfails": [null, 2],
 24 |             ...
 25 |         }
 26 | 
 27 |    The parsed object will therefore be a :py:obj:`dict` mapping strings to lists of numbers, which may contain :py:obj:`None` values.
 28 | 
 29 | #. We need to find all entries ending in :samp:`_orderfails`. In Python, we can transform a :py:obj:`dict` in a list of tuples :samp:`({key}, {value})` using the :py:meth:`items` method::
 30 | 
 31 |         http(...).json().items()
 32 | 
 33 |    We now need to filter this list to include only order failure information. Using a loop and an if statement, this could be accomplished like this::
 34 | 
 35 |         result = []
 36 |         for key, value in http(...).json().items():
 37 |             if key.endswith('_orderfails'):
 38 |                 result.append(value)
 39 | 
 40 |    (Note how the tuples in the list returned by :py:meth:`items` are automatically "unpacked", their elements being assigned to :py:obj:`key` and :py:obj:`value`, respectively.)
 41 | 
 42 |    Since the check command needs to be a single expression, not a series of statements, this is unfortunately not an option. Fortunately, Python provides a feature called list comprehension, which allows us to express the code above as follows::
 43 | 
 44 |         [value for key, value in http(...).json().items() if key.endswith('_orderfails')]
 45 | 
 46 |    That is, code of the form ::
 47 | 
 48 |         result = []
 49 |         for ELEMENT in LIST:
 50 |             if CONDITION:
 51 |                 result.append(RESULT_ELEMENT)
 52 | 
 53 |    becomes ::
 54 | 
 55 |         [RESULT_ELEMENT for ELEMENT in LIST if CONDITION]
 56 | 
 57 |    (The ``if CONDITION`` part is optional.)
 58 | 
 59 |    We now have a list of lists ``[[1, 0], [None, 2]]``.
 60 | 
 61 | #. In order to sum the list, we'd need to flatten it first, so that it has the form ``[1, 0, None, 2]``. This can be accomplished with the :py:func:`chain` function. Given one or more iterable objects (such as lists), :py:func:`chain` returns a new iterable object produced by concatenating the given objects. That is ::
 62 | 
 63 |         chain([1, 0], [None, 2])
 64 | 
 65 |    would return ::
 66 | 
 67 |         [1, 0, None, 2]
 68 | 
 69 |    Unfortunately, the lists we want to chain are themselves elements of a list, and calling ``chain([[1, 0], [None, 2]])`` would just concatenate the list with nothing and return the it unchanged. We therefore need to tell Python to unpack the list, so that each of its elements becomes a new argument for the invocation of :py:func:`chain`.
 70 | 
 71 |    This can be accomplished by the ``*`` operator::
 72 | 
 73 |         chain(*[[1, 0], [None, 2]])
 74 | 
 75 |    That is, out expression is now ::
 76 | 
 77 |         chain(*[value for key, value in http(...).json().items() if key.endswith('_orderfails')])
 78 | 
 79 | #. Now we need to remove that pesky :py:obj:`None` from the list. This could be accomplished with another list comprehension::
 80 | 
 81 |         [value for value in chain(...) if value is not None]
 82 | 
 83 |    For didactic reasons, we shall use the :py:func:`filter` function instead. :py:func:`filter` takes two arguments: a function that is called for each element in the filtered list and indicates whether that element should be in the resulting list, and the list that is to be filtered itself. We can create an anonymous function for this purpose using a lambda expression::
 84 | 
 85 |         filter(lambda element: element is not None, chain(...))
 86 | 
 87 |    In this case, we can use a somewhat obscure shortcut, though. If the function given to :py:func:`filter` is :py:obj:`None`, the identity function is used. Therefore, objects will be included in the resulting list if and only if they are "truthy", which :py:obj:`None` isn't. The integer :py:obj:`0` isn't truthy either, but this isn't a problem in this case since the presence or absence of zeros does not affect the sum. Therefore, we can use the expression ::
 88 | 
 89 |         filter(None, chain(*[value for key, value in http(...).json().items() if key.endswith('_orderfails')]))
 90 | 
 91 | #. Finally, we need to sum the elements of the list. For that, we can just use the :py:func:`sum` function, so that the expression is now ::
 92 | 
 93 |         sum(filter(None, chain(*[value for key, value in http(...).json().items() if key.endswith('_orderfails')])))
 94 | 
 95 | 
 96 | Python Recipes
 97 | ==============
 98 | 
 99 | .. describe:: Merging Data Into One Result
100 | 
101 |     You can merge heterogeneous data into a single result object:
102 | 
103 |     .. code-block:: python
104 | 
105 |         {
106 |             'http_data': http(...).json()[...],
107 |             'jmx_data':  jmx().query(...).results()[...],
108 |             'sql_data':  sql().execute(...)[...],
109 |         }
110 | 
111 | 
112 | .. describe:: Mapping SQL Results by ID
113 | 
114 |     The SQL ``results()`` methods returns a list of maps (``[{'id': 1, 'data': 1000}, {'id': 2, 'data': 2000}]``). You can convert this to a single map (``{1: 1000, 2: 2000}``) like this:
115 | 
116 |     .. code-block:: python
117 | 
118 |         { row['id']: row['data'] for row in sql().execute(...).results() }
119 | 
120 | 
121 | .. describe:: Using Multiple Captures
122 | 
123 |     If you have a alert condition such as
124 | 
125 |     .. code-block:: python
126 | 
127 |         FOO > 10 or BAR > 10
128 | 
129 |     adding capures is a bit tricky. If you use
130 | 
131 |     .. code-block:: python
132 | 
133 |         capture(foo=FOO) > 10 or capture(bar=BAR) > 10
134 | 
135 |     and both ``FOO`` and ``BAR`` are greater than 10, only ``foo`` will be captured because the ``or`` uses short-circuit evaluation (``True or X`` is true for all ``X``, so ``X`` doesn't need to be evaluated). Instead, you can use
136 | 
137 |     .. code-block:: python
138 | 
139 |         any([capture(foo=FOO) > 10, capture(bar=BAR) > 10])
140 | 
141 |     which will always evaluate both comparisons and thus capture both values.
142 | 
143 | 
144 | .. describe:: Defining Temporary Variables
145 | 
146 |     You aren't supposed to be able to do define variables, but you can work around this restriction as follows:
147 | 
148 |     .. code-block:: python
149 | 
150 |         (lambda x:
151 |             # Some complex operation using x multiple times
152 |         )(
153 |             x = sql().execute(...)  # Some complex or expensive query
154 |         )
155 | 
156 | 
157 | .. describe:: Defining Functions
158 | 
159 |     Since you can define variables with the trick above, you can also define functions:
160 | 
161 |     .. code-block:: python
162 | 
163 |         (lambda f:
164 |             # Some complex operation calling f multiple times
165 |         )(
166 |             f = lambda a, b, c: sql().execute(...)  # Some code using the arguments a, b, and c
167 |         )
168 | 
169 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/cloudwatch_wrapper.rst:
--------------------------------------------------------------------------------
  1 | .. _cloudwatch:
  2 | 
  3 | CloudWatch
  4 | ----------
  5 | 
  6 | If running on AWS you can use ``cloudwatch()`` to access AWS metrics easily.
  7 | 
  8 | .. py:function:: cloudwatch(region=None, assume_role_arn=None)
  9 | 
 10 |     Initialize CloudWatch wrapper.
 11 | 
 12 |     :param region: AWS region for CloudWatch queries. Will be auto-detected if not supplied.
 13 |     :type region: str
 14 | 
 15 |     :param assume_role_arn: AWS IAM role ARN to be assumed. This can be useful in cross-account CloudWatch queries.
 16 |     :type assume_role_arn: str
 17 | 
 18 | 
 19 | Methods of Cloudwatch
 20 | ^^^^^^^^^^^^^^^^^^^^^
 21 | 
 22 | .. py:method:: query_one(dimensions, metric_name, statistics, namespace, period=60, minutes=5, start=None, end=None, extended_statistics=None)
 23 | 
 24 |     Query a single AWS CloudWatch metric and return a single scalar value (float).
 25 |     Metric will be aggregated over the last five minutes using the provided aggregation type.
 26 | 
 27 |     This method is a more low-level variant of the ``query`` method: all parameters, including all dimensions need to be known.
 28 | 
 29 |     :param dimensions: Cloudwatch dimensions. Example ``{'LoadBalancerName': 'my-elb-name'}``
 30 |     :type dimensions: dict
 31 | 
 32 |     :param metric_name: Cloudwatch metric. Example ``'Latency'``.
 33 |     :type metric_name: str
 34 | 
 35 |     :param statistics: Cloudwatch metric statistics. Example ``'Sum'``
 36 |     :type statistics: list
 37 | 
 38 |     :param namespace: Cloudwatch namespace. Example ``'AWS/ELB'``
 39 |     :type namespace: str
 40 | 
 41 |     :param period: Cloudwatch statistics granularity in seconds. Default is 60.
 42 |     :type period: int
 43 | 
 44 |     :param minutes: Used to determine ``start`` time of the Cloudwatch query. Default is 5. Ignored if ``start`` is supplied.
 45 |     :type minutes: int
 46 | 
 47 |     :param start: Cloudwatch start timestamp. Default is ``None``.
 48 |     :type start: int
 49 | 
 50 |     :param end: Cloudwatch end timestamp. Default is ``None``. If not supplied, then end time is now.
 51 |     :type end: int
 52 | 
 53 |     :param extended_statistics: Cloudwatch ExtendedStatistics for percentiles query. Example ``['p95', 'p99']``.
 54 |     :type extended_statistics: list
 55 | 
 56 |     :return: Return a float if single value, dict otherwise.
 57 |     :rtype: float, dict
 58 | 
 59 | 
 60 |     Example query with percentiles for AWS ALB:
 61 | 
 62 |     .. code-block:: python
 63 | 
 64 |         cloudwatch().query_one({'LoadBalancer': 'app/my-alb/1234'}, 'TargetResponseTime', 'Average', 'AWS/ApplicationELB', extended_statistics=['p95', 'p99', 'p99.45'])
 65 |         {
 66 |             'Average': 0.224,
 67 |             'p95': 0.245,
 68 |             'p99': 0.300,
 69 |             'p99.45': 0.500
 70 |         }
 71 | 
 72 | .. note::
 73 | 
 74 |    In very rare cases, e.g. for ELB metrics, you may see only 1/2 or 1-2/3 of the value in ZMON due to a race condition of what data is already present in cloud watch.
 75 |    To fix this click "evaluate" on the alert, this will trigger the check and move its execution time to a new start time.
 76 | 
 77 | .. py:method:: query(dimensions, metric_name, statistics='Sum', namespace=None, period=60, minutes=5)
 78 | 
 79 |   Query AWS CloudWatch for metrics. Metrics will be aggregated over the last five minutes using the provided aggregation type (default "Sum").
 80 | 
 81 |   *dimensions* is a dictionary to filter the metrics to query. See the `list_metrics boto documentation`_.
 82 |   You can provide the special value "NOT_SET" for a dimension to only query metrics where the given key is not set.
 83 |   This makes sense e.g. for ELB metrics as they are available both per AZ ("AvailabilityZone" has a value) and aggregated over all AZs ("AvailabilityZone" not set).
 84 |   Additionally you can include the special "*" character in a dimension value to do fuzzy (shell globbing) matching.
 85 | 
 86 |   *metric_name* is the name of the metric to filter against (e.g. "RequestCount").
 87 | 
 88 |   *namespace* is an optional namespace filter (e.g. "AWS/EC2).
 89 | 
 90 |   To query an ELB for requests per second:
 91 | 
 92 |   .. code-block:: python
 93 | 
 94 |         # both using special "NOT_SET" and "*" in dimensions here:
 95 |         val = cloudwatch().query({'AvailabilityZone': 'NOT_SET', 'LoadBalancerName': 'pierone-*'}, 'RequestCount', 'Sum')['RequestCount']
 96 |         requests_per_second = val / 60
 97 | 
 98 | You can find existing metrics with the AWS CLI tools:
 99 | 
100 | .. code-block:: bash
101 | 
102 |     $ aws cloudwatch list-metrics --namespace "AWS/EC2"
103 | 
104 | Use the "dimensions" argument to select on what dimension(s) to aggregate over:
105 | 
106 | .. code-block:: bash
107 | 
108 |     $ aws cloudwatch list-metrics --namespace "AWS/EC2" --dimensions Name=AutoScalingGroupName,Value=my-asg-FEYBCZF
109 | 
110 | The desired metric can now be queried in ZMON:
111 | 
112 | .. code-block:: python
113 | 
114 |     cloudwatch().query({'AutoScalingGroupName': 'my-asg-*'}, 'DiskReadBytes', 'Sum')
115 | 
116 | 
117 | .. _list_metrics boto documentation: http://boto.readthedocs.org/en/latest/ref/cloudwatch.html#boto.ec2.cloudwatch.CloudWatchConnection.list_metrics
118 | 
119 | 
120 | .. py:method:: alarms(alarm_names=None, alarm_name_prefix=None, state_value=STATE_ALARM, action_prefix=None, max_records=50)
121 | 
122 |     Retrieve cloudwatch alarms filtered by state value.
123 | 
124 |     See `describe_alarms boto documentation`_ for more details.
125 | 
126 |     :param alarm_names: List of alarm names.
127 |     :type alarm_names: list
128 | 
129 |     :param alarm_name_prefix: Prefix of alarms. Cannot be specified if ``alarm_names`` is specified.
130 |     :type alarm_name_prefix: str
131 | 
132 |     :param state_value: State value used in alarm filtering. Available values are ``OK``, ``ALARM`` (default) and ``INSUFFICIENT_DATA``.
133 |     :type state_value: str
134 | 
135 |     :param action_prefix: Action name prefix. Example ``arn:aws:autoscaling:`` to filter results for all autoscaling related alarms.
136 |     :type action_prefix: str
137 | 
138 |     :param max_records: Maximum records to be returned. Default is 50.
139 |     :type max_records: int
140 | 
141 |     :return: List of MetricAlarms.
142 |     :rtype: list
143 | 
144 | 
145 | .. _describe_alarms boto documentation: http://boto3.readthedocs.io/en/latest/reference/services/cloudwatch.html#CloudWatch.Client.describe_alarms
146 | 
147 | .. code-block:: python
148 | 
149 |     cloudwatch().alarms(state_value='ALARM')[0]
150 |     {
151 |         'ActionsEnabled': True,
152 |         'AlarmActions': ['arn:aws:autoscaling:...'],
153 |         'AlarmArn': 'arn:aws:cloudwatch:...',
154 |         'AlarmConfigurationUpdatedTimestamp': datetime.datetime(2016, 5, 12, 10, 44, 15, 707000, tzinfo=tzutc()),
155 |         'AlarmDescription': 'Scale-down if CPU < 50% for 10.0 minutes (Average)',
156 |         'AlarmName': 'metric-alarm-for-service-x',
157 |         'ComparisonOperator': 'LessThanThreshold',
158 |         'Dimensions': [
159 |             {
160 |                 'Name': 'AutoScalingGroupName',
161 |                 'Value': 'service-x-asg'
162 |             }
163 |         ],
164 |         'EvaluationPeriods': 2,
165 |         'InsufficientDataActions': [],
166 |         'MetricName': 'CPUUtilization',
167 |         'Namespace': 'AWS/EC2',
168 |         'OKActions': [],
169 |         'Period': 300,
170 |         'StateReason': 'Threshold Crossed: 1 datapoint (36.1) was less than the threshold (50.0).',
171 |         'StateReasonData': '{...}',
172 |         'StateUpdatedTimestamp': datetime.datetime(2016, 5, 12, 10, 44, 16, 294000, tzinfo=tzutc()),
173 |         'StateValue': 'ALARM',
174 |         'Statistic': 'Average',
175 |         'Threshold': 50.0
176 |     }
177 | 


--------------------------------------------------------------------------------
/docs/conf.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | #
  5 | # ZMON documentation build configuration file, created by
  6 | # sphinx-quickstart on Fri Jan 24 19:24:14 2014.
  7 | #
  8 | # This file is execfile()d with the current directory set to its containing dir.
  9 | #
 10 | # Note that not all possible configuration values are present in this
 11 | # autogenerated file.
 12 | #
 13 | # All configuration values have a default; values that are commented out
 14 | # serve to show the default.
 15 | 
 16 | import sys
 17 | import os
 18 | import sphinx_rtd_theme
 19 | 
 20 | # If extensions (or modules to document with autodoc) are in another directory,
 21 | # add these directories to sys.path here. If the directory is relative to the
 22 | # documentation root, use os.path.abspath to make it absolute, like shown here.
 23 | #sys.path.insert(0, os.path.abspath('../../zmon-worker/src'))
 24 | 
 25 | # -- General configuration -----------------------------------------------------
 26 | 
 27 | # If your documentation needs a minimal Sphinx version, state it here.
 28 | # needs_sphinx = '1.0'
 29 | 
 30 | # Add any Sphinx extension module names here, as strings. They can be extensions
 31 | # coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
 32 | extensions = ['sphinx.ext.viewcode', 'sphinx.ext.autodoc', 'sphinx.ext.intersphinx']
 33 | 
 34 | # Add any paths that contain templates here, relative to this directory.
 35 | templates_path = ['_templates']
 36 | 
 37 | # The suffix of source filenames.
 38 | source_suffix = '.rst'
 39 | 
 40 | # The encoding of source files.
 41 | # source_encoding = 'utf-8-sig'
 42 | 
 43 | # The master toctree document.
 44 | master_doc = 'index'
 45 | 
 46 | # General information about the project.
 47 | project = u'ZMON'
 48 | copyright = u'2014, Zalando SE'
 49 | 
 50 | # The version info for the project you're documenting, acts as replacement for
 51 | # |version| and |release|, also used in various other places throughout the
 52 | # built documents.
 53 | #
 54 | # The short X.Y version.
 55 | version = '2.0'
 56 | # The full version, including alpha/beta/rc tags.
 57 | release = '2.0'
 58 | 
 59 | # The language for content autogenerated by Sphinx. Refer to documentation
 60 | # for a list of supported languages.
 61 | # language = None
 62 | 
 63 | # There are two options for replacing |today|: either, you set today to some
 64 | # non-false value, then it is used:
 65 | # today = ''
 66 | # Else, today_fmt is used as the format for a strftime call.
 67 | # today_fmt = '%B %d, %Y'
 68 | 
 69 | # List of patterns, relative to source directory, that match files and
 70 | # directories to ignore when looking for source files.
 71 | exclude_patterns = ['_build']
 72 | 
 73 | # The reST default role (used for this markup: `text`) to use for all documents.
 74 | # default_role = None
 75 | 
 76 | # If true, '()' will be appended to :func: etc. cross-reference text.
 77 | # add_function_parentheses = True
 78 | 
 79 | # If true, the current module name will be prepended to all description
 80 | # unit titles (such as .. function::).
 81 | # add_module_names = True
 82 | 
 83 | # If true, sectionauthor and moduleauthor directives will be shown in the
 84 | # output. They are ignored by default.
 85 | # show_authors = False
 86 | 
 87 | # The name of the Pygments (syntax highlighting) style to use.
 88 | pygments_style = 'sphinx'
 89 | 
 90 | # A list of ignored prefixes for module index sorting.
 91 | # modindex_common_prefix = []
 92 | 
 93 | # -- Options for HTML output ---------------------------------------------------
 94 | 
 95 | # The theme to use for HTML and HTML Help pages.  See the documentation for
 96 | # a list of builtin themes.
 97 | html_theme = 'sphinx_rtd_theme'
 98 | #html_style = 'zmon.css'
 99 | 
100 | # Theme options are theme-specific and customize the look and feel of a theme
101 | # further.  For a list of options available for each theme, see the
102 | # documentation.
103 | 
104 | # Add any paths that contain custom themes here, relative to this directory.
105 | # html_theme_path = []
106 | 
107 | # The name for this set of Sphinx documents.  If None, it defaults to
108 | # "<project> v<release> documentation".
109 | # html_title = None
110 | 
111 | # A shorter title for the navigation bar.  Default is the same as html_title.
112 | # html_short_title = None
113 | 
114 | # The name of an image file (relative to this directory) to place at the top
115 | # of the sidebar.
116 | # html_logo = None
117 | 
118 | # The name of an image file (within the static path) to use as favicon of the
119 | # docs.  This file should be a Windows icon file (.ico) being 16x16 or 32x32
120 | # pixels large.
121 | # html_favicon = None
122 | 
123 | # Add any paths that contain custom static files (such as style sheets) here,
124 | # relative to this directory. They are copied after the builtin static files,
125 | # so a file named "default.css" will overwrite the builtin "default.css".
126 | html_static_path = ['_static']
127 | 
128 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
129 | # using the given strftime format.
130 | # html_last_updated_fmt = '%b %d, %Y'
131 | 
132 | # If true, SmartyPants will be used to convert quotes and dashes to
133 | # typographically correct entities.
134 | # html_use_smartypants = True
135 | 
136 | # Custom sidebar templates, maps document names to template names.
137 | # html_sidebars = {}
138 | 
139 | # Additional templates that should be rendered to pages, maps page names to
140 | # template names.
141 | # html_additional_pages = {}
142 | 
143 | # If false, no module index is generated.
144 | # html_domain_indices = True
145 | 
146 | # If false, no index is generated.
147 | # html_use_index = True
148 | 
149 | # If true, the index is split into individual pages for each letter.
150 | # html_split_index = False
151 | 
152 | # If true, links to the reST sources are added to the pages.
153 | # html_show_sourcelink = True
154 | 
155 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
156 | # html_show_sphinx = True
157 | 
158 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
159 | # html_show_copyright = True
160 | 
161 | # If true, an OpenSearch description file will be output, and all pages will
162 | # contain a <link> tag referring to it.  The value of this option must be the
163 | # base URL from which the finished HTML is served.
164 | # html_use_opensearch = ''
165 | 
166 | # This is the file name suffix for HTML files (e.g. ".xhtml").
167 | # html_file_suffix = None
168 | 
169 | # Output file base name for HTML help builder.
170 | htmlhelp_basename = 'ZMONdoc'
171 | 
172 | # -- Options for LaTeX output --------------------------------------------------
173 | 
174 | latex_elements = {}
175 | # The paper size ('letterpaper' or 'a4paper').
176 | # 'papersize': 'letterpaper',
177 | 
178 | # The font size ('10pt', '11pt' or '12pt').
179 | # 'pointsize': '10pt',
180 | 
181 | # Additional stuff for the LaTeX preamble.
182 | # 'preamble': '',
183 | 
184 | # Grouping the document tree into LaTeX files. List of tuples
185 | # (source start file, target name, title, author, documentclass [howto/manual]).
186 | latex_documents = [(
187 |     'index',
188 |     'ZMON.tex',
189 |     u'ZMON Documentation',
190 |     u'Henning Jacobs',
191 |     'manual',
192 | )]
193 | 
194 | # The name of an image file (relative to this directory) to place at the top of
195 | # the title page.
196 | # latex_logo = None
197 | 
198 | # For "manual" documents, if this is true, then toplevel headings are parts,
199 | # not chapters.
200 | # latex_use_parts = False
201 | 
202 | # If true, show page references after internal links.
203 | # latex_show_pagerefs = False
204 | 
205 | # If true, show URL addresses after external links.
206 | # latex_show_urls = False
207 | 
208 | # Documents to append as an appendix to all manuals.
209 | # latex_appendices = []
210 | 
211 | # If false, no module index is generated.
212 | # latex_domain_indices = True
213 | 
214 | # -- Options for manual page output --------------------------------------------
215 | 
216 | # One entry per manual page. List of tuples
217 | # (source start file, name, description, authors, manual section).
218 | man_pages = [(
219 |     'index',
220 |     'zmon',
221 |     u'ZMON Documentation',
222 |     [u'Henning Jacobs'],
223 |     1,
224 | )]
225 | 
226 | # If true, show URL addresses after external links.
227 | # man_show_urls = False
228 | 
229 | # -- Options for Texinfo output ------------------------------------------------
230 | 
231 | # Grouping the document tree into Texinfo files. List of tuples
232 | # (source start file, target name, title, author,
233 | #  dir menu entry, description, category)
234 | texinfo_documents = [(
235 |     'index',
236 |     'ZMON',
237 |     u'ZMON Documentation',
238 |     u'Henning Jacobs',
239 |     'ZMON',
240 |     'One line description of project.',
241 |     'Miscellaneous',
242 | )]
243 | 
244 | # Documents to append as an appendix to all manuals.
245 | # texinfo_appendices = []
246 | 
247 | # If false, no module index is generated.
248 | # texinfo_domain_indices = True
249 | 
250 | # How to display URL addresses: 'footnote', 'no', or 'inline'.
251 | # texinfo_show_urls = 'footnote'
252 | 
253 | intersphinx_mapping = {'python': ('http://docs.python.org/2', None)}
254 | 


--------------------------------------------------------------------------------
/docs/intro.rst:
--------------------------------------------------------------------------------
  1 | ************
  2 | Introduction
  3 | ************
  4 | 
  5 | ZMON is a flexible and extensible open-source platform monitoring tool developed at Zalando_ and is in production use since early 2014.
  6 | It offers proven scaling with its distributed nature and fast storage with KairosDB on top of Cassandra.
  7 | ZMON splits checking(data acquisition) from the alerting responsibilities and uses abstract entities to describe what's being monitored.
  8 | Its checks and alerts rely on Python expressions, giving the user a lot of power and connectivity.
  9 | Besides the UI it provides RESTful APIs to manage and configure most properties automatically.
 10 | 
 11 | Anyone can use ZMON, but offers particular advantages for technical organizations with many autonomous teams.
 12 | Its front end (see Demo_ / Bootstrap_ / Kubernetes_/ Vagrant_) comes with Grafana3 "built-in," enabling teams to create and manage their own data-driven dashboards along side ZMON's own team/personal dashboards for alerts and custom widgets.
 13 | Being able to inherit and clone alerts makes it easier for teams to reuse and share code.
 14 | Alerts can trigger HipChat, Slack, and E-Mail notifications.
 15 | iOS and Android clients are works in progress, but push notifications are already implemented.
 16 | 
 17 | ZMON also enables painless integration with CMDBs and deployment tools.
 18 | It also supports service discovery via custom adapters or its built-in entity service's REST API.
 19 | For an example, see zmon-aws-agent_ to learn how we connect AWS service discovery with our monitoring in the cloud.
 20 | 
 21 | Feel free to contact us via `slack.zmon.io`_.
 22 | 
 23 | ZMON Components
 24 | ===============
 25 | 
 26 | .. image:: images/components.svg
 27 | 
 28 | A minimum ZMON setup requires these four components:
 29 | 
 30 | - zmon-controller_: UI/Grafana/Oauth2 Login/Github Login
 31 | - zmon-scheduler_: Scheduling check/alert evaluation
 32 | - zmon-worker_: Doing the heavy lifting
 33 | - zmon-eventlog-service_: History for state changes and modifications
 34 | 
 35 | Plus the storage covered in the :ref:`requirements` section.
 36 | 
 37 | The following components are optional:
 38 | 
 39 | - zmon-cli_: A command line client for managing entities/checks/alerts if needed
 40 | - zmon-aws-agent_: Works with the AWS API to retrieve "known" applications
 41 | - zmon-data-service_: API for multi DC federation: receiver for remote workers primarily
 42 | - zmon-metric-cache_: Small scale special purpose metric store for API metrics in ZMON's cloud UI
 43 | - zmon-notification-service_: Provides mobile API and push notification support for GCM to Android/iOS app
 44 | - zmon-android_: An Android client for ZMON monitoring
 45 | - zmon-ios_: An iOS client for ZMON monitoring
 46 | 
 47 | ZMON Origins
 48 | ============
 49 | 
 50 | ZMON was born in late 2013 during Zalando's annual `Hack Week`_, when a group of Zalando engineers aimed to develop a replacement for ICINGA.
 51 | Scalability, manageability and flexibility were all critical, as Zalando's small teams needed to be able to monitor their services independent of each other.
 52 | In early 2014, Zalando teams began migrating all checks to ZMON, which continues to serve Zalando Tech.
 53 | 
 54 | Entities
 55 | ========
 56 | 
 57 | ZMON uses entities to describe your infrastructure or platform, and to bind check variables to fixed values.
 58 | 
 59 | .. code-block:: json
 60 | 
 61 |   {
 62 | 	"type":"host",
 63 | 	"id":"cassandra01",
 64 | 	"host":"cassandra01",
 65 | 	"role":"cassandra-host",
 66 | 	"ip":"192.168.1.17",
 67 | 	"dc":"data-center-1"
 68 |   }
 69 | 
 70 | Or more abstract objects:
 71 | 
 72 | .. code-block:: json
 73 | 
 74 |   {
 75 |   	"type":"postgresql-cluster",
 76 |   	"id":"article-cluster",
 77 |   	"name":"article-cluster",
 78 |   	"shards": {
 79 | 		"shard1":"articledb01:5432/shard1",
 80 | 		"shard2":"articledb02:5432/shard2"
 81 |   	}
 82 |   }
 83 | 
 84 | Entity properties are not defined in any schema, so you can add properties as you see fit. This enables finer-grained filtering or selection of entities later on. As an example, host entities can include a physical model to later select the proper hardware checks.
 85 | 
 86 | Below you see an exmple of the entity view with alerts per entity.
 87 | 
 88 | .. image:: images/entities.png
 89 | 
 90 | Checks
 91 | ======
 92 | 
 93 | A check describes how data is acquired. Its key properties are: a command to execute and an entity filter. The filter selects a subset of entities by requiring an overlap on specified properties. An example:
 94 | 
 95 | .. code-block:: json
 96 | 
 97 |   {
 98 |     "type":"postgresql-cluster", "name":"article-cluster"
 99 |   }
100 | 
101 | The check command itself is an executable Python_ expression. ZMON provides many custom wrappers that bind to the selected entity. The following example uses a PostgreSQL wrapper to execute a query on every shard defined above:
102 | 
103 | .. code-block:: python
104 | 
105 |   # sql() in this context is aware of the "shards" property
106 | 
107 |   sql().execute('SELECT count(1) FROM articles "total"').result()
108 | 
109 | A check command always returns a value to the alert. This can be of any Python type.
110 | 
111 | Not familiar with Python's functional expressions? No worries: ZMON allows you to define a top-level function and define your command in an easier, less functional way:
112 | 
113 | .. code-block:: python
114 | 
115 |   def check():
116 |     # sql() binds to the entity used and thus knows the connection URLs
117 |     return sql().execute('SELECT count(1) FROM articles "total"').result()
118 | 
119 | Alerts
120 | ======
121 | 
122 | A basic alert consists of an alert condition, an entity filter, and a team.
123 | An alert has only two states: up or down.
124 | An alert is up if it yields anything but False; this also includes exceptions thrown during evaluation of the check or alert, e.g. in the event of connection problems.
125 | ZMON does not support levels of criticality, or something like "unknown", but you have a color option to customize sort and style on your dashboard (red, orange, yellow).
126 | 
127 | Let's revisit the above PostgreSQL check again. The alert below would either popup if there are no articles found or if we get an exception connecting to the PostgreSQL database.
128 | 
129 | .. code-block:: yaml
130 | 
131 |   team: database
132 |   entities:
133 |     - type: postgresql-cluster
134 |   alert_condition: |
135 |     value <= 0
136 | 
137 | Alerts raised by exceptions are marked in the dashboard with a "!".
138 | 
139 | Via ZMON's UI, alerts support parameters to the alert condition.
140 | This makes it easy for teams/users to implement different thresholds, and — with the priority field defining the dashboard color — render their dashboards to reflect their priorities.
141 | 
142 | Dashboards
143 | ==========
144 | 
145 | Dashboards include a widget area where you can render important data with charts, gauges, or plain text.
146 | Another section features rendering of all active alerts for the team filter, defined at the dashboard level.
147 | Using the team filter, select the alerts you want your dashboard to include.
148 | Specify multiple teams, if necessary. TAGs are supported to subselect topics.
149 | 
150 | .. image:: images/dashboard.png
151 | 
152 | REST API and CLI
153 | ================
154 | 
155 | To make your life easier, ZMON's REST API manages all the essential moving parts to support your daily work — creating and updating entities to allow for sync-up with your existing infrastructure.
156 | When you create and modify checks and alerts, the scheduler will quickly pick up these changes so you won't have to restart or deploy anything.
157 | 
158 | And ZMON's command line client - a slim wrapper around the REST API - also adds usability by making it simpler to work with YAML files or push collections of entities.
159 | 
160 | Development Status
161 | ==================
162 | The team behind ZMON continues to improve performance and functionality. Please let us know via GitHub's issues tracker if you find any bugs or issues.
163 | 
164 | .. _Python: http://www.python.org
165 | .. _Zalando: https://tech.zalando.de/
166 | .. _zmon-controller: https://github.com/zalando-zmon/zmon-controller
167 | .. _Demo: https://demo.zmon.io
168 | .. _Bootstrap: https://github.com/zalando-zmon/zmon-demo
169 | .. _Vagrant: https://github.com/zalando/zmon
170 | .. _zmon-scheduler: https://github.com/zalando-zmon/zmon-scheduler
171 | .. _zmon-worker: https://github.com/zalando-zmon/zmon-worker
172 | .. _zmon-eventlog-service: https://github.com/zalando-zmon/zmon-eventlog-service
173 | .. _zmon-android: https://github.com/zalando-zmon/zmon-android
174 | .. _zmon-ios: https://github.com/zalando-zmon/zmon-ios
175 | .. _zmon-cli: https://github.com/zalando-zmon/zmon-cli
176 | .. _zmon-actuator: https://github.com/zalando-zmon/zmon-actuator
177 | .. _zmon-aws-agent: https://github.com/zalando-zmon/zmon-aws-agent
178 | .. _zmon-data-service: https://github.com/zalando-zmon/zmon-data-service
179 | .. _zmon-notification-service: https://github.com/zalando-zmon/zmon-notification-service
180 | .. _zmon-metric-cache: https://github.com/zalando-zmon/zmon-metric-cache
181 | .. _Hack Week: https://tech.zalando.de/blog/?tags=Hack%20Week
182 | .. _slack.zmon.io: https://slack.zmon.io
183 | .. _Kubernetes: https://github.com/zalando-zmon/zmon-kubernetes
184 | 


--------------------------------------------------------------------------------
/docs/user/check-ref/kubernetes_wrapper.rst:
--------------------------------------------------------------------------------
  1 | Kubernetes
  2 | ----------
  3 | 
  4 | Provides a wrapper for querying Kubernetes cluster resources.
  5 | 
  6 | 
  7 | .. py:function:: kubernetes(namespace='default')
  8 | 
  9 |     If ``namespace`` is ``None`` then **all** namespaces will be queried. This however will increase the number of calls to Kubernetes API server.
 10 | 
 11 | .. note::
 12 | 
 13 |     - Kubernetes wrapper will authenticate using service account, which assumes the worker is running in a Kubernetes cluster.
 14 |     - All Kubernetes wrapper calls are scoped to the Kubernetes cluster hosting the worker. It is not intended to be used in querying multiple clusters.
 15 | 
 16 | .. _labelSelectors:
 17 | 
 18 | Label Selectors
 19 | ^^^^^^^^^^^^^^^
 20 | 
 21 | Kubernetes API provides a way to filter resources using `labelSelector <https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/>`_. Kubernetes wrapper provides a friendly syntax for filtering.
 22 | 
 23 | The following examples show different usage of the Kubernetes wrapper utilizing label filtering:
 24 | 
 25 | .. code-block:: python
 26 | 
 27 |     # Get all pods with label ``application`` equal to ``zmon-worker``
 28 |     kubernetes().pods(application='zmon-worker')
 29 |     kubernetes().pods(application__eq='zmon-worker')
 30 | 
 31 | 
 32 |     # Get all pods with label ``application`` **not equal to** ``zmon-worker``
 33 |     kubernetes().pods(application__neq='zmon-worker')
 34 | 
 35 | 
 36 |     # Get all pods with label ``application`` **any of** ``zmon-worker`` or ``zmon-agent``
 37 |     kubernetes().pods(application__in=['zmon-worker', 'zmon-agent'])
 38 | 
 39 |     # Get all pods with label ``application`` **not any of** ``zmon-worker`` or ``zmon-agent``
 40 |     kubernetes().pods(application__notin=['zmon-worker', 'zmon-agent'])
 41 | 
 42 | 
 43 | Methods of Kubernetes
 44 | ^^^^^^^^^^^^^^^^^^^^^
 45 | 
 46 | .. py:function:: pods(name=None, phase=None, ready=None, **kwargs)
 47 | 
 48 |         Return list of `Pods <https://kubernetes.io/docs/user-guide/pods/>`_.
 49 | 
 50 |         :param name: Pod name.
 51 |         :type name: str
 52 | 
 53 |         :param phase: Pod status phase. Valid values are: Pending, Running, Failed, Succeeded or Unknown.
 54 |         :type phase: str
 55 | 
 56 |         :param ready: Pod readiness status. If ``None`` then all pods are returned.
 57 |         :type ready: bool
 58 | 
 59 |         :param kwargs: Pod :ref:`labelSelectors` filters.
 60 |         :type kwargs: dict
 61 | 
 62 |         :return: List of pods. Typical pod has "metadata", "status" and "spec" fields.
 63 |         :rtype: list
 64 | 
 65 | .. py:function:: nodes(name=None, **kwargs)
 66 | 
 67 |         Return list of `Nodes <https://kubernetes.io/docs/admin/node/>`_. Namespace does not apply.
 68 | 
 69 |         :param name: Node name.
 70 |         :type name: str
 71 | 
 72 |         :param kwargs: Node :ref:`labelSelectors` filters.
 73 |         :type kwargs: dict
 74 | 
 75 |         :return: List of nodes. Typical pod has "metadata", "status" and "spec" fields.
 76 |         :rtype: list
 77 | 
 78 | .. py:function:: services(name=None, **kwargs)
 79 | 
 80 |         Return list of `Services <https://kubernetes.io/docs/user-guide/services/>`_.
 81 | 
 82 |         :param name: Service name.
 83 |         :type name: str
 84 | 
 85 |         :param kwargs: Service :ref:`labelSelectors` filters.
 86 |         :type kwargs: dict
 87 | 
 88 |         :return: List of services. Typical service has "metadata", "status" and "spec" fields.
 89 |         :rtype: list
 90 | 
 91 | .. py:function:: endpoints(name=None, **kwargs)
 92 | 
 93 |         Return list of Endpoints.
 94 | 
 95 |         :param name: Endpoint name.
 96 |         :type name: str
 97 | 
 98 |         :param kwargs: Endpoint :ref:`labelSelectors` filters.
 99 |         :type kwargs: dict
100 | 
101 |         :return: List of Endpoints. Typical Endpoint has "metadata", and "subsets" fields.
102 |         :rtype: list
103 | 
104 | .. py:function:: ingresses(name=None, **kwargs)
105 | 
106 |         Return list of `Ingresses <https://kubernetes.io/docs/user-guide/ingress/>`_.
107 | 
108 |         :param name: Ingress name.
109 |         :type name: str
110 | 
111 |         :param kwargs: Ingress :ref:`labelSelectors` filters.
112 |         :type kwargs: dict
113 | 
114 |         :return: List of Ingresses. Typical Ingress has "metadata", "spec" and "status" fields.
115 |         :rtype: list
116 | 
117 | .. py:function:: statefulsets(name=None, replicas=None, **kwargs)
118 | 
119 |         Return list of `Statefulsets <https://kubernetes.io/docs/user-guide/petset/>`_.
120 | 
121 |         :param name: Statefulset name.
122 |         :type name: str
123 | 
124 |         :param replicas: Statefulset replicas.
125 |         :type replicas: int
126 | 
127 |         :param kwargs: Statefulset :ref:`labelSelectors` filters.
128 |         :type kwargs: dict
129 | 
130 |         :return: List of Statefulsets. Typical Statefulset has "metadata", "status" and "spec" fields.
131 |         :rtype: list
132 | 
133 | .. py:function:: daemonsets(name=None, **kwargs)
134 | 
135 |         Return list of `Daemonsets <https://kubernetes.io/docs/admin/daemons/>`_.
136 | 
137 |         :param name: Daemonset name.
138 |         :type name: str
139 | 
140 |         :param kwargs: Daemonset :ref:`labelSelectors` filters.
141 |         :type kwargs: dict
142 | 
143 |         :return: List of Daemonsets. Typical Daemonset has "metadata", "status" and "spec" fields.
144 |         :rtype: list
145 | 
146 | .. py:function:: replicasets(name=None, replicas=None, **kwargs)
147 | 
148 |         Return list of `ReplicaSets <https://kubernetes.io/docs/user-guide/replicasets/>`_.
149 | 
150 |         :param name: ReplicaSet name.
151 |         :type name: str
152 | 
153 |         :param replicas: ReplicaSet replicas.
154 |         :type replicas: int
155 | 
156 |         :param kwargs: ReplicaSet :ref:`labelSelectors` filters.
157 |         :type kwargs: dict
158 | 
159 |         :return: List of ReplicaSets. Typical ReplicaSet has "metadata", "status" and "spec" fields.
160 |         :rtype: list
161 | 
162 | .. py:function:: deployments(name=None, replicas=None, ready=None, **kwargs)
163 | 
164 |         Return list of `Deployments <https://kubernetes.io/docs/user-guide/deployments/>`_.
165 | 
166 |         :param name: Deployment name.
167 |         :type name: str
168 | 
169 |         :param replicas: Deployment replicas.
170 |         :type replicas: int
171 | 
172 |         :param ready: Deployment readiness status.
173 |         :type ready: bool
174 | 
175 |         :param kwargs: Deployment :ref:`labelSelectors` filters.
176 |         :type kwargs: dict
177 | 
178 |         :return: List of Deployments. Typical Deployment has "metadata", "status" and "spec" fields.
179 |         :rtype: list
180 | 
181 | .. py:function:: configmaps(name=None, **kwargs)
182 | 
183 |         Return list of `ConfigMaps <https://kubernetes.io/docs/user-guide/configmap/>`_.
184 | 
185 |         :param name: ConfigMap name.
186 |         :type name: str
187 | 
188 |         :param kwargs: ConfigMap :ref:`labelSelectors` filters.
189 |         :type kwargs: dict
190 | 
191 |         :return: List of ConfigMaps. Typical ConfigMap has "metadata" and "data".
192 |         :rtype: list
193 | 
194 | .. py:function:: persistentvolumeclaims(name=None, phase=None, **kwargs)
195 | 
196 |         Return list of `PersistentVolumeClaims <https://kubernetes.io/docs/user-guide/persistent-volumes/>`_.
197 | 
198 |         :param name: PersistentVolumeClaim name.
199 |         :type name: str
200 | 
201 |         :param phase: Volume phase.
202 |         :type phase: str
203 | 
204 |         :param kwargs: PersistentVolumeClaim :ref:`labelSelectors` filters.
205 |         :type kwargs: dict
206 | 
207 |         :return: List of PersistentVolumeClaims. Typical PersistentVolumeClaim has "metadata", "status" and "spec" fields.
208 |         :rtype: list
209 | 
210 | .. py:function:: persistentvolumes(name=None, phase=None, **kwargs)
211 | 
212 |         Return list of `PersistentVolumes <https://kubernetes.io/docs/user-guide/persistent-volumes/>`_.
213 | 
214 |         :param name: PersistentVolume name.
215 |         :type name: str
216 | 
217 |         :param phase: Volume phase.
218 |         :type phase: str
219 | 
220 |         :param kwargs: PersistentVolume :ref:`labelSelectors` filters.
221 |         :type kwargs: dict
222 | 
223 |         :return: List of PersistentVolumes. Typical PersistentVolume has "metadata", "status" and "spec" fields.
224 |         :rtype: list
225 | 
226 | .. py:function:: jobs(name=None, **kwargs)
227 | 
228 |         Return list of `Jobs <https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/>`_.
229 | 
230 |         :param name: Job name.
231 |         :type name: str
232 | 
233 |         :param **kwargs: Job labelSelector filters.
234 |         :type **kwargs: dict
235 | 
236 |         :return: List of Jobs. Typical Job has "metadata", "status" and "spec".
237 |         :rtype: list
238 | 
239 | .. py:function:: cronjobs(name=None, **kwargs)
240 | 
241 |         Return list of `CronJobs <https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/>`_.
242 | 
243 |         :param name: CronJob name.
244 |         :type name: str
245 | 
246 |         :param **kwargs: CronJob labelSelector filters.
247 |         :type **kwargs: dict
248 | 
249 |         :return: List of CronJobs. Typical CronJob has "metadata", "status" and "spec".
250 |         :rtype: list
251 | 
252 | .. py:function:: metrics()
253 | 
254 |         Return API server metrics in prometheus format.
255 | 
256 |         :return: Cluster metrics.
257 |         :rtype: dict
258 | 


--------------------------------------------------------------------------------
/docs/user/monitoringonaws.rst:
--------------------------------------------------------------------------------
  1 | .. _monitoringonaws:
  2 | 
  3 | *****************
  4 | Monitoring on AWS
  5 | *****************
  6 | 
  7 | This section assumes that you're running zmon-aws-agent_, which automatically discovers your EC2 instances, auto-scaling of groups, ELBs, and more.
  8 | 
  9 | ZMON AWS agent syncs the following entities from AWS infrastructure:
 10 | 
 11 | - EC2 instances
 12 | - Auto-Scaling groups
 13 | - ELBs (classic and ELBv2)
 14 | - Elasticaches
 15 | - RDS instances
 16 | - DynamoDB tables
 17 | - IAM/ACM certificates
 18 | 
 19 | .. note::
 20 | 
 21 |     ZMON AWS Agent can be also deployed via a single `appliance`_, which runs AWS Agent, `ZMON worker`_ and `ZMON scheduler`_.
 22 | 
 23 | CloudWatch Metrics
 24 | ------------------
 25 | You can achieve most basic monitoring with AWS CloudWatch_. CloudWatch EC2 metrics contain the following information:
 26 | 
 27 | - CPU Utilization
 28 | - Network traffic
 29 | - Disk throughput/operations per second (only for ephemeral storage; EBS volumes are not included)
 30 | 
 31 | ZMON allows querying arbitrary CloudWatch metrics using the :ref:`cloudwatch() <cloudwatch>` wrapper.
 32 | 
 33 | Security Groups
 34 | ---------------
 35 | 
 36 | Depending on your AWS setup, you'll probably have to open particular ports/instances to access from ZMON. Using a limited set of ports to expose management APIs and the Prometheus node exporter will make your life easier. ZMON allows parsing of Prometheus metrics via the :ref:`http().prometheus() <http-prometheus>`.
 37 | 
 38 | You can deploy ZMON into each of your AWS accounts to allow cross-team monitoring and dashboards. Make sure that your security groups allow ZMON to connect to port 9100 of your monitored instances.
 39 | 
 40 | Not having the proper security groups configured is mainly visible by not getting the expected results at all, as packages are dropped by the EC2 instance rather then e.g. getting a connection refused.
 41 | 
 42 | Low-Level or Basic Properties
 43 | -----------------------------
 44 | 
 45 | EC2 Instances
 46 | =============
 47 | 
 48 | Having enough **diskspace** on your instance is important; `here's a sample check`_. By default, you can only get space used from CloudWatch_. Using Amazon's own script, you can push free space to CloudWatch and pull this data via ZMON. Alternatively, you can run the `Prometheus Node exporter`_ to pull disk space data from the EC2 node itself via HTTP.
 49 | 
 50 | Similarly, you can pull CPU-related metrics from CloudWatch. The Prometheus Node exporter also exposes these metrics.
 51 | 
 52 | You also need enough available **INodes**.
 53 | 
 54 | Regarding **memory**, you can either query via CloudWatch, use Prometheus Node exporter to feed ZMON, or go with low-level ``snmp()`` [not recommended].
 55 | 
 56 | The following block shows *part* of EC2 instance entity properties:
 57 | 
 58 | .. code-block:: yaml
 59 | 
 60 |     id: a-app-1-2QBrR1[aws:123456789:eu-west-1]
 61 |     type: instance
 62 |     aws_id: i-87654321
 63 |     created_by: agent
 64 |     host: 172.33.173.201
 65 |     infrastructure_account: aws:123456789
 66 |     instance_type: t2.medium
 67 |     ip: 172.33.173.201
 68 |     ports:
 69 |       '5432': 5432
 70 |       '8008': 8008
 71 |     region: eu-west-1
 72 | 
 73 | An example check using :ref:`cloudwatch wrapper <cloudwatch>` and entity properties would look like the following:
 74 | 
 75 | .. code-block:: python
 76 | 
 77 |     cloudwatch().query_one({'InstanceId': entity['aws_id']}, 'CPUUtilization', 'Average', 'AWS/EC2', period=120)
 78 | 
 79 | 
 80 | Elastic Load Balancers
 81 | ======================
 82 | 
 83 | You can query AWS CloudWatch to get ELB-specific metrics. The ZMON agent will put data into the ELB entity, allowing you to monitor instance and healthy instance count.
 84 | 
 85 | .. code-block:: yaml
 86 | 
 87 |     id: elb-a-app-1[aws:123456789:eu-west-1]
 88 |     type: elb
 89 |     elb_type: classic
 90 |     active_members: 1
 91 |     created_by: agent
 92 |     dns_name: internal-a-app-1.eu-west-1.elb.amazonaws.com
 93 |     host: internal-a-app-1.eu-west-1.elb.amazonaws.com
 94 |     infrastructure_account: aws:123456789
 95 |     members: 3
 96 |     region: eu-west-1
 97 |     scheme: internal
 98 | 
 99 | ZMON AWS agent will detect both ELBs, classic and application load balancers. Both ELBs entities will be created in ZMON with ``type:elb``. In order to distinguish between them in your checks, there is another property ``elb_type`` which holds either ``classic`` or ``application``.
100 | 
101 | Since Cloudwatch metrics are different for each ELB type, please check `CloudWatch ELB metrics`_ for detailed reference. An example check using :ref:`Cloudwatch wrapper <cloudwatch>` and entity properties would look like the following:
102 | 
103 | .. code-block:: python
104 | 
105 |     # Classic ELB
106 |     lb_name = entity['name']
107 |     key = 'LoadBalancerName'
108 |     namespace = 'AWS/ELB'
109 | 
110 |     # Check if Application ELBv2 entity
111 |     if entity.get('elb_type') == 'application':
112 |         lb_name = entity['cloudwatch_name']
113 |         key = 'LoadBalancer'
114 |         namespace = 'AWS/ApplicationELB'
115 | 
116 |     cloudwatch().query_one({key: lb_name}, 'RequestCount', 'Sum', namespace)
117 | 
118 | .. note::
119 | 
120 |     ELB entities contain a special flag ``dns_traffic`` which is an indicator about the load balancer being actively serving traffic.
121 | 
122 | Auto-Scaling Groups
123 | ===================
124 | 
125 | ZMON's agent creates an auto-scaling group entity that provides you with the number of desired instances and the number of instances in a healthy state. This enables you to monitor whether the ASG actually works and hosts spawn into a productive state.
126 | 
127 | .. code-block:: yaml
128 | 
129 |     id: asg-proxy-1[aws:123456789:eu-central-1]
130 |     type: asg
131 |     name: proxy-1
132 |     created_by: agent
133 |     desired_capacity: 2
134 |     dns_traffic: 'true'
135 |     dns_weight: 200
136 |     infrastructure_account: aws:123456789
137 |     instances:
138 |     - aws_id: i-123456
139 |       ip: 172.33.109.201
140 |     - aws_id: i-654321
141 |       ip: 172.33.109.202
142 |     max_size: 4
143 |     min_size: 2
144 |     region: eu-central-1
145 | 
146 | RDS Instances
147 | =============
148 | 
149 | ZMON AWS agent will detect RDS instances and store them as entities with type ``database``.
150 | 
151 | .. code-block:: yaml
152 | 
153 |     id: rds-db-1[aws:123456789]
154 |     type: database
155 |     name: db-1
156 |     created_by: agent
157 |     engine: postgres
158 |     host: db-1.rds.amazonaws.com
159 |     infrastructure_account: aws:123456789
160 |     port: 5432
161 |     region: eu-west-1
162 | 
163 | .. code-block:: python
164 | 
165 |     cloudwatch().query_one({'DBInstanceIdentifier': entity['name']}, 'DatabaseConnections', 'Sum', 'AWS/RDS')
166 | 
167 | ElastiCache Redis
168 | =================
169 | 
170 | Elasticache instances are stored as entities with type ``elc``.
171 | 
172 | .. code-block:: yaml
173 | 
174 |     id: elc-redis-1[aws:123456789:eu-central-1]
175 |     type: elc
176 |     cluster_id: all-redis-001
177 |     cluster_num_nodes: 1
178 |     created_by: agent
179 |     engine: redis
180 |     host: redis-1.cache.amazonaws.com
181 |     infrastructure_account: aws:123456789
182 |     port: 6379
183 |     region: eu-central-1
184 | 
185 | IAM/ACM Certificates
186 | ====================
187 | 
188 | ZMON AWS agent will also sync IAM/ACM SSL certificates, with type ``certificate``. Certificate entities could be used to create an alert in case a certificate is about to expire for instance.
189 | 
190 | .. code-block:: yaml
191 | 
192 |     id: cert-acm-example.org[aws:123456789:eu-central-1]
193 |     type: certificate
194 |     name: '*.example.org'
195 |     status: ISSUED
196 |     arn: arn:aws:acm:eu-central-1:123456789:certificate/123456-123456-123456-123456
197 |     certificate_type: acm
198 |     created_by: agent
199 |     expiration: '2017-07-28T12:00:00+00:00'
200 |     infrastructure_account: aws:123456789
201 |     region: eu-central-1
202 | 
203 | 
204 | Application API Monitoring
205 | --------------------------
206 | 
207 | When monitoring an application, you'll usually want to check the number of received requests, latency patterns, and the number of returned status codes.
208 | These data points form a pretty clear picture of what is going on with the application.
209 | 
210 | Additional metrics will help you find problems as well as opportunities for improvement.
211 | Assuming that your applications provide HTTP APIs hidden behind ELBs, you can use ZMON to gather this data from CloudWatch.
212 | 
213 | For more detailed data, ZMON offers options for different languages and frameworks.
214 | One is zmon-actuator_ for Spring Boot.
215 | ZMON gathers the data by querying a JSON endpoint ``/metrics`` adhering to the DropWizard metrics layout with some convention on the naming of timers.
216 | Basically on timer per API path and status code.
217 | 
218 | We also recommend checking out Friboo_ for working with Clojure, the Python/Flask framework Connexion_ or Markscheider_ for Play/Scala development.
219 | 
220 | The :ref:`http(url=...).actuator_metrics() <http-actuator>` will parse the data into a Python dict that allows you to easily monitor and alert on changes in API behavior.
221 | 
222 | This also drives ZMON's cloud UI.
223 | 
224 | .. image:: /images/cloud1.png
225 | 
226 | .. _appliance: https://github.com/zalando-zmon/zmon-appliance
227 | .. _CloudWatch ELB metrics: http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/elb-metricscollected.html
228 | .. _CloudWatch: https://aws.amazon.com/cloudwatch/
229 | .. _Connexion: https://github.com/zalando/connexion
230 | .. _Friboo: https://github.com/zalando-stups/friboo
231 | .. _here's a sample check: https://github.com/zalando/zmon/tree/master/examples/check-definitions/11-ec2-diskspace.yaml
232 | .. _Prometheus Node exporter: https://github.com/prometheus/node_exporter
233 | .. _ZMON scheduler: https://github.com/zalando-zmon/zmon-scheduler
234 | .. _ZMON worker: https://github.com/zalando-zmon/zmon-worker
235 | .. _zmon-actuator: https://github.com/zalando-zmon/zmon-actuator
236 | .. _zmon-aws-agent: https://github.com/zalando-zmon/zmon-aws-agent
237 | .. _markscheider: https://github.com/zalando-incubator/markscheider
238 | 


--------------------------------------------------------------------------------
/docs/user/alert-definitions.rst:
--------------------------------------------------------------------------------
  1 | .. _alert-definitions:
  2 | 
  3 | *****************
  4 | Alert Definitions
  5 | *****************
  6 | 
  7 | Alert definitions specify when (condition, time period) and who (team) to notify for a desired monitoring event.
  8 | Alert definitions can be defined in the ZMON web frontend and via the :ref:`ZMON CLI <zmon-cli>`.
  9 | 
 10 | The following fields exist for alert definitions:
 11 | 
 12 |     name
 13 |         The alert's display name on the dashboard.
 14 |         This field can contain curly-brace variables like ``{mycapture}`` that are replaced by capture's value when the alert is triggered.
 15 |         It's also possible to format decimal precision (e.g. "My alert ``{mycapture:.2f}``" would show as "My alert ``123.45``" if mycapture is ``123.456789``).
 16 |         To include a comma separated list of entities as part of the alert's name, just use the special placeholder ``{entities}``.
 17 | 
 18 |     description
 19 |         Meaningful text for people trying to handle the alert, e.g. incident support.
 20 | 
 21 |     priority
 22 |         The alert's dashboard priority. This defines color and sort order on the dashboard.
 23 | 
 24 |     condition
 25 |         Valid Python expression to return true when alert should be triggered.
 26 | 
 27 |     parameters
 28 |         You may apply parameters your alert condition using variables. More details :ref:`here <alert-definition-parameters>`
 29 | 
 30 |     entities filter
 31 |         Additional filter to apply the alert definition only to a subset of entities.
 32 | 
 33 |     notifications
 34 |         List of :ref:`notification <notifications>` commands, e.g. to send out emails.
 35 | 
 36 |     time_period
 37 |         Notification time period.
 38 | 
 39 |     team
 40 |         Team dashboard to show alert on.
 41 | 
 42 |     responsible_team
 43 |         Additional team field to allow delegating alert monitoring to other teams.
 44 |         The responsible team's name will be shown on the dashboard.
 45 | 
 46 |     status
 47 |         Alerts will only be triggered if status is "ACTIVE".
 48 | 
 49 |     template
 50 |         A template is an alert definition that is not evaluated and can only be used for extension. More details :ref:`here <alert-definition-inheritance>`
 51 | 
 52 | .. _alert-condition:
 53 | 
 54 | Condition
 55 | ---------
 56 | Simple expressions can start directly with an operator. To trigger an alert if the check result value is larger than zero:
 57 | 
 58 | .. code-block:: python
 59 | 
 60 |     > 0
 61 | 
 62 | You can use the ``value`` variable to create more complex conditions:
 63 | 
 64 | .. code-block:: python
 65 | 
 66 |     value >= 10 and value <= 100
 67 | 
 68 | Some more examples of valid conditions:
 69 | 
 70 | .. code-block:: python
 71 | 
 72 |     == 'OK'
 73 |     != False
 74 |     value in ('banana', 'apple')
 75 | 
 76 | If the value already is a dictionary (hash map), we can apply all the Python magic to it:
 77 | 
 78 | .. code-block:: python
 79 | 
 80 |     ['mykey'] > 100                                       # check a specific dict value
 81 |     'error-message' in value                              # trigger alert if key is present
 82 |     not empty([ k for k, v in value.items() if v > 100 ]) # trigger alert if some dict value is > 100
 83 | 
 84 | .. _captures:
 85 | 
 86 | Captures
 87 | --------
 88 | 
 89 | You can capture intermediate results in alert conditions by using the
 90 | ``capture`` function. This allows easier debugging of complex alert
 91 | conditions.
 92 | 
 93 | .. code-block:: python
 94 | 
 95 |     capture(value["a"]/value["b"]) > 0
 96 |     capture(myval=value["a"]/value["b"]) > 0
 97 |     any([capture(foo=FOO) > 10, capture(bar=BAR) > 10])
 98 | 
 99 | Please refer to Recipes section in :ref:`Python Tutorial <python-tutorial>` for some Python tricks you may use.
100 | 
101 | Named captures can be used to customize the alert display on the :term:`dashboard` by using template substitution in the alert name.
102 | 
103 | If you call your capture *dashboard*, it will be used on dashboard next to entity name instead of entity value.
104 | For example, if you have a host-based alert that fails on z-host1 and z-host2, you would normally see something like that
105 | 
106 | ALERT TITLE (N)
107 | z-host1 (value1), z-host2 (value2)
108 | 
109 | Once you introduce capture called *dashboard*, you will get something like
110 | 
111 | ALERT TITLE (N)
112 | z-host1 (capturevalue1), z-host2 (capturevalue2)
113 | 
114 | where capturevalue1 is value of "dashboard" capture evaluated against z-host1.
115 | 
116 | Example alert condition (based on PF/System check for diskspace)
117 | 
118 | .. code-block:: python
119 | 
120 |     "ERROR" not in value
121 |     and
122 |     capture(dashboard=(lambda d: '{}:{}'.format(d.keys()[0], d[d.keys()[0]]['percentage_space_used']) if d else d)(dict((k, v) for k,v in value.iteritems() if v.get('percentage_space_used', 0) >= 90))))
123 | 
124 | Entity (Exclude) Filter
125 | -----------------------
126 | 
127 | The :ref:`check definition <check-definitions>` already defines on what entities the checks should run.
128 | Usually the check definition's ``entities`` are broader than you want.
129 | A diskspace check might be defined for all hosts, but you want to trigger alerts only for hosts you are interested in.
130 | The alert definition's ``entities`` field allows to filter entities by their attributes.
131 | 
132 | See :ref:`entities` for details on supported entities and their attributes.
133 | 
134 | Note: The entity name can be included in the alert message by using a special placeholder `{entities}`` on the alert name.
135 | 
136 | Notifications
137 | -------------
138 | 
139 | ZMON notifications lets you know when you have a new alert without check the web UI.
140 | This section will explain how to use the different options available to notify about changes in alert states.
141 | We support E-Mail, HipChat, Slack and one SMS provider that we have been using.
142 | 
143 | The notifications field is a list of function calls (see below for examples), calling one of the following methods of notification:
144 | 
145 | .. py:function:: send_email(email*, [subject, message, repeat])
146 | .. py:function:: send_sms(number*, [message, repeat])
147 | .. py:function:: send_push([message, repeat, url, key])
148 | .. py:function:: send_slack([channel, message, repeat, token])
149 | .. py:function:: send_hipchat([room, message, color='red', repeat, token, message_format='html', notify=False])
150 | 
151 | If the alert has the top priority and should be handled immediately, you can specify the repeat interval for each notification.
152 | In this case, you will be notified periodically, according to the specified interval, while the alert persists.
153 | The interval is specified in seconds.
154 | 
155 | To receive push notifications you need one of the ZMON mobile apps (configured for your deployment) and subscribe to alert ids, before you can receive notifications.
156 | 
157 | In addition, you may use :ref:`notification-groups` to configure groups of people with associated **emails** and/or **phone numbers** and use these groups in notifications like this:
158 | 
159 | Example JSON email and SMS configuration using groups:
160 | 
161 | .. code-block:: yaml
162 | 
163 |    [
164 |       "send_sms('active:2nd-database')",
165 |       "send_email('group:2nd-database')"
166 |    ]
167 | 
168 | In the above example you send SMS to **active** member of **2nd-database** group and send email to **all members** of the group.
169 | 
170 | Example JSON email configuration:
171 | 
172 | .. code-block:: yaml
173 | 
174 |    [
175 |       "send_mail('a@example.org', 'b@example.org')",
176 |       "send_mail('a@example.com', 'b@example.com', subject='Critical Alert please do something!')",
177 |       "send_mail('c@example.com', repeat=60)"
178 |    ]
179 | 
180 | Example JSON Slack configuration:
181 | 
182 | .. code-block:: yaml
183 | 
184 |    [
185 |       "send_slack()",
186 |       "send_slack(channel='#incidents')",
187 |       "send_slack(channel='#incidents', token='your-token')"
188 |    ]
189 | 
190 | Example JSON HipChat configuration:
191 | 
192 | .. code-block:: yaml
193 | 
194 |    [
195 |       "send_hipchat()",
196 |       "send_hipchat(room='#incidents', color='red')",
197 |       "send_hipchat(room='#incidents', token='your-token')",
198 |       "send_hipchat(room='#incidents', token='your-token', notify=True)",
199 |       "send_hipchat(room='#incidents', token='your-token', notify=True, message='@here Plz check it', message_format='text')"
200 |    ]
201 | 
202 | Example JSON Push configuration:
203 | 
204 |    .. code-block:: yaml
205 | 
206 |       [
207 |          "send_push()"
208 |       ]
209 | 
210 | Example JSON SMS configuration:
211 | 
212 | .. code-block:: yaml
213 | 
214 |    [
215 |       "send_sms('0049123555555', '0123111111')",
216 |       "send_sms('0049123555555', '0123111111', message='Critical Alert please do something!')",
217 |       "send_sms('0029123555556', repeat=300)"
218 |    ]
219 | 
220 | Example email:
221 | 
222 | ::
223 | 
224 |    From: ZMON <zmon@example.com>
225 |    Date: 2014-05-28 18:37 GMT+01:00
226 |    Subject: NEW ALERT: Low Orders/m: 84.9% of last weeks on GLOBAL
227 |    To: Undisclosed Recipients <zmon@example.com>
228 | 
229 |    New alert on GLOBAL: Low Orders/m: {percentage_wow:.1f}% of last weeks
230 | 
231 | 
232 |    Current value: {'2w_ago': 188.8, 'now': 180.8, '1w_ago': 186.6, '3w_ago': 196.4, '4w_ago': 208.8}
233 | 
234 | 
235 |    Captures:
236 | 
237 |    percentage_wow: 184.9185496584
238 | 
239 |    last_weeks_avg: 195.15
240 | 
241 | 
242 | 
243 |    Alert Definition
244 |    Name (ID):     Low Orders/m: {percentage_wow:.1f}% of last weeks (ID: 190)
245 |    Priority:      1
246 |    Check ID:      203
247 |    Condition      capture(percentage_wow=100. * value['now']/capture(last_weeks_avg=(value['1w_ago'] + value['2w_ago'] + value['3w_ago'] + value['4w_ago'])/4. )) < 85
248 |    Team:          Platform/Software
249 |    Resp. Team:    Platform/Software
250 |    Notifications: [u"send_mail('example@example.com')"]
251 | 
252 |    Entity
253 | 
254 |    id: GLOBAL
255 | 
256 |    type: GLOBAL
257 | 
258 |    percentage_wow: 184.9185496584
259 | 
260 |    last_weeks_avg: 195.15
261 | 
262 | Example SMS:
263 | 
264 | ::
265 | 
266 |    Message details:
267 |       Type: Text Message
268 |       From: zmon2
269 |    Message text:
270 |       NEW ALERT: DB instances test alert on all shards on customer-integration-master
271 | 
272 | 
273 | .. _time-periods:
274 | 
275 | Time periods
276 | ------------
277 | 
278 | ZMON 2.0 allows specifying time periods (in UTC) in alert definitions.
279 | When specified, user will be notified about the alert only when it occurs during given period.
280 | Examples below cover most common use cases of time periods’ definitions.
281 | 
282 | To specify a time period from Monday through Friday, 9:00 to 17:00, use a
283 | period such as
284 | 
285 |         wd {Mon-Fri} hr {9-16}
286 | 
287 | When specifying a range by using -, it is best to think of - as meaning through.
288 | It is 9:00 through 16:00, which is just before 17:00 (16:59:59).
289 | 
290 | To specify a time period from Monday through Friday, 9:00 to 17:00 on Monday, Wednesday, and Friday, and 9:00 to 15:00 on Tuesday and Thursday, use a period such as
291 | 
292 |         wd {Mon Wed Fri} hr {9-16}, wd{Tue Thu} hr {9-14}
293 | 
294 | To specify a time period that extends Mon-Fri 9-16, but alternates weeks in a month, use a period such as
295 | 
296 |         wk {1 3 5} wd {Mon Wed Fri} hr {9-16}
297 | 
298 | A period that specifies winter in the northern hemisphere:
299 | 
300 |         mo {Nov-Feb}
301 | 
302 | This is equivalent to the previous example:
303 | 
304 |         mo {Jan-Feb Nov-Dec}
305 | 
306 | As is
307 | 
308 |         mo {jan feb nov dec}
309 | 
310 | And this is too:
311 | 
312 |         mo {Jan Feb}, mo {Nov Dec}
313 | 
314 | To specify a period that describes every other half-hour, use something like:
315 | 
316 |         minute { 0-29 }
317 | 
318 | To specify the morning, use
319 | 
320 |         hour { 0-11 }
321 | 
322 | Remember, 11 is not 11:00:00, but rather 11:00:00 - 11:59:59.
323 | 
324 | 5 second blocks:
325 | 
326 |         sec {0-4 10-14 20-24 30-34 40-44 50-54}
327 | 
328 | To specify every first half-hour on alternating week days, and the
329 | second half-hour the rest of the week, use the period
330 | 
331 |         wd {1 3 5 7} min {0-29}, wd {2 4 6} min {30-59}
332 | 
333 | For more examples and syntax reference, please refer to this `documentation <http://search.cpan.org/~pryan/Period-1.20/Period.pm#PERIOD_EXAMPLES>`_,
334 | note that suffixes like `am` or `pm` for hours are **not** supported, only
335 | integers between 0 and 23. In doubt, try calling with python with your period definition
336 | like
337 | 
338 | .. code-block:: python
339 | 
340 |     from timeperiod import in_period
341 |     in_period('hr { 0 - 23 }')
342 | 
343 | This should not throw an exception. The
344 | timeperiod module in use is `timeperiod2 <https://pypi.python.org/pypi/timeperiod2>`_.
345 | The `in_period` function accepts a second parameter which is a
346 | `datetime <https://docs.python.org/2/library/datetime.html#datetime-objects>`_ like
347 | 
348 | .. code-block:: python
349 | 
350 |     from datetime import datetime
351 |     from timeperiod import in_period
352 |     in_period('hr { 7 - 23 }', datetime(2018, 1, 8, 2, 15)) # check 2018-01-08 02:15:00
353 | 
354 | 
355 | .. include:: alert-definition-inheritance.rst
356 | .. include:: alert-definition-parameters.rst
357 | .. include:: downtimes.rst
358 | .. include:: comments.rst
359 | 


--------------------------------------------------------------------------------
/docs/installation/configuration.rst:
--------------------------------------------------------------------------------
  1 | ***********************
  2 | Component Configuration
  3 | ***********************
  4 | 
  5 | In this section we assume that you want to use Docker as means of deployment.
  6 | The ZMON Dockerimages in Zalando's Open Source registry are exactly the ones we use ourselves, injecting all configuartion via environment variables.
  7 | 
  8 | If this does not fit your needs you can run the artifacts directly and decide to use environment variables or modify the example config files.
  9 | 
 10 | At this point we also assume the requirements in terms of PostgreSQL, Redis and KairosDB are available and you have the credentials at hand.
 11 | If not see :ref:`requirements`. The minimal configuration options below are taken from the Demo's Bootstrap_ script!
 12 | 
 13 | Authentication
 14 | ==============
 15 | 
 16 | For the ZMON controller we assume that it is publicly accessible.
 17 | Thus the UI always requires users to login and the REST API, too.
 18 | The REST API relies on tokens via the ``Authorization: Bearer <token>`` header to allow access.
 19 | For environments where you have no OAauth2 setup you can configure pre-shared keys for API access.
 20 | 
 21 | .. note::
 22 | 
 23 |    Feel free to look at Zalando's `Plan-B <http://planb.readthedocs.io/en/latest/>`_, which is a freely available OAuth2 provider we use for our platform to secure service to service communication.
 24 | 
 25 | Creating a preshared token can be achieved like this and adding them to the Controller configuration.
 26 | 
 27 | .. code-block:: bash
 28 | 
 29 |   SCHEDULER_TOKEN=$(makepasswd --string=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ --chars 32)
 30 | 
 31 | .. warning::
 32 | 
 33 |     Due to magic in matching env vars token must be ALL UPPERCASE
 34 | 
 35 | Scheduler and worker both at times call the controller's REST API thus you need to configure tokens for them.
 36 | For the scheduler, KairosDB, eventlog-service and metric-cache if deployed we assume for now they are private.
 37 | Theses services are accessed only by worker and controller and do not need to be public.
 38 | Same is true for Redis, PostgreSQL and Cassandra.
 39 | However in general we advise you to setup proper credentials and roles where possible.
 40 | 
 41 | Running Docker
 42 | ==============
 43 | 
 44 | First we need to figure out what tags to run.
 45 | Belows bash snippet helps you to retrieve and set the latest available tags.
 46 | 
 47 | .. code-block:: bash
 48 | 
 49 |   function get_latest () {
 50 |       name=$1
 51 |       # REST API returns tags sorted by time
 52 |       tag=$(curl --silent https://registry.opensource.zalan.do/teams/stups/artifacts/$name/tags | jq .[].name -r | tail -n 1)
 53 |       echo "$name:$tag"
 54 |   }
 55 | 
 56 |   echo "Retrieving latest versions.."
 57 |   REPO=registry.opensource.zalan.do/stups
 58 |   POSTGRES_IMAGE=$REPO/postgres:9.4.5-1
 59 |   REDIS_IMAGE=$REPO/redis:3.2.0-alpine
 60 |   CASSANDRA_IMAGE=$REPO/cassandra:2.1.5-1
 61 |   ZMON_KAIROSDB_IMAGE=$REPO/$(get_latest kairosdb)
 62 |   ZMON_EVENTLOG_SERVICE_IMAGE=$REPO/$(get_latest zmon-eventlog-service)
 63 |   ZMON_CONTROLLER_IMAGE=$REPO/$(get_latest zmon-controller)
 64 |   ZMON_SCHEDULER_IMAGE=$REPO/$(get_latest zmon-scheduler)
 65 |   ZMON_WORKER_IMAGE=$REPO/$(get_latest zmon-worker)
 66 |   ZMON_METRIC_CACHE=$REPO/$(get_latest zmon-metric-cache)
 67 | 
 68 | To run the selected images use Docker's run command together with the options explained below.
 69 | We use the following wrapper for this:
 70 | 
 71 | .. code-block:: bash
 72 | 
 73 |   function run_docker () {
 74 |       name=$1
 75 |       shift 1
 76 |       echo "Starting Docker container ${name}.."
 77 |       # ignore non-existing containers
 78 |       docker kill $name &> /dev/null || true
 79 |       docker rm -f $name &> /dev/null || true
 80 |       docker run --restart "on-failure:10" --net zmon-demo -d --name $name $@
 81 |   }
 82 | 
 83 |   run_docker zmon-controller \
 84 |               # -e ......... \
 85 |               # -e ......... \
 86 |              $ZMON_CONTROLLER_IMAGE
 87 | 
 88 | Controller
 89 | ==========
 90 | 
 91 | Authentication
 92 | ^^^^^^^^^^^^^^
 93 | 
 94 | Configure your Github application
 95 | 
 96 | .. code-block:: bash
 97 | 
 98 |     -e SPRING_PROFILES_ACTIVE=github \
 99 |     -e ZMON_OAUTH2_SSO_CLIENT_ID=64210244ddd8378699d6 \
100 |     -e ZMON_OAUTH2_SSO_CLIENT_SECRET=48794a58705d1ba66ec9b0f06a3a44ecb273c048 \
101 | 
102 | Make everyone admin for now:
103 | 
104 | .. code-block:: bash
105 | 
106 |     -e ZMON_AUTHORITIES_SIMPLE_ADMINS=* \
107 | 
108 | 
109 | Logout URL
110 | ^^^^^^^^^^
111 | 
112 | When switching to TV Mode, you can use this to enable the Pop-up dialog described in
113 | :doc:`/user/tv-login` which opens the Logout URL in a new Tab to terminate the user's session.
114 | 
115 | .. code-block:: bash
116 | 
117 |     -e ZMON_LOGOUT_URL="https://example.com/logout"
118 | 
119 | Dependencies
120 | ^^^^^^^^^^^^
121 | 
122 | Configure PostgreSQL access:
123 | 
124 | .. code-block:: bash
125 | 
126 |     -e POSTGRES_URL=jdbc:postgresql://$PGHOST:5432/local_zmon_db \
127 |     -e POSTGRES_PASSWORD=$PGPASSWORD \
128 | 
129 | Setup Redis connection:
130 | 
131 | .. code-block:: bash
132 | 
133 |     -e REDIS_HOST=zmon-redis \
134 |     -e REDIS_PORT=6379 \
135 | 
136 | Set CORS allowed origins:
137 | 
138 | .. code-block:: bash
139 | 
140 |     -e ENDPOINTS_CORS_ALLOWED_ORIGINS=https://demo.zmon.io \
141 | 
142 | Setup URLs for other services:
143 | 
144 | .. code-block:: bash
145 | 
146 |     -e ZMON_EVENTLOG_URL=http://zmon-eventlog-service:8081/ \
147 |     -e ZMON_KAIROSDB_URL=http://zmon-kairosdb:8083/ \
148 |     -e ZMON_METRICCACHE_URL=http://zmon-metric-cache:8086/ \
149 |     -e ZMON_SCHEDULER_URL=http://zmon-scheduler:8085/ \
150 | 
151 | And last but not least, configure a preshared token, to allow the scheduler and worker to access the REST API. Remember tokens need to all uppercase here.
152 | 
153 | .. code-block:: bash
154 | 
155 |     -e PRESHARED_TOKENS_${SCHEDULER_TOKEN}_UID=zmon-scheduler \
156 |     -e PRESHARED_TOKENS_${SCHEDULER_TOKEN}_EXPIRES_AT=1758021422 \
157 |     -e PRESHARED_TOKENS_${SCHEDULER_TOKEN}_AUTHORITY=user
158 | 
159 | Firebase and Webpush
160 | ^^^^^^^^^^^^^^^^^^^^
161 | 
162 | Enable desktop push notification UI with the following options:
163 | 
164 | .. code-block:: bash
165 | 
166 |     -e ZMON_ENABLE_FIREBASE=true \
167 |     -e ZMON_NOTIFICATIONSERVICE_URL=http://zmon-notification-service:8087/ \
168 |     -e ZMON_FIREBASE_API_KEY="AIzaSyBM1ktKS5u_d2jxWPHVU7Xk39s-PG5gy7c" \
169 |     -e ZMON_FIREBASE_AUTH_DOMAIN="zmon-demo.firebaseapp.com" \
170 |     -e ZMON_FIREBASE_DATABASE_URL="https://zmon-demo.firebaseio.com" \
171 |     -e ZMON_FIREBASE_STORAGE_BUCKET="zmon-demo.appspot.com" \
172 |     -e ZMON_FIREBASE_MESSAGING_SENDER_ID="280881042812" \
173 | 
174 | This feature requires additional config for the worker and to run the notification-service.
175 | 
176 | Scheduler
177 | =========
178 | 
179 | Specify the Redis server you want to use:
180 | 
181 | .. code-block:: bash
182 | 
183 |    -e SCHEDULER_REDIS_HOST=zmon-redis \
184 |    -e SCHEDULER_REDIS_PORT=6379 \
185 | 
186 | Setup access to the controller and entity service (both provided by the controller):
187 | Not the reuse of the above defined pre shared key!
188 | 
189 | .. code-block:: bash
190 | 
191 |    -e SCHEDULER_OAUTH2_STATIC_TOKEN=$SCHEDULER_TOKEN \
192 |    -e SCHEDULER_URLS_WITHOUT_REST=true \
193 |    -e SCHEDULER_ENTITY_SERVICE_URL=http://zmon-controller:8080/ \
194 |    -e SCHEDULER_CONTROLLER_URL=http://zmon-controller:8080/ \
195 | 
196 | If you run into scenarios of different queues or the demand for different levels of parallelism, e.g. limiting number of queries run at MySQL/PostgreSQL databases use the following as an example:
197 | 
198 | .. code-block:: bash
199 | 
200 |     -e SPRING_APPLICATION_JSON='{"scheduler":{"queue_property_mapping":{"zmon:queue:mysql":[{"type":"mysql"}]}}}'
201 | 
202 | This will route checks agains entities of type "mysql" to another queue.
203 | 
204 | Worker
205 | ======
206 | 
207 | The worker configuration is split into essential configuration options, like Redis and KairosDB and the plugin configuration, e.g. PostgreSQL credentials, ...
208 | 
209 | Essential Options
210 | ^^^^^^^^^^^^^^^^^
211 | 
212 | Configure Redis Access:
213 | 
214 | .. code-block:: bash
215 | 
216 |   -e WORKER_REDIS_SERVERS=zmon-redis:6379 \
217 | 
218 | Configure parallelism and throughput:
219 | 
220 | .. code-block:: bash
221 | 
222 |   -e WORKER_ZMON_QUEUES=zmon:queue:default/25,zmon:queue:mysql/3
223 | 
224 | Specify the number of worker processes that are polling the queues and execute tasks.
225 | You can specify multiple queues here to listen to.
226 | 
227 | Configure KairosDB:
228 | 
229 | .. code-block:: bash
230 | 
231 |   -e WORKER_KAIROSDB_HOST=zmon-kairosdb \
232 | 
233 | Configure EventLog service:
234 | 
235 | .. code-block:: bash
236 | 
237 |   -e WORKER_EVENTLOG_HOST=zmon-eventlog-service \
238 |   -e WORKER_EVENTLOG_PORT=8081 \
239 | 
240 | Configure Worker token to access controller API: (relying on Python tokens library here)
241 | 
242 | .. code-block:: bash
243 | 
244 |   -e  OAUTH2_ACCESS_TOKENS=uid=$WORKER_TOKEN \
245 | 
246 | Configure Worker named tokens to access external APIs:
247 | 
248 | .. code-block:: bash
249 | 
250 |   -e WORKER_PLUGIN_HTTP_OAUTH2_TOKENS=token_name1=scope1,scope2,scope3:token_name2=scope1,scope2
251 | 
252 | Configure Metric Cache (optional):
253 | 
254 | .. code-block:: bash
255 | 
256 |   -e WORKER_METRICCACHE_URL=http://zmon-metric-cache:8086/api/v1/rest-api-metrics/ \
257 |   -e WORKER_METRICCACHE_CHECK_ID=9 \
258 | 
259 | .. _notification-options-label:
260 | 
261 | Notification Options
262 | ^^^^^^^^^^^^^^^^^^^^
263 | 
264 | Firebase and Webpush
265 | --------------------
266 | To trigger notifications for desktop web and mobile apps set the following params to point to notification service.
267 | 
268 | ``WORKER_NOTIFICATION_SERVICE_URL``
269 |     Notification service base url
270 | 
271 | ``WORKER_NOTIFICATION_SERVICE_KEY``
272 |     (optional, if not using oauth2) A shared key configured in the notification service
273 | 
274 | 
275 | Hipchat
276 | -------
277 | ``WORKER_NOTIFICATIONS_HIPCHAT_TOKEN``
278 |     Access token for HipChat notifications.
279 | ``WORKER_NOTIFICATIONS_HIPCHAT_URL``
280 |     URL of HipChat server.
281 | 
282 | HTTP
283 | ----
284 | 
285 | This allows to trigger HTTP Post calls to arbitrary services.
286 | 
287 | ``WORKER_NOTIFICATIONS_HTTP_DEFAULT_URL``
288 |     HTTP endpoint default URL.
289 | ``WORKER_NOTIFICATIONS_HTTP_WHITELIST_URLS``
290 |     List of whitelist URL endpoints. If URL is not in this list, then exception will be raised.
291 | ``WORKER_NOTIFICATIONS_HTTP_ALLOW_ALL``
292 |     Allow any URL to be used in HTTP notification.
293 | ``WORKER_NOTIFICATIONS_HTTP_HEADERS``
294 |     Default headers to be used in HTTP requests.
295 | 
296 | Mail
297 | ----
298 | ``WORKER_NOTIFICATIONS_MAIL_HOST``
299 |     SMTP host for email notifications.
300 | ``WORKER_NOTIFICATIONS_MAIL_PORT``
301 |     SMTP port for email notifications.
302 | ``WORKER_NOTIFICATIONS_MAIL_SENDER``
303 |     Sender address for email notifications.
304 | ``WORKER_NOTIFICATIONS_MAIL_USER``
305 |     SMTP user for email notifications.
306 | ``WORKER_NOTIFICATIONS_MAIL_PASSWORD``
307 |     SMTP password for email notifications.
308 | 
309 | Slack
310 | -----
311 | ``WORKER_NOTIFICATIONS_SLACK_WEBHOOK``
312 |     Slack webhook for channel notifications.
313 | 
314 | Twilio
315 | ------
316 | ``WORKER_NOTIFICATION_SERVICE_URL``
317 |     URL of notification service (needs to be publicly accessible)
318 | ``WORKER_NOTIFICATION_SERVICE_KEY``
319 |     (optional, if not using oauth2) Preshared key to call notification service
320 | 
321 | Pagerduty
322 | ---------
323 | ``WORKER_NOTIFICATIONS_PAGERDUTY_SERVICEKEY``
324 |     Routing key for a Pagerduty service
325 | 
326 | 
327 | 
328 | Plug-In Options
329 | ---------------
330 | 
331 | All plug-in options have the prefix ``WORKER_PLUGIN_<plugin-name>_``, i.e. if you want to set option "bar" of the plugin "foo" to "123" via environment variable:
332 | 
333 | .. code-block:: bash
334 | 
335 |     WORKER_PLUGIN_FOO_BAR=123
336 | 
337 | If you plan to access your PostgreSQL cluster specify the credentials below. We suggest to use a distinct user for ZMON with limited read only privileges.
338 | 
339 | .. code-block:: bash
340 | 
341 |    WORKER_PLUGIN_SQL_USER
342 |    WORKER_PLUGIN_SQL_PASS
343 | 
344 | If you need to access MySQL specify the user credentials below, again we suggest to use a user with limited privileges only.
345 | 
346 | .. code-block:: bash
347 | 
348 |    WORKER_PLUGIN_MYSQL_USER
349 |    WORKER_PLUGIN_MYSQL_PASS
350 | 
351 | 
352 | .. _Bootstrap: https://github.com/zalando-zmon/zmon-demo
353 | 
354 | 
355 | Notification Service
356 | ====================
357 | 
358 | Optional component to service mobile API, push notifications and Twilio notifications.
359 | 
360 | Authentication
361 | ^^^^^^^^^^^^^^
362 | 
363 | ``SPRING_APPLICATION_JSON``
364 |     Use this to define pre-shared keys if not using OAuth2. Specify key and max validity.
365 | 
366 |     .. code-block:: json
367 | 
368 |         {"notifications":{"shared_keys":{"<your random key>": 1504981053654}}}
369 | 
370 | 
371 | Firebase and Web Push
372 | ^^^^^^^^^^^^^^^^^^^^^
373 | 
374 | ``NOTIFICATIONS_GOOGLE_PUSH_SERVICE_API_KEY``
375 |     Private Firebase messaging server key
376 | 
377 | ``NOTIFICATIONS_ZMON_URL``
378 |     ZMON's base URL
379 | 
380 | 
381 | Twilio options
382 | ^^^^^^^^^^^^^^
383 | 
384 | ``NOTIFICATIONS_TWILIO_API_KEY``
385 |     Private API Key
386 | ``NOTIFICATIONS_TWILIO_USER``
387 |     User
388 | ``NOTIFICATIONS_TWILIO_PHONE_NUMBER``
389 |     Phone number to use
390 | ``NOTIFICATIONS_DOMAIN``
391 |     Domain under which notification service is reachable
392 | 


--------------------------------------------------------------------------------