├── README.adoc
├── alertmanager
    ├── Dockerfile
    ├── README.adoc
    ├── alertmanager.yaml
    ├── build.sh
    └── config.yml
├── install_everything.yml
├── node-exporter
    ├── Dockerfile
    ├── README.adoc
    ├── build.sh
    ├── dockerinfo
    │   ├── DockerStatus.json
    │   ├── README.adoc
    │   ├── dockerinfo.cron
    │   ├── dockerinfo.sh
    │   └── install_dockerinfo.yml
    ├── native
    │   ├── README.adoc
    │   ├── install_node-exporter.yml
    │   └── node_exporter.conf
    ├── node-exporter.yaml
    ├── setup.sh
    └── update_firewall.yml
├── prometheus-image
    ├── Dockerfile
    ├── build.sh
    └── config.yml
├── prometheus.yaml
├── setup.sh
└── setup_infranodes.yml


/README.adoc:
--------------------------------------------------------------------------------
 1 | # Prometheus on OpenShift
 2 | 
 3 | This repository contains definitions and tools to run Prometheus and its associated ecosystem on Red Hat OpenShift.
 4 | 
 5 | ## Components
 6 | 
 7 | The following components are available:
 8 | 
 9 | * link:https://prometheus.io/docs/introduction/overview/[Prometheus]
10 | * link:https://prometheus.io/docs/instrumenting/exporters/[node-exporter]
11 | * link:https://prometheus.io/docs/alerting/alertmanager/[Alertmanager]
12 | 
13 | ## Project Organization
14 | 
15 | A new project called _prometheus_ will be created to contain the entire ecosystem.
16 | 
17 | Execute the following command to create the project:
18 | 
19 | [source,bash]
20 | ----
21 | oc new-project prometheus --display-name="Prometheus Monitoring"
22 | ----
23 | 
24 | Make sure that there is not a default node selector on the project:
25 | 
26 | [source,bash]
27 | ----
28 | oc annotate namespace prometheus openshift.io/node-selector=""
29 | ----
30 | 
31 | ## Deploy Prometheus
32 | 
33 | Starting with OpenShift 3.6 the OpenShift routers expose a metrics endpoint on port 1936. For Prometheus to be able to monitor the routers this port needs to be open.
34 | 
35 | Additionally Prometheus does not work with remote volumes (NFS, EBS, ...) but needs local disk storage as well. This means we need to create a directory on (one of) the infranodes. The Prometheus template includes a Node Selector `prometheus-host=true` - so we need to set the correct label on the infranode(s) as well.
36 | 
37 | Run the following Ansible playbook to configure the infranodes:
38 | 
39 | [source,bash]
40 | ----
41 | ansible-playbook -i /etc/ansible/hosts ./setup_infranodes.yml
42 | ----
43 | 
44 | The router also requires basic authentication to be allowed to scrape the metrics. Find the router password by executing the following command:
45 | 
46 | [source,bash]
47 | ----
48 | oc set env dc router -n default --list|grep STATS_PASSWORD|awk -F"=" '{print $2}'
49 | ----
50 | 
51 | An OpenShift template has been provided to streamline the deployment to OpenShift.
52 | 
53 | Execute the following command to instantiate the Prometheus template using the previously retrieved router password as a parameter:
54 | 
55 | [source,bash]
56 | ----
57 | oc new-app -f prometheus.yaml --param ROUTER_PASSWORD=<Router Password>
58 | ----
59 | 
60 | Since Prometheus needs to use a local disk to write its metrics add the `privileged` SCC to the *prometheus* service account:
61 | 
62 | [source,bash]
63 | ----
64 | oc adm policy add-scc-to-user privileged system:serviceaccount:prometheus:prometheus
65 | ----
66 | 
67 | Make sure your Prometheus pod is running (on an Infranode):
68 | 
69 | [source,bash]
70 | ----
71 | oc get pod -o wide
72 | ----
73 | 
74 | ## Next Steps
75 | 
76 | Please refer to the following to enhance the functionality of Prometheus
77 | 
78 | * link:alertmanager[Alertmanager]
79 | * link:node-exporter[Node exporter]
80 | * link:https://github.com/wkulhanek/docker-openshift-grafana[Grafana]
81 | 
82 | # Cleanup
83 | 
84 | Delete the project and the cluster-reader binding (which gets created by the template but doesn't get deleted as part of the project):
85 | 
86 | [source,bash]
87 | ----
88 | oc delete project prometheus
89 | oc delete clusterrolebinding prometheus-cluster-reader
90 | oc adm policy remove-scc-from-user privileged prometheus
91 | ----
92 | 
93 | You will also need to clean up the directory /var/lib/prometheus-data on the Infranode(s) and remove the label `prometheus-host=true` from the Infranode(s).
94 | 


--------------------------------------------------------------------------------
/alertmanager/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM docker.io/centos:7
 2 | LABEL maintainer="Wolfgang Kulhanek <WolfgangKulhanek@gmail.com>"
 3 | 
 4 | ENV ALERT_MANAGER_VERSION=0.16.1
 5 | 
 6 | RUN yum -y update && yum -y upgrade  && \
 7 |     yum -y clean all && \
 8 |     curl -L -o /tmp/alert_manager.tar.gz https://github.com/prometheus/alertmanager/releases/download/v$ALERT_MANAGER_VERSION/alertmanager-$ALERT_MANAGER_VERSION.linux-amd64.tar.gz && \
 9 |     tar -xzf /tmp/alert_manager.tar.gz && \
10 |     mv ./alertmanager-$ALERT_MANAGER_VERSION.linux-amd64/alertmanager /bin && \
11 |     rm -rf ./alertmanager-$ALERT_MANAGER_VERSION.linux-amd64 && \
12 |     rm /tmp/alert_manager.tar.gz
13 | 
14 | COPY config.yml /etc/alertmanager/config.yml
15 | 
16 | EXPOSE      9093
17 | USER        nobody
18 | VOLUME     [ "/alertmanager" ]
19 | WORKDIR    /alertmanager
20 | ENTRYPOINT [ "/bin/alertmanager" ]
21 | CMD        [ "-config.file=/etc/alertmanager/config.yml", \
22 |              "-storage.path=/alertmanager" ]
23 | 


--------------------------------------------------------------------------------
/alertmanager/README.adoc:
--------------------------------------------------------------------------------
 1 | # OpenShift Prometheus Alert Manager
 2 | 
 3 | The Alert Manager will be connected to Prometheus and send alerts when specific rules are hit.
 4 | 
 5 | The example implementation makes use of a link:https://prometheus.io/docs/alerting/configuration/#<webhook_config>[webhook] to notify a target system.
 6 | 
 7 | The Alertmanager can be configured to support other or additional alerting mechanism by editing the ConfigMap `alertmanager` either in the template or after the Alertmanager has been deployed (in that case it needs to be restarted to pick up the new configuration).
 8 | 
 9 | ## Deployment to OpenShift
10 | 
11 | Use the following steps to deploy Alertmanager to OpenShift:
12 | 
13 | [source,bash]
14 | ----
15 | oc new-app -f alertmanager.yaml -p VOLUME_CAPACITY=4Gi -p "WEBHOOK_URL=<insert_url>"
16 | ----
17 | 
18 | NOTE: Be sure to substitute the value of the Webhook URL when instantiating the template.
19 | 


--------------------------------------------------------------------------------
/alertmanager/alertmanager.yaml:
--------------------------------------------------------------------------------
  1 | apiVersion: v1
  2 | kind: Template
  3 | metadata:
  4 |   name: alertmanager
  5 |   annotations:
  6 |     "openshift.io/display-name": Prometheus Alert Manager
  7 |     description: |
  8 |       A monitoring solution for an OpenShift cluster - collect and gather metrics from nodes, services, and the infrastructure. This component provides the Alert Manager.
  9 |     iconClass: icon-cogs
 10 |     tags: "monitoring,prometheus,alertmanager,time-series"
 11 | parameters:
 12 | - description: The location of the prometheus image
 13 |   name: IMAGE_ALERTMANAGER
 14 |   value: wkulhanek/alertmanager:latest
 15 | - name: VOLUME_CAPACITY
 16 |   displayName: Volume Capacity
 17 |   description: Volume space available for data, e.g. 512Mi, 2Gi.
 18 |   value: 4Gi
 19 |   required: true
 20 | - name: WEBHOOK_URL
 21 |   displayName: Webhook URL
 22 |   description: URL for the Webhook to send alerts.
 23 |   required: true
 24 | objects:
 25 | - apiVersion: v1
 26 |   kind: Route
 27 |   metadata:
 28 |     name: alertmanager
 29 |   spec:
 30 |     to:
 31 |       name: alertmanager
 32 | - apiVersion: v1
 33 |   kind: Service
 34 |   metadata:
 35 |     annotations:
 36 |       prometheus.io/scrape: "true"
 37 |       prometheus.io/scheme: http
 38 |     labels:
 39 |       name: alertmanager
 40 |     name: alertmanager
 41 |   spec:
 42 |     ports:
 43 |     - name: alertmanager
 44 |       port: 9093
 45 |       protocol: TCP
 46 |       targetPort: 9093
 47 |     selector:
 48 |       app: alertmanager
 49 | - apiVersion: v1
 50 |   kind: DeploymentConfig
 51 |   metadata:
 52 |     labels:
 53 |       app: alertmanager
 54 |     name: alertmanager
 55 |   spec:
 56 |     replicas: 1
 57 |     selector:
 58 |       app: alertmanager
 59 |       deploymentconfig: alertmanager
 60 |     template:
 61 |       metadata:
 62 |         labels:
 63 |           app: alertmanager
 64 |           deploymentconfig: alertmanager
 65 |         name: alertmanager
 66 |       spec:
 67 |         containers:
 68 |         - name: alertmanager
 69 |           args:
 70 |           - --config.file=/etc/alertmanager/config.yml
 71 |           - --storage.path=/alertmanager
 72 |           image: ${IMAGE_ALERTMANAGER}
 73 |           imagePullPolicy: IfNotPresent
 74 |           livenessProbe:
 75 |             failureThreshold: 3
 76 |             httpGet:
 77 |               path: /#/status
 78 |               port: 9093
 79 |               scheme: HTTP
 80 |             initialDelaySeconds: 2
 81 |             periodSeconds: 10
 82 |             successThreshold: 1
 83 |             timeoutSeconds: 1
 84 |           readinessProbe:
 85 |             failureThreshold: 3
 86 |             httpGet:
 87 |               path: /#/status
 88 |               port: 9093
 89 |               scheme: HTTP
 90 |             initialDelaySeconds: 2
 91 |             periodSeconds: 10
 92 |             successThreshold: 1
 93 |             timeoutSeconds: 1
 94 | 
 95 |           volumeMounts:
 96 |           - mountPath: /etc/alertmanager
 97 |             name: config-volume
 98 |           - mountPath: /alertmanager
 99 |             name: data-volume
100 |         restartPolicy: Always
101 |         volumes:
102 |         - name: data-volume
103 |           persistentVolumeClaim:
104 |             claimName: alertmanager-data-pvc
105 |         - configMap:
106 |             name: alertmanager
107 |           name: config-volume
108 | - apiVersion: v1
109 |   kind: PersistentVolumeClaim
110 |   metadata:
111 |     name: alertmanager-data-pvc
112 |   spec:
113 |     accessModes:
114 |     - ReadWriteMany
115 |     resources:
116 |       requests:
117 |         storage: "${VOLUME_CAPACITY}"
118 | - apiVersion: v1
119 |   kind: ConfigMap
120 |   metadata:
121 |     name: alertmanager
122 |   data:
123 |     config.yml: |
124 |       global:
125 | 
126 |       # The root route on which each incoming alert enters.
127 |       route:
128 |         # The root route must not have any matchers as it is the entry point for
129 |         # all alerts. It needs to have a receiver configured so alerts that do not
130 |         # match any of the sub-routes are sent to someone.
131 |         receiver: 'webhook'
132 | 
133 |         # The labels by which incoming alerts are grouped together. For example,
134 |         # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
135 |         # be batched into a single group.
136 |         group_by: ['alertname', 'cluster']
137 | 
138 |         # When a new group of alerts is created by an incoming alert, wait at
139 |         # least 'group_wait' to send the initial notification.
140 |         # This way ensures that you get multiple alerts for the same group that start
141 |         # firing shortly after another are batched together on the first
142 |         # notification.
143 |         group_wait: 30s
144 | 
145 |         # When the first notification was sent, wait 'group_interval' to send a batch
146 |         # of new alerts that started firing for that group.
147 |         group_interval: 5m
148 | 
149 |         # If an alert has successfully been sent, wait 'repeat_interval' to
150 |         # resend them.
151 |         repeat_interval: 3h
152 | 
153 |       receivers:
154 |       - name: 'webhook'
155 |         webhook_configs:
156 |         - send_resolved: true
157 |           url: '${WEBHOOK_URL}'
158 | 


--------------------------------------------------------------------------------
/alertmanager/build.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | export VERSION=0.16.1
3 | docker build . -t wkulhanek/alertmanager:latest
4 | docker tag wkulhanek/alertmanager:latest wkulhanek/alertmanager:${VERSION}
5 | docker push wkulhanek/alertmanager:latest
6 | docker push wkulhanek/alertmanager:${VERSION}
7 | 


--------------------------------------------------------------------------------
/alertmanager/config.yml:
--------------------------------------------------------------------------------
  1 | global:
  2 |   # The smarthost and SMTP sender used for mail notifications.
  3 |   smtp_smarthost: 'localhost:25'
  4 |   smtp_from: 'alertmanager@openshift.opentlc.com'
  5 | 
  6 | # The root route on which each incoming alert enters.
  7 | route:
  8 |   # The root route must not have any matchers as it is the entry point for
  9 |   # all alerts. It needs to have a receiver configured so alerts that do not
 10 |   # match any of the sub-routes are sent to someone.
 11 |   receiver: 'team-X-mails'
 12 | 
 13 |   # The labels by which incoming alerts are grouped together. For example,
 14 |   # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
 15 |   # be batched into a single group.
 16 |   group_by: ['alertname', 'cluster']
 17 | 
 18 |   # When a new group of alerts is created by an incoming alert, wait at
 19 |   # least 'group_wait' to send the initial notification.
 20 |   # This way ensures that you get multiple alerts for the same group that start
 21 |   # firing shortly after another are batched together on the first
 22 |   # notification.
 23 |   group_wait: 30s
 24 | 
 25 |   # When the first notification was sent, wait 'group_interval' to send a batch
 26 |   # of new alerts that started firing for that group.
 27 |   group_interval: 5m
 28 | 
 29 |   # If an alert has successfully been sent, wait 'repeat_interval' to
 30 |   # resend them.
 31 |   repeat_interval: 3h
 32 | 
 33 |   # All the above attributes are inherited by all child routes and can
 34 |   # overwritten on each.
 35 | 
 36 |   # The child route trees.
 37 |   routes:
 38 |   # This routes performs a regular expression match on alert labels to
 39 |   # catch alerts that are related to a list of services.
 40 |   - match_re:
 41 |       service: ^(foo1|foo2|baz)$
 42 |     receiver: team-X-mails
 43 | 
 44 |     # The service has a sub-route for critical alerts, any alerts
 45 |     # that do not match, i.e. severity != critical, fall-back to the
 46 |     # parent node and are sent to 'team-X-mails'
 47 |     routes:
 48 |     - match:
 49 |         severity: critical
 50 |       receiver: team-X-pager
 51 | 
 52 |   - match:
 53 |       service: files
 54 |     receiver: team-Y-mails
 55 | 
 56 |     routes:
 57 |     - match:
 58 |         severity: critical
 59 |       receiver: team-Y-pager
 60 | 
 61 |   # This route handles all alerts coming from a database service. If there's
 62 |   # no team to handle it, it defaults to the DB team.
 63 |   - match:
 64 |       service: database
 65 | 
 66 |     receiver: team-DB-pager
 67 |     # Also group alerts by affected database.
 68 |     group_by: [alertname, cluster, database]
 69 | 
 70 |     routes:
 71 |     - match:
 72 |         owner: team-X
 73 |       receiver: team-X-pager
 74 | 
 75 |     - match:
 76 |         owner: team-Y
 77 |       receiver: team-Y-pager
 78 | 
 79 | 
 80 | # Inhibition rules allow to mute a set of alerts given that another alert is
 81 | # firing.
 82 | # We use this to mute any warning-level notifications if the same alert is
 83 | # already critical.
 84 | inhibit_rules:
 85 | - source_match:
 86 |     severity: 'critical'
 87 |   target_match:
 88 |     severity: 'warning'
 89 |   # Apply inhibition if the alertname is the same.
 90 |   equal: ['alertname']
 91 | 
 92 | 
 93 | receivers:
 94 | - name: 'team-X-mails'
 95 |   email_configs:
 96 |   - to: 'team-X+alerts@example.org'
 97 | 
 98 | - name: 'team-X-pager'
 99 |   email_configs:
100 |   - to: 'team-X+alerts-critical@example.org'
101 |   pagerduty_configs:
102 |   - service_key: <team-X-key>
103 | 
104 | - name: 'team-Y-mails'
105 |   email_configs:
106 |   - to: 'team-Y+alerts@example.org'
107 | 
108 | - name: 'team-Y-pager'
109 |   pagerduty_configs:
110 |   - service_key: <team-Y-key>
111 | 
112 | - name: 'team-DB-pager'
113 |   pagerduty_configs:
114 |   - service_key: <team-DB-key>
115 | 


--------------------------------------------------------------------------------
/install_everything.yml:
--------------------------------------------------------------------------------
  1 | ---
  2 | # This Playbook sets up Prometheus on an OpenTLC Cluster
  3 | - hosts: localhost
  4 |   gather_facts: false
  5 |   tasks:
  6 |   # Find out which nodes are Infranodes and add them to a new group: infranodes
  7 |   - name: Add Infranodes to a new infranodes Group
  8 |     add_host:
  9 |       name: "{{ item }}"
 10 |       groups: infranodes
 11 |     with_items: "{{ groups['nodes'] }}"
 12 |     when:
 13 |     - item | match("^infranode.*")
 14 |   - name: Add Masters to a new masters Group
 15 |     add_host:
 16 |       name: "{{ item }}"
 17 |       groups: masters
 18 |     with_items: "{{ groups['nodes'] }}"
 19 |     when:
 20 |     - item | match("^master.*")
 21 | 
 22 | - hosts: infranodes[0]
 23 |   remote_user: ec2_user
 24 |   become: yes
 25 |   become_user: root
 26 |   tasks:
 27 |   # OpenShift Routers expose /metrics on port 1936. Therefore we need to open
 28 |   # the port for both future and current sessions so that Prometheus can access
 29 |   # the router metrics.
 30 |   # Open Firewall Port 1936 for future sessions by adding the rule to
 31 |   # the iptables file.
 32 |   - name: Open Firewall port 1936 for future sessions
 33 |     lineinfile:
 34 |       dest: /etc/sysconfig/iptables
 35 |       insertafter: '-A FORWARD -j REJECT --reject-with icmp-host-prohibited'
 36 |       line: '-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 1936 -j ACCEPT'
 37 |       state: present
 38 |     tags:
 39 |     - prometheus
 40 |   # Open Firewall Port 1936 for current session by adding the rule to the
 41 |   # current iptables configuration. We won't need to restart the iptables
 42 |   # service - which will ensure all OpenShift rules stay in place.
 43 |   - name: Open Firewall Port 1936 for current session
 44 |     iptables:
 45 |       action: insert
 46 |       protocol: tcp
 47 |       destination_port: 1936
 48 |       state: present
 49 |       chain: OS_FIREWALL_ALLOW
 50 |       jump: ACCEPT
 51 |     tags:
 52 |     - prometheus
 53 |   # Create Directory /var/lib/prometheus-data with correct permissions
 54 |   # Make sure the directory has SELinux Type svirt_sandbox_file_t otherwise
 55 |   # there is a permissions problem trying to mount it into the pod.
 56 |   # If there are more than one infranodes this directory will be created on all
 57 |   # infranodes - but only used on the first one
 58 |   - name: Create directory /var/lib/prometheus-data
 59 |     file:
 60 |       path: /var/lib/prometheus-data
 61 |       state: directory
 62 |       group: root
 63 |       owner: root
 64 |       mode: 0777
 65 |       setype: svirt_sandbox_file_t
 66 |     tags:
 67 |     - prometheus
 68 | 
 69 | # Configure all Nodes (including Infranodes and Masters) for monitoring
 70 | - hosts: nodes
 71 |   remote_user: ec2_user
 72 |   become: yes
 73 |   become_user: root
 74 |   tasks:
 75 |   # Node Exporters on all Nodes liston on port 9100.
 76 |   # Open Firewall Port 9100 for future sessions by adding the rule to
 77 |   # the iptables file.
 78 |   - name: Open Firewall port 9100 for future sessions
 79 |     lineinfile:
 80 |       dest: /etc/sysconfig/iptables
 81 |       insertafter: '-A FORWARD -j REJECT --reject-with icmp-host-prohibited'
 82 |       line: '-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 9100 -j ACCEPT'
 83 |       state: present
 84 |     tags:
 85 |     - prometheus
 86 |   # Open Firewall Port 9100 for current session by adding the rule to the
 87 |   # current iptables configuration. We won't need to restart the iptables
 88 |   # service - which will ensure all OpenShift rules stay in place.
 89 |   - name: Open Firewall Port 9100 for current session
 90 |     iptables:
 91 |       action: insert
 92 |       protocol: tcp
 93 |       destination_port: 9100
 94 |       state: present
 95 |       chain: OS_FIREWALL_ALLOW
 96 |       jump: ACCEPT
 97 |     tags:
 98 |     - prometheus
 99 |   # The Node Exporter reads information from the Nodes. In addition it can
100 |   # read arbitrary information from a (properly formatted) text file.
101 |   # We have a shell script that puts information about Docker into a textfile
102 |   # to be read by the Node Exporter. Therefore we need to create the directory
103 |   # where the text file is to be written.
104 |   - name: Create textfile_collector directory
105 |     file:
106 |       path:    /var/lib/node_exporter/textfile_collector
107 |       state:   directory
108 |       owner:   root
109 |       group:   root
110 |       mode:    0775
111 |     tags:
112 |     - prometheus
113 |   # Copy the shell script to the nodes to collect docker information and write
114 |   # to a text file
115 |   - name: Copy dockerinfo to node
116 |     get_url:
117 |       url:   https://raw.githubusercontent.com/wkulhanek/openshift-prometheus/master/node-exporter/dockerinfo/dockerinfo.sh
118 |       dest:  /usr/local/bin/dockerinfo.sh
119 |       owner: root
120 |       group: root
121 |       mode:  0755
122 |     tags:
123 |     - prometheus
124 |   # Create a cron job to run the dockerinfo shell script periodically.
125 |   - name: Copy cron.d/docker_info.cron to node
126 |     get_url:
127 |       url:   https://raw.githubusercontent.com/wkulhanek/openshift-prometheus/master/node-exporter/dockerinfo/dockerinfo.cron
128 |       dest:  /etc/cron.d/dockerinfo.cron
129 |       owner: root
130 |       group: root
131 |       mode:  0644
132 |     tags:
133 |     - prometheus
134 | # Restart crond service to pick up new cron job
135 |   - name: Restart crond service on node
136 |     systemd:
137 |       name: crond
138 |       state: restarted
139 |     tags:
140 |     - prometheus
141 | 
142 | # Finally create all the necessary OpenShift objects. This happens via the
143 | # oc binary on the (first) master host.
144 | - hosts: masters[0]
145 |   remote_user: ec2_user
146 |   become: yes
147 |   become_user: root
148 |   vars:
149 |     # The Web Hook URL for the Alert Manager to send alerts to Rocket Chat
150 |     # The default is #gpte-devops-prometheus channel on the Red Hat Chat instance
151 |     webhook_url: https://chat.consulting.redhat.com/hooks/LTtLntjbTBNvij6br/Rfa6WZSANJJBB8QDsu5nhynQ2sJG2wrSL3BLzbxYJqit3EGk
152 |   tasks:
153 |   # Add label "prometheus-host=true" to the first infranode
154 |   - name: Label Infranodes with prometheus-host=true
155 |     shell: oc label node {{ groups['infranodes'][0] }} prometheus-host=true --overwrite
156 |     tags:
157 |     - prometheus
158 |   # Check if there is already a prometheus project
159 |   - name: Check for prometheus project
160 |     command: "oc get project prometheus"
161 |     register: prometheus_project_present
162 |     ignore_errors: true
163 |   # Create the Prometheus Project if it's not there yet
164 |   - name: Create Prometheus Project
165 |     shell: oc new-project prometheus --display-name="Prometheus Monitoring"
166 |     when: prometheus_project_present | failed
167 |     tags:
168 |     - prometheus
169 |   - name: Set Node Selectors to empty on Prometheus Project
170 |     shell: oc annotate namespace prometheus openshift.io/node-selector=""
171 |     when: prometheus_project_present | failed
172 |     tags:
173 |     - prometheus
174 |   - name: Determine Router Password
175 |     shell: oc set env dc router -n default --list|grep STATS_PASSWORD|awk -F"=" '{print $2}'
176 |     when: prometheus_project_present | failed
177 |     register: router_password
178 |     tags:
179 |     - prometheus
180 |   - name: Deploy Prometheus
181 |     shell: oc new-app -f https://raw.githubusercontent.com/wkulhanek/openshift-prometheus/master/prometheus.yaml --param ROUTER_PASSWORD={{ router_password.stdout }}
182 |     when: prometheus_project_present | failed
183 |     tags:
184 |     - prometheus
185 |   - name: Grant privileged SCC to Prometheus Service account
186 |     shell: oc adm policy add-scc-to-user privileged system:serviceaccount:prometheus:prometheus
187 |     when: prometheus_project_present | failed
188 |     tags:
189 |     - prometheus
190 |   - name: Grant privileged SCC to default service account for Node Exporter
191 |     shell: oc adm policy add-scc-to-user privileged system:serviceaccount:prometheus:default
192 |     when: prometheus_project_present | failed
193 |     tags:
194 |     - prometheus
195 |   - name: Deploy Node Exporter Daemon Set
196 |     shell: oc new-app -f https://raw.githubusercontent.com/wkulhanek/openshift-prometheus/master/node-exporter/node-exporter.yaml
197 |     when: prometheus_project_present | failed
198 |     tags:
199 |     - prometheus
200 |   - name: Deploy Alertmanager
201 |     shell: oc new-app -f https://raw.githubusercontent.com/wkulhanek/openshift-prometheus/master/alertmanager/alertmanager.yaml -p "WEBHOOK_URL={{ webhook_url }}"
202 |     when: prometheus_project_present | failed
203 |     tags:
204 |     - prometheus
205 |     - alertmanager
206 |   - name: Move Alertmanager to an Infranode
207 |     command: "oc patch dc alertmanager --patch '{ \"spec\": { \"template\": { \"spec\": { \"nodeSelector\": { \"env\":\"infra\"}}}}}' "
208 |     when: prometheus_project_present | failed
209 |     tags:
210 |     - prometheus
211 |     - alertmanager
212 |   - name: Deploy Grafana
213 |     shell: oc new-app -f https://raw.githubusercontent.com/wkulhanek/docker-openshift-grafana/master/grafana.yaml
214 |     when: prometheus_project_present | failed
215 |     tags:
216 |     - prometheus
217 |     - grafana
218 |   - name: Move Grafana to an Infranode
219 |     command: "oc patch dc grafana --patch '{ \"spec\": { \"template\": { \"spec\": { \"nodeSelector\": { \"env\":\"infra\"}}}}}' "
220 |     when: prometheus_project_present | failed
221 |     tags:
222 |     - prometheus
223 |     - grafana
224 | 


--------------------------------------------------------------------------------
/node-exporter/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM docker.io/centos:7
 2 | LABEL maintainer="Wolfgang Kulhanek <WolfgangKulhanek@gmail.com>"
 3 | 
 4 | ENV NODE_EXPORTER_VERSION=0.17.0
 5 | 
 6 | RUN yum -y update && yum -y upgrade && yum clean all && \
 7 |     curl -L -o /tmp/node_exporter.tar.gz https://github.com/prometheus/node_exporter/releases/download/v$NODE_EXPORTER_VERSION/node_exporter-$NODE_EXPORTER_VERSION.linux-amd64.tar.gz && \
 8 |     tar -xzf /tmp/node_exporter.tar.gz && \
 9 |     mv ./node_exporter-$NODE_EXPORTER_VERSION.linux-amd64/node_exporter /bin && \
10 |     rm -rf ./node_exporter-$NODE_EXPORTER_VERSION.linux-amd64 && \
11 |     rm /tmp/node_exporter.tar.gz && \
12 |     mkdir /textfile_collector
13 | 
14 | EXPOSE      9100
15 | USER        nobody
16 | ENTRYPOINT [ "/bin/node_exporter" ]
17 | 


--------------------------------------------------------------------------------
/node-exporter/README.adoc:
--------------------------------------------------------------------------------
 1 | # OpenShift Prometheus Node Exporter
 2 | 
 3 | The Node Exporter provides OS and system level metrics for Prometheus. The exporter runs as a link:https://docs.openshift.com/container-platform/latest/dev_guide/daemonsets.html[DaemonSet] to guarantee each node runs at least one copy of the application.
 4 | 
 5 | ## Deployment to OpenShift
 6 | 
 7 | To deploy the node-exporter to OpenShift, complete the following steps.
 8 | 
 9 | ### Open Firewall Ports
10 | 
11 | #### Ansible Version
12 | 
13 | The easy way to open the firewall ports on all nodes is to run the provided Ansible playbook. It will open port 9100 and then restart the IP tables service.
14 | 
15 | [source,bash]
16 | ----
17 | ansible-playbook -i /etc/ansible/hosts update_firewall.yml
18 | ----
19 | 
20 | #### Manual Version
21 | 
22 | To manually configure the firewall follow the following steps:
23 | 
24 | . Port 9100 needs to be opened on each OpenShift host in order for the Prometheus server to scrape the metrics.
25 | +
26 | Add the following line to `/etc/sysconfig/iptables`:
27 | +
28 | [source,bash]
29 | ----
30 | -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 9100 -j ACCEPT
31 | ----
32 | +
33 | . Restart _iptables_ and OpenShift services in order to properly rebuild the rules
34 | +
35 | NOTE: The following commands will cause all running containers on the node to stop and restart
36 | +
37 | [source,bash]
38 | ----
39 | systemctl reload  iptables
40 | systemctl restart iptables.service
41 | systemctl restart docker
42 | systemctl restart atomic-openshift-node.service
43 | ----
44 | 
45 | ### Elevated Access
46 | 
47 | Since the node exporter will be accessing resources from each host, the service account being used to run the pod must be granted elevated access. Execute the following command to add the _default_ Service Account in the _prometheus_ project to the _privileged_ SCC:
48 | 
49 | [source,bash]
50 | ----
51 | oc adm policy add-scc-to-user privileged system:serviceaccount:prometheus:default
52 | ----
53 | 
54 | ### Node Selector
55 | 
56 | If there is a default project node selector for the OpenShift cluster (e.g. *env=users*) it is necessary to set en empty node selector for the Prometheus project. Otherwise the Daemon Set for creating the node-exporter pods will fail with a `MatchNodeSelector` error on nodes that don't have that particular label.
57 | 
58 | [NOTE]
59 | If your OpenShift Cluster does not have a default project node selector you can skip this section.
60 | 
61 | To set an empty node selector:
62 | 
63 | [source,bash]
64 | ----
65 | oc annotate namespace prometheus openshift.io/node-selector=""
66 | ----
67 | 
68 | ### Instantiate the Template
69 | 
70 | Execute the following command to add and instantiate the node-exporter template
71 | 
72 | [source,bash]
73 | ----
74 | oc new-app -f node-exporter.yaml
75 | ----
76 | 


--------------------------------------------------------------------------------
/node-exporter/build.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | export VERSION=0.17.0
3 | docker build . -t wkulhanek/node-exporter:latest
4 | docker tag wkulhanek/node-exporter:latest wkulhanek/node-exporter:${VERSION}
5 | docker push wkulhanek/node-exporter:latest
6 | docker push wkulhanek/node-exporter:${VERSION}
7 | 


--------------------------------------------------------------------------------
/node-exporter/dockerinfo/DockerStatus.json:
--------------------------------------------------------------------------------
  1 | {
  2 |   "__inputs": [
  3 |     {
  4 |       "name": "DS_DS-PROMETHEUS",
  5 |       "label": "DS-Prometheus",
  6 |       "description": "",
  7 |       "type": "datasource",
  8 |       "pluginId": "prometheus",
  9 |       "pluginName": "Prometheus"
 10 |     }
 11 |   ],
 12 |   "__requires": [
 13 |     {
 14 |       "type": "grafana",
 15 |       "id": "grafana",
 16 |       "name": "Grafana",
 17 |       "version": "4.4.1"
 18 |     },
 19 |     {
 20 |       "type": "panel",
 21 |       "id": "graph",
 22 |       "name": "Graph",
 23 |       "version": ""
 24 |     },
 25 |     {
 26 |       "type": "datasource",
 27 |       "id": "prometheus",
 28 |       "name": "Prometheus",
 29 |       "version": "1.0.0"
 30 |     },
 31 |     {
 32 |       "type": "panel",
 33 |       "id": "singlestat",
 34 |       "name": "Singlestat",
 35 |       "version": ""
 36 |     },
 37 |     {
 38 |       "type": "panel",
 39 |       "id": "table",
 40 |       "name": "Table",
 41 |       "version": ""
 42 |     }
 43 |   ],
 44 |   "annotations": {
 45 |     "list": []
 46 |   },
 47 |   "description": "Displays status of Docker Service on Nodes",
 48 |   "editable": true,
 49 |   "gnetId": null,
 50 |   "graphTooltip": 0,
 51 |   "hideControls": false,
 52 |   "id": null,
 53 |   "links": [],
 54 |   "refresh": "10s",
 55 |   "rows": [
 56 |     {
 57 |       "collapse": false,
 58 |       "height": 328,
 59 |       "panels": [
 60 |         {
 61 |           "aliasColors": {},
 62 |           "bars": true,
 63 |           "dashLength": 10,
 64 |           "dashes": false,
 65 |           "datasource": "${DS_DS-PROMETHEUS}",
 66 |           "description": "Percentage of usage of the Docker Volume Group on each host.",
 67 |           "fill": 1,
 68 |           "id": 4,
 69 |           "legend": {
 70 |             "avg": false,
 71 |             "current": false,
 72 |             "max": false,
 73 |             "min": false,
 74 |             "show": false,
 75 |             "total": false,
 76 |             "values": false
 77 |           },
 78 |           "lines": false,
 79 |           "linewidth": 1,
 80 |           "links": [],
 81 |           "nullPointMode": "null",
 82 |           "percentage": false,
 83 |           "pointradius": 5,
 84 |           "points": false,
 85 |           "renderer": "flot",
 86 |           "seriesOverrides": [],
 87 |           "spaceLength": 10,
 88 |           "span": 6,
 89 |           "stack": false,
 90 |           "steppedLine": false,
 91 |           "targets": [
 92 |             {
 93 |               "expr": "node_docker_volume_data_percent_full{instance !~ \"^master1.*\"}",
 94 |               "format": "time_series",
 95 |               "intervalFactor": 2,
 96 |               "legendFormat": "",
 97 |               "metric": "node_docker_v",
 98 |               "refId": "A",
 99 |               "step": 10
100 |             }
101 |           ],
102 |           "thresholds": [
103 |             {
104 |               "colorMode": "critical",
105 |               "fill": true,
106 |               "line": true,
107 |               "op": "gt",
108 |               "value": 80
109 |             }
110 |           ],
111 |           "timeFrom": null,
112 |           "timeShift": null,
113 |           "title": "Docker Volume % Used",
114 |           "tooltip": {
115 |             "shared": false,
116 |             "sort": 0,
117 |             "value_type": "individual"
118 |           },
119 |           "type": "graph",
120 |           "xaxis": {
121 |             "buckets": null,
122 |             "mode": "series",
123 |             "name": null,
124 |             "show": false,
125 |             "values": [
126 |               "current"
127 |             ]
128 |           },
129 |           "yaxes": [
130 |             {
131 |               "format": "percent",
132 |               "label": "",
133 |               "logBase": 1,
134 |               "max": "100",
135 |               "min": "0",
136 |               "show": true
137 |             },
138 |             {
139 |               "format": "short",
140 |               "label": null,
141 |               "logBase": 1,
142 |               "max": null,
143 |               "min": null,
144 |               "show": false
145 |             }
146 |           ]
147 |         },
148 |         {
149 |           "aliasColors": {},
150 |           "bars": true,
151 |           "dashLength": 10,
152 |           "dashes": false,
153 |           "datasource": "${DS_DS-PROMETHEUS}",
154 |           "fill": 1,
155 |           "id": 5,
156 |           "legend": {
157 |             "avg": false,
158 |             "current": false,
159 |             "max": false,
160 |             "min": false,
161 |             "show": false,
162 |             "total": false,
163 |             "values": false
164 |           },
165 |           "lines": false,
166 |           "linewidth": 1,
167 |           "links": [],
168 |           "nullPointMode": "null",
169 |           "percentage": false,
170 |           "pointradius": 5,
171 |           "points": false,
172 |           "renderer": "flot",
173 |           "seriesOverrides": [],
174 |           "spaceLength": 10,
175 |           "span": 6,
176 |           "stack": false,
177 |           "steppedLine": false,
178 |           "targets": [
179 |             {
180 |               "expr": "((node_filesystem_size{mountpoint=\"/\", device=\"rootfs\"}-node_filesystem_avail{mountpoint=\"/\", device=\"rootfs\"}) / node_filesystem_size{mountpoint=\"/\", device=\"rootfs\"} ) * 100",
181 |               "format": "time_series",
182 |               "intervalFactor": 2,
183 |               "refId": "A",
184 |               "step": 10
185 |             }
186 |           ],
187 |           "thresholds": [
188 |             {
189 |               "colorMode": "critical",
190 |               "fill": true,
191 |               "line": true,
192 |               "op": "gt",
193 |               "value": 80
194 |             }
195 |           ],
196 |           "timeFrom": null,
197 |           "timeShift": null,
198 |           "title": "Root Filesystem % Used",
199 |           "tooltip": {
200 |             "shared": false,
201 |             "sort": 0,
202 |             "value_type": "individual"
203 |           },
204 |           "type": "graph",
205 |           "xaxis": {
206 |             "buckets": null,
207 |             "mode": "series",
208 |             "name": null,
209 |             "show": false,
210 |             "values": [
211 |               "current"
212 |             ]
213 |           },
214 |           "yaxes": [
215 |             {
216 |               "format": "percent",
217 |               "label": "",
218 |               "logBase": 1,
219 |               "max": "100",
220 |               "min": "0",
221 |               "show": true
222 |             },
223 |             {
224 |               "format": "short",
225 |               "label": null,
226 |               "logBase": 1,
227 |               "max": null,
228 |               "min": null,
229 |               "show": false
230 |             }
231 |           ]
232 |         }
233 |       ],
234 |       "repeat": null,
235 |       "repeatIteration": null,
236 |       "repeatRowId": null,
237 |       "showTitle": true,
238 |       "title": "Docker Volume Availabe",
239 |       "titleSize": "h4"
240 |     },
241 |     {
242 |       "collapse": false,
243 |       "height": 250,
244 |       "panels": [
245 |         {
246 |           "aliasColors": {},
247 |           "bars": false,
248 |           "dashLength": 10,
249 |           "dashes": false,
250 |           "datasource": "${DS_DS-PROMETHEUS}",
251 |           "fill": 0,
252 |           "id": 8,
253 |           "legend": {
254 |             "avg": false,
255 |             "current": false,
256 |             "max": false,
257 |             "min": false,
258 |             "show": true,
259 |             "total": false,
260 |             "values": false
261 |           },
262 |           "lines": true,
263 |           "linewidth": 3,
264 |           "links": [],
265 |           "nullPointMode": "null",
266 |           "percentage": false,
267 |           "pointradius": 5,
268 |           "points": false,
269 |           "renderer": "flot",
270 |           "seriesOverrides": [],
271 |           "spaceLength": 10,
272 |           "span": 6,
273 |           "stack": false,
274 |           "steppedLine": false,
275 |           "targets": [
276 |             {
277 |               "expr": "node_docker_running_containers",
278 |               "format": "time_series",
279 |               "intervalFactor": 2,
280 |               "metric": "node_docker_running_containers",
281 |               "refId": "A",
282 |               "step": 10
283 |             }
284 |           ],
285 |           "thresholds": [],
286 |           "timeFrom": null,
287 |           "timeShift": null,
288 |           "title": "Docker Containers Running",
289 |           "tooltip": {
290 |             "shared": true,
291 |             "sort": 0,
292 |             "value_type": "individual"
293 |           },
294 |           "type": "graph",
295 |           "xaxis": {
296 |             "buckets": null,
297 |             "mode": "time",
298 |             "name": null,
299 |             "show": true,
300 |             "values": []
301 |           },
302 |           "yaxes": [
303 |             {
304 |               "format": "short",
305 |               "label": null,
306 |               "logBase": 1,
307 |               "max": null,
308 |               "min": null,
309 |               "show": true
310 |             },
311 |             {
312 |               "format": "short",
313 |               "label": null,
314 |               "logBase": 1,
315 |               "max": null,
316 |               "min": null,
317 |               "show": false
318 |             }
319 |           ]
320 |         },
321 |         {
322 |           "cacheTimeout": null,
323 |           "colorBackground": false,
324 |           "colorValue": false,
325 |           "colors": [
326 |             "rgba(245, 54, 54, 0.9)",
327 |             "rgba(237, 129, 40, 0.89)",
328 |             "rgba(50, 172, 45, 0.97)"
329 |           ],
330 |           "datasource": "${DS_DS-PROMETHEUS}",
331 |           "format": "s",
332 |           "gauge": {
333 |             "maxValue": 100,
334 |             "minValue": 0,
335 |             "show": false,
336 |             "thresholdLabels": false,
337 |             "thresholdMarkers": true
338 |           },
339 |           "id": 7,
340 |           "interval": null,
341 |           "links": [],
342 |           "mappingType": 1,
343 |           "mappingTypes": [
344 |             {
345 |               "name": "value to text",
346 |               "value": 1
347 |             },
348 |             {
349 |               "name": "range to text",
350 |               "value": 2
351 |             }
352 |           ],
353 |           "maxDataPoints": 100,
354 |           "nullPointMode": "connected",
355 |           "nullText": null,
356 |           "postfix": "",
357 |           "postfixFontSize": "50%",
358 |           "prefix": "",
359 |           "prefixFontSize": "50%",
360 |           "rangeMaps": [
361 |             {
362 |               "from": "null",
363 |               "text": "N/A",
364 |               "to": "null"
365 |             }
366 |           ],
367 |           "span": 6,
368 |           "sparkline": {
369 |             "fillColor": "rgba(31, 118, 189, 0.18)",
370 |             "full": false,
371 |             "lineColor": "rgb(31, 120, 193)",
372 |             "show": false
373 |           },
374 |           "tableColumn": "Time",
375 |           "targets": [
376 |             {
377 |               "expr": "time(node_docker_last_successful_update)",
378 |               "format": "table",
379 |               "interval": "",
380 |               "intervalFactor": 2,
381 |               "metric": "node_docker_last_successful_update",
382 |               "refId": "A",
383 |               "step": 60
384 |             }
385 |           ],
386 |           "thresholds": "",
387 |           "title": "Last Successful Docker Update",
388 |           "type": "singlestat",
389 |           "valueFontSize": "80%",
390 |           "valueMaps": [
391 |             {
392 |               "op": "=",
393 |               "text": "N/A",
394 |               "value": "null"
395 |             }
396 |           ],
397 |           "valueName": "current"
398 |         }
399 |       ],
400 |       "repeat": null,
401 |       "repeatIteration": null,
402 |       "repeatRowId": null,
403 |       "showTitle": false,
404 |       "title": "Dashboard Row",
405 |       "titleSize": "h6"
406 |     },
407 |     {
408 |       "collapse": false,
409 |       "height": 250,
410 |       "panels": [
411 |         {
412 |           "columns": [],
413 |           "fontSize": "100%",
414 |           "id": 9,
415 |           "links": [],
416 |           "pageSize": null,
417 |           "scroll": true,
418 |           "showHeader": true,
419 |           "sort": {
420 |             "col": 2,
421 |             "desc": true
422 |           },
423 |           "span": 12,
424 |           "styles": [
425 |             {
426 |               "alias": "",
427 |               "colorMode": null,
428 |               "colors": [
429 |                 "rgba(245, 54, 54, 0.9)",
430 |                 "rgba(237, 129, 40, 0.89)",
431 |                 "rgba(50, 172, 45, 0.97)"
432 |               ],
433 |               "decimals": 2,
434 |               "pattern": "/.*/",
435 |               "thresholds": [],
436 |               "type": "number",
437 |               "unit": "short"
438 |             }
439 |           ],
440 |           "targets": [
441 |             {
442 |               "expr": "node_docker_last_successful_update",
443 |               "format": "table",
444 |               "intervalFactor": 2,
445 |               "metric": "node_docker_last_successful_update_month",
446 |               "refId": "A",
447 |               "step": 4
448 |             }
449 |           ],
450 |           "title": "Panel Title",
451 |           "transform": "table",
452 |           "type": "table"
453 |         }
454 |       ],
455 |       "repeat": null,
456 |       "repeatIteration": null,
457 |       "repeatRowId": null,
458 |       "showTitle": false,
459 |       "title": "Dashboard Row",
460 |       "titleSize": "h6"
461 |     }
462 |   ],
463 |   "schemaVersion": 14,
464 |   "style": "dark",
465 |   "tags": [],
466 |   "templating": {
467 |     "list": []
468 |   },
469 |   "time": {
470 |     "from": "now-1h",
471 |     "to": "now"
472 |   },
473 |   "timepicker": {
474 |     "refresh_intervals": [
475 |       "5s",
476 |       "10s",
477 |       "30s",
478 |       "1m",
479 |       "5m",
480 |       "15m",
481 |       "30m",
482 |       "1h",
483 |       "2h",
484 |       "1d"
485 |     ],
486 |     "time_options": [
487 |       "5m",
488 |       "15m",
489 |       "1h",
490 |       "6h",
491 |       "12h",
492 |       "24h",
493 |       "2d",
494 |       "7d",
495 |       "30d"
496 |     ]
497 |   },
498 |   "timezone": "browser",
499 |   "title": "Docker Status",
500 |   "version": 12
501 | }


--------------------------------------------------------------------------------
/node-exporter/dockerinfo/README.adoc:
--------------------------------------------------------------------------------
 1 | == Docker Info
 2 | 
 3 | This directory contains utility scripts to run on every Node with Docker to add information on space in the Docker Volume Group (which is a RAW device that is not mounted and therefore is not available to node exporter).
 4 | 
 5 | It contains of a shell script that collects the available space and writes into a directory on the host. This directory is mounted into the node exporter pods so that the node exporter can report on the contents of any file in that directory.
 6 | There is also a cron script that runs every 5 minutes.
 7 | 
 8 | To install run the following command from your OpenShift bastion host (where you have your Ansible hosts file with a list of all nodes):
 9 | 
10 | [source,bash]
11 | ----
12 | ansible-playbook -i /etc/ansible/hosts install_dockerinfo.yml
13 | ----
14 | 
15 | There is a Grafana Dashboard (DockerStatus.json) that reads the values from this utility script and displays them in a graphical fashion. Simply import the Dashboard into Grafana once Grafana has been set up and connected to Prometheus.
16 | 


--------------------------------------------------------------------------------
/node-exporter/dockerinfo/dockerinfo.cron:
--------------------------------------------------------------------------------
1 | */1 * * * * root /usr/local/bin/dockerinfo.sh
2 | 


--------------------------------------------------------------------------------
/node-exporter/dockerinfo/dockerinfo.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | # Collect Docker Volume Group Space Information
 4 | /sbin/lvs docker-vg --units b |grep docker-pool|awk -c '{print "node_docker_volume_size_bytes " substr($4, 1, length($4)-1)}' >/tmp/docker_info.prom
 5 | /sbin/lvs docker-vg --units b |grep docker-pool|awk -c '{print "node_docker_volume_data_percent_full " $5}'                  >>/tmp/docker_info.prom
 6 | /sbin/lvs docker-vg --units b |grep docker-pool|awk -c '{print "node_docker_volume_meta_percent_full " $6}'                  >>/tmp/docker_info.prom
 7 | echo "node_docker_running_containers " $(docker ps -q |wc -l)                                                                >>/tmp/docker_info.prom
 8 | echo "node_docker_last_successful_update{year=\"`date +%Y`\", month=\"`date +%m`\", day=\"`date +%d`\", hour=\"`date +%H`\", minute=\"`date +%M`\", second=\"`date +%S`\"} " `date +%s` >>/tmp/docker_info.prom
 9 | 
10 | mv /tmp/docker_info.prom /var/lib/node_exporter/textfile_collector
11 | 


--------------------------------------------------------------------------------
/node-exporter/dockerinfo/install_dockerinfo.yml:
--------------------------------------------------------------------------------
 1 | ---
 2 | - hosts: nodes
 3 |   remote_user: ec2_user
 4 |   become: yes
 5 |   become_user: root
 6 |   tasks:
 7 |   - name: Create textfile_collector directory
 8 |     file:
 9 |       path:    /var/lib/node_exporter/textfile_collector
10 |       state:   directory
11 |       owner:   root
12 |       group:   root
13 |       mode:    0775
14 |   - name: Copy dockerinfo to node
15 |     copy:
16 |       src:   dockerinfo.sh
17 |       dest:  /usr/local/bin/dockerinfo.sh
18 |       owner: root
19 |       group: root
20 |       mode:  0755
21 |   - name: Copy cron.d/docker_info.cron to node
22 |     copy:
23 |       src:   dockerinfo.cron
24 |       dest:  /etc/cron.d/dockerinfo.cron
25 |       owner: root
26 |       group: root
27 |       mode:  0644
28 |   - name: Restart crond service on node
29 |     systemd:
30 |       name: crond
31 |       state: restarted
32 | 


--------------------------------------------------------------------------------
/node-exporter/native/README.adoc:
--------------------------------------------------------------------------------
 1 | # Installing node-exporter natively on an OpenShift Cluster
 2 | 
 3 | If you would like to run node-exporter natively on your OpenShift cluster you can use the playbook to deploy it to all the nodes in your cluster.
 4 | 
 5 | Run the provided Ansible playbook on your OpenShift bastion host and point it to your Ansible hosts file for your OpenShift installation (default location: `/etc/ansible/hosts`).
 6 | 
 7 | This will install the `node_exporter` binary in all hosts that are in the `nodes` group in the referenced hosts file. It will also set up a system service to automatically start the node_exporter.
 8 | 
 9 | Finally the playbook configures the firewall on all nodes to allow for inbound traffic on port 9100.
10 | 
11 | [source,bash]
12 | ----
13 | ansible-playbook -i /etc/ansible/hosts install_node-exporter.yml
14 | ----
15 | 


--------------------------------------------------------------------------------
/node-exporter/native/install_node-exporter.yml:
--------------------------------------------------------------------------------
 1 | ---
 2 | # Get latest Node Exporter software
 3 | - hosts: localhost
 4 |   tasks:
 5 |   - name: Download latest node_exporter binary
 6 |     get_url:
 7 |       url: https://github.com/prometheus/node_exporter/releases/download/v0.15.0/node_exporter-0.15.0.linux-amd64.tar.gz
 8 |       dest: /tmp/node_exporter.tgz
 9 |   - name: Untar node_exporter binary
10 |     unarchive:
11 |       src: /tmp/node_exporter.tgz
12 |       dest: /tmp
13 |       copy: no
14 | 
15 | # Now set up all nodes running OpenShift software with the Node Exporter
16 | - hosts: nodes
17 |   remote_user: ec2_user
18 |   become: yes
19 |   become_user: root
20 |   tasks:
21 |   - name: Copy node-exporter binary to node
22 |     copy:
23 |       src:   /tmp/node_exporter-0.15.0.linux-amd64/node_exporter
24 |       dest:  /usr/local/bin/node_exporter
25 |       owner: root
26 |       group: root
27 |       mode:  0755
28 |   - name: Set up Service for node-exporter
29 |     copy:
30 |       src:   node_exporter.conf
31 |       dest:  /etc/systemd/system/node_exporter.service
32 |       owner: root
33 |       group: root
34 |       mode:  0644
35 |   - name: Start node_exporter service on node
36 |     systemd:
37 |       name: node_exporter
38 |       enabled: yes
39 |       state: started
40 |       daemon_reload: yes
41 |   # Open Firewall Port 9100 for future sessions by adding the rule to
42 |   # the iptables file.
43 |   - name: Open Firewall port 9100 for future sessions
44 |       lineinfile:
45 |         dest: /etc/sysconfig/iptables
46 |         insertafter: '-A FORWARD -j REJECT --reject-with icmp-host-prohibited'
47 |         line: '-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 9100 -j ACCEPT'
48 |         state: present
49 |   # Open Firewall Port 9100 for current session by adding the rule to the
50 |   # current iptables configuration. We won't need to restart the iptables
51 |   # service - which will ensure all OpenShift rules stay in place.- name: Restart IP Tables service
52 |   - name: Open Firewall Port 9100 for current session
53 |     iptables:
54 |       action: insert
55 |       protocol: tcp
56 |       destination_port: 9100
57 |       state: present
58 |       chain: OS_FIREWALL_ALLOW
59 |       jump: ACCEPT
60 | 
61 | # Clean up on Localhost
62 | # Delete the downloaded and unarchived files
63 | - hosts: localhost
64 |   tasks:
65 |   - name: Cleanup localhost /tmp
66 |     file:
67 |       path: /tmp/node_exporter.tgz
68 |       state: absent
69 |   - name: Cleanup localhost unarchived directory
70 |     file:
71 |       path: /tmp/node_exporter-0.15.0.linux-amd64
72 |       state: absent
73 | 


--------------------------------------------------------------------------------
/node-exporter/native/node_exporter.conf:
--------------------------------------------------------------------------------
 1 | [Unit]
 2 | Description=Node Exporter
 3 | 
 4 | [Service]
 5 | User=root
 6 | ExecStart=/usr/local/bin/node_exporter --collector.textfile.directory /var/lib/node_exporter/textfile_collector
 7 | 
 8 | [Install]
 9 | WantedBy=default.target
10 | 


--------------------------------------------------------------------------------
/node-exporter/node-exporter.yaml:
--------------------------------------------------------------------------------
 1 | apiVersion: v1
 2 | kind: Template
 3 | metadata:
 4 |   name: node-exporter
 5 |   annotations:
 6 |     "openshift.io/display-name": Node Exporter
 7 |     description: |
 8 |       Node Exporter for use with Prometheus on Red Hat OpenShift - collect and gather metrics from nodes, services, and the infrastructure.
 9 |     iconClass: icon-cogs
10 |     tags: "monitoring,node-exporter,time-series"
11 | parameters:
12 | - description: The location of the prometheus image
13 |   name: IMAGE_NODE_EXPORTER
14 |   value: wkulhanek/node-exporter:latest
15 | objects:
16 | 
17 | - apiVersion: extensions/v1beta1
18 |   kind: DaemonSet
19 |   metadata:
20 |     name: node-exporter
21 |   spec:
22 |     template:
23 |       metadata:
24 |         labels:
25 |           app: node-exporter
26 |         name: node-exporter
27 |         annotations:
28 |           prometheus.io/scrape: "true"
29 |           prometheus.io/port: "9100"
30 |       spec:
31 |         hostPID: true
32 |         hostNetwork: true
33 |         containers:
34 |         - image: ${IMAGE_NODE_EXPORTER}
35 |           imagePullPolicy: IfNotPresent
36 |           name: node-exporter
37 |           ports:
38 |           - containerPort: 9100
39 |             hostPort: 9100
40 |             protocol: TCP
41 |             name: scrape
42 |           securityContext:
43 |             privileged: true
44 |           args:
45 |             - --path.procfs
46 |             - /host/proc
47 |             - --path.sysfs
48 |             - /host/sys
49 |             - --collector.filesystem.ignored-mount-points
50 |             - '"^/(sys|proc|dev|host|etc)($|/)"'
51 |             - --collector.textfile.directory
52 |             - /textfile_collector
53 |           volumeMounts:
54 |             - name: dev
55 |               mountPath: /host/dev
56 |             - name: proc
57 |               mountPath: /host/proc
58 |             - name: sys
59 |               mountPath: /host/sys
60 |             - name: rootfs
61 |               mountPath: /rootfs
62 |             - name: textfile-collector
63 |               mountPath: /textfile_collector
64 |         volumes:
65 |           - name: proc
66 |             hostPath:
67 |               path: /proc
68 |           - name: dev
69 |             hostPath:
70 |               path: /dev
71 |           - name: sys
72 |             hostPath:
73 |               path: /sys
74 |           - name: rootfs
75 |             hostPath:
76 |               path: /
77 |           - name: textfile-collector
78 |             hostPath:
79 |               path: /var/lib/node_exporter/textfile_collector
80 | 


--------------------------------------------------------------------------------
/node-exporter/setup.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | # you should set securityContext privileged true on container part of node-exporter.yaml file to collect filesystem metrics!
 4 | # Change into the Prometheus project
 5 | oc project prometheus
 6 | 
 7 | # The Service Account running the node selector pods needs the SCC `hostaccess`:
 8 | 
 9 | # oc adm policy add-scc-to-user privileged system:serviceaccount:<NAMESPACE>:<PROJECT>
10 | oc adm policy add-scc-to-user privileged system:serviceaccount:prometheus:default 
11 | 
12 | # Import the template and create the DaemonSet
13 | oc create -f node-exporter-template.yaml
14 | oc new-app --template=node-exporter
15 | 
16 | 


--------------------------------------------------------------------------------
/node-exporter/update_firewall.yml:
--------------------------------------------------------------------------------
 1 | ---
 2 | - hosts: nodes
 3 |   remote_user: ec2_user
 4 |   become: yes
 5 |   become_user: root
 6 |   tasks:
 7 |   # Open Firewall Port 9100 for future sessions by adding the rule to
 8 |   # the iptables file.
 9 |   - name: Open Firewall port 9100 for future sessions
10 |     lineinfile:
11 |       dest: /etc/sysconfig/iptables
12 |       insertafter: '-A FORWARD -j REJECT --reject-with icmp-host-prohibited'
13 |       line: '-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 9100 -j ACCEPT'
14 |       state: present
15 |   # Open Firewall Port 9100 for current session by adding the rule to the
16 |   # current iptables configuration. We won't need to restart the iptables
17 |   # service - which will ensure all OpenShift rules stay in place.- name: Restart IP Tables service
18 |   - name: Open Firewall Port 9100 for current session
19 |     iptables:
20 |       action: insert
21 |       protocol: tcp
22 |       destination_port: 9100
23 |       state: present
24 |       chain: OS_FIREWALL_ALLOW
25 |       jump: ACCEPT
26 | 


--------------------------------------------------------------------------------
/prometheus-image/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM docker.io/centos:7
 2 | LABEL maintainer="Wolfgang Kulhanek <WolfgangKulhanek@gmail.com>"
 3 | 
 4 | ENV PROMETHEUS_VERSION=2.8.0
 5 | 
 6 | RUN yum -y update && yum -y upgrade  && \
 7 |     yum -y clean all && \
 8 |     rm -rf /var/cache/yum && \
 9 |     curl -L -o /tmp/prometheus.tar.gz https://github.com/prometheus/prometheus/releases/download/v$PROMETHEUS_VERSION/prometheus-$PROMETHEUS_VERSION.linux-amd64.tar.gz && \
10 |     tar -xmzf /tmp/prometheus.tar.gz && \
11 |     mv ./prometheus-$PROMETHEUS_VERSION.linux-amd64 /usr/share/prometheus && \
12 |     ln -s /usr/share/prometheus/prometheus /bin/prometheus && \
13 |     ln -s /usr/share/prometheus/promtool /bin/promtool && \
14 |     mkdir -p /etc/prometheus && \
15 |     mkdir -p /prometheus && \
16 |     chmod 777 /prometheus && \
17 |     rm /tmp/prometheus.tar.gz
18 | 
19 | COPY config.yml /etc/prometheus/prometheus.yml
20 | 
21 | EXPOSE      9090
22 | USER        nobody
23 | VOLUME     [ "/prometheus" ]
24 | WORKDIR    /prometheus
25 | ENTRYPOINT [ "/bin/prometheus" ]
26 | CMD        [ "--config.file=/etc/prometheus/prometheus.yml", \
27 |              "--web.listen-address=:9090", \
28 |              "--web.console.templates=/usr/share/prometheus/consoles", \
29 |              "--web.console.libraries=/usr/share/prometheus/console_libraries", \
30 |              "--storage.tsdb.path=/prometheus", \
31 |              "--storage.tsdb.retention=24h", \
32 |              "--storage.tsdb.min-block-duration=15m", \
33 |              "--storage.tsdb.max-block-duration=60m" \
34 |             ]
35 | 


--------------------------------------------------------------------------------
/prometheus-image/build.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | export VERSION=2.8.0
3 | docker build . -t wkulhanek/prometheus:latest
4 | docker tag wkulhanek/prometheus:latest wkulhanek/prometheus:${VERSION}
5 | docker push wkulhanek/prometheus:latest
6 | docker push wkulhanek/prometheus:${VERSION}
7 | 


--------------------------------------------------------------------------------
/prometheus-image/config.yml:
--------------------------------------------------------------------------------
  1 | rule_files:
  2 |   - 'prometheus.rules'
  3 | 
  4 | # A scrape configuration for running Prometheus on a Kubernetes cluster.
  5 | # This uses separate scrape configs for cluster components (i.e. API server, node)
  6 | # and services to allow each to use different authentication configs.
  7 | #
  8 | # Kubernetes labels will be added as Prometheus labels on metrics via the
  9 | # `labelmap` relabeling action.
 10 | 
 11 | # Scrape config for API servers.
 12 | #
 13 | # Kubernetes exposes API servers as endpoints to the default/kubernetes
 14 | # service so this uses `endpoints` role and uses relabelling to only keep
 15 | # the endpoints associated with the default/kubernetes service using the
 16 | # default named port `https`. This works for single API server deployments as
 17 | # well as HA API server deployments.
 18 | scrape_configs:
 19 | - job_name: 'kubernetes-apiservers'
 20 | 
 21 |   kubernetes_sd_configs:
 22 |   - role: endpoints
 23 | 
 24 |   scheme: https
 25 |   tls_config:
 26 |     ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
 27 |   bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
 28 | 
 29 |   # Keep only the default/kubernetes service endpoints for the https port. This
 30 |   # will add targets for each API server which Kubernetes adds an endpoint to
 31 |   # the default/kubernetes service.
 32 |   relabel_configs:
 33 |   - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
 34 |     action: keep
 35 |     regex: default;kubernetes;https
 36 | 
 37 | # Scrape config for nodes.
 38 | #
 39 | # Each node exposes a /metrics endpoint that contains operational metrics for
 40 | # the Kubelet and other components.
 41 | - job_name: 'kubernetes-nodes'
 42 | 
 43 |   scheme: https
 44 |   tls_config:
 45 |     ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
 46 |   bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
 47 | 
 48 |   kubernetes_sd_configs:
 49 |   - role: node
 50 | 
 51 |   relabel_configs:
 52 |   - action: labelmap
 53 |     regex: __meta_kubernetes_node_label_(.+)
 54 | 
 55 | # Scrape config for controllers.
 56 | #
 57 | # Each master node exposes a /metrics endpoint on :8444 that contains operational metrics for
 58 | # the controllers.
 59 | #
 60 | # TODO: move this to a pure endpoints based metrics gatherer when controllers are exposed via
 61 | #       endpoints.
 62 | - job_name: 'kubernetes-controllers'
 63 | 
 64 |   scheme: https
 65 |   tls_config:
 66 |     ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
 67 |   bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
 68 | 
 69 |   kubernetes_sd_configs:
 70 |   - role: endpoints
 71 | 
 72 |   # Keep only the default/kubernetes service endpoints for the https port, and then
 73 |   # set the port to 8444. This is the default configuration for the controllers on OpenShift
 74 |   # masters.
 75 |   relabel_configs:
 76 |   - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
 77 |     action: keep
 78 |     regex: default;kubernetes;https
 79 |   - source_labels: [__address__]
 80 |     action: replace
 81 |     target_label: __address__
 82 |     regex: (.+)(?::\d+)
 83 |     replacement: $1:8444
 84 | 
 85 | # Scrape config for cAdvisor.
 86 | #
 87 | # Beginning in Kube 1.7, each node exposes a /metrics/cadvisor endpoint that
 88 | # reports container metrics for each running pod. Scrape those by default.
 89 | - job_name: 'kubernetes-cadvisor'
 90 | 
 91 |   scheme: https
 92 |   tls_config:
 93 |     ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
 94 |   bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
 95 | 
 96 |   metrics_path: /metrics/cadvisor
 97 | 
 98 |   kubernetes_sd_configs:
 99 |   - role: node
100 | 
101 |   relabel_configs:
102 |   - action: labelmap
103 |     regex: __meta_kubernetes_node_label_(.+)
104 | 
105 | # Scrape config for service endpoints.
106 | #
107 | # The relabeling allows the actual service scrape endpoint to be configured
108 | # via the following annotations:
109 | #
110 | # * `prometheus.io/scrape`: Only scrape services that have a value of `true`
111 | # * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
112 | # to set this to `https` & most likely set the `tls_config` of the scrape config.
113 | # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
114 | # * `prometheus.io/port`: If the metrics are exposed on a different port to the
115 | # service then set this appropriately.
116 | - job_name: 'kubernetes-service-endpoints'
117 | 
118 |   tls_config:
119 |     ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
120 |     # TODO: this should be per target
121 |     insecure_skip_verify: true
122 | 
123 |   kubernetes_sd_configs:
124 |   - role: endpoints
125 | 
126 | 
127 |   relabel_configs:
128 |   - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
129 |     action: keep
130 |     regex: true
131 |   - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
132 |     action: replace
133 |     target_label: __scheme__
134 |     regex: (https?)
135 |   - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
136 |     action: replace
137 |     target_label: __metrics_path__
138 |     regex: (.+)
139 |   - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
140 |     action: replace
141 |     target_label: __address__
142 |     regex: (.+)(?::\d+);(\d+)
143 |     replacement: $1:$2
144 |   - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_username]
145 |     action: replace
146 |     target_label: __basic_auth_username__
147 |     regex: (.+)
148 |   - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_password]
149 |     action: replace
150 |     target_label: __basic_auth_password__
151 |     regex: (.+)
152 |   - action: labelmap
153 |     regex: __meta_kubernetes_service_label_(.+)
154 |   - source_labels: [__meta_kubernetes_namespace]
155 |     action: replace
156 |     target_label: kubernetes_namespace
157 |   - source_labels: [__meta_kubernetes_service_name]
158 |     action: replace
159 |     target_label: kubernetes_name
160 | 
161 | alerting:
162 |   alertmanagers:
163 |   - scheme: http
164 |     static_configs:
165 |     - targets:
166 |       - "localhost:9093"


--------------------------------------------------------------------------------
/prometheus.yaml:
--------------------------------------------------------------------------------
  1 | apiVersion: v1
  2 | kind: Template
  3 | metadata:
  4 |   name: prometheus
  5 |   annotations:
  6 |     "openshift.io/display-name": Prometheus
  7 |     description: |
  8 |       A monitoring solution for an OpenShift cluster - collect and gather metrics from nodes, services, and the infrastructure.
  9 |     iconClass: icon-cogs
 10 |     tags: "monitoring,prometheus,time-series"
 11 | parameters:
 12 | - description: The namespace to instantiate prometheus under. Defaults to 'prometheus'.
 13 |   name: NAMESPACE
 14 |   value: prometheus
 15 | - description: The location of the prometheus image
 16 |   name: IMAGE_PROMETHEUS
 17 |   value: wkulhanek/prometheus:latest
 18 | - description: The scheme to communicate with the Alertmanager. Defaults to 'http'.
 19 |   name: ALERT_MANAGER_SCHEME
 20 |   value: http
 21 | - description: Alertmanager Hostname and Port. Defaults to 'alertmanager:9093'.
 22 |   name: ALERT_MANAGER_HOST_PORT
 23 |   value: alertmanager:9093
 24 | - description: Router Password (oc set env dc router -n default --list|grep STATS_PASSWORD)
 25 |   name: ROUTER_PASSWORD
 26 |   required: true
 27 | 
 28 | objects:
 29 | 
 30 | - apiVersion: v1
 31 |   kind: ServiceAccount
 32 |   metadata:
 33 |     name: prometheus
 34 | 
 35 | - apiVersion: v1
 36 |   kind: ClusterRoleBinding
 37 |   metadata:
 38 |     name: prometheus-cluster-reader
 39 |   roleRef:
 40 |     name: cluster-reader
 41 |   subjects:
 42 |   - kind: ServiceAccount
 43 |     name: prometheus
 44 |     namespace: ${NAMESPACE}
 45 | 
 46 | - apiVersion: v1
 47 |   kind: Route
 48 |   metadata:
 49 |     name: prometheus
 50 |   spec:
 51 |     to:
 52 |       name: prometheus
 53 | 
 54 | - apiVersion: v1
 55 |   kind: Service
 56 |   metadata:
 57 |     annotations:
 58 |       prometheus.io/scrape: "true"
 59 |       prometheus.io/scheme: http
 60 |     labels:
 61 |       name: prometheus
 62 |     name: prometheus
 63 |   spec:
 64 |     ports:
 65 |     - name: prometheus
 66 |       port: 9090
 67 |       protocol: TCP
 68 |       targetPort: 9090
 69 |     selector:
 70 |       app: prometheus
 71 | 
 72 | - apiVersion: v1
 73 |   kind: DeploymentConfig
 74 |   metadata:
 75 |     name: prometheus
 76 |     labels:
 77 |       app: prometheus
 78 |   spec:
 79 |     replicas: 1
 80 |     selector:
 81 |       app: prometheus
 82 |       deploymentconfig: prometheus
 83 |     template:
 84 |       metadata:
 85 |         labels:
 86 |           app: prometheus
 87 |           deploymentconfig: prometheus
 88 |         name: prometheus
 89 |       spec:
 90 |         serviceAccountName: prometheus
 91 |         securityContext:
 92 |           privileged: true
 93 |         nodeSelector:
 94 |           prometheus-host: "true"
 95 |         containers:
 96 |         - name: prometheus
 97 |           args:
 98 |           - --config.file=/etc/prometheus/prometheus.yml
 99 |           - --web.listen-address=:9090
100 |           - --storage.tsdb.retention=6h
101 |           - --storage.tsdb.min-block-duration=15m
102 |           - --storage.tsdb.max-block-duration=60m
103 |           image: ${IMAGE_PROMETHEUS}
104 |           imagePullPolicy: IfNotPresent
105 |           livenessProbe:
106 |             failureThreshold: 3
107 |             httpGet:
108 |               path: /status
109 |               port: 9090
110 |               scheme: HTTP
111 |             initialDelaySeconds: 2
112 |             periodSeconds: 10
113 |             successThreshold: 1
114 |             timeoutSeconds: 1
115 |           readinessProbe:
116 |             failureThreshold: 3
117 |             httpGet:
118 |               path: /status
119 |               port: 9090
120 |               scheme: HTTP
121 |             initialDelaySeconds: 2
122 |             periodSeconds: 10
123 |             successThreshold: 1
124 |             timeoutSeconds: 1
125 |           # Set Requests & Limits
126 |           # Prometheus uses 2Gi memory by default with 50% headroom
127 |           # required.
128 |           resources:
129 |             requests:
130 |               cpu: 500m
131 |               memory: 3Gi
132 |             limits:
133 |               cpu: 500m
134 |               memory: 3Gi
135 |           volumeMounts:
136 |           - mountPath: /etc/prometheus
137 |             name: config-volume
138 |           - mountPath: /prometheus
139 |             name: data-volume
140 |           - mountPath: /etc/prometheus-rules
141 |             name: rules-volume
142 |         restartPolicy: Always
143 |         volumes:
144 |         - name: data-volume
145 |           hostPath:
146 |             path: /var/lib/prometheus-data
147 |             type: Directory
148 |         - name: config-volume
149 |           configMap:
150 |             defaultMode: 420
151 |             name: prometheus
152 |         - name: rules-volume
153 |           configMap:
154 |             defaultMode: 420
155 |             name: prometheus-rules
156 | 
157 | - apiVersion: v1
158 |   kind: ConfigMap
159 |   metadata:
160 |     name: prometheus
161 |   data:
162 |     prometheus.yml: |
163 |       global:
164 |         scrape_interval: 1m
165 |         scrape_timeout: 10s
166 |         evaluation_interval: 1m
167 |       alerting:
168 |         alertmanagers:
169 |           - scheme: ${ALERT_MANAGER_SCHEME}
170 |             static_configs:
171 |             - targets:
172 |               - "${ALERT_MANAGER_HOST_PORT}"
173 |       
174 |       rule_files:
175 |       - /etc/prometheus-rules/*.rules
176 |       
177 |       # A scrape configuration for running Prometheus on a Kubernetes cluster.
178 |       # This uses separate scrape configs for cluster components (i.e. API server, node)
179 |       # and services to allow each to use different authentication configs.
180 |       #
181 |       # Kubernetes labels will be added as Prometheus labels on metrics via the
182 |       # `labelmap` relabeling action.
183 |       
184 |       scrape_configs:
185 |         # Scrape config for API servers.
186 |         #
187 |         # Kubernetes exposes API servers as endpoints to the default/kubernetes
188 |         # service so this uses `endpoints` role and uses relabelling to only keep
189 |         # the endpoints associated with the default/kubernetes service using the
190 |         # default named port `https`. This works for single API server deployments as
191 |         # well as HA API server deployments.
192 |       - job_name: kubernetes-apiservers
193 |         scrape_interval: 1m
194 |         scrape_timeout: 10s
195 |         metrics_path: /metrics
196 |         scheme: https
197 |         kubernetes_sd_configs:
198 |         - api_server: null
199 |           role: endpoints
200 |           namespaces:
201 |             names: []
202 |         bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
203 |         tls_config:
204 |           ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
205 |           insecure_skip_verify: false
206 |         relabel_configs:
207 |         - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
208 |           separator: ;
209 |           regex: default;kubernetes;https
210 |           replacement: $1
211 |           action: keep
212 |       
213 |         # Scrape config for controllers.
214 |         #
215 |         # Each master node exposes a /metrics endpoint on :8444 that contains operational metrics for
216 |         # the controllers.
217 |         #
218 |         # TODO: move this to a pure endpoints based metrics gatherer when controllers are exposed via
219 |         #       endpoints.
220 |       - job_name: kubernetes-controllers
221 |         scrape_interval: 1m
222 |         scrape_timeout: 10s
223 |         metrics_path: /metrics
224 |         scheme: https
225 |         kubernetes_sd_configs:
226 |         - api_server: null
227 |           role: endpoints
228 |           namespaces:
229 |             names: []
230 |         bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
231 |         tls_config:
232 |           ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
233 |           insecure_skip_verify: false
234 |       
235 |         # Keep only the default/kubernetes service endpoints for the https port, and then
236 |         # set the port to 8444. This is the default configuration for the controllers on OpenShift
237 |         # masters.
238 |         relabel_configs:
239 |         - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
240 |           separator: ;
241 |           regex: default;kubernetes;https
242 |           replacement: $1
243 |           action: keep
244 |         - source_labels: [__address__]
245 |           separator: ;
246 |           regex: (.+)(?::\d+)
247 |           target_label: __address__
248 |           replacement: $1:8444
249 |           action: replace
250 |       
251 |         # Scrape config for nodes.
252 |         #
253 |         # Each node exposes a /metrics endpoint that contains operational metrics for
254 |         # the Kubelet and other components.   
255 |       - job_name: kubernetes-nodes
256 |         scrape_interval: 1m
257 |         scrape_timeout: 10s
258 |         metrics_path: /metrics
259 |         scheme: https
260 |         kubernetes_sd_configs:
261 |         - api_server: null
262 |           role: node
263 |           namespaces:
264 |             names: []
265 |         bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
266 |         tls_config:
267 |           ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
268 |           insecure_skip_verify: false
269 |         relabel_configs:
270 |         - separator: ;
271 |           regex: __meta_kubernetes_node_label_(.+)
272 |           replacement: $1
273 |           action: labelmap
274 |       
275 |         # Scrape config for cAdvisor.
276 |         #
277 |         # Beginning in Kube 1.7, each node exposes a /metrics/cadvisor endpoint that
278 |         # reports container metrics for each running pod. Scrape those by default.
279 |       - job_name: kubernetes-cadvisor
280 |         scrape_interval: 1m
281 |         scrape_timeout: 10s
282 |         metrics_path: /metrics/cadvisor
283 |         scheme: https
284 |         kubernetes_sd_configs:
285 |         - api_server: null
286 |           role: node
287 |           namespaces:
288 |             names: []
289 |         bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
290 |         tls_config:
291 |           ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
292 |           insecure_skip_verify: false
293 |         relabel_configs:
294 |         - separator: ;
295 |           regex: __meta_kubernetes_node_label_(.+)
296 |           replacement: $1
297 |           action: labelmap
298 |       
299 |         # Scrape config for service endpoints.
300 |         #
301 |         # The relabeling allows the actual service scrape endpoint to be configured
302 |         # via the following annotations:
303 |         #
304 |         # * `prometheus.io/scrape`: Only scrape services that have a value of `true`
305 |         # * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
306 |         # to set this to `https` & most likely set the `tls_config` of the scrape config.
307 |         # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
308 |         # * `prometheus.io/port`: If the metrics are exposed on a different port to the
309 |         # service then set this appropriately.
310 |       - job_name: kubernetes-service-endpoints
311 |         scrape_interval: 1m
312 |         scrape_timeout: 10s
313 |         metrics_path: /metrics
314 |         scheme: http
315 |         kubernetes_sd_configs:
316 |         - api_server: null
317 |           role: endpoints
318 |           namespaces:
319 |             names: []
320 |         tls_config:
321 |           ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
322 |           insecure_skip_verify: true
323 |         relabel_configs:
324 |         - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
325 |           separator: ;
326 |           regex: "true"
327 |           replacement: $1
328 |           action: keep
329 |         - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
330 |           separator: ;
331 |           regex: (https?)
332 |           target_label: __scheme__
333 |           replacement: $1
334 |           action: replace
335 |         - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
336 |           separator: ;
337 |           regex: (.+)
338 |           target_label: __metrics_path__
339 |           replacement: $1
340 |           action: replace
341 |         - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
342 |           separator: ;
343 |           regex: (.+)(?::\d+);(\d+)
344 |           target_label: __address__
345 |           replacement: $1:$2
346 |           action: replace
347 |         - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_username]
348 |           separator: ;
349 |           regex: (.+)
350 |           target_label: __basic_auth_username__
351 |           replacement: $1
352 |           action: replace
353 |         - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_password]
354 |           separator: ;
355 |           regex: (.+)
356 |           target_label: __metrics_path__
357 |           replacement: $1
358 |           action: replace
359 |         - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
360 |           separator: ;
361 |           regex: (.+)(?::\d+);(\d+)
362 |           target_label: __address__
363 |           replacement: $1:$2
364 |           action: replace
365 |         - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_username]
366 |           separator: ;
367 |           regex: (.+)
368 |           target_label: __basic_auth_username__
369 |           replacement: $1
370 |           action: replace
371 |         - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_password]
372 |           separator: ;
373 |           regex: (.+)
374 |           target_label: __basic_auth_password__
375 |           replacement: $1
376 |           action: replace
377 |         - separator: ;
378 |           regex: __meta_kubernetes_service_label_(.+)
379 |           replacement: $1
380 |           action: labelmap
381 |         - source_labels: [__meta_kubernetes_namespace]
382 |           separator: ;
383 |           regex: (.*)
384 |           target_label: kubernetes_namespace
385 |           replacement: $1
386 |           action: replace
387 |         - source_labels: [__meta_kubernetes_service_name]
388 |           separator: ;
389 |           regex: (.*)
390 |           target_label: kubernetes_name
391 |           replacement: $1
392 |       
393 |       # Scrape config for node-exporter, which is expected to be running on port 9100.
394 |       - job_name: node-exporters
395 |         scrape_interval: 30s
396 |         scrape_timeout: 30s
397 |         metrics_path: /metrics
398 |         scheme: http
399 |         kubernetes_sd_configs:
400 |         - api_server: null
401 |           role: node
402 |           namespaces:
403 |             names: []
404 |         tls_config:
405 |           ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
406 |           insecure_skip_verify: true
407 |         relabel_configs:
408 |         - separator: ;
409 |           regex: __meta_kubernetes_node_label_(.+)
410 |           replacement: $1
411 |           action: labelmap
412 |         - source_labels: [__meta_kubernetes_role]
413 |           separator: ;
414 |           regex: (.*)
415 |           target_label: kubernetes_role
416 |           replacement: $1
417 |           action: replace
418 |         - source_labels: [__address__]
419 |           separator: ;
420 |           regex: (.*):10250
421 |           target_label: __address__
422 |           replacement: ${1}:9100
423 |           action: replace
424 |       
425 |         # Scrape config for the template service broker
426 |       - job_name: 'openshift-template-service-broker'
427 |         scheme: https
428 |         tls_config:
429 |           ca_file: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
430 |           server_name: apiserver.openshift-template-service-broker.svc
431 |         bearer_token_file: /var/run/secrets/kubernetes.io/scraper/token
432 |         kubernetes_sd_configs:
433 |         - role: endpoints
434 |           namespaces:
435 |             names:
436 |             - openshift-template-service-broker
437 |         relabel_configs:
438 |         - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
439 |           action: keep
440 |           regex: api-server;https
441 |       
442 |       - job_name: openshift-routers
443 |         scrape_interval: 30s
444 |         scrape_timeout: 30s
445 |         metrics_path: /metrics
446 |         scheme: http
447 |         static_configs:
448 |         - targets:
449 |           - router.default.svc.cluster.local:1936
450 |         basic_auth:
451 |           username: admin
452 |           password: ${ROUTER_PASSWORD}
453 | 
454 | - apiVersion: v1
455 |   kind: ConfigMap
456 |   metadata:
457 |     name: prometheus-rules
458 |   data:
459 |     alerting.rules: |
460 |       groups:
461 |       - name: example-rules
462 |         interval: 30s # defaults to global interval
463 |         rules:
464 |         - alert: Node Down
465 |           expr: up{job="kubernetes-nodes"} == 0
466 |           annotations:
467 |             miqTarget: "ContainerNode"
468 |             severity: "HIGH"
469 |             message: "{{$labels.instance}} is down"
470 | 
471 |     recording.rules: |
472 |       groups:
473 |       - name: aggregate_container_resources
474 |         rules:
475 |         - record: container_cpu_usage_rate
476 |           expr: sum without (cpu) (rate(container_cpu_usage_seconds_total[5m]))
477 |         - record: container_memory_rss_by_type
478 |           expr: container_memory_rss{id=~"/|/system.slice|/kubepods.slice"} > 0
479 |         - record: container_cpu_usage_percent_by_host
480 |           expr: sum by (kubernetes_io_hostname,type)(rate(container_cpu_usage_seconds_total{id="/"}[5m])) / on (kubernetes_io_hostname,type) machine_cpu_cores
481 |         - record: apiserver_request_count_rate_by_resources
482 |           expr: sum without (client,instance,contentType) (rate(apiserver_request_count[5m]))
483 |           


--------------------------------------------------------------------------------
/setup.sh:
--------------------------------------------------------------------------------
1 | #!/bin/sh
2 | oc project prometheus
3 | if [ "$?" != "0" ]; then
4 |   oc new-project prometheus --display-name="Prometheus Monitoring"
5 |   oc annotate namespace prometheus openshift.io/node-selector=""
6 | fi
7 | oc new-app -f prometheus.yaml --param ROUTER_PASSWORD=$(oc set env dc router -n default --list|grep STATS_PASSWORD|awk -F"=" '{print $2}')
8 | oc adm policy add-scc-to-user privileged system:serviceaccount:prometheus:prometheus
9 | 


--------------------------------------------------------------------------------
/setup_infranodes.yml:
--------------------------------------------------------------------------------
 1 | ---
 2 | - hosts: localhost
 3 |   gather_facts: false
 4 |   tasks:
 5 |   # Find out which nodes are Infranodes and add them to a new group: infranodes
 6 |   - name: Add Infranodes to a new infranodes Group
 7 |     add_host:
 8 |       name: "{{ item }}"
 9 |       groups: infranodes
10 |     with_items: "{{ groups['nodes'] }}"
11 |     when:
12 |     - item | match("^infranode.*")
13 |   - name: Add Masters to a new masters Group
14 |     add_host:
15 |       name: "{{ item }}"
16 |       groups: masters
17 |     with_items: "{{ groups['nodes'] }}"
18 |     when:
19 |     - item | match("^master.*")
20 | 
21 | - hosts: infranodes
22 |   remote_user: ec2_user
23 |   become: yes
24 |   become_user: root
25 |   tasks:
26 |   # OpenShift Routers expose /metrics on port 1936. Therefore we need to open
27 |   # the port for both future and current sessions so that Prometheus can access
28 |   # the router metrics.
29 |   # Open Firewall Port 1936 for future sessions by adding the rule to
30 |   # the iptables file.
31 |   - name: Open Firewall port 1936 for future sessions
32 |     lineinfile:
33 |       dest: /etc/sysconfig/iptables
34 |       insertafter: '-A FORWARD -j REJECT --reject-with icmp-host-prohibited'
35 |       line: '-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 1936 -j ACCEPT'
36 |       state: present
37 |   # Open Firewall Port 1936 for current session by adding the rule to the
38 |   # current iptables configuration. We won't need to restart the iptables
39 |   # service - which will ensure all OpenShift rules stay in place.
40 |   - name: Open Firewall Port 1936 for current session
41 |     iptables:
42 |       action: insert
43 |       protocol: tcp
44 |       destination_port: 1936
45 |       state: present
46 |       chain: OS_FIREWALL_ALLOW
47 |       jump: ACCEPT
48 |   # Create Directory /var/lib/prometheus-data with correct permissions
49 |   # Make sure the directory has SELinux Type svirt_sandbox_file_t otherwise
50 |   # there is a permissions problem trying to mount it into the pod.
51 |   - name: Create directory /var/lib/prometheus-data
52 |     file:
53 |       path: /var/lib/prometheus-data
54 |       state: directory
55 |       group: root
56 |       owner: root
57 |       mode: 0777
58 |       setype: svirt_sandbox_file_t
59 | 
60 | # Add label "prometheus-host=true" to our infranodes
61 | # Do that via the oc label command from the first master host
62 | - hosts: masters[0]
63 |   remote_user: ec2_user
64 |   become: yes
65 |   become_user: root
66 |   tasks:
67 |   - name: Label Infranodes with prometheus-host=true
68 |     shell: oc label node {{ item }} prometheus-host=true --overwrite
69 |     with_items: "{{ groups['infranodes'] }}"
70 | 


--------------------------------------------------------------------------------