├── README.adoc ├── alertmanager ├── Dockerfile ├── README.adoc ├── alertmanager.yaml ├── build.sh └── config.yml ├── install_everything.yml ├── node-exporter ├── Dockerfile ├── README.adoc ├── build.sh ├── dockerinfo │ ├── DockerStatus.json │ ├── README.adoc │ ├── dockerinfo.cron │ ├── dockerinfo.sh │ └── install_dockerinfo.yml ├── native │ ├── README.adoc │ ├── install_node-exporter.yml │ └── node_exporter.conf ├── node-exporter.yaml ├── setup.sh └── update_firewall.yml ├── prometheus-image ├── Dockerfile ├── build.sh └── config.yml ├── prometheus.yaml ├── setup.sh └── setup_infranodes.yml /README.adoc: -------------------------------------------------------------------------------- 1 | # Prometheus on OpenShift 2 | 3 | This repository contains definitions and tools to run Prometheus and its associated ecosystem on Red Hat OpenShift. 4 | 5 | ## Components 6 | 7 | The following components are available: 8 | 9 | * link:https://prometheus.io/docs/introduction/overview/[Prometheus] 10 | * link:https://prometheus.io/docs/instrumenting/exporters/[node-exporter] 11 | * link:https://prometheus.io/docs/alerting/alertmanager/[Alertmanager] 12 | 13 | ## Project Organization 14 | 15 | A new project called _prometheus_ will be created to contain the entire ecosystem. 16 | 17 | Execute the following command to create the project: 18 | 19 | [source,bash] 20 | ---- 21 | oc new-project prometheus --display-name="Prometheus Monitoring" 22 | ---- 23 | 24 | Make sure that there is not a default node selector on the project: 25 | 26 | [source,bash] 27 | ---- 28 | oc annotate namespace prometheus openshift.io/node-selector="" 29 | ---- 30 | 31 | ## Deploy Prometheus 32 | 33 | Starting with OpenShift 3.6 the OpenShift routers expose a metrics endpoint on port 1936. For Prometheus to be able to monitor the routers this port needs to be open. 34 | 35 | Additionally Prometheus does not work with remote volumes (NFS, EBS, ...) but needs local disk storage as well. This means we need to create a directory on (one of) the infranodes. The Prometheus template includes a Node Selector `prometheus-host=true` - so we need to set the correct label on the infranode(s) as well. 36 | 37 | Run the following Ansible playbook to configure the infranodes: 38 | 39 | [source,bash] 40 | ---- 41 | ansible-playbook -i /etc/ansible/hosts ./setup_infranodes.yml 42 | ---- 43 | 44 | The router also requires basic authentication to be allowed to scrape the metrics. Find the router password by executing the following command: 45 | 46 | [source,bash] 47 | ---- 48 | oc set env dc router -n default --list|grep STATS_PASSWORD|awk -F"=" '{print $2}' 49 | ---- 50 | 51 | An OpenShift template has been provided to streamline the deployment to OpenShift. 52 | 53 | Execute the following command to instantiate the Prometheus template using the previously retrieved router password as a parameter: 54 | 55 | [source,bash] 56 | ---- 57 | oc new-app -f prometheus.yaml --param ROUTER_PASSWORD= 58 | ---- 59 | 60 | Since Prometheus needs to use a local disk to write its metrics add the `privileged` SCC to the *prometheus* service account: 61 | 62 | [source,bash] 63 | ---- 64 | oc adm policy add-scc-to-user privileged system:serviceaccount:prometheus:prometheus 65 | ---- 66 | 67 | Make sure your Prometheus pod is running (on an Infranode): 68 | 69 | [source,bash] 70 | ---- 71 | oc get pod -o wide 72 | ---- 73 | 74 | ## Next Steps 75 | 76 | Please refer to the following to enhance the functionality of Prometheus 77 | 78 | * link:alertmanager[Alertmanager] 79 | * link:node-exporter[Node exporter] 80 | * link:https://github.com/wkulhanek/docker-openshift-grafana[Grafana] 81 | 82 | # Cleanup 83 | 84 | Delete the project and the cluster-reader binding (which gets created by the template but doesn't get deleted as part of the project): 85 | 86 | [source,bash] 87 | ---- 88 | oc delete project prometheus 89 | oc delete clusterrolebinding prometheus-cluster-reader 90 | oc adm policy remove-scc-from-user privileged prometheus 91 | ---- 92 | 93 | You will also need to clean up the directory /var/lib/prometheus-data on the Infranode(s) and remove the label `prometheus-host=true` from the Infranode(s). 94 | -------------------------------------------------------------------------------- /alertmanager/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM docker.io/centos:7 2 | LABEL maintainer="Wolfgang Kulhanek " 3 | 4 | ENV ALERT_MANAGER_VERSION=0.16.1 5 | 6 | RUN yum -y update && yum -y upgrade && \ 7 | yum -y clean all && \ 8 | curl -L -o /tmp/alert_manager.tar.gz https://github.com/prometheus/alertmanager/releases/download/v$ALERT_MANAGER_VERSION/alertmanager-$ALERT_MANAGER_VERSION.linux-amd64.tar.gz && \ 9 | tar -xzf /tmp/alert_manager.tar.gz && \ 10 | mv ./alertmanager-$ALERT_MANAGER_VERSION.linux-amd64/alertmanager /bin && \ 11 | rm -rf ./alertmanager-$ALERT_MANAGER_VERSION.linux-amd64 && \ 12 | rm /tmp/alert_manager.tar.gz 13 | 14 | COPY config.yml /etc/alertmanager/config.yml 15 | 16 | EXPOSE 9093 17 | USER nobody 18 | VOLUME [ "/alertmanager" ] 19 | WORKDIR /alertmanager 20 | ENTRYPOINT [ "/bin/alertmanager" ] 21 | CMD [ "-config.file=/etc/alertmanager/config.yml", \ 22 | "-storage.path=/alertmanager" ] 23 | -------------------------------------------------------------------------------- /alertmanager/README.adoc: -------------------------------------------------------------------------------- 1 | # OpenShift Prometheus Alert Manager 2 | 3 | The Alert Manager will be connected to Prometheus and send alerts when specific rules are hit. 4 | 5 | The example implementation makes use of a link:https://prometheus.io/docs/alerting/configuration/#[webhook] to notify a target system. 6 | 7 | The Alertmanager can be configured to support other or additional alerting mechanism by editing the ConfigMap `alertmanager` either in the template or after the Alertmanager has been deployed (in that case it needs to be restarted to pick up the new configuration). 8 | 9 | ## Deployment to OpenShift 10 | 11 | Use the following steps to deploy Alertmanager to OpenShift: 12 | 13 | [source,bash] 14 | ---- 15 | oc new-app -f alertmanager.yaml -p VOLUME_CAPACITY=4Gi -p "WEBHOOK_URL=" 16 | ---- 17 | 18 | NOTE: Be sure to substitute the value of the Webhook URL when instantiating the template. 19 | -------------------------------------------------------------------------------- /alertmanager/alertmanager.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: v1 2 | kind: Template 3 | metadata: 4 | name: alertmanager 5 | annotations: 6 | "openshift.io/display-name": Prometheus Alert Manager 7 | description: | 8 | A monitoring solution for an OpenShift cluster - collect and gather metrics from nodes, services, and the infrastructure. This component provides the Alert Manager. 9 | iconClass: icon-cogs 10 | tags: "monitoring,prometheus,alertmanager,time-series" 11 | parameters: 12 | - description: The location of the prometheus image 13 | name: IMAGE_ALERTMANAGER 14 | value: wkulhanek/alertmanager:latest 15 | - name: VOLUME_CAPACITY 16 | displayName: Volume Capacity 17 | description: Volume space available for data, e.g. 512Mi, 2Gi. 18 | value: 4Gi 19 | required: true 20 | - name: WEBHOOK_URL 21 | displayName: Webhook URL 22 | description: URL for the Webhook to send alerts. 23 | required: true 24 | objects: 25 | - apiVersion: v1 26 | kind: Route 27 | metadata: 28 | name: alertmanager 29 | spec: 30 | to: 31 | name: alertmanager 32 | - apiVersion: v1 33 | kind: Service 34 | metadata: 35 | annotations: 36 | prometheus.io/scrape: "true" 37 | prometheus.io/scheme: http 38 | labels: 39 | name: alertmanager 40 | name: alertmanager 41 | spec: 42 | ports: 43 | - name: alertmanager 44 | port: 9093 45 | protocol: TCP 46 | targetPort: 9093 47 | selector: 48 | app: alertmanager 49 | - apiVersion: v1 50 | kind: DeploymentConfig 51 | metadata: 52 | labels: 53 | app: alertmanager 54 | name: alertmanager 55 | spec: 56 | replicas: 1 57 | selector: 58 | app: alertmanager 59 | deploymentconfig: alertmanager 60 | template: 61 | metadata: 62 | labels: 63 | app: alertmanager 64 | deploymentconfig: alertmanager 65 | name: alertmanager 66 | spec: 67 | containers: 68 | - name: alertmanager 69 | args: 70 | - --config.file=/etc/alertmanager/config.yml 71 | - --storage.path=/alertmanager 72 | image: ${IMAGE_ALERTMANAGER} 73 | imagePullPolicy: IfNotPresent 74 | livenessProbe: 75 | failureThreshold: 3 76 | httpGet: 77 | path: /#/status 78 | port: 9093 79 | scheme: HTTP 80 | initialDelaySeconds: 2 81 | periodSeconds: 10 82 | successThreshold: 1 83 | timeoutSeconds: 1 84 | readinessProbe: 85 | failureThreshold: 3 86 | httpGet: 87 | path: /#/status 88 | port: 9093 89 | scheme: HTTP 90 | initialDelaySeconds: 2 91 | periodSeconds: 10 92 | successThreshold: 1 93 | timeoutSeconds: 1 94 | 95 | volumeMounts: 96 | - mountPath: /etc/alertmanager 97 | name: config-volume 98 | - mountPath: /alertmanager 99 | name: data-volume 100 | restartPolicy: Always 101 | volumes: 102 | - name: data-volume 103 | persistentVolumeClaim: 104 | claimName: alertmanager-data-pvc 105 | - configMap: 106 | name: alertmanager 107 | name: config-volume 108 | - apiVersion: v1 109 | kind: PersistentVolumeClaim 110 | metadata: 111 | name: alertmanager-data-pvc 112 | spec: 113 | accessModes: 114 | - ReadWriteMany 115 | resources: 116 | requests: 117 | storage: "${VOLUME_CAPACITY}" 118 | - apiVersion: v1 119 | kind: ConfigMap 120 | metadata: 121 | name: alertmanager 122 | data: 123 | config.yml: | 124 | global: 125 | 126 | # The root route on which each incoming alert enters. 127 | route: 128 | # The root route must not have any matchers as it is the entry point for 129 | # all alerts. It needs to have a receiver configured so alerts that do not 130 | # match any of the sub-routes are sent to someone. 131 | receiver: 'webhook' 132 | 133 | # The labels by which incoming alerts are grouped together. For example, 134 | # multiple alerts coming in for cluster=A and alertname=LatencyHigh would 135 | # be batched into a single group. 136 | group_by: ['alertname', 'cluster'] 137 | 138 | # When a new group of alerts is created by an incoming alert, wait at 139 | # least 'group_wait' to send the initial notification. 140 | # This way ensures that you get multiple alerts for the same group that start 141 | # firing shortly after another are batched together on the first 142 | # notification. 143 | group_wait: 30s 144 | 145 | # When the first notification was sent, wait 'group_interval' to send a batch 146 | # of new alerts that started firing for that group. 147 | group_interval: 5m 148 | 149 | # If an alert has successfully been sent, wait 'repeat_interval' to 150 | # resend them. 151 | repeat_interval: 3h 152 | 153 | receivers: 154 | - name: 'webhook' 155 | webhook_configs: 156 | - send_resolved: true 157 | url: '${WEBHOOK_URL}' 158 | -------------------------------------------------------------------------------- /alertmanager/build.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | export VERSION=0.16.1 3 | docker build . -t wkulhanek/alertmanager:latest 4 | docker tag wkulhanek/alertmanager:latest wkulhanek/alertmanager:${VERSION} 5 | docker push wkulhanek/alertmanager:latest 6 | docker push wkulhanek/alertmanager:${VERSION} 7 | -------------------------------------------------------------------------------- /alertmanager/config.yml: -------------------------------------------------------------------------------- 1 | global: 2 | # The smarthost and SMTP sender used for mail notifications. 3 | smtp_smarthost: 'localhost:25' 4 | smtp_from: 'alertmanager@openshift.opentlc.com' 5 | 6 | # The root route on which each incoming alert enters. 7 | route: 8 | # The root route must not have any matchers as it is the entry point for 9 | # all alerts. It needs to have a receiver configured so alerts that do not 10 | # match any of the sub-routes are sent to someone. 11 | receiver: 'team-X-mails' 12 | 13 | # The labels by which incoming alerts are grouped together. For example, 14 | # multiple alerts coming in for cluster=A and alertname=LatencyHigh would 15 | # be batched into a single group. 16 | group_by: ['alertname', 'cluster'] 17 | 18 | # When a new group of alerts is created by an incoming alert, wait at 19 | # least 'group_wait' to send the initial notification. 20 | # This way ensures that you get multiple alerts for the same group that start 21 | # firing shortly after another are batched together on the first 22 | # notification. 23 | group_wait: 30s 24 | 25 | # When the first notification was sent, wait 'group_interval' to send a batch 26 | # of new alerts that started firing for that group. 27 | group_interval: 5m 28 | 29 | # If an alert has successfully been sent, wait 'repeat_interval' to 30 | # resend them. 31 | repeat_interval: 3h 32 | 33 | # All the above attributes are inherited by all child routes and can 34 | # overwritten on each. 35 | 36 | # The child route trees. 37 | routes: 38 | # This routes performs a regular expression match on alert labels to 39 | # catch alerts that are related to a list of services. 40 | - match_re: 41 | service: ^(foo1|foo2|baz)$ 42 | receiver: team-X-mails 43 | 44 | # The service has a sub-route for critical alerts, any alerts 45 | # that do not match, i.e. severity != critical, fall-back to the 46 | # parent node and are sent to 'team-X-mails' 47 | routes: 48 | - match: 49 | severity: critical 50 | receiver: team-X-pager 51 | 52 | - match: 53 | service: files 54 | receiver: team-Y-mails 55 | 56 | routes: 57 | - match: 58 | severity: critical 59 | receiver: team-Y-pager 60 | 61 | # This route handles all alerts coming from a database service. If there's 62 | # no team to handle it, it defaults to the DB team. 63 | - match: 64 | service: database 65 | 66 | receiver: team-DB-pager 67 | # Also group alerts by affected database. 68 | group_by: [alertname, cluster, database] 69 | 70 | routes: 71 | - match: 72 | owner: team-X 73 | receiver: team-X-pager 74 | 75 | - match: 76 | owner: team-Y 77 | receiver: team-Y-pager 78 | 79 | 80 | # Inhibition rules allow to mute a set of alerts given that another alert is 81 | # firing. 82 | # We use this to mute any warning-level notifications if the same alert is 83 | # already critical. 84 | inhibit_rules: 85 | - source_match: 86 | severity: 'critical' 87 | target_match: 88 | severity: 'warning' 89 | # Apply inhibition if the alertname is the same. 90 | equal: ['alertname'] 91 | 92 | 93 | receivers: 94 | - name: 'team-X-mails' 95 | email_configs: 96 | - to: 'team-X+alerts@example.org' 97 | 98 | - name: 'team-X-pager' 99 | email_configs: 100 | - to: 'team-X+alerts-critical@example.org' 101 | pagerduty_configs: 102 | - service_key: 103 | 104 | - name: 'team-Y-mails' 105 | email_configs: 106 | - to: 'team-Y+alerts@example.org' 107 | 108 | - name: 'team-Y-pager' 109 | pagerduty_configs: 110 | - service_key: 111 | 112 | - name: 'team-DB-pager' 113 | pagerduty_configs: 114 | - service_key: 115 | -------------------------------------------------------------------------------- /install_everything.yml: -------------------------------------------------------------------------------- 1 | --- 2 | # This Playbook sets up Prometheus on an OpenTLC Cluster 3 | - hosts: localhost 4 | gather_facts: false 5 | tasks: 6 | # Find out which nodes are Infranodes and add them to a new group: infranodes 7 | - name: Add Infranodes to a new infranodes Group 8 | add_host: 9 | name: "{{ item }}" 10 | groups: infranodes 11 | with_items: "{{ groups['nodes'] }}" 12 | when: 13 | - item | match("^infranode.*") 14 | - name: Add Masters to a new masters Group 15 | add_host: 16 | name: "{{ item }}" 17 | groups: masters 18 | with_items: "{{ groups['nodes'] }}" 19 | when: 20 | - item | match("^master.*") 21 | 22 | - hosts: infranodes[0] 23 | remote_user: ec2_user 24 | become: yes 25 | become_user: root 26 | tasks: 27 | # OpenShift Routers expose /metrics on port 1936. Therefore we need to open 28 | # the port for both future and current sessions so that Prometheus can access 29 | # the router metrics. 30 | # Open Firewall Port 1936 for future sessions by adding the rule to 31 | # the iptables file. 32 | - name: Open Firewall port 1936 for future sessions 33 | lineinfile: 34 | dest: /etc/sysconfig/iptables 35 | insertafter: '-A FORWARD -j REJECT --reject-with icmp-host-prohibited' 36 | line: '-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 1936 -j ACCEPT' 37 | state: present 38 | tags: 39 | - prometheus 40 | # Open Firewall Port 1936 for current session by adding the rule to the 41 | # current iptables configuration. We won't need to restart the iptables 42 | # service - which will ensure all OpenShift rules stay in place. 43 | - name: Open Firewall Port 1936 for current session 44 | iptables: 45 | action: insert 46 | protocol: tcp 47 | destination_port: 1936 48 | state: present 49 | chain: OS_FIREWALL_ALLOW 50 | jump: ACCEPT 51 | tags: 52 | - prometheus 53 | # Create Directory /var/lib/prometheus-data with correct permissions 54 | # Make sure the directory has SELinux Type svirt_sandbox_file_t otherwise 55 | # there is a permissions problem trying to mount it into the pod. 56 | # If there are more than one infranodes this directory will be created on all 57 | # infranodes - but only used on the first one 58 | - name: Create directory /var/lib/prometheus-data 59 | file: 60 | path: /var/lib/prometheus-data 61 | state: directory 62 | group: root 63 | owner: root 64 | mode: 0777 65 | setype: svirt_sandbox_file_t 66 | tags: 67 | - prometheus 68 | 69 | # Configure all Nodes (including Infranodes and Masters) for monitoring 70 | - hosts: nodes 71 | remote_user: ec2_user 72 | become: yes 73 | become_user: root 74 | tasks: 75 | # Node Exporters on all Nodes liston on port 9100. 76 | # Open Firewall Port 9100 for future sessions by adding the rule to 77 | # the iptables file. 78 | - name: Open Firewall port 9100 for future sessions 79 | lineinfile: 80 | dest: /etc/sysconfig/iptables 81 | insertafter: '-A FORWARD -j REJECT --reject-with icmp-host-prohibited' 82 | line: '-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 9100 -j ACCEPT' 83 | state: present 84 | tags: 85 | - prometheus 86 | # Open Firewall Port 9100 for current session by adding the rule to the 87 | # current iptables configuration. We won't need to restart the iptables 88 | # service - which will ensure all OpenShift rules stay in place. 89 | - name: Open Firewall Port 9100 for current session 90 | iptables: 91 | action: insert 92 | protocol: tcp 93 | destination_port: 9100 94 | state: present 95 | chain: OS_FIREWALL_ALLOW 96 | jump: ACCEPT 97 | tags: 98 | - prometheus 99 | # The Node Exporter reads information from the Nodes. In addition it can 100 | # read arbitrary information from a (properly formatted) text file. 101 | # We have a shell script that puts information about Docker into a textfile 102 | # to be read by the Node Exporter. Therefore we need to create the directory 103 | # where the text file is to be written. 104 | - name: Create textfile_collector directory 105 | file: 106 | path: /var/lib/node_exporter/textfile_collector 107 | state: directory 108 | owner: root 109 | group: root 110 | mode: 0775 111 | tags: 112 | - prometheus 113 | # Copy the shell script to the nodes to collect docker information and write 114 | # to a text file 115 | - name: Copy dockerinfo to node 116 | get_url: 117 | url: https://raw.githubusercontent.com/wkulhanek/openshift-prometheus/master/node-exporter/dockerinfo/dockerinfo.sh 118 | dest: /usr/local/bin/dockerinfo.sh 119 | owner: root 120 | group: root 121 | mode: 0755 122 | tags: 123 | - prometheus 124 | # Create a cron job to run the dockerinfo shell script periodically. 125 | - name: Copy cron.d/docker_info.cron to node 126 | get_url: 127 | url: https://raw.githubusercontent.com/wkulhanek/openshift-prometheus/master/node-exporter/dockerinfo/dockerinfo.cron 128 | dest: /etc/cron.d/dockerinfo.cron 129 | owner: root 130 | group: root 131 | mode: 0644 132 | tags: 133 | - prometheus 134 | # Restart crond service to pick up new cron job 135 | - name: Restart crond service on node 136 | systemd: 137 | name: crond 138 | state: restarted 139 | tags: 140 | - prometheus 141 | 142 | # Finally create all the necessary OpenShift objects. This happens via the 143 | # oc binary on the (first) master host. 144 | - hosts: masters[0] 145 | remote_user: ec2_user 146 | become: yes 147 | become_user: root 148 | vars: 149 | # The Web Hook URL for the Alert Manager to send alerts to Rocket Chat 150 | # The default is #gpte-devops-prometheus channel on the Red Hat Chat instance 151 | webhook_url: https://chat.consulting.redhat.com/hooks/LTtLntjbTBNvij6br/Rfa6WZSANJJBB8QDsu5nhynQ2sJG2wrSL3BLzbxYJqit3EGk 152 | tasks: 153 | # Add label "prometheus-host=true" to the first infranode 154 | - name: Label Infranodes with prometheus-host=true 155 | shell: oc label node {{ groups['infranodes'][0] }} prometheus-host=true --overwrite 156 | tags: 157 | - prometheus 158 | # Check if there is already a prometheus project 159 | - name: Check for prometheus project 160 | command: "oc get project prometheus" 161 | register: prometheus_project_present 162 | ignore_errors: true 163 | # Create the Prometheus Project if it's not there yet 164 | - name: Create Prometheus Project 165 | shell: oc new-project prometheus --display-name="Prometheus Monitoring" 166 | when: prometheus_project_present | failed 167 | tags: 168 | - prometheus 169 | - name: Set Node Selectors to empty on Prometheus Project 170 | shell: oc annotate namespace prometheus openshift.io/node-selector="" 171 | when: prometheus_project_present | failed 172 | tags: 173 | - prometheus 174 | - name: Determine Router Password 175 | shell: oc set env dc router -n default --list|grep STATS_PASSWORD|awk -F"=" '{print $2}' 176 | when: prometheus_project_present | failed 177 | register: router_password 178 | tags: 179 | - prometheus 180 | - name: Deploy Prometheus 181 | shell: oc new-app -f https://raw.githubusercontent.com/wkulhanek/openshift-prometheus/master/prometheus.yaml --param ROUTER_PASSWORD={{ router_password.stdout }} 182 | when: prometheus_project_present | failed 183 | tags: 184 | - prometheus 185 | - name: Grant privileged SCC to Prometheus Service account 186 | shell: oc adm policy add-scc-to-user privileged system:serviceaccount:prometheus:prometheus 187 | when: prometheus_project_present | failed 188 | tags: 189 | - prometheus 190 | - name: Grant privileged SCC to default service account for Node Exporter 191 | shell: oc adm policy add-scc-to-user privileged system:serviceaccount:prometheus:default 192 | when: prometheus_project_present | failed 193 | tags: 194 | - prometheus 195 | - name: Deploy Node Exporter Daemon Set 196 | shell: oc new-app -f https://raw.githubusercontent.com/wkulhanek/openshift-prometheus/master/node-exporter/node-exporter.yaml 197 | when: prometheus_project_present | failed 198 | tags: 199 | - prometheus 200 | - name: Deploy Alertmanager 201 | shell: oc new-app -f https://raw.githubusercontent.com/wkulhanek/openshift-prometheus/master/alertmanager/alertmanager.yaml -p "WEBHOOK_URL={{ webhook_url }}" 202 | when: prometheus_project_present | failed 203 | tags: 204 | - prometheus 205 | - alertmanager 206 | - name: Move Alertmanager to an Infranode 207 | command: "oc patch dc alertmanager --patch '{ \"spec\": { \"template\": { \"spec\": { \"nodeSelector\": { \"env\":\"infra\"}}}}}' " 208 | when: prometheus_project_present | failed 209 | tags: 210 | - prometheus 211 | - alertmanager 212 | - name: Deploy Grafana 213 | shell: oc new-app -f https://raw.githubusercontent.com/wkulhanek/docker-openshift-grafana/master/grafana.yaml 214 | when: prometheus_project_present | failed 215 | tags: 216 | - prometheus 217 | - grafana 218 | - name: Move Grafana to an Infranode 219 | command: "oc patch dc grafana --patch '{ \"spec\": { \"template\": { \"spec\": { \"nodeSelector\": { \"env\":\"infra\"}}}}}' " 220 | when: prometheus_project_present | failed 221 | tags: 222 | - prometheus 223 | - grafana 224 | -------------------------------------------------------------------------------- /node-exporter/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM docker.io/centos:7 2 | LABEL maintainer="Wolfgang Kulhanek " 3 | 4 | ENV NODE_EXPORTER_VERSION=0.17.0 5 | 6 | RUN yum -y update && yum -y upgrade && yum clean all && \ 7 | curl -L -o /tmp/node_exporter.tar.gz https://github.com/prometheus/node_exporter/releases/download/v$NODE_EXPORTER_VERSION/node_exporter-$NODE_EXPORTER_VERSION.linux-amd64.tar.gz && \ 8 | tar -xzf /tmp/node_exporter.tar.gz && \ 9 | mv ./node_exporter-$NODE_EXPORTER_VERSION.linux-amd64/node_exporter /bin && \ 10 | rm -rf ./node_exporter-$NODE_EXPORTER_VERSION.linux-amd64 && \ 11 | rm /tmp/node_exporter.tar.gz && \ 12 | mkdir /textfile_collector 13 | 14 | EXPOSE 9100 15 | USER nobody 16 | ENTRYPOINT [ "/bin/node_exporter" ] 17 | -------------------------------------------------------------------------------- /node-exporter/README.adoc: -------------------------------------------------------------------------------- 1 | # OpenShift Prometheus Node Exporter 2 | 3 | The Node Exporter provides OS and system level metrics for Prometheus. The exporter runs as a link:https://docs.openshift.com/container-platform/latest/dev_guide/daemonsets.html[DaemonSet] to guarantee each node runs at least one copy of the application. 4 | 5 | ## Deployment to OpenShift 6 | 7 | To deploy the node-exporter to OpenShift, complete the following steps. 8 | 9 | ### Open Firewall Ports 10 | 11 | #### Ansible Version 12 | 13 | The easy way to open the firewall ports on all nodes is to run the provided Ansible playbook. It will open port 9100 and then restart the IP tables service. 14 | 15 | [source,bash] 16 | ---- 17 | ansible-playbook -i /etc/ansible/hosts update_firewall.yml 18 | ---- 19 | 20 | #### Manual Version 21 | 22 | To manually configure the firewall follow the following steps: 23 | 24 | . Port 9100 needs to be opened on each OpenShift host in order for the Prometheus server to scrape the metrics. 25 | + 26 | Add the following line to `/etc/sysconfig/iptables`: 27 | + 28 | [source,bash] 29 | ---- 30 | -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 9100 -j ACCEPT 31 | ---- 32 | + 33 | . Restart _iptables_ and OpenShift services in order to properly rebuild the rules 34 | + 35 | NOTE: The following commands will cause all running containers on the node to stop and restart 36 | + 37 | [source,bash] 38 | ---- 39 | systemctl reload iptables 40 | systemctl restart iptables.service 41 | systemctl restart docker 42 | systemctl restart atomic-openshift-node.service 43 | ---- 44 | 45 | ### Elevated Access 46 | 47 | Since the node exporter will be accessing resources from each host, the service account being used to run the pod must be granted elevated access. Execute the following command to add the _default_ Service Account in the _prometheus_ project to the _privileged_ SCC: 48 | 49 | [source,bash] 50 | ---- 51 | oc adm policy add-scc-to-user privileged system:serviceaccount:prometheus:default 52 | ---- 53 | 54 | ### Node Selector 55 | 56 | If there is a default project node selector for the OpenShift cluster (e.g. *env=users*) it is necessary to set en empty node selector for the Prometheus project. Otherwise the Daemon Set for creating the node-exporter pods will fail with a `MatchNodeSelector` error on nodes that don't have that particular label. 57 | 58 | [NOTE] 59 | If your OpenShift Cluster does not have a default project node selector you can skip this section. 60 | 61 | To set an empty node selector: 62 | 63 | [source,bash] 64 | ---- 65 | oc annotate namespace prometheus openshift.io/node-selector="" 66 | ---- 67 | 68 | ### Instantiate the Template 69 | 70 | Execute the following command to add and instantiate the node-exporter template 71 | 72 | [source,bash] 73 | ---- 74 | oc new-app -f node-exporter.yaml 75 | ---- 76 | -------------------------------------------------------------------------------- /node-exporter/build.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | export VERSION=0.17.0 3 | docker build . -t wkulhanek/node-exporter:latest 4 | docker tag wkulhanek/node-exporter:latest wkulhanek/node-exporter:${VERSION} 5 | docker push wkulhanek/node-exporter:latest 6 | docker push wkulhanek/node-exporter:${VERSION} 7 | -------------------------------------------------------------------------------- /node-exporter/dockerinfo/DockerStatus.json: -------------------------------------------------------------------------------- 1 | { 2 | "__inputs": [ 3 | { 4 | "name": "DS_DS-PROMETHEUS", 5 | "label": "DS-Prometheus", 6 | "description": "", 7 | "type": "datasource", 8 | "pluginId": "prometheus", 9 | "pluginName": "Prometheus" 10 | } 11 | ], 12 | "__requires": [ 13 | { 14 | "type": "grafana", 15 | "id": "grafana", 16 | "name": "Grafana", 17 | "version": "4.4.1" 18 | }, 19 | { 20 | "type": "panel", 21 | "id": "graph", 22 | "name": "Graph", 23 | "version": "" 24 | }, 25 | { 26 | "type": "datasource", 27 | "id": "prometheus", 28 | "name": "Prometheus", 29 | "version": "1.0.0" 30 | }, 31 | { 32 | "type": "panel", 33 | "id": "singlestat", 34 | "name": "Singlestat", 35 | "version": "" 36 | }, 37 | { 38 | "type": "panel", 39 | "id": "table", 40 | "name": "Table", 41 | "version": "" 42 | } 43 | ], 44 | "annotations": { 45 | "list": [] 46 | }, 47 | "description": "Displays status of Docker Service on Nodes", 48 | "editable": true, 49 | "gnetId": null, 50 | "graphTooltip": 0, 51 | "hideControls": false, 52 | "id": null, 53 | "links": [], 54 | "refresh": "10s", 55 | "rows": [ 56 | { 57 | "collapse": false, 58 | "height": 328, 59 | "panels": [ 60 | { 61 | "aliasColors": {}, 62 | "bars": true, 63 | "dashLength": 10, 64 | "dashes": false, 65 | "datasource": "${DS_DS-PROMETHEUS}", 66 | "description": "Percentage of usage of the Docker Volume Group on each host.", 67 | "fill": 1, 68 | "id": 4, 69 | "legend": { 70 | "avg": false, 71 | "current": false, 72 | "max": false, 73 | "min": false, 74 | "show": false, 75 | "total": false, 76 | "values": false 77 | }, 78 | "lines": false, 79 | "linewidth": 1, 80 | "links": [], 81 | "nullPointMode": "null", 82 | "percentage": false, 83 | "pointradius": 5, 84 | "points": false, 85 | "renderer": "flot", 86 | "seriesOverrides": [], 87 | "spaceLength": 10, 88 | "span": 6, 89 | "stack": false, 90 | "steppedLine": false, 91 | "targets": [ 92 | { 93 | "expr": "node_docker_volume_data_percent_full{instance !~ \"^master1.*\"}", 94 | "format": "time_series", 95 | "intervalFactor": 2, 96 | "legendFormat": "", 97 | "metric": "node_docker_v", 98 | "refId": "A", 99 | "step": 10 100 | } 101 | ], 102 | "thresholds": [ 103 | { 104 | "colorMode": "critical", 105 | "fill": true, 106 | "line": true, 107 | "op": "gt", 108 | "value": 80 109 | } 110 | ], 111 | "timeFrom": null, 112 | "timeShift": null, 113 | "title": "Docker Volume % Used", 114 | "tooltip": { 115 | "shared": false, 116 | "sort": 0, 117 | "value_type": "individual" 118 | }, 119 | "type": "graph", 120 | "xaxis": { 121 | "buckets": null, 122 | "mode": "series", 123 | "name": null, 124 | "show": false, 125 | "values": [ 126 | "current" 127 | ] 128 | }, 129 | "yaxes": [ 130 | { 131 | "format": "percent", 132 | "label": "", 133 | "logBase": 1, 134 | "max": "100", 135 | "min": "0", 136 | "show": true 137 | }, 138 | { 139 | "format": "short", 140 | "label": null, 141 | "logBase": 1, 142 | "max": null, 143 | "min": null, 144 | "show": false 145 | } 146 | ] 147 | }, 148 | { 149 | "aliasColors": {}, 150 | "bars": true, 151 | "dashLength": 10, 152 | "dashes": false, 153 | "datasource": "${DS_DS-PROMETHEUS}", 154 | "fill": 1, 155 | "id": 5, 156 | "legend": { 157 | "avg": false, 158 | "current": false, 159 | "max": false, 160 | "min": false, 161 | "show": false, 162 | "total": false, 163 | "values": false 164 | }, 165 | "lines": false, 166 | "linewidth": 1, 167 | "links": [], 168 | "nullPointMode": "null", 169 | "percentage": false, 170 | "pointradius": 5, 171 | "points": false, 172 | "renderer": "flot", 173 | "seriesOverrides": [], 174 | "spaceLength": 10, 175 | "span": 6, 176 | "stack": false, 177 | "steppedLine": false, 178 | "targets": [ 179 | { 180 | "expr": "((node_filesystem_size{mountpoint=\"/\", device=\"rootfs\"}-node_filesystem_avail{mountpoint=\"/\", device=\"rootfs\"}) / node_filesystem_size{mountpoint=\"/\", device=\"rootfs\"} ) * 100", 181 | "format": "time_series", 182 | "intervalFactor": 2, 183 | "refId": "A", 184 | "step": 10 185 | } 186 | ], 187 | "thresholds": [ 188 | { 189 | "colorMode": "critical", 190 | "fill": true, 191 | "line": true, 192 | "op": "gt", 193 | "value": 80 194 | } 195 | ], 196 | "timeFrom": null, 197 | "timeShift": null, 198 | "title": "Root Filesystem % Used", 199 | "tooltip": { 200 | "shared": false, 201 | "sort": 0, 202 | "value_type": "individual" 203 | }, 204 | "type": "graph", 205 | "xaxis": { 206 | "buckets": null, 207 | "mode": "series", 208 | "name": null, 209 | "show": false, 210 | "values": [ 211 | "current" 212 | ] 213 | }, 214 | "yaxes": [ 215 | { 216 | "format": "percent", 217 | "label": "", 218 | "logBase": 1, 219 | "max": "100", 220 | "min": "0", 221 | "show": true 222 | }, 223 | { 224 | "format": "short", 225 | "label": null, 226 | "logBase": 1, 227 | "max": null, 228 | "min": null, 229 | "show": false 230 | } 231 | ] 232 | } 233 | ], 234 | "repeat": null, 235 | "repeatIteration": null, 236 | "repeatRowId": null, 237 | "showTitle": true, 238 | "title": "Docker Volume Availabe", 239 | "titleSize": "h4" 240 | }, 241 | { 242 | "collapse": false, 243 | "height": 250, 244 | "panels": [ 245 | { 246 | "aliasColors": {}, 247 | "bars": false, 248 | "dashLength": 10, 249 | "dashes": false, 250 | "datasource": "${DS_DS-PROMETHEUS}", 251 | "fill": 0, 252 | "id": 8, 253 | "legend": { 254 | "avg": false, 255 | "current": false, 256 | "max": false, 257 | "min": false, 258 | "show": true, 259 | "total": false, 260 | "values": false 261 | }, 262 | "lines": true, 263 | "linewidth": 3, 264 | "links": [], 265 | "nullPointMode": "null", 266 | "percentage": false, 267 | "pointradius": 5, 268 | "points": false, 269 | "renderer": "flot", 270 | "seriesOverrides": [], 271 | "spaceLength": 10, 272 | "span": 6, 273 | "stack": false, 274 | "steppedLine": false, 275 | "targets": [ 276 | { 277 | "expr": "node_docker_running_containers", 278 | "format": "time_series", 279 | "intervalFactor": 2, 280 | "metric": "node_docker_running_containers", 281 | "refId": "A", 282 | "step": 10 283 | } 284 | ], 285 | "thresholds": [], 286 | "timeFrom": null, 287 | "timeShift": null, 288 | "title": "Docker Containers Running", 289 | "tooltip": { 290 | "shared": true, 291 | "sort": 0, 292 | "value_type": "individual" 293 | }, 294 | "type": "graph", 295 | "xaxis": { 296 | "buckets": null, 297 | "mode": "time", 298 | "name": null, 299 | "show": true, 300 | "values": [] 301 | }, 302 | "yaxes": [ 303 | { 304 | "format": "short", 305 | "label": null, 306 | "logBase": 1, 307 | "max": null, 308 | "min": null, 309 | "show": true 310 | }, 311 | { 312 | "format": "short", 313 | "label": null, 314 | "logBase": 1, 315 | "max": null, 316 | "min": null, 317 | "show": false 318 | } 319 | ] 320 | }, 321 | { 322 | "cacheTimeout": null, 323 | "colorBackground": false, 324 | "colorValue": false, 325 | "colors": [ 326 | "rgba(245, 54, 54, 0.9)", 327 | "rgba(237, 129, 40, 0.89)", 328 | "rgba(50, 172, 45, 0.97)" 329 | ], 330 | "datasource": "${DS_DS-PROMETHEUS}", 331 | "format": "s", 332 | "gauge": { 333 | "maxValue": 100, 334 | "minValue": 0, 335 | "show": false, 336 | "thresholdLabels": false, 337 | "thresholdMarkers": true 338 | }, 339 | "id": 7, 340 | "interval": null, 341 | "links": [], 342 | "mappingType": 1, 343 | "mappingTypes": [ 344 | { 345 | "name": "value to text", 346 | "value": 1 347 | }, 348 | { 349 | "name": "range to text", 350 | "value": 2 351 | } 352 | ], 353 | "maxDataPoints": 100, 354 | "nullPointMode": "connected", 355 | "nullText": null, 356 | "postfix": "", 357 | "postfixFontSize": "50%", 358 | "prefix": "", 359 | "prefixFontSize": "50%", 360 | "rangeMaps": [ 361 | { 362 | "from": "null", 363 | "text": "N/A", 364 | "to": "null" 365 | } 366 | ], 367 | "span": 6, 368 | "sparkline": { 369 | "fillColor": "rgba(31, 118, 189, 0.18)", 370 | "full": false, 371 | "lineColor": "rgb(31, 120, 193)", 372 | "show": false 373 | }, 374 | "tableColumn": "Time", 375 | "targets": [ 376 | { 377 | "expr": "time(node_docker_last_successful_update)", 378 | "format": "table", 379 | "interval": "", 380 | "intervalFactor": 2, 381 | "metric": "node_docker_last_successful_update", 382 | "refId": "A", 383 | "step": 60 384 | } 385 | ], 386 | "thresholds": "", 387 | "title": "Last Successful Docker Update", 388 | "type": "singlestat", 389 | "valueFontSize": "80%", 390 | "valueMaps": [ 391 | { 392 | "op": "=", 393 | "text": "N/A", 394 | "value": "null" 395 | } 396 | ], 397 | "valueName": "current" 398 | } 399 | ], 400 | "repeat": null, 401 | "repeatIteration": null, 402 | "repeatRowId": null, 403 | "showTitle": false, 404 | "title": "Dashboard Row", 405 | "titleSize": "h6" 406 | }, 407 | { 408 | "collapse": false, 409 | "height": 250, 410 | "panels": [ 411 | { 412 | "columns": [], 413 | "fontSize": "100%", 414 | "id": 9, 415 | "links": [], 416 | "pageSize": null, 417 | "scroll": true, 418 | "showHeader": true, 419 | "sort": { 420 | "col": 2, 421 | "desc": true 422 | }, 423 | "span": 12, 424 | "styles": [ 425 | { 426 | "alias": "", 427 | "colorMode": null, 428 | "colors": [ 429 | "rgba(245, 54, 54, 0.9)", 430 | "rgba(237, 129, 40, 0.89)", 431 | "rgba(50, 172, 45, 0.97)" 432 | ], 433 | "decimals": 2, 434 | "pattern": "/.*/", 435 | "thresholds": [], 436 | "type": "number", 437 | "unit": "short" 438 | } 439 | ], 440 | "targets": [ 441 | { 442 | "expr": "node_docker_last_successful_update", 443 | "format": "table", 444 | "intervalFactor": 2, 445 | "metric": "node_docker_last_successful_update_month", 446 | "refId": "A", 447 | "step": 4 448 | } 449 | ], 450 | "title": "Panel Title", 451 | "transform": "table", 452 | "type": "table" 453 | } 454 | ], 455 | "repeat": null, 456 | "repeatIteration": null, 457 | "repeatRowId": null, 458 | "showTitle": false, 459 | "title": "Dashboard Row", 460 | "titleSize": "h6" 461 | } 462 | ], 463 | "schemaVersion": 14, 464 | "style": "dark", 465 | "tags": [], 466 | "templating": { 467 | "list": [] 468 | }, 469 | "time": { 470 | "from": "now-1h", 471 | "to": "now" 472 | }, 473 | "timepicker": { 474 | "refresh_intervals": [ 475 | "5s", 476 | "10s", 477 | "30s", 478 | "1m", 479 | "5m", 480 | "15m", 481 | "30m", 482 | "1h", 483 | "2h", 484 | "1d" 485 | ], 486 | "time_options": [ 487 | "5m", 488 | "15m", 489 | "1h", 490 | "6h", 491 | "12h", 492 | "24h", 493 | "2d", 494 | "7d", 495 | "30d" 496 | ] 497 | }, 498 | "timezone": "browser", 499 | "title": "Docker Status", 500 | "version": 12 501 | } -------------------------------------------------------------------------------- /node-exporter/dockerinfo/README.adoc: -------------------------------------------------------------------------------- 1 | == Docker Info 2 | 3 | This directory contains utility scripts to run on every Node with Docker to add information on space in the Docker Volume Group (which is a RAW device that is not mounted and therefore is not available to node exporter). 4 | 5 | It contains of a shell script that collects the available space and writes into a directory on the host. This directory is mounted into the node exporter pods so that the node exporter can report on the contents of any file in that directory. 6 | There is also a cron script that runs every 5 minutes. 7 | 8 | To install run the following command from your OpenShift bastion host (where you have your Ansible hosts file with a list of all nodes): 9 | 10 | [source,bash] 11 | ---- 12 | ansible-playbook -i /etc/ansible/hosts install_dockerinfo.yml 13 | ---- 14 | 15 | There is a Grafana Dashboard (DockerStatus.json) that reads the values from this utility script and displays them in a graphical fashion. Simply import the Dashboard into Grafana once Grafana has been set up and connected to Prometheus. 16 | -------------------------------------------------------------------------------- /node-exporter/dockerinfo/dockerinfo.cron: -------------------------------------------------------------------------------- 1 | */1 * * * * root /usr/local/bin/dockerinfo.sh 2 | -------------------------------------------------------------------------------- /node-exporter/dockerinfo/dockerinfo.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # Collect Docker Volume Group Space Information 4 | /sbin/lvs docker-vg --units b |grep docker-pool|awk -c '{print "node_docker_volume_size_bytes " substr($4, 1, length($4)-1)}' >/tmp/docker_info.prom 5 | /sbin/lvs docker-vg --units b |grep docker-pool|awk -c '{print "node_docker_volume_data_percent_full " $5}' >>/tmp/docker_info.prom 6 | /sbin/lvs docker-vg --units b |grep docker-pool|awk -c '{print "node_docker_volume_meta_percent_full " $6}' >>/tmp/docker_info.prom 7 | echo "node_docker_running_containers " $(docker ps -q |wc -l) >>/tmp/docker_info.prom 8 | echo "node_docker_last_successful_update{year=\"`date +%Y`\", month=\"`date +%m`\", day=\"`date +%d`\", hour=\"`date +%H`\", minute=\"`date +%M`\", second=\"`date +%S`\"} " `date +%s` >>/tmp/docker_info.prom 9 | 10 | mv /tmp/docker_info.prom /var/lib/node_exporter/textfile_collector 11 | -------------------------------------------------------------------------------- /node-exporter/dockerinfo/install_dockerinfo.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - hosts: nodes 3 | remote_user: ec2_user 4 | become: yes 5 | become_user: root 6 | tasks: 7 | - name: Create textfile_collector directory 8 | file: 9 | path: /var/lib/node_exporter/textfile_collector 10 | state: directory 11 | owner: root 12 | group: root 13 | mode: 0775 14 | - name: Copy dockerinfo to node 15 | copy: 16 | src: dockerinfo.sh 17 | dest: /usr/local/bin/dockerinfo.sh 18 | owner: root 19 | group: root 20 | mode: 0755 21 | - name: Copy cron.d/docker_info.cron to node 22 | copy: 23 | src: dockerinfo.cron 24 | dest: /etc/cron.d/dockerinfo.cron 25 | owner: root 26 | group: root 27 | mode: 0644 28 | - name: Restart crond service on node 29 | systemd: 30 | name: crond 31 | state: restarted 32 | -------------------------------------------------------------------------------- /node-exporter/native/README.adoc: -------------------------------------------------------------------------------- 1 | # Installing node-exporter natively on an OpenShift Cluster 2 | 3 | If you would like to run node-exporter natively on your OpenShift cluster you can use the playbook to deploy it to all the nodes in your cluster. 4 | 5 | Run the provided Ansible playbook on your OpenShift bastion host and point it to your Ansible hosts file for your OpenShift installation (default location: `/etc/ansible/hosts`). 6 | 7 | This will install the `node_exporter` binary in all hosts that are in the `nodes` group in the referenced hosts file. It will also set up a system service to automatically start the node_exporter. 8 | 9 | Finally the playbook configures the firewall on all nodes to allow for inbound traffic on port 9100. 10 | 11 | [source,bash] 12 | ---- 13 | ansible-playbook -i /etc/ansible/hosts install_node-exporter.yml 14 | ---- 15 | -------------------------------------------------------------------------------- /node-exporter/native/install_node-exporter.yml: -------------------------------------------------------------------------------- 1 | --- 2 | # Get latest Node Exporter software 3 | - hosts: localhost 4 | tasks: 5 | - name: Download latest node_exporter binary 6 | get_url: 7 | url: https://github.com/prometheus/node_exporter/releases/download/v0.15.0/node_exporter-0.15.0.linux-amd64.tar.gz 8 | dest: /tmp/node_exporter.tgz 9 | - name: Untar node_exporter binary 10 | unarchive: 11 | src: /tmp/node_exporter.tgz 12 | dest: /tmp 13 | copy: no 14 | 15 | # Now set up all nodes running OpenShift software with the Node Exporter 16 | - hosts: nodes 17 | remote_user: ec2_user 18 | become: yes 19 | become_user: root 20 | tasks: 21 | - name: Copy node-exporter binary to node 22 | copy: 23 | src: /tmp/node_exporter-0.15.0.linux-amd64/node_exporter 24 | dest: /usr/local/bin/node_exporter 25 | owner: root 26 | group: root 27 | mode: 0755 28 | - name: Set up Service for node-exporter 29 | copy: 30 | src: node_exporter.conf 31 | dest: /etc/systemd/system/node_exporter.service 32 | owner: root 33 | group: root 34 | mode: 0644 35 | - name: Start node_exporter service on node 36 | systemd: 37 | name: node_exporter 38 | enabled: yes 39 | state: started 40 | daemon_reload: yes 41 | # Open Firewall Port 9100 for future sessions by adding the rule to 42 | # the iptables file. 43 | - name: Open Firewall port 9100 for future sessions 44 | lineinfile: 45 | dest: /etc/sysconfig/iptables 46 | insertafter: '-A FORWARD -j REJECT --reject-with icmp-host-prohibited' 47 | line: '-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 9100 -j ACCEPT' 48 | state: present 49 | # Open Firewall Port 9100 for current session by adding the rule to the 50 | # current iptables configuration. We won't need to restart the iptables 51 | # service - which will ensure all OpenShift rules stay in place.- name: Restart IP Tables service 52 | - name: Open Firewall Port 9100 for current session 53 | iptables: 54 | action: insert 55 | protocol: tcp 56 | destination_port: 9100 57 | state: present 58 | chain: OS_FIREWALL_ALLOW 59 | jump: ACCEPT 60 | 61 | # Clean up on Localhost 62 | # Delete the downloaded and unarchived files 63 | - hosts: localhost 64 | tasks: 65 | - name: Cleanup localhost /tmp 66 | file: 67 | path: /tmp/node_exporter.tgz 68 | state: absent 69 | - name: Cleanup localhost unarchived directory 70 | file: 71 | path: /tmp/node_exporter-0.15.0.linux-amd64 72 | state: absent 73 | -------------------------------------------------------------------------------- /node-exporter/native/node_exporter.conf: -------------------------------------------------------------------------------- 1 | [Unit] 2 | Description=Node Exporter 3 | 4 | [Service] 5 | User=root 6 | ExecStart=/usr/local/bin/node_exporter --collector.textfile.directory /var/lib/node_exporter/textfile_collector 7 | 8 | [Install] 9 | WantedBy=default.target 10 | -------------------------------------------------------------------------------- /node-exporter/node-exporter.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: v1 2 | kind: Template 3 | metadata: 4 | name: node-exporter 5 | annotations: 6 | "openshift.io/display-name": Node Exporter 7 | description: | 8 | Node Exporter for use with Prometheus on Red Hat OpenShift - collect and gather metrics from nodes, services, and the infrastructure. 9 | iconClass: icon-cogs 10 | tags: "monitoring,node-exporter,time-series" 11 | parameters: 12 | - description: The location of the prometheus image 13 | name: IMAGE_NODE_EXPORTER 14 | value: wkulhanek/node-exporter:latest 15 | objects: 16 | 17 | - apiVersion: extensions/v1beta1 18 | kind: DaemonSet 19 | metadata: 20 | name: node-exporter 21 | spec: 22 | template: 23 | metadata: 24 | labels: 25 | app: node-exporter 26 | name: node-exporter 27 | annotations: 28 | prometheus.io/scrape: "true" 29 | prometheus.io/port: "9100" 30 | spec: 31 | hostPID: true 32 | hostNetwork: true 33 | containers: 34 | - image: ${IMAGE_NODE_EXPORTER} 35 | imagePullPolicy: IfNotPresent 36 | name: node-exporter 37 | ports: 38 | - containerPort: 9100 39 | hostPort: 9100 40 | protocol: TCP 41 | name: scrape 42 | securityContext: 43 | privileged: true 44 | args: 45 | - --path.procfs 46 | - /host/proc 47 | - --path.sysfs 48 | - /host/sys 49 | - --collector.filesystem.ignored-mount-points 50 | - '"^/(sys|proc|dev|host|etc)($|/)"' 51 | - --collector.textfile.directory 52 | - /textfile_collector 53 | volumeMounts: 54 | - name: dev 55 | mountPath: /host/dev 56 | - name: proc 57 | mountPath: /host/proc 58 | - name: sys 59 | mountPath: /host/sys 60 | - name: rootfs 61 | mountPath: /rootfs 62 | - name: textfile-collector 63 | mountPath: /textfile_collector 64 | volumes: 65 | - name: proc 66 | hostPath: 67 | path: /proc 68 | - name: dev 69 | hostPath: 70 | path: /dev 71 | - name: sys 72 | hostPath: 73 | path: /sys 74 | - name: rootfs 75 | hostPath: 76 | path: / 77 | - name: textfile-collector 78 | hostPath: 79 | path: /var/lib/node_exporter/textfile_collector 80 | -------------------------------------------------------------------------------- /node-exporter/setup.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # you should set securityContext privileged true on container part of node-exporter.yaml file to collect filesystem metrics! 4 | # Change into the Prometheus project 5 | oc project prometheus 6 | 7 | # The Service Account running the node selector pods needs the SCC `hostaccess`: 8 | 9 | # oc adm policy add-scc-to-user privileged system:serviceaccount:: 10 | oc adm policy add-scc-to-user privileged system:serviceaccount:prometheus:default 11 | 12 | # Import the template and create the DaemonSet 13 | oc create -f node-exporter-template.yaml 14 | oc new-app --template=node-exporter 15 | 16 | -------------------------------------------------------------------------------- /node-exporter/update_firewall.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - hosts: nodes 3 | remote_user: ec2_user 4 | become: yes 5 | become_user: root 6 | tasks: 7 | # Open Firewall Port 9100 for future sessions by adding the rule to 8 | # the iptables file. 9 | - name: Open Firewall port 9100 for future sessions 10 | lineinfile: 11 | dest: /etc/sysconfig/iptables 12 | insertafter: '-A FORWARD -j REJECT --reject-with icmp-host-prohibited' 13 | line: '-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 9100 -j ACCEPT' 14 | state: present 15 | # Open Firewall Port 9100 for current session by adding the rule to the 16 | # current iptables configuration. We won't need to restart the iptables 17 | # service - which will ensure all OpenShift rules stay in place.- name: Restart IP Tables service 18 | - name: Open Firewall Port 9100 for current session 19 | iptables: 20 | action: insert 21 | protocol: tcp 22 | destination_port: 9100 23 | state: present 24 | chain: OS_FIREWALL_ALLOW 25 | jump: ACCEPT 26 | -------------------------------------------------------------------------------- /prometheus-image/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM docker.io/centos:7 2 | LABEL maintainer="Wolfgang Kulhanek " 3 | 4 | ENV PROMETHEUS_VERSION=2.8.0 5 | 6 | RUN yum -y update && yum -y upgrade && \ 7 | yum -y clean all && \ 8 | rm -rf /var/cache/yum && \ 9 | curl -L -o /tmp/prometheus.tar.gz https://github.com/prometheus/prometheus/releases/download/v$PROMETHEUS_VERSION/prometheus-$PROMETHEUS_VERSION.linux-amd64.tar.gz && \ 10 | tar -xmzf /tmp/prometheus.tar.gz && \ 11 | mv ./prometheus-$PROMETHEUS_VERSION.linux-amd64 /usr/share/prometheus && \ 12 | ln -s /usr/share/prometheus/prometheus /bin/prometheus && \ 13 | ln -s /usr/share/prometheus/promtool /bin/promtool && \ 14 | mkdir -p /etc/prometheus && \ 15 | mkdir -p /prometheus && \ 16 | chmod 777 /prometheus && \ 17 | rm /tmp/prometheus.tar.gz 18 | 19 | COPY config.yml /etc/prometheus/prometheus.yml 20 | 21 | EXPOSE 9090 22 | USER nobody 23 | VOLUME [ "/prometheus" ] 24 | WORKDIR /prometheus 25 | ENTRYPOINT [ "/bin/prometheus" ] 26 | CMD [ "--config.file=/etc/prometheus/prometheus.yml", \ 27 | "--web.listen-address=:9090", \ 28 | "--web.console.templates=/usr/share/prometheus/consoles", \ 29 | "--web.console.libraries=/usr/share/prometheus/console_libraries", \ 30 | "--storage.tsdb.path=/prometheus", \ 31 | "--storage.tsdb.retention=24h", \ 32 | "--storage.tsdb.min-block-duration=15m", \ 33 | "--storage.tsdb.max-block-duration=60m" \ 34 | ] 35 | -------------------------------------------------------------------------------- /prometheus-image/build.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | export VERSION=2.8.0 3 | docker build . -t wkulhanek/prometheus:latest 4 | docker tag wkulhanek/prometheus:latest wkulhanek/prometheus:${VERSION} 5 | docker push wkulhanek/prometheus:latest 6 | docker push wkulhanek/prometheus:${VERSION} 7 | -------------------------------------------------------------------------------- /prometheus-image/config.yml: -------------------------------------------------------------------------------- 1 | rule_files: 2 | - 'prometheus.rules' 3 | 4 | # A scrape configuration for running Prometheus on a Kubernetes cluster. 5 | # This uses separate scrape configs for cluster components (i.e. API server, node) 6 | # and services to allow each to use different authentication configs. 7 | # 8 | # Kubernetes labels will be added as Prometheus labels on metrics via the 9 | # `labelmap` relabeling action. 10 | 11 | # Scrape config for API servers. 12 | # 13 | # Kubernetes exposes API servers as endpoints to the default/kubernetes 14 | # service so this uses `endpoints` role and uses relabelling to only keep 15 | # the endpoints associated with the default/kubernetes service using the 16 | # default named port `https`. This works for single API server deployments as 17 | # well as HA API server deployments. 18 | scrape_configs: 19 | - job_name: 'kubernetes-apiservers' 20 | 21 | kubernetes_sd_configs: 22 | - role: endpoints 23 | 24 | scheme: https 25 | tls_config: 26 | ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt 27 | bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token 28 | 29 | # Keep only the default/kubernetes service endpoints for the https port. This 30 | # will add targets for each API server which Kubernetes adds an endpoint to 31 | # the default/kubernetes service. 32 | relabel_configs: 33 | - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] 34 | action: keep 35 | regex: default;kubernetes;https 36 | 37 | # Scrape config for nodes. 38 | # 39 | # Each node exposes a /metrics endpoint that contains operational metrics for 40 | # the Kubelet and other components. 41 | - job_name: 'kubernetes-nodes' 42 | 43 | scheme: https 44 | tls_config: 45 | ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt 46 | bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token 47 | 48 | kubernetes_sd_configs: 49 | - role: node 50 | 51 | relabel_configs: 52 | - action: labelmap 53 | regex: __meta_kubernetes_node_label_(.+) 54 | 55 | # Scrape config for controllers. 56 | # 57 | # Each master node exposes a /metrics endpoint on :8444 that contains operational metrics for 58 | # the controllers. 59 | # 60 | # TODO: move this to a pure endpoints based metrics gatherer when controllers are exposed via 61 | # endpoints. 62 | - job_name: 'kubernetes-controllers' 63 | 64 | scheme: https 65 | tls_config: 66 | ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt 67 | bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token 68 | 69 | kubernetes_sd_configs: 70 | - role: endpoints 71 | 72 | # Keep only the default/kubernetes service endpoints for the https port, and then 73 | # set the port to 8444. This is the default configuration for the controllers on OpenShift 74 | # masters. 75 | relabel_configs: 76 | - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] 77 | action: keep 78 | regex: default;kubernetes;https 79 | - source_labels: [__address__] 80 | action: replace 81 | target_label: __address__ 82 | regex: (.+)(?::\d+) 83 | replacement: $1:8444 84 | 85 | # Scrape config for cAdvisor. 86 | # 87 | # Beginning in Kube 1.7, each node exposes a /metrics/cadvisor endpoint that 88 | # reports container metrics for each running pod. Scrape those by default. 89 | - job_name: 'kubernetes-cadvisor' 90 | 91 | scheme: https 92 | tls_config: 93 | ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt 94 | bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token 95 | 96 | metrics_path: /metrics/cadvisor 97 | 98 | kubernetes_sd_configs: 99 | - role: node 100 | 101 | relabel_configs: 102 | - action: labelmap 103 | regex: __meta_kubernetes_node_label_(.+) 104 | 105 | # Scrape config for service endpoints. 106 | # 107 | # The relabeling allows the actual service scrape endpoint to be configured 108 | # via the following annotations: 109 | # 110 | # * `prometheus.io/scrape`: Only scrape services that have a value of `true` 111 | # * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need 112 | # to set this to `https` & most likely set the `tls_config` of the scrape config. 113 | # * `prometheus.io/path`: If the metrics path is not `/metrics` override this. 114 | # * `prometheus.io/port`: If the metrics are exposed on a different port to the 115 | # service then set this appropriately. 116 | - job_name: 'kubernetes-service-endpoints' 117 | 118 | tls_config: 119 | ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt 120 | # TODO: this should be per target 121 | insecure_skip_verify: true 122 | 123 | kubernetes_sd_configs: 124 | - role: endpoints 125 | 126 | 127 | relabel_configs: 128 | - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] 129 | action: keep 130 | regex: true 131 | - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] 132 | action: replace 133 | target_label: __scheme__ 134 | regex: (https?) 135 | - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] 136 | action: replace 137 | target_label: __metrics_path__ 138 | regex: (.+) 139 | - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] 140 | action: replace 141 | target_label: __address__ 142 | regex: (.+)(?::\d+);(\d+) 143 | replacement: $1:$2 144 | - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_username] 145 | action: replace 146 | target_label: __basic_auth_username__ 147 | regex: (.+) 148 | - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_password] 149 | action: replace 150 | target_label: __basic_auth_password__ 151 | regex: (.+) 152 | - action: labelmap 153 | regex: __meta_kubernetes_service_label_(.+) 154 | - source_labels: [__meta_kubernetes_namespace] 155 | action: replace 156 | target_label: kubernetes_namespace 157 | - source_labels: [__meta_kubernetes_service_name] 158 | action: replace 159 | target_label: kubernetes_name 160 | 161 | alerting: 162 | alertmanagers: 163 | - scheme: http 164 | static_configs: 165 | - targets: 166 | - "localhost:9093" -------------------------------------------------------------------------------- /prometheus.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: v1 2 | kind: Template 3 | metadata: 4 | name: prometheus 5 | annotations: 6 | "openshift.io/display-name": Prometheus 7 | description: | 8 | A monitoring solution for an OpenShift cluster - collect and gather metrics from nodes, services, and the infrastructure. 9 | iconClass: icon-cogs 10 | tags: "monitoring,prometheus,time-series" 11 | parameters: 12 | - description: The namespace to instantiate prometheus under. Defaults to 'prometheus'. 13 | name: NAMESPACE 14 | value: prometheus 15 | - description: The location of the prometheus image 16 | name: IMAGE_PROMETHEUS 17 | value: wkulhanek/prometheus:latest 18 | - description: The scheme to communicate with the Alertmanager. Defaults to 'http'. 19 | name: ALERT_MANAGER_SCHEME 20 | value: http 21 | - description: Alertmanager Hostname and Port. Defaults to 'alertmanager:9093'. 22 | name: ALERT_MANAGER_HOST_PORT 23 | value: alertmanager:9093 24 | - description: Router Password (oc set env dc router -n default --list|grep STATS_PASSWORD) 25 | name: ROUTER_PASSWORD 26 | required: true 27 | 28 | objects: 29 | 30 | - apiVersion: v1 31 | kind: ServiceAccount 32 | metadata: 33 | name: prometheus 34 | 35 | - apiVersion: v1 36 | kind: ClusterRoleBinding 37 | metadata: 38 | name: prometheus-cluster-reader 39 | roleRef: 40 | name: cluster-reader 41 | subjects: 42 | - kind: ServiceAccount 43 | name: prometheus 44 | namespace: ${NAMESPACE} 45 | 46 | - apiVersion: v1 47 | kind: Route 48 | metadata: 49 | name: prometheus 50 | spec: 51 | to: 52 | name: prometheus 53 | 54 | - apiVersion: v1 55 | kind: Service 56 | metadata: 57 | annotations: 58 | prometheus.io/scrape: "true" 59 | prometheus.io/scheme: http 60 | labels: 61 | name: prometheus 62 | name: prometheus 63 | spec: 64 | ports: 65 | - name: prometheus 66 | port: 9090 67 | protocol: TCP 68 | targetPort: 9090 69 | selector: 70 | app: prometheus 71 | 72 | - apiVersion: v1 73 | kind: DeploymentConfig 74 | metadata: 75 | name: prometheus 76 | labels: 77 | app: prometheus 78 | spec: 79 | replicas: 1 80 | selector: 81 | app: prometheus 82 | deploymentconfig: prometheus 83 | template: 84 | metadata: 85 | labels: 86 | app: prometheus 87 | deploymentconfig: prometheus 88 | name: prometheus 89 | spec: 90 | serviceAccountName: prometheus 91 | securityContext: 92 | privileged: true 93 | nodeSelector: 94 | prometheus-host: "true" 95 | containers: 96 | - name: prometheus 97 | args: 98 | - --config.file=/etc/prometheus/prometheus.yml 99 | - --web.listen-address=:9090 100 | - --storage.tsdb.retention=6h 101 | - --storage.tsdb.min-block-duration=15m 102 | - --storage.tsdb.max-block-duration=60m 103 | image: ${IMAGE_PROMETHEUS} 104 | imagePullPolicy: IfNotPresent 105 | livenessProbe: 106 | failureThreshold: 3 107 | httpGet: 108 | path: /status 109 | port: 9090 110 | scheme: HTTP 111 | initialDelaySeconds: 2 112 | periodSeconds: 10 113 | successThreshold: 1 114 | timeoutSeconds: 1 115 | readinessProbe: 116 | failureThreshold: 3 117 | httpGet: 118 | path: /status 119 | port: 9090 120 | scheme: HTTP 121 | initialDelaySeconds: 2 122 | periodSeconds: 10 123 | successThreshold: 1 124 | timeoutSeconds: 1 125 | # Set Requests & Limits 126 | # Prometheus uses 2Gi memory by default with 50% headroom 127 | # required. 128 | resources: 129 | requests: 130 | cpu: 500m 131 | memory: 3Gi 132 | limits: 133 | cpu: 500m 134 | memory: 3Gi 135 | volumeMounts: 136 | - mountPath: /etc/prometheus 137 | name: config-volume 138 | - mountPath: /prometheus 139 | name: data-volume 140 | - mountPath: /etc/prometheus-rules 141 | name: rules-volume 142 | restartPolicy: Always 143 | volumes: 144 | - name: data-volume 145 | hostPath: 146 | path: /var/lib/prometheus-data 147 | type: Directory 148 | - name: config-volume 149 | configMap: 150 | defaultMode: 420 151 | name: prometheus 152 | - name: rules-volume 153 | configMap: 154 | defaultMode: 420 155 | name: prometheus-rules 156 | 157 | - apiVersion: v1 158 | kind: ConfigMap 159 | metadata: 160 | name: prometheus 161 | data: 162 | prometheus.yml: | 163 | global: 164 | scrape_interval: 1m 165 | scrape_timeout: 10s 166 | evaluation_interval: 1m 167 | alerting: 168 | alertmanagers: 169 | - scheme: ${ALERT_MANAGER_SCHEME} 170 | static_configs: 171 | - targets: 172 | - "${ALERT_MANAGER_HOST_PORT}" 173 | 174 | rule_files: 175 | - /etc/prometheus-rules/*.rules 176 | 177 | # A scrape configuration for running Prometheus on a Kubernetes cluster. 178 | # This uses separate scrape configs for cluster components (i.e. API server, node) 179 | # and services to allow each to use different authentication configs. 180 | # 181 | # Kubernetes labels will be added as Prometheus labels on metrics via the 182 | # `labelmap` relabeling action. 183 | 184 | scrape_configs: 185 | # Scrape config for API servers. 186 | # 187 | # Kubernetes exposes API servers as endpoints to the default/kubernetes 188 | # service so this uses `endpoints` role and uses relabelling to only keep 189 | # the endpoints associated with the default/kubernetes service using the 190 | # default named port `https`. This works for single API server deployments as 191 | # well as HA API server deployments. 192 | - job_name: kubernetes-apiservers 193 | scrape_interval: 1m 194 | scrape_timeout: 10s 195 | metrics_path: /metrics 196 | scheme: https 197 | kubernetes_sd_configs: 198 | - api_server: null 199 | role: endpoints 200 | namespaces: 201 | names: [] 202 | bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token 203 | tls_config: 204 | ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt 205 | insecure_skip_verify: false 206 | relabel_configs: 207 | - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] 208 | separator: ; 209 | regex: default;kubernetes;https 210 | replacement: $1 211 | action: keep 212 | 213 | # Scrape config for controllers. 214 | # 215 | # Each master node exposes a /metrics endpoint on :8444 that contains operational metrics for 216 | # the controllers. 217 | # 218 | # TODO: move this to a pure endpoints based metrics gatherer when controllers are exposed via 219 | # endpoints. 220 | - job_name: kubernetes-controllers 221 | scrape_interval: 1m 222 | scrape_timeout: 10s 223 | metrics_path: /metrics 224 | scheme: https 225 | kubernetes_sd_configs: 226 | - api_server: null 227 | role: endpoints 228 | namespaces: 229 | names: [] 230 | bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token 231 | tls_config: 232 | ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt 233 | insecure_skip_verify: false 234 | 235 | # Keep only the default/kubernetes service endpoints for the https port, and then 236 | # set the port to 8444. This is the default configuration for the controllers on OpenShift 237 | # masters. 238 | relabel_configs: 239 | - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] 240 | separator: ; 241 | regex: default;kubernetes;https 242 | replacement: $1 243 | action: keep 244 | - source_labels: [__address__] 245 | separator: ; 246 | regex: (.+)(?::\d+) 247 | target_label: __address__ 248 | replacement: $1:8444 249 | action: replace 250 | 251 | # Scrape config for nodes. 252 | # 253 | # Each node exposes a /metrics endpoint that contains operational metrics for 254 | # the Kubelet and other components. 255 | - job_name: kubernetes-nodes 256 | scrape_interval: 1m 257 | scrape_timeout: 10s 258 | metrics_path: /metrics 259 | scheme: https 260 | kubernetes_sd_configs: 261 | - api_server: null 262 | role: node 263 | namespaces: 264 | names: [] 265 | bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token 266 | tls_config: 267 | ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt 268 | insecure_skip_verify: false 269 | relabel_configs: 270 | - separator: ; 271 | regex: __meta_kubernetes_node_label_(.+) 272 | replacement: $1 273 | action: labelmap 274 | 275 | # Scrape config for cAdvisor. 276 | # 277 | # Beginning in Kube 1.7, each node exposes a /metrics/cadvisor endpoint that 278 | # reports container metrics for each running pod. Scrape those by default. 279 | - job_name: kubernetes-cadvisor 280 | scrape_interval: 1m 281 | scrape_timeout: 10s 282 | metrics_path: /metrics/cadvisor 283 | scheme: https 284 | kubernetes_sd_configs: 285 | - api_server: null 286 | role: node 287 | namespaces: 288 | names: [] 289 | bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token 290 | tls_config: 291 | ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt 292 | insecure_skip_verify: false 293 | relabel_configs: 294 | - separator: ; 295 | regex: __meta_kubernetes_node_label_(.+) 296 | replacement: $1 297 | action: labelmap 298 | 299 | # Scrape config for service endpoints. 300 | # 301 | # The relabeling allows the actual service scrape endpoint to be configured 302 | # via the following annotations: 303 | # 304 | # * `prometheus.io/scrape`: Only scrape services that have a value of `true` 305 | # * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need 306 | # to set this to `https` & most likely set the `tls_config` of the scrape config. 307 | # * `prometheus.io/path`: If the metrics path is not `/metrics` override this. 308 | # * `prometheus.io/port`: If the metrics are exposed on a different port to the 309 | # service then set this appropriately. 310 | - job_name: kubernetes-service-endpoints 311 | scrape_interval: 1m 312 | scrape_timeout: 10s 313 | metrics_path: /metrics 314 | scheme: http 315 | kubernetes_sd_configs: 316 | - api_server: null 317 | role: endpoints 318 | namespaces: 319 | names: [] 320 | tls_config: 321 | ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt 322 | insecure_skip_verify: true 323 | relabel_configs: 324 | - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] 325 | separator: ; 326 | regex: "true" 327 | replacement: $1 328 | action: keep 329 | - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] 330 | separator: ; 331 | regex: (https?) 332 | target_label: __scheme__ 333 | replacement: $1 334 | action: replace 335 | - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] 336 | separator: ; 337 | regex: (.+) 338 | target_label: __metrics_path__ 339 | replacement: $1 340 | action: replace 341 | - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] 342 | separator: ; 343 | regex: (.+)(?::\d+);(\d+) 344 | target_label: __address__ 345 | replacement: $1:$2 346 | action: replace 347 | - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_username] 348 | separator: ; 349 | regex: (.+) 350 | target_label: __basic_auth_username__ 351 | replacement: $1 352 | action: replace 353 | - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_password] 354 | separator: ; 355 | regex: (.+) 356 | target_label: __metrics_path__ 357 | replacement: $1 358 | action: replace 359 | - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] 360 | separator: ; 361 | regex: (.+)(?::\d+);(\d+) 362 | target_label: __address__ 363 | replacement: $1:$2 364 | action: replace 365 | - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_username] 366 | separator: ; 367 | regex: (.+) 368 | target_label: __basic_auth_username__ 369 | replacement: $1 370 | action: replace 371 | - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_password] 372 | separator: ; 373 | regex: (.+) 374 | target_label: __basic_auth_password__ 375 | replacement: $1 376 | action: replace 377 | - separator: ; 378 | regex: __meta_kubernetes_service_label_(.+) 379 | replacement: $1 380 | action: labelmap 381 | - source_labels: [__meta_kubernetes_namespace] 382 | separator: ; 383 | regex: (.*) 384 | target_label: kubernetes_namespace 385 | replacement: $1 386 | action: replace 387 | - source_labels: [__meta_kubernetes_service_name] 388 | separator: ; 389 | regex: (.*) 390 | target_label: kubernetes_name 391 | replacement: $1 392 | 393 | # Scrape config for node-exporter, which is expected to be running on port 9100. 394 | - job_name: node-exporters 395 | scrape_interval: 30s 396 | scrape_timeout: 30s 397 | metrics_path: /metrics 398 | scheme: http 399 | kubernetes_sd_configs: 400 | - api_server: null 401 | role: node 402 | namespaces: 403 | names: [] 404 | tls_config: 405 | ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt 406 | insecure_skip_verify: true 407 | relabel_configs: 408 | - separator: ; 409 | regex: __meta_kubernetes_node_label_(.+) 410 | replacement: $1 411 | action: labelmap 412 | - source_labels: [__meta_kubernetes_role] 413 | separator: ; 414 | regex: (.*) 415 | target_label: kubernetes_role 416 | replacement: $1 417 | action: replace 418 | - source_labels: [__address__] 419 | separator: ; 420 | regex: (.*):10250 421 | target_label: __address__ 422 | replacement: ${1}:9100 423 | action: replace 424 | 425 | # Scrape config for the template service broker 426 | - job_name: 'openshift-template-service-broker' 427 | scheme: https 428 | tls_config: 429 | ca_file: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt 430 | server_name: apiserver.openshift-template-service-broker.svc 431 | bearer_token_file: /var/run/secrets/kubernetes.io/scraper/token 432 | kubernetes_sd_configs: 433 | - role: endpoints 434 | namespaces: 435 | names: 436 | - openshift-template-service-broker 437 | relabel_configs: 438 | - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] 439 | action: keep 440 | regex: api-server;https 441 | 442 | - job_name: openshift-routers 443 | scrape_interval: 30s 444 | scrape_timeout: 30s 445 | metrics_path: /metrics 446 | scheme: http 447 | static_configs: 448 | - targets: 449 | - router.default.svc.cluster.local:1936 450 | basic_auth: 451 | username: admin 452 | password: ${ROUTER_PASSWORD} 453 | 454 | - apiVersion: v1 455 | kind: ConfigMap 456 | metadata: 457 | name: prometheus-rules 458 | data: 459 | alerting.rules: | 460 | groups: 461 | - name: example-rules 462 | interval: 30s # defaults to global interval 463 | rules: 464 | - alert: Node Down 465 | expr: up{job="kubernetes-nodes"} == 0 466 | annotations: 467 | miqTarget: "ContainerNode" 468 | severity: "HIGH" 469 | message: "{{$labels.instance}} is down" 470 | 471 | recording.rules: | 472 | groups: 473 | - name: aggregate_container_resources 474 | rules: 475 | - record: container_cpu_usage_rate 476 | expr: sum without (cpu) (rate(container_cpu_usage_seconds_total[5m])) 477 | - record: container_memory_rss_by_type 478 | expr: container_memory_rss{id=~"/|/system.slice|/kubepods.slice"} > 0 479 | - record: container_cpu_usage_percent_by_host 480 | expr: sum by (kubernetes_io_hostname,type)(rate(container_cpu_usage_seconds_total{id="/"}[5m])) / on (kubernetes_io_hostname,type) machine_cpu_cores 481 | - record: apiserver_request_count_rate_by_resources 482 | expr: sum without (client,instance,contentType) (rate(apiserver_request_count[5m])) 483 | -------------------------------------------------------------------------------- /setup.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | oc project prometheus 3 | if [ "$?" != "0" ]; then 4 | oc new-project prometheus --display-name="Prometheus Monitoring" 5 | oc annotate namespace prometheus openshift.io/node-selector="" 6 | fi 7 | oc new-app -f prometheus.yaml --param ROUTER_PASSWORD=$(oc set env dc router -n default --list|grep STATS_PASSWORD|awk -F"=" '{print $2}') 8 | oc adm policy add-scc-to-user privileged system:serviceaccount:prometheus:prometheus 9 | -------------------------------------------------------------------------------- /setup_infranodes.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - hosts: localhost 3 | gather_facts: false 4 | tasks: 5 | # Find out which nodes are Infranodes and add them to a new group: infranodes 6 | - name: Add Infranodes to a new infranodes Group 7 | add_host: 8 | name: "{{ item }}" 9 | groups: infranodes 10 | with_items: "{{ groups['nodes'] }}" 11 | when: 12 | - item | match("^infranode.*") 13 | - name: Add Masters to a new masters Group 14 | add_host: 15 | name: "{{ item }}" 16 | groups: masters 17 | with_items: "{{ groups['nodes'] }}" 18 | when: 19 | - item | match("^master.*") 20 | 21 | - hosts: infranodes 22 | remote_user: ec2_user 23 | become: yes 24 | become_user: root 25 | tasks: 26 | # OpenShift Routers expose /metrics on port 1936. Therefore we need to open 27 | # the port for both future and current sessions so that Prometheus can access 28 | # the router metrics. 29 | # Open Firewall Port 1936 for future sessions by adding the rule to 30 | # the iptables file. 31 | - name: Open Firewall port 1936 for future sessions 32 | lineinfile: 33 | dest: /etc/sysconfig/iptables 34 | insertafter: '-A FORWARD -j REJECT --reject-with icmp-host-prohibited' 35 | line: '-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 1936 -j ACCEPT' 36 | state: present 37 | # Open Firewall Port 1936 for current session by adding the rule to the 38 | # current iptables configuration. We won't need to restart the iptables 39 | # service - which will ensure all OpenShift rules stay in place. 40 | - name: Open Firewall Port 1936 for current session 41 | iptables: 42 | action: insert 43 | protocol: tcp 44 | destination_port: 1936 45 | state: present 46 | chain: OS_FIREWALL_ALLOW 47 | jump: ACCEPT 48 | # Create Directory /var/lib/prometheus-data with correct permissions 49 | # Make sure the directory has SELinux Type svirt_sandbox_file_t otherwise 50 | # there is a permissions problem trying to mount it into the pod. 51 | - name: Create directory /var/lib/prometheus-data 52 | file: 53 | path: /var/lib/prometheus-data 54 | state: directory 55 | group: root 56 | owner: root 57 | mode: 0777 58 | setype: svirt_sandbox_file_t 59 | 60 | # Add label "prometheus-host=true" to our infranodes 61 | # Do that via the oc label command from the first master host 62 | - hosts: masters[0] 63 | remote_user: ec2_user 64 | become: yes 65 | become_user: root 66 | tasks: 67 | - name: Label Infranodes with prometheus-host=true 68 | shell: oc label node {{ item }} prometheus-host=true --overwrite 69 | with_items: "{{ groups['infranodes'] }}" 70 | --------------------------------------------------------------------------------