├── .gitignore
├── README.md
├── SUMMARY.md
├── alertmanager
    ├── README.md
    ├── config.md
    ├── email.md
    ├── others.md
    ├── slack.md
    ├── webhooks.md
    ├── wechat.md
    ├── wechat
    │   ├── wechat01.png
    │   ├── wechat03.png
    │   └── wechat04.png
    └── what.md
├── application
    ├── README.md
    ├── django.md
    ├── laravel.md
    ├── phoenix.md
    ├── rails.md
    └── spring.md
├── arrangement
    ├── README.md
    ├── k8s.md
    └── swarm.md
├── concepts
    ├── README.md
    ├── data-model.md
    ├── jobs-and-instances.md
    └── metric-types.md
├── configuration
    ├── README.md
    ├── alerting.md
    ├── demo.md
    ├── global.md
    ├── remote_read.md
    ├── remote_write.md
    ├── rule_files.md
    ├── scrape_configs.md
    └── server_discovery.md
├── container
    ├── README.md
    ├── docker.md
    └── k8s.md
├── demo
    ├── README.md
    ├── alertmanager.md
    ├── grafana.md
    ├── rule.md
    └── target.md
├── devops
    ├── README.md
    ├── exporter.md
    ├── product.md
    └── receiver.md
├── exporter
    ├── README.md
    ├── nodeexporter.md
    ├── nodeexporter_grafana_template.md
    ├── nodeexporter_query.md
    ├── other.md
    ├── sample.md
    └── text.md
├── ha
    ├── README.md
    ├── alertmanger.md
    ├── img1.png
    ├── img2.png
    ├── img3.png
    ├── img4.png
    ├── img5.png
    └── prometheus.md
├── how-to-contribute.md
├── images
    ├── alertmanager
    │   ├── slack-alert1.png
    │   ├── slack-alert2.png
    │   ├── slack-alert3.png
    │   ├── slack-alert4.png
    │   ├── slack-alert5.png
    │   └── slack-alert6.png
    ├── cadvisor
    │   ├── cadvisor-01.png
    │   ├── cadvisor-02.png
    │   ├── cadvisor-03.png
    │   ├── cadvisor-04.png
    │   ├── cadvisor-05.png
    │   ├── prometheus01.png
    │   ├── prometheus02.png
    │   ├── prometheus03.png
    │   ├── prometheus04.png
    │   ├── prometheus05.png
    │   ├── prometheus06.png
    │   └── prometheus07.png
    ├── exporter
    │   ├── sample_exporter_data.png
    │   └── simple_exporter.png
    ├── install
    │   ├── prometheus-console.png
    │   └── prometheus-graph.png
    ├── qa
    │   └── reload_success.png
    └── visualiztion
    │   ├── grafana-add-graph.png
    │   ├── grafana-added-panel.png
    │   ├── grafana-datasource.png
    │   ├── grafana-default-dashbord.png
    │   ├── grafana-edit-panel.png
    │   ├── grafana-hide-controls.png
    │   ├── grafana-into-manage-dashboard.png
    │   ├── grafana-login.png
    │   ├── grafana-prometheus-data-source.png
    │   ├── prometheus-web-console.png
    │   └── prometheus-web-graph.png
├── install
    ├── README.md
    ├── binary.md
    └── docker.md
├── introduction
    ├── README.md
    ├── what.md
    └── why.md
├── optimize
    ├── README.md
    ├── config.md
    ├── logger.md
    └── status.md
├── password.html
├── promql
    ├── README.md
    ├── sql.md
    └── summary.md
├── pushgateway
    ├── README.md
    ├── how.md
    └── why.md
├── qa
    ├── README.md
    ├── auth.md
    ├── hotreload.md
    ├── jvm.md
    ├── losedata.md
    ├── memory.md
    └── pushgateway.md
├── revision-record.md
├── rule
    ├── README.md
    ├── config.md
    └── what.md
├── service
    ├── README.md
    ├── memcached.md
    ├── mongodb.md
    ├── mysql.md
    ├── nginx.md
    └── redis.md
├── store
    ├── README.md
    ├── local.md
    └── remote.md
├── tools
    ├── README.md
    ├── client.md
    └── promu.md
├── v2.0
    ├── README.md
    ├── feature.md
    ├── rule.md
    └── store.md
└── visualiztion
    ├── README.md
    ├── console.md
    └── grafana.md


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Node rules:
 2 | ## Grunt intermediate storage (http://gruntjs.com/creating-plugins#storing-task-files)
 3 | .grunt
 4 | 
 5 | ## Dependency directory
 6 | ## Commenting this out is preferred by some people, see
 7 | ## https://docs.npmjs.com/misc/faq#should-i-check-my-node_modules-folder-into-git
 8 | node_modules
 9 | 
10 | # Book build output
11 | _book
12 | 
13 | # eBook build output
14 | *.epub
15 | *.mobi
16 | *.pdf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Prometheus 实战
 2 | 
 3 | v0.1.0
 4 | 
 5 | 在过去一年左右时间里，我们使用 Prometheus 完成了对几个机房的基础和业务监控，大大提高了服务质量以及 oncall 水平，在此特别感谢 Promethues 这样优秀的开源软件。
 6 | 
 7 | 当初选择 Prometheus 并不是偶然，因为：
 8 | 
 9 | * Prometheus 是按照 Google SRE 运维之道的理念构建的，具有实用性和前瞻性。
10 | 
11 | * Prometheus 社区非常活跃，基本稳定在 1个月1个版本的迭代速度，从 2016 年 v1.01 开始接触使用以来，到目前发布的 v1.8.2 以及最新最新的 v2.1 ，你会发现 Prometheus 一直在进步、在优化。
12 | 
13 | * Go 语言开发，性能不错，安装部署简单，多平台部署兼容性好。
14 | 
15 | * 丰富的数据收集客户端，官方提供了各种常用 exporter。
16 | 
17 | * 丰富强大的查询能力。
18 | 
19 | Prometheus 作为监控后起之秀，虽然还有做的不够好的地方，但是不妨碍我们使用和喜爱它。根据我们长期的使用经验来看，它足以满足大多数场景需求，只不过对于新东西，往往需要花费更多力气才能发挥它的最大能力而已。
20 | 
21 | 本书主要根据个人过去一年多的使用经验总结而成，内容主要包括 Prometheus 基本知识、进阶、实战以及常见问题列表等方面，希望对大家有所帮助。
22 | 
23 | 本开源书籍既适用于具备基础 Linux 知识的运维初学者，也可供渴望理解 Prometheus 原理和实现细节的高级用户参考，同时也希望书中给出的实践案例在实际部署监控中对大家有所帮助。
24 | 
25 | 你准备好了吗？接下来就让我们一起开始这段神奇旅行吧！
26 | 
27 | ## 技术交流
28 | 
29 | 欢迎加入 Prometheus 技术交流 QQ 群，分享 Prometheus 资源，交流 Prometheus 技术。
30 | 
31 | * QQ 群：465362780 (已满) 申请加入请备注：prometheus 实战
32 | * 加入我们的知识星球，在这里你可以和国内最专业的 Prometheus 同行一起交流和学习：
33 | 
34 |   ![image_228845211141_3](https://user-images.githubusercontent.com/1459834/41642276-7101ce02-749a-11e8-8da0-6cf702af0870.JPG)
35 | 
36 | ## 关于作者
37 | 
38 | * small_fish__
39 | 
40 |   * [微博](https://weibo.com/songjiayang1)
41 |   * [github](https://github.com/songjiayang)
42 |   * 个人公众号
43 |   
44 |   ![第二范式](https://git.io/vAQvJ)
45 |  
46 | - 薛锦
47 | 
48 |   * [微博](https://weibo.com/1660913012/profile?topnav=1&wvr=6)
49 |   * [github](https://github.com/csxuejin)
50 |   * 个人公众号
51 | 
52 |   ![GitHub Logo](https://songjiayang.gitbooks.io/go-basic-courses/content/pics/easy-hacking.jpg)
53 |   
54 | ## GitHub 关注曲线
55 | 
56 | [![Stargazers over time](https://starcharts.herokuapp.com/songjiayang/prometheus_practice.svg)](https://starcharts.herokuapp.com/songjiayang/prometheus_practice)
57 |   
58 | 


--------------------------------------------------------------------------------
/SUMMARY.md:
--------------------------------------------------------------------------------
  1 | # Summary
  2 | 
  3 | * [前言](README.md)
  4 | * [修订记录](revision-record.md)
  5 | * [如何贡献](how-to-contribute.md)
  6 | * [Prometheus 简介](introduction/README.md)
  7 |     * [Prometheus 是什么](introduction/what.md)
  8 |     * [为什么选择 Prometheus](introduction/why.md)
  9 | * [Prometheus 安装](install/README.md)
 10 |   * [二进制包安装](install/binary.md)
 11 |   * [Docker 安装](install/docker.md)
 12 | * [基础概念](concepts/README.md)
 13 |   * [数据模型](concepts/data-model.md)
 14 |   * [Metric types](concepts/metric-types.md)
 15 |   * [作业与实例](concepts/jobs-and-instances.md)
 16 | * [PromQL](promql/README.md)
 17 |   * [PromQL 基本使用](promql/summary.md)
 18 |   * [与 SQL 对比](promql/sql.md)
 19 | * [数据可视化](visualiztion/README.md)
 20 |   * [Web Console](visualiztion/console.md)
 21 |   * [Grafana](visualiztion/grafana.md)
 22 | * [Prometheus 配置](configuration/README.md)
 23 |   * [全局配置](configuration/global.md)
 24 |   * [告警配置](configuration/alerting.md)
 25 |   * [规则配置](configuration/rule_files.md)
 26 |   * [数据拉取配置](configuration/scrape_configs.md)
 27 |   * [远程可写存储](configuration/remote_write.md)
 28 |   * [远程可读存储](configuration/remote_read.md)
 29 |   * [服务发现](configuration/server_discovery.md)
 30 |   * [配置样例](configuration/demo.md)
 31 | * [Exporter](exporter/README.md)
 32 |   * [文本格式](exporter/text.md)
 33 |   * [Sample Exporter](exporter/sample.md)
 34 |   * [Node Exporter 安装使用](exporter/nodeexporter.md)
 35 |   * [Node Exporter 常用查询](exporter/nodeexporter_query.md)
 36 |   * [Node Exporter Grafana 模版](exporter/nodeexporter_grafana_template.md)
 37 |   * [其他 Exporter 介绍](exporter/other.md)
 38 | * [Pushgateway](pushgateway/README.md)
 39 |     * [Pushgateway 是什么](pushgateway/why.md)
 40 |     * [如何使用 Pushgateway? ](pushgateway/how.md)  
 41 | * [数据存储](store/README.md)
 42 |     * [Local Store](store/local.md)
 43 |     * [Remote Store](store/remote.md)
 44 | * [告警规则](rule/README.md)
 45 |     * [如何配置](rule/config.md)
 46 |     * [触发逻辑](rule/what.md)  
 47 | * [Alertmanager](alertmanager/README.md)
 48 |     * [Alertmanager 是什么](alertmanager/what.md)
 49 |     * [配置详情](alertmanager/config.md)  
 50 |     * [通过 Email 接收告警](alertmanager/email.md)  
 51 |     * [通过企业微信接收告警](alertmanager/wechat.md)
 52 |     * [通过 Slack 接收告警](alertmanager/slack.md)  
 53 |     * [通过 Webhook 接收告警](alertmanager/webhooks.md)  
 54 |     * [其他告警接收方案](alertmanager/others.md)
 55 | * [主机监控完整示例](demo/README.md)
 56 |     * [NodeExporter](demo/target.md)
 57 |     * [配置告警规则](demo/rule.md)
 58 |     * [Grafana 集成](demo/grafana.md)
 59 |     * [通过 Alertmanager 告警](demo/alertmanager.md)
 60 | * [Prometheus 工具](tools/README.md)
 61 |     * [Promu 介绍和使用](tools/promu.md)
 62 |     * [Client SDK](tools/client.md)
 63 | * [Prometheus 性能调优](optimize/README.md)
 64 |     * [通过 Metrics 查看 Prometheus 运行状态](optimize/status.md)
 65 |     * [通过日志分析 Prometheus 运行状态](optimize/logger.md)
 66 |     * [启动参数详解](optimize/config.md)
 67 | * [Prometheus 与容器](container/README.md)
 68 |     * [Docker](container/docker.md)
 69 |     * [Kubernetes](container/k8s.md)
 70 | * [Prometheus 常见服务监控](service/README.md)
 71 |     * [Nginx](service/nginx.md)
 72 |     * [Memcached](service/memcached.md)
 73 |     * [MongoDB](service/mongodb.md)
 74 |     * [MySQL](service/mysql.md)
 75 |     * [Redis](service/redis.md)
 76 | * [Prometheus 常见应用监控](application/README.md)
 77 |     * [Spring boot](application/spring.md)
 78 |     * [Rails](application/rails.md)
 79 |     * [Django](application/django.md)
 80 |     * [Laravel](application/laravel.md)
 81 |     * [Phoenix](application/phoenix.md)
 82 | * [Prometheus 与 DevOps](devops/README.md)
 83 |     * [从 0 开发一个 exporter](devops/exporter.md)
 84 |     * [使用 Webhooks 开发一个 alert receiver](devops/receiver.md)
 85 | * [高可用方案探讨](ha/README.md)
 86 |     * [Prometheus Server 的高可靠](ha/prometheus.md)
 87 |     * [AlertManager 的高可靠](ha/alertmanger.md)
 88 | * [v2.x 迁移注意](v2.0/README.md)
 89 |     * [新功能](v2.0/feature.md)
 90 |     * [新存储架构](v2.0/store.md)
 91 |     * [Rule 新配置](v2.0/rule.md)
 92 | * [常见问题收录](qa/README.md)
 93 |     * [如何热加载新配置](qa/hotreload.md)
 94 |     * [为什么重启 Prometheus 过后，数据无法查询](qa/hotreload.md)
 95 |     * [如何删除 Pushgateway 的数据](qa/pushgateway.md)
 96 |     * [为什么内存使用这么高](qa/memory.md)
 97 |     * [为什么有数据丢失](qa/losedata.md)
 98 |     * [如何通过认证后拉取数据](qa/auth.md)
 99 |     * [监控 JVM](qa/jvm.md)
100 | 


--------------------------------------------------------------------------------
/alertmanager/README.md:
--------------------------------------------------------------------------------
 1 | # Alertmanager
 2 | 
 3 | 在 Prometheus 中告警分为两部分:
 4 | 
 5 | - Prometheus 服务根据所设置的告警规则将告警信息发送给 Alertmanager。
 6 | - Alertmanager 对收到的告警信息进行处理，包括去重，降噪，分组，策略路由告警通知。
 7 | 
 8 | 使用告警服务主要的步骤如下：
 9 | 
10 | - 下载配置 Alertmanager。
11 | - 通过设置 `-alertmanager.url` 让 Prometheus 服务与 Alertmanager 进行通信。
12 | - 在 Prometheus 服务中设置告警规则。


--------------------------------------------------------------------------------
/alertmanager/config.md:
--------------------------------------------------------------------------------
 1 | # 配置详情
 2 | 
 3 | ### 全局配置
 4 | 
 5 | ### 分组
 6 | 
 7 | ### 去重
 8 | 
 9 | ### 降噪
10 | 
11 | ### 告警路由配置


--------------------------------------------------------------------------------
/alertmanager/email.md:
--------------------------------------------------------------------------------
 1 | #通过 Email 接收告警
 2 | 
 3 | 本章将通过一个简单的实验介绍如何通过 Email 接受告警。
 4 | 
 5 | 相关信息说明：
 6 | - Prometheus 版本：prometheus-1.7.1.darwin-amd64
 7 | - Alertmanager 版本：alertmanager-0.8.0.darwin-amd64
 8 | - 发送告警邮件的邮箱：qq email
 9 | 
10 | 假设该实验运行在本地机器上, Prometheus 默认端口为 9090，Alertmanager 默认端口为 9093。
11 | 
12 | ### 修改 AlertManager 配置文件
13 | 
14 | 其中一些关键配置如下：
15 | 
16 | ```
17 | global:
18 |   smtp_smarthost: 'smtp.qq.com:587'
19 |   smtp_from: 'xxx@qq.com'
20 |   smtp_auth_username: 'xxx@qq.com'
21 |   smtp_auth_password: 'your_email_password'
22 | 
23 | route：
24 |   # If an alert has successfully been sent, wait 'repeat_interval' to resend them.
25 |   repeat_interval: 10s    
26 |   #  A default receiver
27 |   receiver: team-X-mails  
28 | 
29 | receivers:
30 |   - name: 'team-X-mails'
31 |     email_configs:
32 |     - to: 'team-X+alerts@example.org'
33 | ```
34 | 
35 | ### 在prometheus下添加 alert.rules 文件
36 | 
37 | 文件中写入以下简单规则作为示例。
38 | ```
39 | ALERT memory_high
40 |   IF prometheus_local_storage_memory_series >= 0
41 |   FOR 15s
42 |   ANNOTATIONS {
43 |     summary = "Prometheus using more memory than it should {{ $labels.instance }}",
44 |     description = "{{ $labels.instance }} has lots of memory man (current value: {{ $value }}s)",
45 |   }
46 | ```
47 | 
48 | ### 修改 prometheus.yml 文件
49 | 
50 | 添加以下规则：
51 | ```
52 | rule_files:
53 |   - "alert.rules"
54 | ```
55 | 
56 | ### 启动AlertManager服务
57 | ```
58 | ./Alertmanager -config.file=simple.yml
59 | ```
60 | 
61 | ### 启动prometheus服务
62 | ```
63 | ./prometheus -Alertmanager.url=http://localhost:9093
64 | ```
65 | 
66 | 根据以上步骤设置，此时 “team-X+alerts@example.org” 应该就可以收到 “xxx@qq.com” 发送的告警邮件了。
67 | 


--------------------------------------------------------------------------------
/alertmanager/others.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/alertmanager/others.md


--------------------------------------------------------------------------------
/alertmanager/slack.md:
--------------------------------------------------------------------------------
  1 | #通过 Slack 接收告警
  2 | 
  3 | [Slack](https://slack.com) 作为 IM 办公软件，简单好用，在国外用的特别多，那如何用它来接收 Prometheus 的告警信息，让咱们的运维看上去高大上。
  4 | 
  5 | ### 目的
  6 | 
  7 | - 使用 slack 接受消息。
  8 | - 消息能够带有 url， 自动跳转到 prometheus 对应 graph 查询页面。
  9 | - 能自定义颜色。
 10 | - 能够 @ 某人
 11 | 
 12 | ### 准备工作
 13 | 
 14 | 已注册了 slack 账号，并创建了一个 #test 频道。
 15 | 
 16 | ### 配置步骤
 17 | 
 18 | Step1: 为 #test 频道创建一个 incomming webhooks 应用。
 19 | 
 20 | - 点击频道标题，选择 `Add an app or integration`
 21 | 
 22 | ![Add an app or integration](/images/alertmanager/slack-alert1.png)
 23 | 
 24 | - 在 app store 中搜索 `incomming webhooks`，选择第一个
 25 | 
 26 | ![incomming webhooks](/images/alertmanager/slack-alert2.png)
 27 | 
 28 | 创建成功以后，拷贝 app webhook 地址，后面会用到。
 29 | 
 30 | Step2: 修改 prometheus rules，在 ANNOTATIONS 中添加特定字段。
 31 | 
 32 | v1.x rule 写法：
 33 | 
 34 | ```
 35 | ALERT InstanceStatus
 36 |  IF up {job="node"}== 0
 37 |  FOR 15s
 38 |  LABELS {
 39 |    instance = "",
 40 |  }
 41 |  ANNOTATIONS {
 42 |    summary = "服务器运行状态",
 43 |    description = 服务器已当机超过多少时间",
 44 |    link="http://xxx",
 45 |    color="#xxx",
 46 |    username="@sjy"
 47 |  }   
 48 | ```
 49 | 
 50 | v2.x alert rule 写法：
 51 | 
 52 | ```
 53 | - alert: InstanceStatus
 54 |   expr: up {job="node"} == 0
 55 |   for: 15s
 56 |   labels:
 57 |     instance: ""
 58 |   annotations:
 59 |     summary: "服务器运行状态"
 60 |     description: 服务器已当机超过多少时间"
 61 |     link: "http://xxx"
 62 |     color: "#xxx"
 63 |     username: "@sjy"
 64 | ```
 65 | 
 66 | 说明：我们在 rule 的 ANNOTATIONS 中添加了 link, color, username  三个字段，用它们来表示 `消息外链地址`，`消息颜色`和需要 `@` 的人。
 67 | 
 68 | Step3: 修改 Alertmanager 配置。
 69 | 
 70 | 使用 `slack_configs` 来配置 slack 的告警接收渠道：
 71 | 
 72 | ```
 73 | 
 74 | receivers:
 75 |   - name: 'slack'
 76 |     slack_configs:
 77 |       - api_url: "xxx"
 78 |         channel: "#test"
 79 |         text: "{{ range .Alerts }} {{ .Annotations.description}}\n {{end}} {{ .CommonAnnotations.username}} <{{.CommonAnnotations.link}}| click here>"
 80 |         title: "{{.CommonAnnotations.summary}}"
 81 |         title_link: "{{.CommonAnnotations.link}}"
 82 |         color: "{{.CommonAnnotations.color}}"
 83 | 
 84 | ```
 85 | 
 86 | 配置说明：
 87 | 
 88 | - 按 alertname 分组。
 89 | - slack_configs 配置中，使用了 template 语句，通过 CommonAnnotations 查找字段。
 90 | - 插入外链不仅可以使用 title_link, 还可以使用 slack link 标记语法 <htttpxxxxxx| Click here>。
 91 | 
 92 | 更多 slack 配置，请参考 [incoming-webhooks](https://api.slack.com/incoming-webhooks)。
 93 | 
 94 | 经过以上配置，我们收到的消息是这样：
 95 | 
 96 | ![slack.alert.demo](/images/alertmanager/slack-alert5.png)
 97 | 
 98 | 点击 title 或者 Click here，即可跳转到 Prometheus graph 页面：
 99 | 
100 | ![prometheus graph](/images/alertmanager/slack-alert6.png)
101 | 
102 | 这样就很方便了，再也不用担心多个 Prometheus 节点，切换查询带来的烦恼。
103 | 


--------------------------------------------------------------------------------
/alertmanager/webhooks.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/alertmanager/webhooks.md


--------------------------------------------------------------------------------
/alertmanager/wechat.md:
--------------------------------------------------------------------------------
 1 | #通过企业微信接收告警
 2 | 
 3 | Alertmanger 从 v0.12 开始已经默认支持企业微信了，下面我们就一起体验一下。
 4 | 
 5 | ### 准备工作
 6 | 
 7 | step 1: 访问[网站](https://work.weixin.qq.com/) 注册企业微信账号（不需要企业认证）。
 8 | 
 9 | step 2: 访问[apps](https://work.weixin.qq.com/wework_admin/frame#apps) 创建第三方应用，点击`创建应用按钮` -> 填写应用信息：
10 | 
11 | ![点击创建第三方应用](wechat/wechat01.png)
12 | 
13 | ### 使用版本
14 | 
15 | - prometheus: 2.0.darwin-amd64
16 | - node_exporter: 0.15.0.darwin-amd64
17 | - alertmanager: 0.14.darwin-amd64
18 | 
19 | ### 详细配置：
20 | 
21 | #### prometheus 配置：
22 | 
23 | ```
24 | # Alertmanager configuration
25 | alerting:
26 |   alertmanagers:
27 |   - static_configs:
28 |     - targets:
29 |       - localhost:9093
30 | 
31 | rule_files:
32 |   - "rules.yml"
33 | 
34 | scrape_configs:
35 |   - job_name: 'node'
36 |     static_configs:
37 |       - targets: ['localhost:9100']
38 | ```
39 | 
40 | rules.yml 配置：
41 | 
42 | ```
43 | groups:
44 | - name: node
45 |   rules:
46 |   - alert: server_status
47 |     expr: up{job="node"} == 0
48 |     for: 15s
49 |     annotations:
50 |       summary: "机器 {{ $labels.instance }} 挂了"
51 | ```
52 | 
53 | #### alertmanger 配置：
54 | 
55 | ```
56 | route:
57 |   group_by: ['alertname']
58 |   receiver: 'wechat'
59 | 
60 | receivers:
61 | - name: 'wechat'
62 |   wechat_configs:
63 |   - corp_id: 'xxx'
64 |     to_party: '1'
65 |     agent_id: '1000002'
66 |     api_secret: 'xxxx'
67 | ```
68 | 
69 | 参数说明：
70 | 
71 | - corp_id: 企业微信账号唯一 ID， 可以在`我的企业`中查看。
72 | - to_party: 需要发送的组。
73 | - agent_id: 第三方企业应用的 ID，可以在自己创建的第三方企业应用详情页面查看。
74 | - api_secret: 第三方企业应用的密钥，可以在自己创建的第三方企业应用详情页面查看。
75 | 
76 | 详情请参考[文档](https://work.weixin.qq.com/api/doc#10167/%E6%96%87%E6%9C%AC%E6%B6%88%E6%81%AF)。
77 | 
78 | ### 验证测试
79 | 
80 | 当我们停掉 node_exporter 的时候，会收到如下告警信息：
81 | 
82 | ![wechat03.png](wechat/wechat03.png)
83 | 
84 | 当我们重新启动 node_exporter 的时候，会收到如下告警信息：
85 | 
86 | ![wechat04.png](wechat/wechat04.png)
87 | 
88 | ### 结论
89 | 
90 | 企业微信从注册到 alertmanger 配置没有什么坑，而且它的通知非常及时，基本不丢消息，大家可以测试体验以下。
91 | 


--------------------------------------------------------------------------------
/alertmanager/wechat/wechat01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/alertmanager/wechat/wechat01.png


--------------------------------------------------------------------------------
/alertmanager/wechat/wechat03.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/alertmanager/wechat/wechat03.png


--------------------------------------------------------------------------------
/alertmanager/wechat/wechat04.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/alertmanager/wechat/wechat04.png


--------------------------------------------------------------------------------
/alertmanager/what.md:
--------------------------------------------------------------------------------
 1 | # Alertmanager 是什么？
 2 | 
 3 | ![架构图](https://raw.githubusercontent.com/prometheus/alertmanager/4e6695682acd2580773a904e4aa2e3b927ee27b7/doc/arch.jpg)
 4 | 
 5 | Alertmanager 主要用于接收 Prometheus 发送的告警信息，它支持丰富的告警通知渠道，而且很容易做到告警信息进行去重，降噪，分组，策略路由，是一款前卫的告警通知系统。
 6 | 
 7 | ### 安装
 8 | 
 9 | 使用 `wget` 下载按转包
10 | 
11 | ```
12 | cd ~/Download
13 | wget https://github.com/prometheus/alertmanager/releases/download/v0.14.0/alertmanager-0.14.0.linux-amd64.tar.gz
14 | cd Prometheus
15 | ```
16 | 
17 | 使用 `tar` 解压缩 alertmanager-0.14.0.linux-amd64.tar.gz
18 | 
19 | ```
20 | tar -xvzf ~/Download alertmanager-0.14.0.linux-amd64.tar.gz
21 | cd alertmanager-0.14.0.linux-amd64
22 | ```
23 | 
24 | 解压成功后，使用 `./alertmanager --version` 来检查是否安装成功
25 | 
26 | ```
27 | alertmanager, version 0.14.0 (branch: HEAD, revision: 30af4d051b37ce817ea7e35b56c57a0e2ec9dbb0)
28 |   build user:       root@37b6a49ebba9
29 |   build date:       20180213-08:16:42
30 |   go version:       go1.9.2
31 | ```
32 | 
33 | ### 基本配置
34 | 
35 | 执行命令 `mv simple.yml alertmanager.yml`，并修改 `alertmanager.yml` 配置：
36 | 
37 | ```
38 | global:
39 |   resolve_timeout: 2h
40 | 
41 | route:
42 |   group_by: ['alertname']
43 |   group_wait: 5s
44 |   group_interval: 10s
45 |   repeat_interval: 1h
46 |   receiver: 'webhook'
47 | 
48 | receivers:
49 | - name: 'webhook'
50 |   webhook_configs:
51 |   - url: 'http://example.com/xxxx'
52 |     send_resolved: true
53 | ```
54 | 
55 | 说明： 这里我们使用 Alertmanager 的 `webhook_configs` 选项来接收消息，当接收到新的告警信息，它会将消息转发到配置的 `url` 地址。
56 | 


--------------------------------------------------------------------------------
/application/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/application/README.md


--------------------------------------------------------------------------------
/application/django.md:
--------------------------------------------------------------------------------
1 | # Prometheus 与 Django


--------------------------------------------------------------------------------
/application/laravel.md:
--------------------------------------------------------------------------------
1 | # Prometheus 与 Laravel


--------------------------------------------------------------------------------
/application/phoenix.md:
--------------------------------------------------------------------------------
1 | # Prometheus 与 Phoenix


--------------------------------------------------------------------------------
/application/rails.md:
--------------------------------------------------------------------------------
1 | # Prometheus 与 Rails


--------------------------------------------------------------------------------
/application/spring.md:
--------------------------------------------------------------------------------
1 | # Prometheus 与 Spring boot


--------------------------------------------------------------------------------
/arrangement/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/arrangement/README.md


--------------------------------------------------------------------------------
/arrangement/k8s.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/arrangement/k8s.md


--------------------------------------------------------------------------------
/arrangement/swarm.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/arrangement/swarm.md


--------------------------------------------------------------------------------
/concepts/README.md:
--------------------------------------------------------------------------------
1 | # 基础概念
2 | 
3 | 本章将介绍 Prometheus 一些基础概念，包括：
4 | - 数据模型
5 | - 四种 Metric Type
6 | - 作业与实例
7 | 


--------------------------------------------------------------------------------
/concepts/data-model.md:
--------------------------------------------------------------------------------
 1 | # 数据模型
 2 | 
 3 | Prometheus 存储的是[时序数据](https://en.wikipedia.org/wiki/Time_series), 即按照相同时序(相同的名字和标签)，以时间维度存储连续的数据的集合。
 4 | 
 5 | ## 时序索引
 6 | 
 7 | 时序(time series) 是由名字(Metric)，以及一组 key/value 标签定义的，具有相同的名字以及标签属于相同时序。
 8 | 
 9 | 时序的名字由 ASCII 字符，数字，下划线，以及冒号组成，它必须满足正则表达式 `[a-zA-Z_:][a-zA-Z0-9_:]* `, 其名字应该具有语义化，一般表示一个可以度量的指标，例如: `http_requests_total`, 可以表示 http 请求的总数。
10 | 
11 | 时序的标签可以使 Prometheus 的数据更加丰富，能够区分具体不同的实例，例如 `http_requests_total{method="POST"}` 可以表示所有 http 中的 POST 请求。
12 | 
13 | 标签名称由 ASCII 字符，数字，以及下划线组成， 其中 `__` 开头属于 Prometheus 保留，标签的值可以是任何 Unicode 字符，支持中文。
14 | 
15 | ## 时序样本
16 | 
17 | 按照某个时序以时间维度采集的数据，称之为样本，其值包含：
18 | 
19 | * 一个 float64 值
20 | * 一个毫秒级的 unix 时间戳
21 | 
22 | ## 格式
23 | 
24 | Prometheus 时序格式与 [OpenTSDB](http://opentsdb.net/) 相似：
25 | 
26 | ```
27 | <metric name>{<label name>=<label value>, ...}
28 | ```
29 | 
30 | 其中包含时序名字以及时序的标签。
31 | 


--------------------------------------------------------------------------------
/concepts/jobs-and-instances.md:
--------------------------------------------------------------------------------
 1 | # 作业和实例
 2 | 
 3 | Prometheus 中，将任意一个独立的数据源（target）称之为实例（instance）。包含相同类型的实例的集合称之为作业（job）。
 4 | 如下是一个含有四个重复实例的作业：
 5 | 
 6 | ```
 7 | - job: api-server
 8 |     - instance 1: 1.2.3.4:5670
 9 |     - instance 2: 1.2.3.4:5671
10 |     - instance 3: 5.6.7.8:5670
11 |     - instance 4: 5.6.7.8:5671
12 | ```
13 | 
14 | ## 自生成标签和时序
15 | 
16 | Prometheus 在采集数据的同时，会自动在时序的基础上添加标签，作为数据源（target）的标识，以便区分：
17 | ```
18 | job: The configured job name that the target belongs to.
19 | instance: The <host>:<port> part of the target's URL that was scraped.
20 | ```
21 | 
22 | 如果其中任一标签已经在此前采集的数据中存在，那么将会根据 `honor_labels` 设置选项来决定新标签。详见官网解释： [scrape configuration documentation](https://prometheus.io/docs/operating/configuration/#%3Cscrape_config%3E)
23 | 
24 | 对每一个实例而言，Prometheus 按照以下时序来存储所采集的数据样本：
25 | 
26 | ```
27 | up{job="<job-name>", instance="<instance-id>"}: 1 表示该实例正常工作
28 | up{job="<job-name>", instance="<instance-id>"}: 0 表示该实例故障
29 | 
30 | scrape_duration_seconds{job="<job-name>", instance="<instance-id>"} 表示拉取数据的时间间隔
31 | 
32 | scrape_samples_post_metric_relabeling{job="<job-name>", instance="<instance-id>"} 表示采用重定义标签（relabeling）操作后仍然剩余的样本数
33 | 
34 | scrape_samples_scraped{job="<job-name>", instance="<instance-id>"}  表示从该数据源获取的样本数
35 | ```
36 | 
37 | 其中 `up` 时序可以有效应用于监控该实例是否正常工作。
38 | 
39 | 
40 | 


--------------------------------------------------------------------------------
/concepts/metric-types.md:
--------------------------------------------------------------------------------
 1 | # 时序 4 种类型
 2 | 
 3 | Prometheus 时序数据分为 [Counter](https://prometheus.io/docs/concepts/metric_types/#counter), [Gauge](https://prometheus.io/docs/concepts/metric_types/#gauge), [Histogram](https://prometheus.io/docs/concepts/metric_types/#histogram), [Summary](https://prometheus.io/docs/concepts/metric_types/#summary) 四种类型。
 4 | 
 5 | 
 6 | ## Counter
 7 | 
 8 | Counter 表示收集的数据是按照某个趋势（增加／减少）一直变化的，我们往往用它记录服务请求总量、错误总数等。
 9 | 
10 | 例如 Prometheus server 中 `http_requests_total`,  表示 Prometheus 处理的 http 请求总数，我们可以使用 `delta`, 很容易得到任意区间数据的增量，这个会在 PromQL 一节中细讲。
11 | 
12 | ## Gauge
13 | 
14 | Gauge 表示搜集的数据是一个瞬时的值，与时间没有关系，可以任意变高变低，往往可以用来记录内存使用率、磁盘使用率等。
15 | 
16 | 例如 Prometheus server 中 `go_goroutines`,  表示 Prometheus 当前 goroutines 的数量。
17 | 
18 | ## Histogram
19 | 
20 | Histogram 由 `<basename>_bucket{le="<upper inclusive bound>"}`，`<basename>_bucket{le="+Inf"}`, `<basename>_sum`，`<basename>_count` 组成，主要用于表示一段时间范围内对数据进行采样（通常是请求持续时间或响应大小），并能够对其指定区间以及总数进行统计，通常它采集的数据展示为直方图。
21 | 
22 | 例如 Prometheus server 中 `prometheus_local_storage_series_chunks_persisted`,  表示 Prometheus 中每个时序需要存储的 chunks 数量，我们可以用它计算待持久化的数据的分位数。
23 | 
24 | ## Summary
25 | 
26 | Summary 和 Histogram 类似，由 `<basename>{quantile="<φ>"}`，`<basename>_sum`，`<basename>_count` 组成，主要用于表示一段时间内数据采样结果（通常是请求持续时间或响应大小），它直接存储了 quantile 数据，而不是根据统计区间计算出来的。
27 | 
28 | 例如 Prometheus server 中 `prometheus_target_interval_length_seconds`。
29 | 
30 | ## Histogram vs Summary
31 | 
32 | - 都包含 `<basename>_sum`，`<basename>_count`
33 | - Histogram 需要通过 `<basename>_bucket` 计算 quantile, 而 Summary 直接存储了 quantile 的值。
34 | 


--------------------------------------------------------------------------------
/configuration/README.md:
--------------------------------------------------------------------------------
 1 | # 配置
 2 | 
 3 | Prometheus 启动的时候，可以加载运行参数 `-config.file` 指定配置文件，默认为 `prometheus.yml`。
 4 | 
 5 | 在配置文件中我们可以指定 global, alerting, rule_files, scrape_configs, remote_write, remote_read 等属性。
 6 | 
 7 | 其代码结构体定义为：
 8 | 
 9 | ```
10 | // Config is the top-level configuration for Prometheus's config files.
11 | type Config struct {
12 | 	GlobalConfig   GlobalConfig    `yaml:"global"`
13 | 	AlertingConfig AlertingConfig  `yaml:"alerting,omitempty"`
14 | 	RuleFiles      []string        `yaml:"rule_files,omitempty"`
15 | 	ScrapeConfigs  []*ScrapeConfig `yaml:"scrape_configs,omitempty"`
16 | 
17 | 	RemoteWriteConfigs []*RemoteWriteConfig `yaml:"remote_write,omitempty"`
18 | 	RemoteReadConfigs  []*RemoteReadConfig  `yaml:"remote_read,omitempty"`
19 | 
20 | 	// Catches all undefined fields and must be empty after parsing.
21 | 	XXX map[string]interface{} `yaml:",inline"`
22 | 
23 | 	// original is the input from which the config was parsed.
24 | 	original string
25 | }
26 | ```
27 | 
28 | 配置文件结构大概为：
29 | 
30 | ```
31 | global:
32 |   # How frequently to scrape targets by default.
33 |   [ scrape_interval: <duration> | default = 1m ]
34 | 
35 |   # How long until a scrape request times out.
36 |   [ scrape_timeout: <duration> | default = 10s ]
37 | 
38 |   # How frequently to evaluate rules.
39 |   [ evaluation_interval: <duration> | default = 1m ]
40 | 
41 |   # The labels to add to any time series or alerts when communicating with
42 |   # external systems (federation, remote storage, Alertmanager).
43 |   external_labels:
44 |     [ <labelname>: <labelvalue> ... ]
45 | 
46 | # Rule files specifies a list of globs. Rules and alerts are read from
47 | # all matching files.
48 | rule_files:
49 |   [ - <filepath_glob> ... ]
50 | 
51 | # A list of scrape configurations.
52 | scrape_configs:
53 |   [ - <scrape_config> ... ]
54 | 
55 | # Alerting specifies settings related to the Alertmanager.
56 | alerting:
57 |   alert_relabel_configs:
58 |     [ - <relabel_config> ... ]
59 |   alertmanagers:
60 |     [ - <alertmanager_config> ... ]
61 | 
62 | # Settings related to the experimental remote write feature.
63 | remote_write:
64 |   [ - <remote_write> ... ]
65 | 
66 | # Settings related to the experimental remote read feature.
67 | remote_read:
68 |   [ - <remote_read> ... ]
69 | ```
70 | 


--------------------------------------------------------------------------------
/configuration/alerting.md:
--------------------------------------------------------------------------------
  1 | # 告警配置
  2 | 
  3 | 通常我们可以使用运行参数 `-alertmanager.xxx` 来配置 Alertmanager， 但是这样不够灵活，没有办法做到动态更新加载，以及动态定义告警属性。
  4 | 
  5 | 所以 `alerting` 配置主要用来解决这个问题，它能够更好的管理 Alertmanager, 主要包含 2 个参数：
  6 | 
  7 | - alert_relabel_configs: 动态修改 alert 属性的规则配置。
  8 | - alertmanagers: 用于动态发现 Alertmanager 的配置。
  9 | 
 10 | 其代码结构体定义为：
 11 | 
 12 | ```
 13 | // AlertingConfig configures alerting and alertmanager related configs.
 14 | type AlertingConfig struct {
 15 | 	AlertRelabelConfigs []*RelabelConfig      `yaml:"alert_relabel_configs,omitempty"`
 16 | 	AlertmanagerConfigs []*AlertmanagerConfig `yaml:"alertmanagers,omitempty"`
 17 | 
 18 | 	// Catches all undefined fields and must be empty after parsing.
 19 | 	XXX map[string]interface{} `yaml:",inline"`
 20 | }
 21 | ```
 22 | 
 23 | 配置文件结构大概为：
 24 | 
 25 | ```
 26 | # Alerting specifies settings related to the Alertmanager.
 27 | alerting:
 28 |   alert_relabel_configs:
 29 |     [ - <relabel_config> ... ]
 30 |   alertmanagers:
 31 |     [ - <alertmanager_config> ... ]
 32 | ```
 33 | 
 34 | 其中 alertmanagers 为 alertmanager_config 数组，而 alertmanager_config 的代码结构体为,
 35 | 
 36 | ```
 37 | // AlertmanagerConfig configures how Alertmanagers can be discovered and communicated with.
 38 | type AlertmanagerConfig struct {
 39 | 	// We cannot do proper Go type embedding below as the parser will then parse
 40 | 	// values arbitrarily into the overflow maps of further-down types.
 41 | 
 42 | 	ServiceDiscoveryConfig ServiceDiscoveryConfig `yaml:",inline"`
 43 | 	HTTPClientConfig       HTTPClientConfig       `yaml:",inline"`
 44 | 
 45 | 	// The URL scheme to use when talking to Alertmanagers.
 46 | 	Scheme string `yaml:"scheme,omitempty"`
 47 | 	// Path prefix to add in front of the push endpoint path.
 48 | 	PathPrefix string `yaml:"path_prefix,omitempty"`
 49 | 	// The timeout used when sending alerts.
 50 | 	Timeout time.Duration `yaml:"timeout,omitempty"`
 51 | 
 52 | 	// List of Alertmanager relabel configurations.
 53 | 	RelabelConfigs []*RelabelConfig `yaml:"relabel_configs,omitempty"`
 54 | 
 55 | 	// Catches all undefined fields and must be empty after parsing.
 56 | 	XXX map[string]interface{} `yaml:",inline"`
 57 | }
 58 | ```
 59 | 
 60 | 配置文件结构大概为:
 61 | 
 62 | ```
 63 | # Per-target Alertmanager timeout when pushing alerts.
 64 | [ timeout: <duration> | default = 10s ]
 65 | 
 66 | # Prefix for the HTTP path alerts are pushed to.
 67 | [ path_prefix: <path> | default = / ]
 68 | 
 69 | # Configures the protocol scheme used for requests.
 70 | [ scheme: <scheme> | default = http ]
 71 | 
 72 | # Sets the `Authorization` header on every request with the
 73 | # configured username and password.
 74 | basic_auth:
 75 |   [ username: <string> ]
 76 |   [ password: <string> ]
 77 | 
 78 | # Sets the `Authorization` header on every request with
 79 | # the configured bearer token. It is mutually exclusive with `bearer_token_file`.
 80 | [ bearer_token: <string> ]
 81 | 
 82 | # Sets the `Authorization` header on every request with the bearer token
 83 | # read from the configured file. It is mutually exclusive with `bearer_token`.
 84 | [ bearer_token_file: /path/to/bearer/token/file ]
 85 | 
 86 | # Configures the scrape request's TLS settings.
 87 | tls_config:
 88 |   [ <tls_config> ]
 89 | 
 90 | # Optional proxy URL.
 91 | [ proxy_url: <string> ]
 92 | 
 93 | # List of Azure service discovery configurations.
 94 | azure_sd_configs:
 95 |   [ - <azure_sd_config> ... ]
 96 | 
 97 | # List of Consul service discovery configurations.
 98 | consul_sd_configs:
 99 |   [ - <consul_sd_config> ... ]
100 | 
101 | # List of DNS service discovery configurations.
102 | dns_sd_configs:
103 |   [ - <dns_sd_config> ... ]
104 | 
105 | # List of EC2 service discovery configurations.
106 | ec2_sd_configs:
107 |   [ - <ec2_sd_config> ... ]
108 | 
109 | # List of file service discovery configurations.
110 | file_sd_configs:
111 |   [ - <file_sd_config> ... ]
112 | 
113 | # List of GCE service discovery configurations.
114 | gce_sd_configs:
115 |   [ - <gce_sd_config> ... ]
116 | 
117 | # List of Kubernetes service discovery configurations.
118 | kubernetes_sd_configs:
119 |   [ - <kubernetes_sd_config> ... ]
120 | 
121 | # List of Marathon service discovery configurations.
122 | marathon_sd_configs:
123 |   [ - <marathon_sd_config> ... ]
124 | 
125 | # List of AirBnB's Nerve service discovery configurations.
126 | nerve_sd_configs:
127 |   [ - <nerve_sd_config> ... ]
128 | 
129 | # List of Zookeeper Serverset service discovery configurations.
130 | serverset_sd_configs:
131 |   [ - <serverset_sd_config> ... ]
132 | 
133 | # List of Triton service discovery configurations.
134 | triton_sd_configs:
135 |   [ - <triton_sd_config> ... ]
136 | 
137 | # List of labeled statically configured Alertmanagers.
138 | static_configs:
139 |   [ - <static_config> ... ]
140 | 
141 | # List of Alertmanager relabel configurations.
142 | relabel_configs:
143 |   [ - <relabel_config> ... ]
144 | ```
145 | 


--------------------------------------------------------------------------------
/configuration/demo.md:
--------------------------------------------------------------------------------
 1 | # 配置样例
 2 | 
 3 | Prometheus 的配置参数比较多，但是个人使用较多的是 global, rules, scrap_configs, statstic_config, rebel_config 等。
 4 | 
 5 | 我平时使用的配置文件大致为这样：
 6 | 
 7 | ```
 8 | global:
 9 |   scrape_interval:     15s # By default, scrape targets every 15 seconds.
10 |   evaluation_interval: 15s # By default, scrape targets every 15 seconds.
11 | 
12 | rule_files:
13 |   - "rules/node.rules"
14 | 
15 | scrape_configs:
16 |   - job_name: 'prometheus'
17 |     scrape_interval: 5s
18 |     static_configs:
19 |       - targets: ['localhost:9090']
20 | 
21 |   - job_name: 'node'
22 |     scrape_interval: 8s
23 |     static_configs:
24 |       - targets: ['127.0.0.1:9100', '127.0.0.12:9100']
25 | 
26 |   - job_name: 'mysqld'
27 |     static_configs:
28 |       - targets: ['127.0.0.1:9104']
29 |   - job_name: 'memcached'
30 |     static_configs:
31 |       - targets: ['127.0.0.1:9150']
32 | ```
33 | 


--------------------------------------------------------------------------------
/configuration/global.md:
--------------------------------------------------------------------------------
 1 | # 全局配置
 2 | 
 3 | `global` 属于全局的默认配置，它主要包含 4 个属性，
 4 | 
 5 | - scrape_interval: 拉取 targets 的默认时间间隔。
 6 | - scrape_timeout: 拉取一个 target 的超时时间。
 7 | - evaluation_interval: 执行 rules 的时间间隔。
 8 | - external_labels: 额外的属性，会添加到拉取的数据并存到数据库中。
 9 | 
10 | 
11 | 其代码结构体定义为：
12 | 
13 | ```
14 |  // GlobalConfig configures values that are used across other configuration
15 | // objects.
16 | type GlobalConfig struct {
17 | 	// How frequently to scrape targets by default.
18 | 	ScrapeInterval model.Duration `yaml:"scrape_interval,omitempty"`
19 | 	// The default timeout when scraping targets.
20 | 	ScrapeTimeout model.Duration `yaml:"scrape_timeout,omitempty"`
21 | 	// How frequently to evaluate rules by default.
22 | 	EvaluationInterval model.Duration `yaml:"evaluation_interval,omitempty"`
23 | 	// The labels to add to any timeseries that this Prometheus instance scrapes.
24 | 	ExternalLabels model.LabelSet `yaml:"external_labels,omitempty"`
25 | 
26 | 	// Catches all undefined fields and must be empty after parsing.
27 | 	XXX map[string]interface{} `yaml:",inline"`
28 | }
29 | ```
30 | 
31 | 配置文件结构大概为：
32 | 
33 | ```
34 | global:
35 |   scrape_interval:     15s # By default, scrape targets every 15 seconds.
36 |   evaluation_interval: 15s # By default, scrape targets every 15 seconds.
37 |   scrape_timeout: 10s # is set to the global default (10s).
38 | 
39 |   # Attach these labels to any time series or alerts when communicating with
40 |   # external systems (federation, remote storage, Alertmanager).
41 |   external_labels:
42 |     monitor: 'codelab-monitor'
43 | ```
44 | 


--------------------------------------------------------------------------------
/configuration/remote_read.md:
--------------------------------------------------------------------------------
 1 | # 远程可读存储
 2 | 
 3 | `remote_read` 主要用于可读远程存储配置，主要包含以下参数：
 4 | 
 5 | - url: 访问地址
 6 | - remote_timeout: 请求超时时间
 7 | 
 8 | 其代码结构体为：
 9 | 
10 | ```
11 | // RemoteReadConfig is the configuration for reading from remote storage.
12 | type RemoteReadConfig struct {
13 | 	URL           *URL           `yaml:"url,omitempty"`
14 | 	RemoteTimeout model.Duration `yaml:"remote_timeout,omitempty"`
15 | 
16 | 	// We cannot do proper Go type embedding below as the parser will then parse
17 | 	// values arbitrarily into the overflow maps of further-down types.
18 | 	HTTPClientConfig HTTPClientConfig `yaml:",inline"`
19 | 
20 | 	// Catches all undefined fields and must be empty after parsing.
21 | 	XXX map[string]interface{} `yaml:",inline"`
22 | }
23 | ```
24 | 
25 | 一份完整的配置大致为:
26 | 
27 | ```
28 | # The URL of the endpoint to query from.
29 | url: <string>
30 | 
31 | # Timeout for requests to the remote read endpoint.
32 | [ remote_timeout: <duration> | default = 30s ]
33 | 
34 | # Sets the `Authorization` header on every remote read request with the
35 | # configured username and password.
36 | basic_auth:
37 |   [ username: <string> ]
38 |   [ password: <string> ]
39 | 
40 | # Sets the `Authorization` header on every remote read request with
41 | # the configured bearer token. It is mutually exclusive with `bearer_token_file`.
42 | [ bearer_token: <string> ]
43 | 
44 | # Sets the `Authorization` header on every remote read request with the bearer token
45 | # read from the configured file. It is mutually exclusive with `bearer_token`.
46 | [ bearer_token_file: /path/to/bearer/token/file ]
47 | 
48 | # Configures the remote read request's TLS settings.
49 | tls_config:
50 |   [ <tls_config> ]
51 | 
52 | # Optional proxy URL.
53 | [ proxy_url: <string> ]
54 | ```
55 | 
56 | 注意： remote_read 属于试验阶段，慎用，因为在以后的版本中可能发生改变。
57 | 


--------------------------------------------------------------------------------
/configuration/remote_write.md:
--------------------------------------------------------------------------------
 1 | # 远程可写存储
 2 | 
 3 | `remote_write` 主要用于可写远程存储配置，主要包含以下参数：
 4 | 
 5 | - url: 访问地址
 6 | - remote_timeout: 请求超时时间
 7 | - write_relabel_configs: 标签重置配置, 拉取到的数据，经过重置处理后，发送给远程存储
 8 | 
 9 | 其代码结构体为：
10 | 
11 | ```
12 | // RemoteWriteConfig is the configuration for writing to remote storage.
13 | type RemoteWriteConfig struct {
14 | 	URL                 *URL             `yaml:"url,omitempty"`
15 | 	RemoteTimeout       model.Duration   `yaml:"remote_timeout,omitempty"`
16 | 	WriteRelabelConfigs []*RelabelConfig `yaml:"write_relabel_configs,omitempty"`
17 | 
18 | 	// We cannot do proper Go type embedding below as the parser will then parse
19 | 	// values arbitrarily into the overflow maps of further-down types.
20 | 	HTTPClientConfig HTTPClientConfig `yaml:",inline"`
21 | 
22 | 	// Catches all undefined fields and must be empty after parsing.
23 | 	XXX map[string]interface{} `yaml:",inline"`
24 | }
25 | ```
26 | 
27 | 一份完整的配置大致为:
28 | 
29 | ```
30 | # The URL of the endpoint to send samples to.
31 | url: <string>
32 | 
33 | # Timeout for requests to the remote write endpoint.
34 | [ remote_timeout: <duration> | default = 30s ]
35 | 
36 | # List of remote write relabel configurations.
37 | write_relabel_configs:
38 |   [ - <relabel_config> ... ]
39 | 
40 | # Sets the `Authorization` header on every remote write request with the
41 | # configured username and password.
42 | basic_auth:
43 |   [ username: <string> ]
44 |   [ password: <string> ]
45 | 
46 | # Sets the `Authorization` header on every remote write request with
47 | # the configured bearer token. It is mutually exclusive with `bearer_token_file`.
48 | [ bearer_token: <string> ]
49 | 
50 | # Sets the `Authorization` header on every remote write request with the bearer token
51 | # read from the configured file. It is mutually exclusive with `bearer_token`.
52 | [ bearer_token_file: /path/to/bearer/token/file ]
53 | 
54 | # Configures the remote write request's TLS settings.
55 | tls_config:
56 |   [ <tls_config> ]
57 | 
58 | # Optional proxy URL.
59 | [ proxy_url: <string> ]
60 | ```
61 | 
62 | 注意： remote_write 属于试验阶段，慎用，因为在以后的版本中可能发生改变。
63 | 


--------------------------------------------------------------------------------
/configuration/rule_files.md:
--------------------------------------------------------------------------------
 1 | # 规则配置
 2 | 
 3 | `rule_files` 主要用于配置 rules 文件，它支持多个文件以及文件目录。
 4 | 
 5 | 其代码结构定义为：
 6 | 
 7 | ```
 8 | RuleFiles      []string        `yaml:"rule_files,omitempty"`
 9 | ```
10 | 
11 | 配置文件结构大致为：
12 | 
13 | ```
14 | rule_files:
15 |   - "rules/node.rules"
16 |   - "rules2/*.rules"
17 | ```
18 | 


--------------------------------------------------------------------------------
/configuration/scrape_configs.md:
--------------------------------------------------------------------------------
  1 | # 数据拉取配置
  2 | 
  3 | scrape_configs 主要用于配置拉取数据节点，每一个拉取配置主要包含以下参数：
  4 | 
  5 | - job_name：任务名称
  6 | - honor_labels： 用于解决拉取数据标签有冲突，当设置为 true, 以拉取数据为准，否则以服务配置为准
  7 | - params：数据拉取访问时带的请求参数
  8 | - scrape_interval： 拉取时间间隔
  9 | - scrape_timeout: 拉取超时时间
 10 | - metrics_path： 拉取节点的 metric 路径
 11 | - scheme： 拉取数据访问协议
 12 | - sample_limit： 存储的数据标签个数限制，如果超过限制，该数据将被忽略，不入存储；默认值为0，表示没有限制
 13 | - relabel_configs： 拉取数据重置标签配置
 14 | - metric_relabel_configs：metric 重置标签配置
 15 | 
 16 | 其代码结构体定义为：
 17 | 
 18 | ```
 19 | // ScrapeConfig configures a scraping unit for Prometheus.
 20 | type ScrapeConfig struct {
 21 | 	// The job name to which the job label is set by default.
 22 | 	JobName string `yaml:"job_name"`
 23 | 	// Indicator whether the scraped metrics should remain unmodified.
 24 | 	HonorLabels bool `yaml:"honor_labels,omitempty"`
 25 | 	// A set of query parameters with which the target is scraped.
 26 | 	Params url.Values `yaml:"params,omitempty"`
 27 | 	// How frequently to scrape the targets of this scrape config.
 28 | 	ScrapeInterval model.Duration `yaml:"scrape_interval,omitempty"`
 29 | 	// The timeout for scraping targets of this config.
 30 | 	ScrapeTimeout model.Duration `yaml:"scrape_timeout,omitempty"`
 31 | 	// The HTTP resource path on which to fetch metrics from targets.
 32 | 	MetricsPath string `yaml:"metrics_path,omitempty"`
 33 | 	// The URL scheme with which to fetch metrics from targets.
 34 | 	Scheme string `yaml:"scheme,omitempty"`
 35 | 	// More than this many samples post metric-relabelling will cause the scrape to fail.
 36 | 	SampleLimit uint `yaml:"sample_limit,omitempty"`
 37 | 
 38 | 	// We cannot do proper Go type embedding below as the parser will then parse
 39 | 	// values arbitrarily into the overflow maps of further-down types.
 40 | 
 41 | 	ServiceDiscoveryConfig ServiceDiscoveryConfig `yaml:",inline"`
 42 | 	HTTPClientConfig       HTTPClientConfig       `yaml:",inline"`
 43 | 
 44 | 	// List of target relabel configurations.
 45 | 	RelabelConfigs []*RelabelConfig `yaml:"relabel_configs,omitempty"`
 46 | 	// List of metric relabel configurations.
 47 | 	MetricRelabelConfigs []*RelabelConfig `yaml:"metric_relabel_configs,omitempty"`
 48 | 
 49 | 	// Catches all undefined fields and must be empty after parsing.
 50 | 	XXX map[string]interface{} `yaml:",inline"`
 51 | }
 52 | ```
 53 | 
 54 | 以上配置定义中还包含 ServiceDiscoveryConfig，它的代码定义为：
 55 | 
 56 | ```
 57 | // ServiceDiscoveryConfig configures lists of different service discovery mechanisms.
 58 | type ServiceDiscoveryConfig struct {
 59 | 	// List of labeled target groups for this job.
 60 | 	StaticConfigs []*TargetGroup `yaml:"static_configs,omitempty"`
 61 | 	// List of DNS service discovery configurations.
 62 | 	DNSSDConfigs []*DNSSDConfig `yaml:"dns_sd_configs,omitempty"`
 63 | 	// List of file service discovery configurations.
 64 | 	FileSDConfigs []*FileSDConfig `yaml:"file_sd_configs,omitempty"`
 65 | 	// List of Consul service discovery configurations.
 66 | 	ConsulSDConfigs []*ConsulSDConfig `yaml:"consul_sd_configs,omitempty"`
 67 | 	// List of Serverset service discovery configurations.
 68 | 	ServersetSDConfigs []*ServersetSDConfig `yaml:"serverset_sd_configs,omitempty"`
 69 | 	// NerveSDConfigs is a list of Nerve service discovery configurations.
 70 | 	NerveSDConfigs []*NerveSDConfig `yaml:"nerve_sd_configs,omitempty"`
 71 | 	// MarathonSDConfigs is a list of Marathon service discovery configurations.
 72 | 	MarathonSDConfigs []*MarathonSDConfig `yaml:"marathon_sd_configs,omitempty"`
 73 | 	// List of Kubernetes service discovery configurations.
 74 | 	KubernetesSDConfigs []*KubernetesSDConfig `yaml:"kubernetes_sd_configs,omitempty"`
 75 | 	// List of GCE service discovery configurations.
 76 | 	GCESDConfigs []*GCESDConfig `yaml:"gce_sd_configs,omitempty"`
 77 | 	// List of EC2 service discovery configurations.
 78 | 	EC2SDConfigs []*EC2SDConfig `yaml:"ec2_sd_configs,omitempty"`
 79 | 	// List of OpenStack service discovery configurations.
 80 | 	OpenstackSDConfigs []*OpenstackSDConfig `yaml:"openstack_sd_configs,omitempty"`
 81 | 	// List of Azure service discovery configurations.
 82 | 	AzureSDConfigs []*AzureSDConfig `yaml:"azure_sd_configs,omitempty"`
 83 | 	// List of Triton service discovery configurations.
 84 | 	TritonSDConfigs []*TritonSDConfig `yaml:"triton_sd_configs,omitempty"`
 85 | 
 86 | 	// Catches all undefined fields and must be empty after parsing.
 87 | 	XXX map[string]interface{} `yaml:",inline"`
 88 | }
 89 | ```
 90 | 
 91 | ServiceDiscoveryConfig 主要用于 target 发现，大体分为两类，静态配置和动态发现。
 92 | 
 93 | 所以，一份完整的 scrape_configs 配置大致为：
 94 | 
 95 | ```
 96 | # The job name assigned to scraped metrics by default.
 97 | job_name: <job_name>
 98 | 
 99 | # How frequently to scrape targets from this job.
100 | [ scrape_interval: <duration> | default = <global_config.scrape_interval> ]
101 | 
102 | # Per-scrape timeout when scraping this job.
103 | [ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]
104 | 
105 | # The HTTP resource path on which to fetch metrics from targets.
106 | [ metrics_path: <path> | default = /metrics ]
107 | 
108 | # honor_labels controls how Prometheus handles conflicts between labels that are
109 | # already present in scraped data and labels that Prometheus would attach
110 | # server-side ("job" and "instance" labels, manually configured target
111 | # labels, and labels generated by service discovery implementations).
112 | #
113 | # If honor_labels is set to "true", label conflicts are resolved by keeping label
114 | # values from the scraped data and ignoring the conflicting server-side labels.
115 | #
116 | # If honor_labels is set to "false", label conflicts are resolved by renaming
117 | # conflicting labels in the scraped data to "exported_<original-label>" (for
118 | # example "exported_instance", "exported_job") and then attaching server-side
119 | # labels. This is useful for use cases such as federation, where all labels
120 | # specified in the target should be preserved.
121 | #
122 | # Note that any globally configured "external_labels" are unaffected by this
123 | # setting. In communication with external systems, they are always applied only
124 | # when a time series does not have a given label yet and are ignored otherwise.
125 | [ honor_labels: <boolean> | default = false ]
126 | 
127 | # Configures the protocol scheme used for requests.
128 | [ scheme: <scheme> | default = http ]
129 | 
130 | # Optional HTTP URL parameters.
131 | params:
132 |   [ <string>: [<string>, ...] ]
133 | 
134 | # Sets the `Authorization` header on every scrape request with the
135 | # configured username and password.
136 | basic_auth:
137 |   [ username: <string> ]
138 |   [ password: <string> ]
139 | 
140 | # Sets the `Authorization` header on every scrape request with
141 | # the configured bearer token. It is mutually exclusive with `bearer_token_file`.
142 | [ bearer_token: <string> ]
143 | 
144 | # Sets the `Authorization` header on every scrape request with the bearer token
145 | # read from the configured file. It is mutually exclusive with `bearer_token`.
146 | [ bearer_token_file: /path/to/bearer/token/file ]
147 | 
148 | # Configures the scrape request's TLS settings.
149 | tls_config:
150 |   [ <tls_config> ]
151 | 
152 | # Optional proxy URL.
153 | [ proxy_url: <string> ]
154 | 
155 | # List of Azure service discovery configurations.
156 | azure_sd_configs:
157 |   [ - <azure_sd_config> ... ]
158 | 
159 | # List of Consul service discovery configurations.
160 | consul_sd_configs:
161 |   [ - <consul_sd_config> ... ]
162 | 
163 | # List of DNS service discovery configurations.
164 | dns_sd_configs:
165 |   [ - <dns_sd_config> ... ]
166 | 
167 | # List of EC2 service discovery configurations.
168 | ec2_sd_configs:
169 |   [ - <ec2_sd_config> ... ]
170 | 
171 | # List of OpenStack service discovery configurations.
172 | openstack_sd_configs:
173 |   [ - <openstack_sd_config> ... ]
174 | 
175 | # List of file service discovery configurations.
176 | file_sd_configs:
177 |   [ - <file_sd_config> ... ]
178 | 
179 | # List of GCE service discovery configurations.
180 | gce_sd_configs:
181 |   [ - <gce_sd_config> ... ]
182 | 
183 | # List of Kubernetes service discovery configurations.
184 | kubernetes_sd_configs:
185 |   [ - <kubernetes_sd_config> ... ]
186 | 
187 | # List of Marathon service discovery configurations.
188 | marathon_sd_configs:
189 |   [ - <marathon_sd_config> ... ]
190 | 
191 | # List of AirBnB's Nerve service discovery configurations.
192 | nerve_sd_configs:
193 |   [ - <nerve_sd_config> ... ]
194 | 
195 | # List of Zookeeper Serverset service discovery configurations.
196 | serverset_sd_configs:
197 |   [ - <serverset_sd_config> ... ]
198 | 
199 | # List of Triton service discovery configurations.
200 | triton_sd_configs:
201 |   [ - <triton_sd_config> ... ]
202 | 
203 | # List of labeled statically configured targets for this job.
204 | static_configs:
205 |   [ - <static_config> ... ]
206 | 
207 | # List of target relabel configurations.
208 | relabel_configs:
209 |   [ - <relabel_config> ... ]
210 | 
211 | # List of metric relabel configurations.
212 | metric_relabel_configs:
213 |   [ - <relabel_config> ... ]
214 | 
215 | # Per-scrape limit on number of scraped samples that will be accepted.
216 | # If more than this number of samples are present after metric relabelling
217 | # the entire scrape will be treated as failed. 0 means no limit.
218 | [ sample_limit: <int> | default = 0 ]
219 | ```
220 | 


--------------------------------------------------------------------------------
/configuration/server_discovery.md:
--------------------------------------------------------------------------------
 1 | # 服务发现
 2 | 
 3 | 在 Prometheus 的配置中，一个最重要的概念就是数据源 target，而数据源的配置主要分为静态配置和动态发现, 大致为以下几类：
 4 | 
 5 | - static_configs: 静态服务发现
 6 | - dns_sd_configs: DNS 服务发现
 7 | - file_sd_configs: 文件服务发现
 8 | - consul_sd_configs: Consul 服务发现
 9 | - serverset_sd_configs: Serverset 服务发现
10 | - nerve_sd_configs: Nerve 服务发现
11 | - marathon_sd_configs: Marathon 服务发现
12 | - kubernetes_sd_configs: Kubernetes 服务发现
13 | - gce_sd_configs: GCE 服务发现
14 | - ec2_sd_configs: EC2 服务发现
15 | - openstack_sd_configs: OpenStack 服务发现
16 | - azure_sd_configs: Azure 服务发现
17 | - triton_sd_configs: Triton 服务发现
18 | 
19 | 
20 | 它们具体使用以及配置模板，请参考服务发现[配置模板](https://prometheus.io/docs/operating/configuration/#<tls_config>)。
21 | 
22 | 它们中最重要的，也是使用最广泛的应该是 `static_configs`, 其实那些动态类型都可以看成是某些通用业务使用静态服务封装的结果。
23 | 


--------------------------------------------------------------------------------
/container/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/container/README.md


--------------------------------------------------------------------------------
/container/docker.md:
--------------------------------------------------------------------------------
  1 | # Docker 监控
  2 | 
  3 | 想必大家在生产环境中已大量使用到了容器，那对于容器的监控（CPU, 内存，网络请求）是如何处理的呢？
  4 | 
  5 | ### docker stats 对 cadvisor
  6 | 
  7 | 众所周知 `dokcer stats` 可以查看运行的 Docker 镜像的运行状态，例如：
  8 | 
  9 | ```
 10 | CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT    MEM %               NET I/O             BLOCK I/O           PIDS
 11 | a25dd77a5237        cadvisor            0.91%               14.8MiB / 1.952GiB   0.74%               749kB / 11.5MB      18.9MB / 0B         11
 12 | ```
 13 | 
 14 | 这种方式比较原始，因为你无法通过 http 的方式来获取数据，而且没有界面，数据可视化还需要做大量的工作。
 15 | 
 16 | 由于 `dokcer stats` 有这些问题，所以 `cadvisor` 诞生了。`cadvisor` 不仅可以搜集一台机器上所有运行的容器信息还提供基础查询界面和 http 接口，方便 Prometheus 进行数据抓取。
 17 | 
 18 | 正是因为 `cadvisor` 与 Prometheus 的完美结合，所以它成为了容器监控的第一选择。
 19 | 
 20 | ### cadvisor 的安装
 21 | 
 22 | Step1: 使用 `docker pull` 下载最新版本的 `cadvisor`
 23 | 
 24 | ```
 25 | $ docker pull google/cadvisor:latest
 26 | ```
 27 | 
 28 | Step2: 使用 `docker images` 查看下载的版本
 29 | 
 30 | ```
 31 | $ docker images  
 32 | 
 33 | google/cadvisor      latest              75f88e3ec333        4 months ago        62.2MB
 34 | ```
 35 | 
 36 | Step3: 使用 `docker run` 启动
 37 | 
 38 | ```
 39 | sudo docker run \
 40 |   --volume=/:/rootfs:ro \
 41 |   --volume=/var/run:/var/run:rw \
 42 |   --volume=/sys:/sys:ro \
 43 |   --volume=/var/lib/docker/:/var/lib/docker:ro \
 44 |   --volume=/dev/disk/:/dev/disk:ro \
 45 |   --publish=8080:8080 \
 46 |   --detach=true \
 47 |   --name=cadvisor \
 48 |   google/cadvisor:latest
 49 | ```
 50 | 
 51 | 当启动成功后，使用 `docker ps` 你会看到
 52 | 
 53 | ```
 54 | CONTAINER ID        IMAGE                    COMMAND                  CREATED             STATUS              PORTS                    NAMES
 55 | 742a464fa631        google/cadvisor:latest   "/usr/bin/cadvisor -…"   1 second ago        Up 1 second         0.0.0.0:8080->8080/tcp   cadvisor
 56 | ```
 57 | 
 58 | Step4: 访问 `http://localhost:8080` 你将看到：
 59 | 
 60 | ![/images/cadvisor/cadvisor-01.png](/images/cadvisor/cadvisor-01.png)
 61 | 
 62 | 这说明 `cadvisor` 已运行成功。
 63 | 
 64 | ### cadvisor 深入了解
 65 | 
 66 | Tips1: 访问 `http://localhost:8080/docker` 可以查看到所有运行的 dokcer 镜像：
 67 | 
 68 | ![/images/cadvisor/cadvisor-02.png](/images/cadvisor/cadvisor-02.png)
 69 | 
 70 | Tips2: 选择任意一个镜像，你将看到其运行状态的详细信息：
 71 | 
 72 | ![/images/cadvisor/cadvisor-03.png](/images/cadvisor/cadvisor-03.png)
 73 | 
 74 | ![/images/cadvisor/cadvisor-04.png](/images/cadvisor/cadvisor-04.png)
 75 | 
 76 | Tips3: 访问 `http://localhost:8080/metrics` 可以查看其暴露给 Prometheus 的所有数据：
 77 | 
 78 | ![/images/cadvisor/cadvisor-05.png](/images/cadvisor/cadvisor-05.png)
 79 | 
 80 | ### cadvisor 与 Prometheus 集成
 81 | 
 82 | Step1: 修改 Prometheus 配置信息，添加 cadvisor 访问地址：
 83 | 
 84 | ```
 85 | #prometheus.yml
 86 | 
 87 | scrape_configs:
 88 |   - job_name: 'node'
 89 |     static_configs:
 90 |       - targets: ['127.0.0.1:9100']
 91 | 
 92 |   - job_name: 'container'
 93 |     static_configs:
 94 |       - targets: ['127.0.0.1:8080']  # 本地 cadvisor 访问地址
 95 | ```
 96 | 
 97 | Step2: 重新加载配置，访问 `http://localhost:9090/targets` 你将看到新加的 `cadvisor` 已经生效。
 98 | 
 99 | ![/images/cadvisor/prometheus01.png](/images/cadvisor/prometheus01.png)
100 | 
101 | Step3: 此时访问 Prometheus 的 graph 页面 `http://localhost:9090/graph`，搜索 `container` 你将看到容器相关数据。
102 | 
103 | ![/images/cadvisor/prometheus02.png](/images/cadvisor/prometheus02.png)
104 | 
105 | ### Prometheus 中查看容器的 CPU，内存，网络流量等数据
106 | 
107 | CPU 使用率查询：
108 | 
109 | ```
110 | sum by (name) (rate(container_cpu_usage_seconds_total{image!=""}[1m])) / scalar(count(node_cpu{mode="user"})) * 100
111 | ```
112 | 
113 | ![/images/cadvisor/prometheus03.png](/images/cadvisor/prometheus03.png)
114 | 
115 | 内存使用量：
116 | 
117 | ```
118 | sum by (name)(container_memory_usage_bytes{image!=""})
119 | ```
120 | 
121 | ![/images/cadvisor/prometheus04.png](/images/cadvisor/prometheus04.png)
122 | 
123 | 网络入口流量
124 | 
125 | ```
126 | sum by (name) (rate(container_network_receive_bytes_total{image!=""}[1m]))
127 | ```
128 | 
129 | ![/images/cadvisor/prometheus06.png](/images/cadvisor/prometheus06.png)
130 | 
131 | 网络出口流量
132 | 
133 | ```
134 | sum by (name) (rate(container_network_transmit_bytes_total{image!=""}[1m]))
135 | ```
136 | 
137 | ![/images/cadvisor/prometheus05.png](/images/cadvisor/prometheus05.png)
138 | 
139 | 磁盘使用量：
140 | 
141 | ```
142 | sum by (name) (container_fs_usage_bytes{image!=""})
143 | ```
144 | 
145 | ![/images/cadvisor/prometheus07.png](/images/cadvisor/prometheus07.png)


--------------------------------------------------------------------------------
/container/k8s.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/container/k8s.md


--------------------------------------------------------------------------------
/demo/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/demo/README.md


--------------------------------------------------------------------------------
/demo/alertmanager.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/demo/alertmanager.md


--------------------------------------------------------------------------------
/demo/grafana.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/demo/grafana.md


--------------------------------------------------------------------------------
/demo/rule.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/demo/rule.md


--------------------------------------------------------------------------------
/demo/target.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/demo/target.md


--------------------------------------------------------------------------------
/devops/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/devops/README.md


--------------------------------------------------------------------------------
/devops/exporter.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/devops/exporter.md


--------------------------------------------------------------------------------
/devops/product.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/devops/product.md


--------------------------------------------------------------------------------
/devops/receiver.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/devops/receiver.md


--------------------------------------------------------------------------------
/exporter/README.md:
--------------------------------------------------------------------------------
1 | # Exporter
2 | 
3 | 在 Prometheus 中负责数据汇报的程序统一叫做 Exporter, 而不同的 Exporter 负责不同的业务。
4 | 它们具有统一命名格式，即 xx_exporter, 例如负责主机信息收集的 node_exporter。
5 | 
6 | Prometheus 社区已经提供了很多 exporter, 详情请参考[这里](https://prometheus.io/docs/instrumenting/exporters/#exporters-and-integrations) 。
7 | 


--------------------------------------------------------------------------------
/exporter/nodeexporter.md:
--------------------------------------------------------------------------------
  1 | # Node Exporter
  2 | 
  3 | [node_exporter](https://github.com/prometheus/node_exporter) 主要用于 *NIX 系统监控,
  4 | 用 Golang 编写。
  5 | 
  6 | ## 功能对照表
  7 | 
  8 | ### 默认开启的功能
  9 | 
 10 | | 名称 | 说明 | 系统 |
 11 | | ------| ------ | ------ |
 12 | | arp | 从 `/proc/net/arp` 中收集 ARP 统计信息 | Linux |
 13 | | conntrack | 从 `/proc/sys/net/netfilter/` 中收集 conntrack 统计信息 | Linux |
 14 | | cpu | 收集 cpu 统计信息 | Darwin, Dragonfly, FreeBSD, Linux |
 15 | | diskstats | 从 `/proc/diskstats` 中收集磁盘 I/O 统计信息  | Linux |
 16 | | edac | 错误检测与纠正统计信息 | Linux |
 17 | | entropy | 可用内核熵信息 | Linux |
 18 | | exec | execution 统计信息 | Dragonfly, FreeBSD |
 19 | | filefd | 从 `/proc/sys/fs/file-nr` 中收集文件描述符统计信息 | Linux |
 20 | | filesystem | 文件系统统计信息，例如磁盘已使用空间 | Darwin, Dragonfly, FreeBSD, Linux, OpenBSD |
 21 | | hwmon | 从 `/sys/class/hwmon/` 中收集监控器或传感器数据信息 | Linux |
 22 | | infiniband | 从 InfiniBand 配置中收集网络统计信息 | Linux |
 23 | | loadavg | 收集系统负载信息 | 	Darwin, Dragonfly, FreeBSD, Linux, NetBSD, OpenBSD, Solaris |
 24 | | mdadm | 从 `/proc/mdstat` 中获取设备统计信息 | Linux |
 25 | | meminfo | 内存统计信息 | Darwin, Dragonfly, FreeBSD, Linux |
 26 | | netdev | 网口流量统计信息，单位 bytes | Darwin, Dragonfly, FreeBSD, Linux, OpenBSD |
 27 | | netstat | 从 `/proc/net/netstat` 收集网络统计数据，等同于 `netstat -s` | Linux |
 28 | | sockstat | 从 `/proc/net/sockstat` 中收集 socket 统计信息 | Linux |
 29 | | stat | 从 `/proc/stat` 中收集各种统计信息，包含系统启动时间，forks, 中断等 | Linux |
 30 | | textfile | 通过 `--collector.textfile.directory` 参数指定本地文本收集路径，收集文本信息 | any |
 31 | | time | 系统当前时间 | any |
 32 | | uname | 通过 `uname` 系统调用, 获取系统信息  | any |
 33 | | vmstat | 从 `/proc/vmstat` 中收集统计信息  | Linux |
 34 | | wifi | 收集 wifi 设备相关统计数据  | Linux |
 35 | | xfs | 收集 xfs 运行时统计信息  | Linux (kernel 4.4+) |
 36 | | zfs | 收集 zfs 性能统计信息 | Linux |
 37 | 
 38 | ### 默认关闭的功能
 39 | 
 40 | | 名称 | 说明 | 系统 |
 41 | | ------| ------ | ------ |
 42 | | bonding | 收集系统配置以及激活的绑定网卡数量 | Linux |
 43 | | buddyinfo | 从 `/proc/buddyinfo` 中收集内存碎片统计信息 | Linux |
 44 | | devstat | 收集设备统计信息 | Dragonfly, FreeBSD |
 45 | | drbd |  收集远程镜像块设备（DRBD）统计信息  | Linux |
 46 | | interrupts | 收集更具体的中断统计信息 | Linux，OpenBSD |
 47 | | ipvs | 从 `/proc/net/ip_vs` 中收集 IPVS 状态信息，从 `/proc/net/ip_vs_stats` 获取统计信息 | Linux |
 48 | | ksmd | 从 `/sys/kernel/mm/ksm` 中获取内核和系统统计信息 | Linux |
 49 | | logind | 从 `logind` 中收集会话统计信息 | Linux |
 50 | | meminfo_numa | 从 `/proc/meminfo_numa` 中收集内存统计信息 | Linux |
 51 | | mountstats | 从 `/proc/self/mountstat` 中收集文件系统统计信息，包括 NFS 客户端统计信息 | Linux |
 52 | | nfs | 从 `/proc/net/rpc/nfs` 中收集 NFS 统计信息，等同于 `nfsstat -c` | Linux |
 53 | | qdisc | 收集队列推定统计信息 | Linux |
 54 | | runit | 收集 runit 状态信息 | any |
 55 | | supervisord | 收集 supervisord 状态信息 | any |
 56 | | systemd | 从 `systemd` 中收集设备系统状态信息 | Linux |
 57 | | tcpstat | 从 `/proc/net/tcp` 和 `/proc/net/tcp6` 收集 TCP 连接状态信息 | Linux |
 58 | 
 59 | ### 将被废弃功能
 60 | 
 61 | | 名称 | 说明 | 系统 |
 62 | | ------| ------ | ------ |
 63 | | gmond | 收集 Ganglia 统计信息 | any |
 64 | | megacli | 从 MegaCLI 中收集 RAID 统计信息 | Linux |
 65 | | ntp | 从 NTP 服务器中获取时钟 | any |
 66 | 
 67 | 注意：我们可以使用 `--collectors.enabled` 运行参数指定 node_exporter 收集的功能模块, 如果不指定，将使用默认模块。
 68 | 
 69 | ## 程序安装和启动
 70 | 
 71 | ### 二进制安装
 72 | 
 73 | 我们可以到[下载页面](https://prometheus.io/download/) 选择对应的二进制安装包，下面我将以 `0.14.0` 作为例子，
 74 | 
 75 | - 使用 wget 下载 Node Exporter
 76 | 
 77 | ```
 78 | cd ~/Download
 79 | https://github.com/prometheus/node_exporter/releases/download/v0.14.0/node_exporter-0.14.0.linux-amd64.tar.gz
 80 | ```
 81 | 
 82 | - 使用 tar 解压缩 node_exporter-0.14.0.linux-amd64.tar.gz
 83 | 
 84 | ```
 85 | cd ~/Prometheus
 86 | tar -xvzf ~/Download/node_exporter-0.14.0.linux-amd64.tar.gz
 87 | cd node_exporter-0.14.0.linux-amd64
 88 | ```
 89 | 
 90 | - 启动 Node Exporter
 91 | 
 92 | 我们可以使用 `./node_exporter -h` 查看运行选项，`./node_exporter` 运行 Node Exporter, 如果看到类似输出，表示启动成功。
 93 | 
 94 | ```
 95 | INFO[0000] Starting node_exporter (version=0.14.0, branch=master, revision=840ba5dcc71a084a3bc63cb6063003c1f94435a6)  source="node_exporter.go:140"
 96 | INFO[0000] Build context (go=go1.7.5, user=root@bb6d0678e7f3, date=20170321-12:13:32)  source="node_exporter.go:141"
 97 | INFO[0000] No directory specified, see --collector.textfile.directory  source="textfile.go:57"
 98 | INFO[0000] Enabled collectors:                           source="node_exporter.go:160"
 99 | .....
100 | INFO[0000] Listening on :9100                            source="node_exporter.go:186"
101 | ```
102 | 
103 | ### Docker 安装
104 | 
105 | 我们可以使用 [docker 镜像](https://quay.io/repository/prometheus/node-exporter) 安装，命令为：
106 | 
107 | ```
108 | docker run -d -p 9100:9100 \
109 |   -v "/proc:/host/proc:ro" \
110 |   -v "/sys:/host/sys:ro" \
111 |   -v "/:/rootfs:ro" \
112 |   --net="host" \
113 |   quay.io/prometheus/node-exporter \
114 |     -collector.procfs /host/proc \
115 |     -collector.sysfs /host/sys \
116 |     -collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($|/)"
117 | ```
118 | 
119 | 当 Node Exporter 运行起来后，在浏览器中访问 http://IP:9100/metrics， 将看到类似输出
120 | 
121 | ```
122 | # HELP go_gc_duration_seconds A summary of the GC invocation durations.
123 | # TYPE go_gc_duration_seconds summary
124 | go_gc_duration_seconds{quantile="0"} 0
125 | go_gc_duration_seconds{quantile="0.25"} 0
126 | go_gc_duration_seconds{quantile="0.5"} 0
127 | . . .
128 | ```
129 | 
130 | ## 数据存储
131 | 
132 | 我们可以利用 Prometheus 的 static_configs 来拉取 node_exporter 的数据。
133 | 
134 | 打开 prometheus.yml 文件, 在 scrape_configs 中添加如下配置：
135 | 
136 | ```
137 | - job_name: "node"
138 |     static_configs:
139 |       - targets: ["127.0.0.1:9100"]
140 | ```
141 | 
142 | 重启加载配置，然后到 Prometheus Console 查询，你会看到 node_exporter 的数据。
143 | 


--------------------------------------------------------------------------------
/exporter/nodeexporter_grafana_template.md:
--------------------------------------------------------------------------------
1 | # Node Exporter Grafana 模版
2 | 


--------------------------------------------------------------------------------
/exporter/nodeexporter_query.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # Node Exporter 常用查询语句
 3 | 
 4 | 收集到 node_exporter 的数据后，我们可以使用 PromQL 进行一些业务查询和监控，下面是一些比较常见的查询。
 5 | 
 6 | 注意：以下查询均以单个节点作为例子，如果大家想查看所有节点，将 `instance="xxx"` 去掉即可。
 7 | 
 8 | ## CPU 使用率
 9 | 
10 | ```
11 | 100 - (avg by (instance) (irate(node_cpu{instance="xxx", mode="idle"}[5m])) * 100)
12 | ```
13 | 
14 | ## CPU 各 mode 占比率
15 | 
16 | ```
17 | avg by (instance, mode) (irate(node_cpu{instance="xxx"}[5m])) * 100
18 | ```
19 | 
20 | ## 机器平均负载
21 | 
22 | ```
23 | node_load1{instance="xxx"} // 1分钟负载
24 | node_load5{instance="xxx"} // 5分钟负载
25 | node_load15{instance="xxx"} // 15分钟负载
26 | ```
27 | 
28 | ## 内存使用率
29 | ```
30 | 100 - ((node_memory_MemFree{instance="xxx"}+node_memory_Cached{instance="xxx"}+node_memory_Buffers{instance="xxx"})/node_memory_MemTotal) * 100
31 | ```
32 | 
33 | ## 磁盘使用率
34 | 
35 | ```
36 | 100 - node_filesystem_free{instance="xxx",fstype!~"rootfs|selinuxfs|autofs|rpc_pipefs|tmpfs|udev|none|devpts|sysfs|debugfs|fuse.*"} / node_filesystem_size{instance="xxx",fstype!~"rootfs|selinuxfs|autofs|rpc_pipefs|tmpfs|udev|none|devpts|sysfs|debugfs|fuse.*"} * 100
37 | ```
38 | 
39 | 或者你也可以直接使用 {fstype="xxx"} 来指定想查看的磁盘信息
40 | 
41 | ## 网络 IO
42 | 
43 | ```
44 | // 上行带宽
45 | sum by (instance) (irate(node_network_receive_bytes{instance="xxx",device!~"bond.*?|lo"}[5m])/128)
46 | 
47 | // 下行带宽
48 | sum by (instance) (irate(node_network_transmit_bytes{instance="xxx",device!~"bond.*?|lo"}[5m])/128)
49 | ```
50 | 
51 | ## 网卡出/入包
52 | 
53 | ```
54 | // 入包量
55 | sum by (instance) (rate(node_network_receive_bytes{instance="xxx",device!="lo"}[5m]))
56 | 
57 | // 出包量
58 | sum by (instance) (rate(node_network_transmit_bytes{instance="xxx",device!="lo"}[5m]))
59 | ```
60 | 


--------------------------------------------------------------------------------
/exporter/other.md:
--------------------------------------------------------------------------------
 1 | # 其他 Exporter 简介
 2 | 
 3 | 除了 node_exporter 我们还会根据自己的业务选择安装其他 exporter 或者自己编写，比较常用的 exporter 有，
 4 | 
 5 | - [Memcached exporter](https://github.com/prometheus/memcached_exporter) 负责收集 Memcached 信息
 6 | - [MySQL server exporter](https://github.com/prometheus/mysqld_exporter) 负责收集 Mysql Sever 信息
 7 | - [MongoDB exporter](https://github.com/dcu/mongodb_exporter) 负责收集 MongoDB 信息
 8 | - [InfluxDB exporter](https://github.com/prometheus/influxdb_exporter) 负责收集 InfluxDB 信息
 9 | - [JMX exporter ](https://github.com/prometheus/jmx_exporter) 负责收集 Java 虚拟机信息
10 | 
11 | 更多 exporter 请参考[链接](https://prometheus.io/docs/instrumenting/exporters/)。
12 | 


--------------------------------------------------------------------------------
/exporter/sample.md:
--------------------------------------------------------------------------------
 1 | # Sample Exporter
 2 | 
 3 | 既然一个 exporter 就是将收集的数据转化为文本格式，并提供 http 请求即可，那很容自己实现一个。
 4 | 
 5 | ## 一个简单的 exporter
 6 | 
 7 | 下面我将用 `golang` 实现一个简单的 `sample_exporter`, 其代码大致为：
 8 | 
 9 | ```golang
10 | 
11 | package main
12 | 
13 | import (
14 | 	"fmt"
15 | 	"net/http"
16 | )
17 | 
18 | func handler(w http.ResponseWriter, r *http.Request) {
19 | 	fmt.Fprintf(w, exportData)
20 | }
21 | 
22 | func main() {
23 | 	http.HandleFunc("/", handler)
24 | 	http.ListenAndServe(":8080", nil)
25 | }
26 | 
27 | var exportData string = `# HELP sample_http_requests_total The total number of HTTP requests.
28 | # TYPE sample_http_requests_total counter
29 | sample_http_requests_total{method="post",code="200"} 1027 1395066363000
30 | sample_http_requests_total{method="post",code="400"}    3 1395066363000
31 | 
32 | # Escaping in label values:
33 | sample_msdos_file_access_time_seconds{path="C:\\DIR\\FILE.TXT",error="Cannot find file:\n\"FILE.TXT\""} 1.458255915e9
34 | 
35 | # Minimalistic line:
36 | sample_metric_without_timestamp_and_labels 12.47
37 | 
38 | # A histogram, which has a pretty complex representation in the text format:
39 | # HELP sample_http_request_duration_seconds A histogram of the request duration.
40 | # TYPE sample_http_request_duration_seconds histogram
41 | sample_http_request_duration_seconds_bucket{le="0.05"} 24054
42 | sample_http_request_duration_seconds_bucket{le="0.1"} 33444
43 | sample_http_request_duration_seconds_bucket{le="0.2"} 100392
44 | sample_http_request_duration_seconds_bucket{le="0.5"} 129389
45 | sample_http_request_duration_seconds_bucket{le="1"} 133988
46 | sample_http_request_duration_seconds_bucket{le="+Inf"} 144320
47 | sample_http_request_duration_seconds_sum 53423
48 | sample_http_request_duration_seconds_count 144320
49 | 
50 | # Finally a summary, which has a complex representation, too:
51 | # HELP sample_rpc_duration_seconds A summary of the RPC duration in seconds.
52 | # TYPE sample_rpc_duration_seconds summary
53 | sample_rpc_duration_seconds{quantile="0.01"} 3102
54 | sample_rpc_duration_seconds{quantile="0.05"} 3272
55 | sample_rpc_duration_seconds{quantile="0.5"} 4773
56 | sample_rpc_duration_seconds{quantile="0.9"} 9001
57 | sample_rpc_duration_seconds{quantile="0.99"} 76656
58 | sample_rpc_duration_seconds_sum 1.7560473e+07
59 | sample_rpc_duration_seconds_count 2693
60 | `
61 | ```
62 | 
63 | 当运行此程序，你访问 `http://localhost:8080/metrics`, 将看到这样的页面：
64 | 
65 | ![simple exporter data](/images/exporter/sample_exporter_data.png)
66 | 
67 | ## 与 Prometheus 集成
68 | 
69 | 我们可以利用 Prometheus 的 static_configs 来收集 `sample_exporter` 的数据。
70 | 
71 | 打开 `prometheus.yml` 文件, 在 `scrape_configs` 中添加如下配置：
72 | 
73 | ```  
74 | - job_name: "sample"
75 |     static_configs:
76 |       - targets: ["127.0.0.1:8080"]
77 | ```
78 | 
79 | 重启加载配置，然后到 Prometheus Console 查询，你会看到 `simple_exporter` 的数据。
80 | 
81 | ![simple exporter](/images/exporter/simple_exporter.png)
82 | 


--------------------------------------------------------------------------------
/exporter/text.md:
--------------------------------------------------------------------------------
 1 | # 文本格式
 2 | 
 3 | 在讨论 Exporter 之前，有必要先介绍一下 Prometheus 文本数据格式，因为一个 Exporter 本质上就是将收集的数据，转化为对应的文本格式，并提供 http 请求。
 4 | 
 5 | Exporter 收集的数据转化的文本内容以行 (`\n`) 为单位，空行将被忽略, 文本内容最后一行为空行。
 6 | 
 7 | ## 注释
 8 | 
 9 | 文本内容，如果以 `#` 开头通常表示注释。
10 | 
11 | - 以 `# HELP` 开头表示 metric 帮助说明。
12 | - 以 `# TYPE ` 开头表示定义 metric 类型，包含 `counter`, `gauge`, `histogram`, `summary`, 和 `untyped` 类型。
13 | - 其他表示一般注释，供阅读使用，将被 Prometheus 忽略。
14 | 
15 | ## 采样数据
16 | 
17 | 内容如果不以 `#` 开头，表示采样数据。它通常紧挨着类型定义行，满足以下格式：
18 | 
19 | ```
20 | metric_name [
21 |   "{" label_name "=" `"` label_value `"` { "," label_name "=" `"` label_value `"` } [ "," ] "}"
22 | ] value [ timestamp ]
23 | ```
24 | 
25 | 下面是一个完整的例子：
26 | 
27 | ```
28 | # HELP http_requests_total The total number of HTTP requests.
29 | # TYPE http_requests_total counter
30 | http_requests_total{method="post",code="200"} 1027 1395066363000
31 | http_requests_total{method="post",code="400"}    3 1395066363000
32 | 
33 | # Escaping in label values:
34 | msdos_file_access_time_seconds{path="C:\\DIR\\FILE.TXT",error="Cannot find file:\n\"FILE.TXT\""} 1.458255915e9
35 | 
36 | # Minimalistic line:
37 | metric_without_timestamp_and_labels 12.47
38 | 
39 | # A weird metric from before the epoch:
40 | something_weird{problem="division by zero"} +Inf -3982045
41 | 
42 | # A histogram, which has a pretty complex representation in the text format:
43 | # HELP http_request_duration_seconds A histogram of the request duration.
44 | # TYPE http_request_duration_seconds histogram
45 | http_request_duration_seconds_bucket{le="0.05"} 24054
46 | http_request_duration_seconds_bucket{le="0.1"} 33444
47 | http_request_duration_seconds_bucket{le="0.2"} 100392
48 | http_request_duration_seconds_bucket{le="0.5"} 129389
49 | http_request_duration_seconds_bucket{le="1"} 133988
50 | http_request_duration_seconds_bucket{le="+Inf"} 144320
51 | http_request_duration_seconds_sum 53423
52 | http_request_duration_seconds_count 144320
53 | 
54 | # Finally a summary, which has a complex representation, too:
55 | # HELP rpc_duration_seconds A summary of the RPC duration in seconds.
56 | # TYPE rpc_duration_seconds summary
57 | rpc_duration_seconds{quantile="0.01"} 3102
58 | rpc_duration_seconds{quantile="0.05"} 3272
59 | rpc_duration_seconds{quantile="0.5"} 4773
60 | rpc_duration_seconds{quantile="0.9"} 9001
61 | rpc_duration_seconds{quantile="0.99"} 76656
62 | rpc_duration_seconds_sum 1.7560473e+07
63 | rpc_duration_seconds_count 2693
64 | ```
65 | 
66 | 需要特别注意的是，假设采样数据 metric 叫做 `x`, 如果 `x` 是 `histogram` 或 `summary` 类型必需满足以下条件：
67 | 
68 | - 采样数据的总和应表示为 `x_sum`。
69 | - 采样数据的总量应表示为 `x_count`。
70 | - `summary` 类型的采样数据的 quantile 应表示为 `x{quantile="y"}`。
71 | - `histogram` 类型的采样分区统计数据将表示为 `x_bucket{le="y"}`。
72 | - `histogram` 类型的采样必须包含 `x_bucket{le="+Inf"}`, 它的值等于 `x_count` 的值。
73 | - `summary` 和 `historam` 中 `quantile` 和 `le` 必需按从小到大顺序排列。
74 | 


--------------------------------------------------------------------------------
/ha/README.md:
--------------------------------------------------------------------------------
1 | # 高可用
2 | 
3 | Prometheus 监控告警系统的高可用主要分为两部分:
4 | 
5 | - Prometheus Server 的高可用，无单点风险。
6 | - Alertmanager 的高可用，避免告警消息丢失和重复。
7 | 
8 | 下面我们就针对这两个方面展开讨论。
9 | 


--------------------------------------------------------------------------------
/ha/alertmanger.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/ha/alertmanger.md


--------------------------------------------------------------------------------
/ha/img1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/ha/img1.png


--------------------------------------------------------------------------------
/ha/img2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/ha/img2.png


--------------------------------------------------------------------------------
/ha/img3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/ha/img3.png


--------------------------------------------------------------------------------
/ha/img4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/ha/img4.png


--------------------------------------------------------------------------------
/ha/img5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/ha/img5.png


--------------------------------------------------------------------------------
/ha/prometheus.md:
--------------------------------------------------------------------------------
  1 | # Prometheus Server 的高可靠方案
  2 | 
  3 | 思路：使用 remote_read 来实现 Prometheus 数据的读写分离的集群方案， 从而达到其高可用的目的，下面我将具体讲解。
  4 | 
  5 | ### remote_read 简介
  6 | 
  7 | 从 Prometheus 1.8 开始，增加了一个叫做 remote_read 的配置，详细信息如下：
  8 | 
  9 | ```
 10 | # The URL of the endpoint to query from.
 11 | url: <string>
 12 | 
 13 | # Timeout for requests to the remote read endpoint.
 14 | [ remote_timeout: <duration> | default = 30s ]
 15 | 
 16 | # Sets the `Authorization` header on every remote read request with the
 17 | # configured username and password.
 18 | basic_auth:
 19 |   [ username: <string> ]
 20 |   [ password: <string> ]
 21 | 
 22 | # Sets the `Authorization` header on every remote read request with
 23 | # the configured bearer token. It is mutually exclusive with `bearer_token_file`.
 24 | [ bearer_token: <string> ]
 25 | 
 26 | # Sets the `Authorization` header on every remote read request with the bearer token
 27 | # read from the configured file. It is mutually exclusive with `bearer_token`.
 28 | [ bearer_token_file: /path/to/bearer/token/file ]
 29 | 
 30 | # Configures the remote read request's TLS settings.
 31 | tls_config:
 32 |   [ <tls_config> ]
 33 | 
 34 | # Optional proxy URL.
 35 | [ proxy_url: <string> ]
 36 | ```
 37 | 
 38 | `remote_read` 参数用于远程读取数据，采用 http 协议。
 39 | 
 40 | 当然作为数据源被读取需要支持它的 remote storage reader 的接口，这么设计的目的是为了在新一代的存储架构里实现解耦，很方便做到：
 41 | 
 42 | - Prometheus 的读写可以在不同的 Prometheus Server 进行，即一个 Prometheus 读取其它 Prometheus 的数据。
 43 | - Prometheus 的读写可以在更多的存储引擎中进行，即你完可以使用 InfluxDB 作为数据库来存储数据。
 44 | 
 45 | ### 部署架构
 46 | 
 47 | ![部署架构图.png](img1.png)
 48 | 
 49 | 架构说明：
 50 | 
 51 | - Server `A'` 表示 Server A 的镜像，具有相同的功能和数据（B，C 同理）。
 52 | - 数据流统一采用拉取的方式。
 53 | 
 54 | 不难发现，架构中的 Prometheus 主要分为两类， 用于数据收集（例如 B） 和 用于数据查询（例如 A）。
 55 | 
 56 | 数据查询的 Prometheus 会从收集到数据的节点中读取数据，请注意，它只做实时的查询以及内存运算，不做数据存储。
 57 | 
 58 | 通过这样的架构，我们就很容易将整个监控的数据收集查询分离开了，也更容易实现高可用。
 59 | 
 60 | ### 如何配置
 61 | 
 62 | 下面我将通过本地实验向大家演示使用 `remote_read` 实现 Prometheus 之间数据的读取过程。
 63 | 
 64 | 软件版本：
 65 | 
 66 | - Prometheus 版本： prometheus-1.8.2.darwin-amd64
 67 | - NodeExporter 版本：node_exporter-0.12.0.darwin-amd64
 68 | 
 69 | 实验内容：
 70 | 
 71 | 本地运行三个 Prometheus Server, 它们分别运行在 `9090`, `9091`, `9092` 端口。 其中 `9091` 和 `9092` 主要用来收集 node_exporter 数据， `9090` 用来读取 `9091`, `9092` 收集的数据。
 72 | 
 73 | 配置信息如下：
 74 | 
 75 | ```
 76 | # 数据读取的 9090 的配置
 77 | remote_read:
 78 |   - url: 'http://localhost:9091/api/v1/read'
 79 |     remote_timeout: 8s
 80 |   - url: 'http://localhost:9092/api/v1/read'
 81 |     remote_timeout: 8s
 82 | ```
 83 | 
 84 | ```
 85 | # 数据收集 9091 的配置
 86 | - job_name: 'node'
 87 |   static_configs:
 88 |   - targets: ["localhost:9100"]
 89 | ```
 90 | 
 91 | ```
 92 | # 数据收集 9092 的配置
 93 | - job_name: 'node'
 94 |   static_configs:
 95 |   - targets: ["localhost:9100"]
 96 | ```
 97 | 
 98 | 完成配置，并成功启动此三个实例后，我们可以在它们自带的界面中进行验证。
 99 | 
100 | ![9091 收集到的数据](img2.png)
101 | 
102 | ![9092 收集到的数据](img3.png)
103 | 
104 | ![9090 通过远程读取的数据](img4.png)
105 | 
106 | 支持聚合运算查询：
107 | 
108 | ![汇总查询](img5.png)
109 | 
110 | 好了，到目前为止我们已经成功实现通过 `remote_read` 配置实现从不同的 Prometheus Server 读取数据，这意味着我们先前讨论的部署架构是完全是可行的。
111 | 


--------------------------------------------------------------------------------
/how-to-contribute.md:
--------------------------------------------------------------------------------
 1 | ## 如何贡献项目 {#如何贡献项目}
 2 | 
 3 | * 在 GitHub 上`fork`到自己的仓库，如 `xxx/prometheus_practice`，然后`clone`到本地，并设置用户信息。
 4 | 
 5 |   ```
 6 |   $ git clone git@github.com:xxx/prometheus_practice.git
 7 | 
 8 |   $ cd prometheus_practice
 9 | 
10 |   $ git config user.name "yourname"
11 | 
12 |   $ git config user.email "your email"
13 |   ```
14 | * 修改代码后提交，并推送到自己的仓库。
15 | 
16 |   ```
17 |   $ #do some change on the content
18 | 
19 |   $ git commit -m "Fix issue #1: change helo to hello"
20 | 
21 |   $ git push
22 |   ```
23 | 
24 | * 在 GitHub 网站上提交 pull request。
25 | 
26 | * 定期使用项目仓库内容更新自己仓库内容。
27 | 
28 |   ```
29 |   $ git remote add upstream https://github.com/songjiayang/prometheus_practice
30 | 
31 |   $ git fetch upstream
32 | 
33 |   $ git checkout master
34 | 
35 |   $ git rebase upstream/master
36 | 
37 |   $ git push -f origin master
38 |   ```
39 | 


--------------------------------------------------------------------------------
/images/alertmanager/slack-alert1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/alertmanager/slack-alert1.png


--------------------------------------------------------------------------------
/images/alertmanager/slack-alert2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/alertmanager/slack-alert2.png


--------------------------------------------------------------------------------
/images/alertmanager/slack-alert3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/alertmanager/slack-alert3.png


--------------------------------------------------------------------------------
/images/alertmanager/slack-alert4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/alertmanager/slack-alert4.png


--------------------------------------------------------------------------------
/images/alertmanager/slack-alert5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/alertmanager/slack-alert5.png


--------------------------------------------------------------------------------
/images/alertmanager/slack-alert6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/alertmanager/slack-alert6.png


--------------------------------------------------------------------------------
/images/cadvisor/cadvisor-01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/cadvisor/cadvisor-01.png


--------------------------------------------------------------------------------
/images/cadvisor/cadvisor-02.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/cadvisor/cadvisor-02.png


--------------------------------------------------------------------------------
/images/cadvisor/cadvisor-03.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/cadvisor/cadvisor-03.png


--------------------------------------------------------------------------------
/images/cadvisor/cadvisor-04.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/cadvisor/cadvisor-04.png


--------------------------------------------------------------------------------
/images/cadvisor/cadvisor-05.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/cadvisor/cadvisor-05.png


--------------------------------------------------------------------------------
/images/cadvisor/prometheus01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/cadvisor/prometheus01.png


--------------------------------------------------------------------------------
/images/cadvisor/prometheus02.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/cadvisor/prometheus02.png


--------------------------------------------------------------------------------
/images/cadvisor/prometheus03.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/cadvisor/prometheus03.png


--------------------------------------------------------------------------------
/images/cadvisor/prometheus04.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/cadvisor/prometheus04.png


--------------------------------------------------------------------------------
/images/cadvisor/prometheus05.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/cadvisor/prometheus05.png


--------------------------------------------------------------------------------
/images/cadvisor/prometheus06.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/cadvisor/prometheus06.png


--------------------------------------------------------------------------------
/images/cadvisor/prometheus07.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/cadvisor/prometheus07.png


--------------------------------------------------------------------------------
/images/exporter/sample_exporter_data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/exporter/sample_exporter_data.png


--------------------------------------------------------------------------------
/images/exporter/simple_exporter.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/exporter/simple_exporter.png


--------------------------------------------------------------------------------
/images/install/prometheus-console.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/install/prometheus-console.png


--------------------------------------------------------------------------------
/images/install/prometheus-graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/install/prometheus-graph.png


--------------------------------------------------------------------------------
/images/qa/reload_success.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/qa/reload_success.png


--------------------------------------------------------------------------------
/images/visualiztion/grafana-add-graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/visualiztion/grafana-add-graph.png


--------------------------------------------------------------------------------
/images/visualiztion/grafana-added-panel.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/visualiztion/grafana-added-panel.png


--------------------------------------------------------------------------------
/images/visualiztion/grafana-datasource.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/visualiztion/grafana-datasource.png


--------------------------------------------------------------------------------
/images/visualiztion/grafana-default-dashbord.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/visualiztion/grafana-default-dashbord.png


--------------------------------------------------------------------------------
/images/visualiztion/grafana-edit-panel.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/visualiztion/grafana-edit-panel.png


--------------------------------------------------------------------------------
/images/visualiztion/grafana-hide-controls.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/visualiztion/grafana-hide-controls.png


--------------------------------------------------------------------------------
/images/visualiztion/grafana-into-manage-dashboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/visualiztion/grafana-into-manage-dashboard.png


--------------------------------------------------------------------------------
/images/visualiztion/grafana-login.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/visualiztion/grafana-login.png


--------------------------------------------------------------------------------
/images/visualiztion/grafana-prometheus-data-source.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/visualiztion/grafana-prometheus-data-source.png


--------------------------------------------------------------------------------
/images/visualiztion/prometheus-web-console.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/visualiztion/prometheus-web-console.png


--------------------------------------------------------------------------------
/images/visualiztion/prometheus-web-graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/images/visualiztion/prometheus-web-graph.png


--------------------------------------------------------------------------------
/install/README.md:
--------------------------------------------------------------------------------
1 | # 安装
2 | 
3 | 本章将介绍 Prometheus 两种安装方式: 传统二进制包安装和 Docker 安装方式。
4 | 


--------------------------------------------------------------------------------
/install/binary.md:
--------------------------------------------------------------------------------
 1 | # 二进制包安装
 2 | 
 3 | 我们可以到 Prometheus 二进制安装包[下载页面](https://prometheus.io/download/)，根据自己的操作系统选择下载对应的安装包。下面我们将以 ubuntu server 作为演示。
 4 | 
 5 | ## 环境准备
 6 | 
 7 | * linux amd64 (ubuntu server)
 8 | * prometheus 1.6.2
 9 | 
10 | ## 下载 Prometheus Server
11 | 
12 | 创建下载目录,以便安装过后清理掉
13 | 
14 | ```
15 | mkdir ~/Download
16 | cd ~/Download
17 | ```
18 | 
19 | 使用 wget 下载 Prometheus 的安装包
20 | 
21 | ```
22 | wget https://github.com/prometheus/prometheus/releases/download/v1.6.2/prometheus-1.6.2.linux-amd64.tar.gz
23 | ```
24 | 
25 | 创建 Prometheus 目录，用于存放所有 Prometheus 相关的运行服务
26 | 
27 | ```
28 | mkdir ~/Prometheus
29 | cd ~/Prometheus
30 | ```
31 | 
32 | 使用 tar 解压缩 prometheus-1.6.2.linux-amd64.tar.gz
33 | 
34 | ```
35 | tar -xvzf ~/Download/prometheus-1.6.2.linux-amd64.tar.gz
36 | cd prometheus-1.6.2.linux-amd64
37 | ```
38 | 
39 | 当解压缩成功后，可以运行 version 检查运行环境是否正常
40 | 
41 | ```
42 | ./prometheus version
43 | ```
44 | 
45 | 如果你看到类似输出，表示你已安装成功:
46 | 
47 | ```
48 | prometheus, version 1.6.2 (branch: master, revision: xxxx)
49 |   build user:       xxxx
50 |   build date:       xxxx
51 |   go version:       go1.8.1
52 | ```
53 | 
54 | ## 启动 Prometheus Server
55 | 
56 | ```
57 | ./prometheus
58 | ```
59 | 
60 | 如果 prometheus 正常启动，你将看到如下信息：
61 | 
62 | ```
63 | INFO[0000] Starting prometheus (version=1.6.2, branch=master, revision=b38e977fd8cc2a0d13f47e7f0e17b82d1a908a9a)  source=main.go:88
64 | INFO[0000] Build context (go=go1.8.1, user=root@c99d9d650cf4, date=20170511-13:03:00)  source=main.go:89
65 | INFO[0000] Loading configuration file prometheus.yml     source=main.go:251
66 | INFO[0000] Loading series map and head chunks...         source=storage.go:421
67 | INFO[0000] 0 series loaded.                              source=storage.go:432
68 | INFO[0000] Starting target manager...                    source=targetmanager.go:61
69 | INFO[0000] Listening on :9090                            source=web.go:259
70 | ```
71 | 
72 | 通过启动日志，可以看到 Prometheus Server 默认端口是 9090。
73 | 
74 | 当 Prometheus 启动后，你可以通过浏览器来访问  `http://IP:9090`，将看到如下页面
75 | 
76 | ![prometheus-graph.png](/images/install/prometheus-graph.png)
77 | 
78 | 在默认配置中，我们已经添加了 Prometheus Server 的监控，所以我们现在可以使用 `PromQL` （Prometheus Query Language）来查看，比如：
79 | 
80 | ![prometheus-console.png](/images/install/prometheus-console.png)
81 | 
82 | ## 总结
83 | 
84 | 1. 可以看出 Prometheus 二进制安装非常方便，没有依赖，自带查询 web 界面。
85 | 2. 在生产环境中，我们可以将 Prometheus 添加到 init 配置里，或者使用 supervisord 作为服务自启动。
86 | 


--------------------------------------------------------------------------------
/install/docker.md:
--------------------------------------------------------------------------------
 1 | # Docker 安装
 2 | 
 3 | 首先确保你已安装了最新版本的 Docker, 如果没有安装请点击[这里](https://docs.docker.com/engine/installation/)。
 4 | 
 5 | 下面我将以 Mac 版本的 Docker 作为演示。
 6 | 
 7 | ## 安装
 8 | 
 9 | Docker 镜像地址 [Quay.io](https://quay.io/repository/prometheus/prometheus)
10 | 
11 | 执行命令安装:
12 | 
13 | ```
14 | $ docker run --name prometheus -d -p 127.0.0.1:9090:9090 quay.io/prometheus/prometheus
15 | ```
16 | 
17 | 如果安装成功你可以访问 `127.0.0.1:9090` 查看到该页面:
18 | 
19 | ![prometheus-graph.png](/images/install/prometheus-graph.png)
20 | 
21 | ## Docker 管理 prometheus
22 | 
23 | 运行 docker ps 查看所有服务:
24 | 
25 | ```
26 | CONTAINER ID        IMAGE                           COMMAND                  CREATED             STATUS              PORTS                      NAMES
27 | e9ebc2435387        quay.io/prometheus/prometheus   "/bin/prometheus -..."   26 minutes ago      Up 26 minutes       127.0.0.1:9090->9090/tcp   prometheus
28 | ```
29 | 
30 | 运行 `docker start prometheus` 启动服务
31 | 
32 | 运行 `docker stats prometheus` 查看 prometheus 状态
33 | 
34 | 运行 `docker stop prometheus` 停止服务
35 | 


--------------------------------------------------------------------------------
/introduction/README.md:
--------------------------------------------------------------------------------
 1 | # 简介
 2 | 
 3 | 本章将带领你进入 Prometheus 的世界。
 4 | 
 5 | 什么是 Prometheus?
 6 | 
 7 | 用它会带来什么样的好处，为什么选择它，以及与其他监控方案的区别？
 8 | 
 9 | 好吧，让我们带着问题开始这神奇之旅。
10 | 


--------------------------------------------------------------------------------
/introduction/what.md:
--------------------------------------------------------------------------------
 1 | # 什么是 Prometheus
 2 | 
 3 | [Prometheus](https://prometheus.io) 是由 SoundCloud 开源监控告警解决方案，从 2012 年开始编写代码，再到 2015 年 github 上开源以来，已经吸引了 9k+ 关注，以及很多大公司的使用；2016 年 Prometheus 成为继 k8s 后，第二名 CNCF\([Cloud Native Computing Foundation](https://cncf.io/)\) 成员。
 4 | 
 5 | 作为新一代开源解决方案，很多理念与 Google SRE 运维之道不谋而合。
 6 | 
 7 | ## 主要功能
 8 | 
 9 | * 多维 [数据模型](https://prometheus.io/docs/concepts/data_model/)（时序由 metric 名字和 k/v 的 labels 构成）。
10 | * 灵活的查询语句（[PromQL](https://prometheus.io/docs/querying/basics/)）。
11 | * 无依赖存储，支持 local 和 remote 不同模型。
12 | * 采用 http 协议，使用 pull 模式，拉取数据，简单易懂。
13 | * 监控目标，可以采用服务发现或静态配置的方式。
14 | * 支持多种统计数据模型，图形化友好。
15 | 
16 | ## 核心组件
17 | 
18 | * [Prometheus Server](https://github.com/prometheus/prometheus)， 主要用于抓取数据和存储时序数据，另外还提供查询和 Alert Rule 配置管理。
19 | * [client libraries](https://prometheus.io/docs/instrumenting/clientlibs/)，用于对接 Prometheus Server, 可以查询和上报数据。
20 | * [push gateway](https://github.com/prometheus/pushgateway) ，用于批量，短期的监控数据的汇总节点，主要用于业务数据汇报等。
21 | * 各种汇报数据的 [exporters](https://prometheus.io/docs/instrumenting/exporters/) ，例如汇报机器数据的 node\_exporter,  汇报 MongoDB 信息的 [MongoDB exporter](https://github.com/dcu/mongodb_exporter) 等等。
22 | * 用于告警通知管理的 [alertmanager](https://github.com/prometheus/alertmanager) 。
23 | 
24 | ## 基础架构
25 | 
26 | 一图胜千言，先来张官方的架构图
27 | 
28 | ![](https://prometheus.io/assets/architecture.svg)
29 | 
30 | 从这个架构图，也可以看出 Prometheus 的主要模块包含， Server,  Exporters, Pushgateway, PromQL, Alertmanager, WebUI 等。
31 | 
32 | 它大致使用逻辑是这样：
33 | 
34 | 1. Prometheus server 定期从静态配置的 targets 或者服务发现的 targets 拉取数据。
35 | 2. 当新拉取的数据大于配置内存缓存区的时候，Prometheus 会将数据持久化到磁盘（如果使用 remote storage 将持久化到云端）。
36 | 3. Prometheus 可以配置 rules，然后定时查询数据，当条件触发的时候，会将 alert 推送到配置的 Alertmanager。
37 | 4. Alertmanager 收到警告的时候，可以根据配置，聚合，去重，降噪，最后发送警告。
38 | 5. 可以使用 API， Prometheus Console 或者 Grafana 查询和聚合数据。
39 | 
40 | ## 注意
41 | 
42 | * Prometheus 的数据是基于时序的 float64 的值，如果你的数据值有更多类型，无法满足。
43 | * Prometheus 不适合做审计计费，因为它的数据是按一定时间采集的，关注的更多是系统的运行瞬时状态以及趋势，即使有少量数据没有采集也能容忍，但是审计计费需要记录每个请求，并且数据长期存储，这个 Prometheus 无法满足，可能需要采用专门的审计系统。
44 | 


--------------------------------------------------------------------------------
/introduction/why.md:
--------------------------------------------------------------------------------
 1 | # 为什么选择 Prometheus
 2 | 
 3 | 在前言中，简单介绍了我们选择 Prometheus 的理由，以及使用后给我们带来的好处。
 4 | 
 5 | 在这里主要和其他监控方案对比，方便大家更好的了解 Prometheus。
 6 | 
 7 | ## Prometheus vs Zabbix
 8 | 
 9 | * Zabbix 使用的是 C 和 PHP, Prometheus 使用 Golang, 整体而言 Prometheus 运行速度更快一点。
10 | * Zabbix 属于传统主机监控，主要用于物理主机，交换机，网络等监控，Prometheus 不仅适用主机监控，还适用于 Cloud, SaaS, Openstack，Container 监控。
11 | * Zabbix 在传统主机监控方面，有更丰富的插件。
12 | * Zabbix 可以在 WebGui 中配置很多事情，但是 Prometheus 需要手动修改文件配置。
13 | 
14 | ## Prometheus vs Graphite
15 | 
16 | * [Graphite](http://graphite.readthedocs.io/en/latest/overview.html) 功能较少，它专注于两件事，存储时序数据，
17 | 可视化数据，其他功能需要安装相关插件，而 Prometheus 属于一站式，提供告警和趋势分析的常见功能，它提供更强的数据存储和查询能力。
18 | * 在水平扩展方案以及数据存储周期上，Graphite 做的更好。
19 | 
20 | ## Prometheus vs InfluxDB
21 | 
22 | * [InfluxDB](https://www.influxdata.com/) 是一个开源的时序数据库，主要用于存储数据，如果想搭建监控告警系统，
23 | 需要依赖其他系统。
24 | * InfluxDB 在存储水平扩展以及高可用方面做的更好, 毕竟核心是数据库。
25 | 
26 | ## Prometheus vs OpenTSDB
27 | 
28 | * [OpenTSDB](http://opentsdb.net/) 是一个分布式时序数据库，它依赖 Hadoop 和 HBase，能存储更长久数据，
29 | 如果你系统已经运行了 Hadoop 和 HBase, 它是个不错的选择。
30 | * 如果想搭建监控告警系统，OpenTSDB 需要依赖其他系统。
31 | 
32 | ## Prometheus vs Nagios
33 | 
34 | * [Nagios](https://www.nagios.org/) 数据不支持自定义 Labels, 不支持查询，告警也不支持去噪，分组, 没有数据存储，如果想查询历史状态，需要安装插件。
35 | * Nagios 是上世纪 90 年代的监控系统，比较适合小集群或静态系统的监控，显然 Nagios 太古老了，很多特性都没有，相比之下Prometheus 要优秀很多。
36 | 
37 | ## Prometheus vs Sensu
38 | 
39 | * [Sensu](https://sensuapp.org/) 广义上讲是 Nagios 的升级版本，它解决了很多 Nagios 的问题，如果你对 Nagios 很熟悉，使用 Sensu 是个不错的选择。
40 | * Sensu 依赖 RabbitMQ 和 Redis，数据存储上扩展性更好。
41 | 
42 | ## 总结
43 | 
44 | * Prometheus 属于一站式监控告警平台，依赖少，功能齐全。
45 | * Prometheus 支持对云或容器的监控，其他系统主要对主机监控。
46 | * Prometheus 数据查询语句表现力更强大，内置更强大的统计函数。
47 | * Prometheus 在数据存储扩展性以及持久性上没有 InfluxDB，OpenTSDB，Sensu 好。
48 | 


--------------------------------------------------------------------------------
/optimize/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/optimize/README.md


--------------------------------------------------------------------------------
/optimize/config.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/optimize/config.md


--------------------------------------------------------------------------------
/optimize/logger.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/optimize/logger.md


--------------------------------------------------------------------------------
/optimize/status.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/optimize/status.md


--------------------------------------------------------------------------------
/password.html:
--------------------------------------------------------------------------------
 1 | ---
 2 | layout: application
 3 | ---
 4 | 
 5 | <div class="topnav-container">
 6 |   <div class="container">
 7 |     <div class="topnav__icon"></div>
 8 |     <div class="navs">
 9 |       <a class="topnav__item" href="/company">应用概览</a>
10 |       <a class="topnav__item active" href="/company/settings/station.html">系统设置</a>
11 |       <a class="topnav__item" href="/company">联系我们</a>
12 |     </div>
13 |   </div>
14 | </div>
15 | 
16 | <div class="container">
17 |   <div class="row">
18 |     <div class="col-md-2">
19 |       <ul class="app-nav">
20 | 
21 |         <li class="app-nav__item active">
22 |           <span class="app-nav__item-icon icon6"></span>
23 |           <a href="/company/settings/station.html" class="gray-color">站点管理</a>
24 |         </li>
25 | 
26 |         <li class="app-nav__item active">
27 |           <span class="app-nav__item-icon icon6"></span>
28 |           <a href="/company/settings/password.html" class="blue-color">修改密码</a>
29 |         </li>
30 |       </ul>
31 |     </div>
32 | 
33 |     <div class="col-md-10 nav-content">
34 |       <header class="nav-content__header">
35 |         <label>修改密码</label>
36 |       </header>
37 |       <div class="nav-content__main">
38 |         <form class="password-form jsAppModule" method="post" data-module="PasswordForm">
39 |           <div class="form-group">
40 |             <label for="">原始密码</label>
41 |             <input type="password" class="form-control" name="password" autocomplete="off">
42 |           </div>
43 | 
44 |           <div class="form-group">
45 |             <label for="">新密码</label>
46 |             <input type="password" class="form-control" name="new_password" autocomplete="off">
47 |           </div>
48 | 
49 |           <div class="form-group">
50 |             <label for="">新密码确认</label>
51 |             <input type="password" class="form-control" name="new_password_confirm" autocomplete="off">
52 |           </div>
53 | 
54 |           <div class="form-group">
55 |             <button type="submit" class="btn btn-primary">保存</button>
56 |           </div>
57 |       </form>
58 |       </div>
59 |     </div>
60 |   </div>
61 | </div>
62 | 


--------------------------------------------------------------------------------
/promql/README.md:
--------------------------------------------------------------------------------
1 | # PromQL
2 | 
3 | 本章将介绍 PromQL 基本用法、示例，并将其与 SQL 进行比较。
4 | 


--------------------------------------------------------------------------------
/promql/sql.md:
--------------------------------------------------------------------------------
  1 | # 与 SQL 对比
  2 | 
  3 | 下面将以 Prometheus server 收集的 `http_requests_total` 时序数据为例子展开对比。
  4 | 
  5 | ## MySQL 数据准备
  6 | 
  7 | ```
  8 | mysql>
  9 | # 创建数据库
 10 | create database prometheus_practice;
 11 | use prometheus_practice;
 12 | 
 13 | # 创建 http_requests_total 表
 14 | CREATE TABLE http_requests_total (
 15 |     code VARCHAR(256),
 16 |     handler VARCHAR(256),
 17 |     instance VARCHAR(256),
 18 |     job VARCHAR(256),
 19 |     method VARCHAR(256),
 20 |     created_at DOUBLE NOT NULL,
 21 |     value DOUBLE NOT NULL) ENGINE=InnoDB DEFAULT CHARSET=utf8;
 22 | 
 23 | ALTER TABLE http_requests_total ADD INDEX created_at_index (created_at);
 24 | 
 25 | # 初始化数据
 26 | # time at 2017/5/22 14:45:27
 27 | INSERT INTO http_requests_total (code, handler, instance, job, method, created_at, value) values ("200", "query_range", "localhost:9090", "prometheus", "get", 1495435527, 3);
 28 | INSERT INTO http_requests_total (code, handler, instance, job, method, created_at, value) values ("400", "query_range", "localhost:9090", "prometheus", "get", 1495435527, 5);
 29 | INSERT INTO http_requests_total (code, handler, instance, job, method, created_at, value) values ("200", "prometheus", "localhost:9090", "prometheus", "get", 1495435527, 6418);
 30 | INSERT INTO http_requests_total (code, handler, instance, job, method, created_at, value) values ("200", "static", "localhost:9090", "prometheus", "get", 1495435527, 9);
 31 | INSERT INTO http_requests_total (code, handler, instance, job, method, created_at, value) values ("304", "static", "localhost:9090", "prometheus", "get", 1495435527, 19);
 32 | INSERT INTO http_requests_total (code, handler, instance, job, method, created_at, value) values ("200", "query", "localhost:9090", "prometheus", "get", 1495435527, 87);
 33 | INSERT INTO http_requests_total (code, handler, instance, job, method, created_at, value) values ("400", "query", "localhost:9090", "prometheus", "get", 1495435527, 26);
 34 | INSERT INTO http_requests_total (code, handler, instance, job, method, created_at, value) values ("200", "graph", "localhost:9090", "prometheus", "get", 1495435527, 7);
 35 | INSERT INTO http_requests_total (code, handler, instance, job, method, created_at, value) values ("200", "label_values", "localhost:9090", "prometheus", "get", 1495435527, 7);
 36 | 
 37 | # time at 2017/5/22 14:48:27
 38 | INSERT INTO http_requests_total (code, handler, instance, job, method, created_at, value) values ("200", "query_range", "localhost:9090", "prometheus", "get", 1495435707, 3);
 39 | INSERT INTO http_requests_total (code, handler, instance, job, method, created_at, value) values ("400", "query_range", "localhost:9090", "prometheus", "get", 1495435707, 5);
 40 | INSERT INTO http_requests_total (code, handler, instance, job, method, created_at, value) values ("200", "prometheus", "localhost:9090", "prometheus", "get", 1495435707, 6418);
 41 | INSERT INTO http_requests_total (code, handler, instance, job, method, created_at, value) values ("200", "static", "localhost:9090", "prometheus", "get", 1495435707, 9);
 42 | INSERT INTO http_requests_total (code, handler, instance, job, method, created_at, value) values ("304", "static", "localhost:9090", "prometheus", "get", 1495435707, 19);
 43 | INSERT INTO http_requests_total (code, handler, instance, job, method, created_at, value) values ("200", "query", "localhost:9090", "prometheus", "get", 1495435707, 87);
 44 | INSERT INTO http_requests_total (code, handler, instance, job, method, created_at, value) values ("400", "query", "localhost:9090", "prometheus", "get", 1495435707, 26);
 45 | INSERT INTO http_requests_total (code, handler, instance, job, method, created_at, value) values ("200", "graph", "localhost:9090", "prometheus", "get", 1495435707, 7);
 46 | INSERT INTO http_requests_total (code, handler, instance, job, method, created_at, value) values ("200", "label_values", "localhost:9090", "prometheus", "get", 1495435707, 7);
 47 | ```
 48 | 
 49 | 数据初始完成后，通过查询可以看到如下数据：
 50 | 
 51 | ```
 52 | mysql>
 53 | mysql> select * from http_requests_total;
 54 | +------+--------------+----------------+------------+--------+------------+-------+
 55 | | code | handler      | instance       | job        | method | created_at | value |
 56 | +------+--------------+----------------+------------+--------+------------+-------+
 57 | | 200  | query_range  | localhost:9090 | prometheus | get    | 1495435527 |     3 |
 58 | | 400  | query_range  | localhost:9090 | prometheus | get    | 1495435527 |     5 |
 59 | | 200  | prometheus   | localhost:9090 | prometheus | get    | 1495435527 |  6418 |
 60 | | 200  | static       | localhost:9090 | prometheus | get    | 1495435527 |     9 |
 61 | | 304  | static       | localhost:9090 | prometheus | get    | 1495435527 |    19 |
 62 | | 200  | query        | localhost:9090 | prometheus | get    | 1495435527 |    87 |
 63 | | 400  | query        | localhost:9090 | prometheus | get    | 1495435527 |    26 |
 64 | | 200  | graph        | localhost:9090 | prometheus | get    | 1495435527 |     7 |
 65 | | 200  | label_values | localhost:9090 | prometheus | get    | 1495435527 |     7 |
 66 | | 200  | query_range  | localhost:9090 | prometheus | get    | 1495435707 |     3 |
 67 | | 400  | query_range  | localhost:9090 | prometheus | get    | 1495435707 |     5 |
 68 | | 200  | prometheus   | localhost:9090 | prometheus | get    | 1495435707 |  6418 |
 69 | | 200  | static       | localhost:9090 | prometheus | get    | 1495435707 |     9 |
 70 | | 304  | static       | localhost:9090 | prometheus | get    | 1495435707 |    19 |
 71 | | 200  | query        | localhost:9090 | prometheus | get    | 1495435707 |    87 |
 72 | | 400  | query        | localhost:9090 | prometheus | get    | 1495435707 |    26 |
 73 | | 200  | graph        | localhost:9090 | prometheus | get    | 1495435707 |     7 |
 74 | | 200  | label_values | localhost:9090 | prometheus | get    | 1495435707 |     7 |
 75 | +------+--------------+----------------+------------+--------+------------+-------+
 76 | 18 rows in set (0.00 sec)
 77 | ```
 78 | 
 79 | ## 基本查询对比
 80 | 
 81 | 假设当前时间为 2017/5/22 14:48:30
 82 | 
 83 | * 查询当前所有数据
 84 | 
 85 | ```
 86 | // PromQL
 87 | http_requests_total
 88 | 
 89 | // MySQL
 90 | SELECT * from http_requests_total WHERE created_at BETWEEN 1495435700 AND 1495435710;
 91 | ```
 92 | 
 93 | 我们查询 MySQL 数据的时候，需要将当前时间向前推一定间隔，比如这里的 10s (Prometheus 数据抓取间隔)，这样才能确保查询到数据，而 PromQL 自动帮我们实现了这个逻辑。
 94 | 
 95 | * 条件查询
 96 | 
 97 | ```
 98 | // PromQL
 99 | http_requests_total{code="200", handler="query"}
100 | 
101 | // MySQL
102 | SELECT * from http_requests_total WHERE code="200" AND handler="query" AND created_at BETWEEN 1495435700 AND 1495435710;
103 | ```
104 | 
105 | * 模糊查询: code 为 2xx 的数据
106 | 
107 | ```
108 | // PromQL
109 | http_requests_total{code~="2xx"}
110 | 
111 | // MySQL
112 | SELECT * from http_requests_total WHERE code LIKE "%2%" AND created_at BETWEEN 1495435700 AND 1495435710;
113 | ```
114 | 
115 | * 比较查询: value 大于 100 的数据
116 | 
117 | ```
118 | // PromQL
119 | http_requests_total > 100
120 | 
121 | // MySQL
122 | SELECT * from http_requests_total WHERE value > 100 AND created_at BETWEEN 1495435700 AND 1495435710;
123 | ```
124 | 
125 | * 范围区间查询: 过去 5 分钟数据
126 | 
127 | ```
128 | // PromQL
129 | http_requests_total[5m]
130 | 
131 | // MySQL
132 | SELECT * from http_requests_total WHERE created_at BETWEEN 1495435410 AND 1495435710;
133 | ```
134 | 
135 | ## 聚合, 统计高级查询
136 | 
137 | * count 查询: 统计当前记录总数
138 | 
139 | ```
140 | // PromQL
141 | count(http_requests_total)
142 | 
143 | // MySQL
144 | SELECT COUNT(*) from http_requests_total WHERE created_at BETWEEN 1495435700 AND 1495435710;
145 | ```
146 | 
147 | * sum 查询: 统计当前数据总值
148 | 
149 | ```
150 | // PromQL
151 | sum(http_requests_total)
152 | 
153 | // MySQL
154 | SELECT SUM(value) from http_requests_total WHERE created_at BETWEEN 1495435700 AND 1495435710;
155 | ```
156 | 
157 | * avg 查询: 统计当前数据平均值
158 | 
159 | ```
160 | // PromQL
161 | avg(http_requests_total)
162 | 
163 | // MySQL
164 | SELECT AVG(value) from http_requests_total WHERE created_at BETWEEN 1495435700 AND 1495435710;
165 | ```
166 | 
167 | * top 查询: 查询最靠前的 3 个值
168 | 
169 | ```
170 | // PromQL
171 | topk(3, http_requests_total)
172 | 
173 | // MySQL
174 | SELECT * from http_requests_total WHERE created_at BETWEEN 1495435700 AND 1495435710 ORDER BY value DESC LIMIT 3;
175 | ```
176 | 
177 | * irate 查询，过去 5 分钟平均每秒数值
178 | 
179 | ```
180 | // PromQL
181 | irate(http_requests_total[5m])
182 | 
183 | // MySQL
184 | SELECT code, handler, instance, job, method, SUM(value)/300 AS value from http_requests_total WHERE created_at BETWEEN 1495435700 AND 1495435710  GROUP BY code, handler, instance, job, method;
185 | ```
186 | 
187 | ## 总结
188 | 
189 | 通过以上一些示例可以看出，在常用查询和统计方面，PromQL 比 MySQL 简单和丰富很多，而且查询性能也高不少。
190 | 


--------------------------------------------------------------------------------
/promql/summary.md:
--------------------------------------------------------------------------------
 1 | # PromQL 基本使用
 2 | 
 3 | PromQL (Prometheus Query Language) 是 Prometheus 自己开发的数据查询 DSL 语言，语言表现力非常丰富，内置函数很多，在日常数据可视化以及rule 告警中都会使用到它。
 4 | 
 5 | 在页面 `http://localhost:9090/graph` 中，输入下面的查询语句，查看结果，例如：
 6 | 
 7 | ```
 8 | http_requests_total{code="200"}
 9 | ```
10 | 
11 | ## 字符串和数字
12 | 
13 | **字符串**: 在查询语句中，字符串往往作为查询条件 labels 的值，和 Golang 字符串语法一致，可以使用 `""`, `''`, 或者 ` `` `, 格式如：
14 | 
15 | ```
16 | "this is a string"
17 | 'these are unescaped: \n \\ \t'
18 | `these are not unescaped: \n ' " \t`
19 | ```
20 | 
21 | **正数，浮点数**: 表达式中可以使用正数或浮点数，例如：
22 | 
23 | ```
24 | 3
25 | -2.4
26 | ```
27 | 
28 | ## 查询结果类型
29 | 
30 | PromQL 查询结果主要有 3 种类型：
31 | 
32 | * 瞬时数据 (Instant vector): 包含一组时序，每个时序只有一个点，例如：`http_requests_total`
33 | * 区间数据 (Range vector): 包含一组时序，每个时序有多个点，例如：`http_requests_total[5m]`
34 | * 纯量数据 (Scalar): 纯量只有一个数字，没有时序，例如：`count(http_requests_total)`
35 | 
36 | ## 查询条件
37 | 
38 | Prometheus 存储的是时序数据，而它的时序是由名字和一组标签构成的，其实名字也可以写出标签的形式，例如 `http_requests_total` 等价于 {__name__="http_requests_total"}。
39 | 
40 | 一个简单的查询相当于是对各种标签的筛选，例如：
41 | 
42 | ```
43 | http_requests_total{code="200"} // 表示查询名字为 http_requests_total，code 为 "200" 的数据
44 | ```
45 | 
46 | 查询条件支持正则匹配，例如：
47 | 
48 | ```
49 | http_requests_total{code!="200"}  // 表示查询 code 不为 "200" 的数据
50 | http_requests_total{code=～"2.."} // 表示查询 code 为 "2xx" 的数据
51 | http_requests_total{code!～"2.."} // 表示查询 code 不为 "2xx" 的数据
52 | ```
53 | 
54 | ## 操作符
55 | 
56 | Prometheus 查询语句中，支持常见的各种表达式操作符，例如
57 | 
58 | **算术运算符**:
59 | 
60 | 支持的算术运算符有 `+，-，*，/，%，^`, 例如 `http_requests_total * 2` 表示将 http_requests_total 所有数据 double 一倍。
61 | 
62 | **比较运算符**:
63 | 
64 | 支持的比较运算符有 `==，!=，>，<，>=，<=`, 例如 `http_requests_total > 100` 表示 http_requests_total 结果中大于 100 的数据。
65 | 
66 | **逻辑运算符**:
67 | 
68 | 支持的逻辑运算符有 `and，or，unless`, 例如 `http_requests_total == 5 or http_requests_total == 2` 表示 http_requests_total 结果中等于 5 或者 2 的数据。
69 | 
70 | **聚合运算符**:
71 | 
72 | 支持的聚合运算符有 `sum，min，max，avg，stddev，stdvar，count，count_values，bottomk，topk，quantile，`, 例如 `max(http_requests_total)` 表示 http_requests_total 结果中最大的数据。
73 | 
74 | 注意，和四则运算类型，Prometheus 的运算符也有优先级，它们遵从（^）> (*, /, %) > (+, -) > (==, !=, <=, <, >=, >) > (and, unless) > (or) 的原则。
75 | 
76 | ## 内置函数
77 | 
78 | Prometheus 内置不少函数，方便查询以及数据格式化，例如将结果由浮点数转为整数的 floor 和 ceil，
79 | 
80 | ```
81 | floor(avg(http_requests_total{code="200"}))
82 | ceil(avg(http_requests_total{code="200"}))
83 | ```
84 | 
85 | 查看 http_requests_total 5分钟内，平均每秒数据
86 | 
87 | ```
88 | rate(http_requests_total[5m])
89 | ```
90 | 
91 | 更多请参见[详情](https://prometheus.io/docs/querying/functions/)。
92 | 


--------------------------------------------------------------------------------
/pushgateway/README.md:
--------------------------------------------------------------------------------
1 | # Pushgateway
2 | 
3 | 本章将介绍 pushgateway 使用经验以及注意事项。
4 | 


--------------------------------------------------------------------------------
/pushgateway/how.md:
--------------------------------------------------------------------------------
 1 | # Pushgateway 安装和使用
 2 | 
 3 | ## 安装
 4 | 
 5 | ### 二进制安装
 6 | 
 7 | 你可以到[下载页面](https://github.com/prometheus/pushgateway/releases/tag/v0.4.0),选择对应的版本，
 8 | 以 v0.4.0 为例子,
 9 | 
10 | * 使用 wget 下载 pushgateway
11 | 
12 | ```
13 | cd ~/Download
14 | https://github.com/prometheus/pushgateway/releases/download/v0.4.0/pushgateway-0.4.0.linux-amd64.tar.gz
15 | ```
16 | 
17 | * 使用 tar 解压缩 pushgateway-0.4.0.linux-amd64.tar.gz
18 | 
19 | ```
20 | cd ~/Prometheus
21 | tar -xvzf ~/Download/pushgateway-0.4.0.linux-amd64.tar.gz
22 | cd pushgateway-0.4.0.linux-amd64
23 | ```
24 | 
25 | * 启动 pushgateway
26 | 
27 | 我们可以使用 ./pushgateway -h 查看运行选项，./pushgateway 运行 pushgateway, 如果看到类似输出，表示启动成功。
28 | 
29 | ```
30 | INFO[0000] Starting pushgateway (version=0.4.0, branch=master, revision=6ceb4a19fa85ac2d6c2d386c144566fb1ede1f6c)  source=main.go:57
31 | INFO[0000] Build context (go=go1.8.3, user=root@87741d1b66a9, date=20170609-12:26:14)  source=main.go:58
32 | INFO[0000] Listening on :9091.                           source=main.go:102
33 | ```
34 | 
35 | ### Docker 安装
36 | 
37 | 我们可以使用 [prom/pushgateway](https://registry.hub.docker.com/u/prom/pushgateway/) 的 Docker 镜像，
38 | 
39 | ```
40 | docker pull prom/pushgateway
41 | 
42 | docker run -d -p 9091:9091 prom/pushgateway
43 | ```
44 | 
45 | ## 数据管理
46 | 
47 | 正常情况我们会使用 Client SDK 推送数据到 pushgateway, 但是我们还可以通过 API 来管理, 例如：
48 | 
49 | * 向 `{job="some_job"}` 添加单条数据：
50 | 
51 | ```
52 | echo "some_metric 3.14" | curl --data-binary @- http://pushgateway.example.org:9091/metrics/job/some_job
53 | ```
54 | 
55 | * 添加更多更复杂数据，通常数据会带上 instance, 表示来源位置：
56 | 
57 | ```
58 | cat <<EOF | curl --data-binary @- http://pushgateway.example.org:9091/metrics/job/some_job/instance/some_instance
59 | # TYPE some_metric counter
60 | some_metric{label="val1"} 42
61 | # TYPE another_metric gauge
62 | # HELP another_metric Just an example.
63 | another_metric 2398.283
64 | EOF
65 | ```
66 | 
67 | * 删除某个组下的某实例的所有数据：
68 | 
69 | ```
70 |  curl -X DELETE http://pushgateway.example.org:9091/metrics/job/some_job/instance/some_instance
71 | ```
72 | 
73 | * 删除某个组下的所有数据：
74 | 
75 | ```
76 | curl -X DELETE http://pushgateway.example.org:9091/metrics/job/some_job
77 | ```
78 | 
79 | 可以发现 pushgateway 中的数据我们通常按照 `job` 和 `instance` 分组分类，所以这两个参数不可缺少。
80 | 
81 | 因为 Prometheus 配置 pushgateway 的时候，也会指定 job 和 instance, 但是它只表示 pushgateway 实例，不能真正表达收集数据的含义。所以在 prometheus 中配置 pushgateway 的时候，需要添加 `honor_labels: true` 参数，
82 | 从而避免收集数据本身的 `job` 和 `instance` 被覆盖。
83 | 
84 | 注意，为了防止 pushgateway 重启或意外挂掉，导致数据丢失，我们可以通过 `-persistence.file` 和 `-persistence.interval` 参数将数据持久化下来。
85 | 


--------------------------------------------------------------------------------
/pushgateway/why.md:
--------------------------------------------------------------------------------
 1 | # Pushgateway 简介
 2 | 
 3 | [Pushgateway](https://github.com/prometheus/pushgateway) 是 Prometheus 生态中一个重要工具，使用它的原因主要是：
 4 | 
 5 | - Prometheus 采用 pull 模式，可能由于不在一个子网或者防火墙原因，导致 Prometheus 无法直接拉取各个 target 数据。
 6 | - 在监控业务数据的时候，需要将不同数据汇总, 由 Prometheus 统一收集。
 7 | 
 8 | 由于以上原因，不得不使用 pushgateway，但在使用之前，有必要了解一下它的一些弊端：
 9 | 
10 | -  将多个节点数据汇总到 pushgateway, 如果 pushgateway 挂了，受影响比多个 target 大。
11 | -  Prometheus 拉取状态 `up` 只针对 pushgateway, 无法做到对每个节点有效。
12 | -  Pushgateway 可以持久化推送给它的所有监控数据。
13 | 
14 | 因此，即使你的监控已经下线，prometheus 还会拉取到旧的监控数据，需要手动清理 pushgateway 不要的数据。
15 | 


--------------------------------------------------------------------------------
/qa/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/qa/README.md


--------------------------------------------------------------------------------
/qa/auth.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/qa/auth.md


--------------------------------------------------------------------------------
/qa/hotreload.md:
--------------------------------------------------------------------------------
 1 | # 如何热加载更新配置?
 2 | 
 3 | 当 Prometheus 有配置文件修改，我们可以采用 Prometheus 提供的热更新方法实现在不停服务的情况下实现配置文件的重新加载。
 4 | 
 5 | 热更新加载方法有两种：
 6 | 
 7 | 1. kill -HUP pid
 8 | 2. curl -X POST http://IP/-/reload
 9 | 
10 | 当你采用以上任一方式执行 reload 成功的时候，将在 promtheus log 中看到如下信息：
11 | 
12 | ![hotreload.png](/images/qa/reload_success.png)
13 | 
14 | 如果因为配置信息填写不正确导致更新失败，将看到类似信息：
15 | 
16 | ```
17 | ERRO[0161] Error reloading config: couldn't load configuration (-config.file=prometheus.yml): unknown fields in scrape_config: job_nae  source=main.go:146
18 | ```
19 | 
20 | 提示：
21 | 
22 | 1. 我个人更倾向于采用 curl -X POST 的方式，因为每次 reload 过后， pid 会改变，使用 kill 方式需要找到当前进程号。
23 | 2. 从 2.0 开始，hot reload 功能是默认关闭的，如需开启，需要在启动 Prometheus 的时候，添加 `--web.enable-lifecycle` 参数。
24 | 
25 | 
26 | 下面我们再来探讨下这两种方式内部实现原理。
27 | 
28 | 第一种：通过 kill 命令的 HUP (hang up) 参数实现:
29 | 
30 | 首先 Prometheus 在 `cmd/promethteus/main.go` 中实现了对进程系统调用监听，如果收到 syscall.SIGHUP 信号，将执行 reloadConfig 函数。
31 | 
32 | 代码类似:
33 | 
34 | ```golang
35 | hup := make(chan os.Signal)
36 | signal.Notify(hup, syscall.SIGHUP)
37 | go func() {
38 |   for {
39 |     select {
40 |     case <-hup:
41 |       if err := reloadConfig(cfg.configFile, reloadables...); err != nil {
42 |         log.Errorf("Error reloading config: %s", err)
43 |       }
44 |     }
45 |   }
46 | }()
47 | ```
48 | 
49 | 第二种：通过 web 模块的 `/-/reload` 请求实现:
50 | 
51 | 1. 首先 Prometheus 在 web(web/web.go) 模块中注册了一个 POST 的 http 请求 `/-/reload`, 它的 handler 是 `web.reload` 函数，该函数主要向 `web.reloadCh` chan 里面发送一个 `error`。
52 | 2. 在 Prometheus 的 `cmd/promethteus/main.go` 中有个单独的 goroutine 来监听 `web.reloadCh`，当接受到新值的时候会执行 reloadConfig 函数。
53 | 
54 | 代码类似：
55 | 
56 | ```golang
57 | hupReady := make(chan bool)
58 | 
59 | go func() {
60 | 	<-hupReady
61 | 	for {
62 | 		select {
63 | 		case rc := <-webHandler.Reload():
64 | 			if err := reloadConfig(cfg.configFile, reloadables...); err != nil {
65 | 				log.Errorf("Error reloading config: %s", err)
66 | 				rc <- err
67 | 			} else {
68 | 				rc <- nil
69 | 			}
70 | 		}
71 | 	}
72 | }()
73 | ```
74 | 
75 | Prometheus 内部提供了成熟的 hot reload 方案，这大大方便配置文件的修改和重新加载，在 Prometheus 生态中，很多 Exporter 也采用类似约定的实现方式。


--------------------------------------------------------------------------------
/qa/jvm.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/qa/jvm.md


--------------------------------------------------------------------------------
/qa/losedata.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/qa/losedata.md


--------------------------------------------------------------------------------
/qa/memory.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/qa/memory.md


--------------------------------------------------------------------------------
/qa/pushgateway.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/qa/pushgateway.md


--------------------------------------------------------------------------------
/revision-record.md:
--------------------------------------------------------------------------------
1 | # 主要修订记录
2 | 
3 | * 0.1.0: 2017-05-12
4 |   * 添加基本内容。
5 | 


--------------------------------------------------------------------------------
/rule/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/rule/README.md


--------------------------------------------------------------------------------
/rule/config.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/rule/config.md


--------------------------------------------------------------------------------
/rule/what.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/rule/what.md


--------------------------------------------------------------------------------
/service/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/service/README.md


--------------------------------------------------------------------------------
/service/memcached.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/service/memcached.md


--------------------------------------------------------------------------------
/service/mongodb.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/service/mongodb.md


--------------------------------------------------------------------------------
/service/mysql.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/service/mysql.md


--------------------------------------------------------------------------------
/service/nginx.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/service/nginx.md


--------------------------------------------------------------------------------
/service/redis.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/service/redis.md


--------------------------------------------------------------------------------
/store/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/store/README.md


--------------------------------------------------------------------------------
/store/local.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/store/local.md


--------------------------------------------------------------------------------
/store/remote.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/store/remote.md


--------------------------------------------------------------------------------
/tools/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/tools/README.md


--------------------------------------------------------------------------------
/tools/client.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/tools/client.md


--------------------------------------------------------------------------------
/tools/promu.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/tools/promu.md


--------------------------------------------------------------------------------
/v2.0/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/v2.0/README.md


--------------------------------------------------------------------------------
/v2.0/feature.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/v2.0/feature.md


--------------------------------------------------------------------------------
/v2.0/rule.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/v2.0/rule.md


--------------------------------------------------------------------------------
/v2.0/store.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cookieY/prometheus_practice/4b3c46bd5ff3a0f586ff3dab4079f333905ff524/v2.0/store.md


--------------------------------------------------------------------------------
/visualiztion/README.md:
--------------------------------------------------------------------------------
1 | # 数据可视化
2 | 
3 | 收集到数据只是第一步，如果没有很好做到数据可视化，有时很难发现问题。
4 | 
5 | 本章将介绍使用 Prometheus 自带的 web console 以及 grafana 来查询和展现数据。
6 | 


--------------------------------------------------------------------------------
/visualiztion/console.md:
--------------------------------------------------------------------------------
 1 | # Prometheus Web
 2 | 
 3 | Prometheus 自带了 Web Console， 安装成功后可以访问 `http://localhost:9090/graph` 页面，用它可以进行任何 PromQL 查询和调试工作，非常方便，例如：
 4 | 
 5 | ![prometheus web console ](/images/visualiztion/prometheus-web-console.png)
 6 | 
 7 | ![prometheus web graph ](/images/visualiztion/prometheus-web-graph.png)
 8 | 
 9 | 通过上图你不难发现，Prometheus 自带的 Web 界面比较简单，因为它的目的是为了及时查询数据，方便 PromeQL 调试。
10 | 
11 | 它并不是像常见的 Admin Dashboard，在一个页面尽可能展示多的数据，如果你有这方面的需求，不妨试试 Grafana。
12 | 


--------------------------------------------------------------------------------
/visualiztion/grafana.md:
--------------------------------------------------------------------------------
  1 | # Grafana 使用
  2 | 
  3 | Grafana 是一套开源的分析监视平台，支持 Graphite, InfluxDB, OpenTSDB, Prometheus, Elasticsearch, CloudWatch 等数据源，其 UI 非常漂亮且高度定制化。
  4 | 
  5 | 这是 Prometheus web console 不具备的，在上一节中我已经说明了选择它的原因。
  6 | 
  7 | ## 版本说明
  8 | 
  9 | * Mac version 4.3.2
 10 | 
 11 | ## 安装和运行程序
 12 | 
 13 | 这里我使用 brew 安装，命令为
 14 | 
 15 | ```
 16 | brew update
 17 | brew install grafana
 18 | ```
 19 | 
 20 | 当安装成功后，你可以使用默认配置启动程序
 21 | 
 22 | ```
 23 | grafana-server -homepath /usr/local/Cellar/grafana/4.3.2/share/grafana/
 24 | ```
 25 | 
 26 | 如果顺利，你将看到如下日志
 27 | 
 28 | ```
 29 | INFO[06-11|15:20:14] Starting Grafana                         logger=main version=4.3.2 commit=unknown-dev compiled=2017-06-01T05:47:48+0800
 30 | INFO[06-11|15:20:14] Config loaded from                       logger=settings file=/usr/local/Cellar/grafana/4.3.2/share/grafana/conf/defaults.ini
 31 | INFO[06-11|15:20:14] Path Home                                logger=settings path=/usr/local/Cellar/grafana/4.3.2/share/grafana/
 32 | INFO[06-11|15:20:14] Path Data                                logger=settings path=/usr/local/Cellar/grafana/4.3.2/share/grafana/data
 33 | INFO[06-11|15:20:14] Path Logs                                logger=settings path=/usr/local/Cellar/grafana/4.3.2/share/grafana/data/log
 34 | INFO[06-11|15:20:14] Path Plugins                             logger=settings path=/usr/local/Cellar/grafana/4.3.2/share/grafana/data/plugins
 35 | INFO[06-11|15:20:14] Initializing DB                          logger=sqlstore dbtype=sqlite3
 36 | INFO[06-11|15:20:14] Starting DB migration                    logger=migrator
 37 | INFO[06-11|15:20:14] Executing migration                      logger=migrator id="copy data account to org"
 38 | INFO[06-11|15:20:14] Skipping migration condition not fulfilled logger=migrator id="copy data account to org"
 39 | INFO[06-11|15:20:14] Executing migration                      logger=migrator id="copy data account_user to org_user"
 40 | INFO[06-11|15:20:14] Skipping migration condition not fulfilled logger=migrator id="copy data account_user to org_user"
 41 | INFO[06-11|15:20:14] Starting plugin search                   logger=plugins
 42 | INFO[06-11|15:20:14] Initializing Alerting                    logger=alerting.engine
 43 | INFO[06-11|15:20:14] Initializing CleanUpService              logger=cleanup
 44 | INFO[06-11|15:20:14] Initializing Stream Manager
 45 | INFO[06-11|15:20:14] Initializing HTTP Server                 logger=http.server address=0.0.0.0:3000 protocol=http subUrl= socket=
 46 | 
 47 | ```
 48 | 
 49 | 此时，你可以打开页面 `http://localhost:3000`， 访问 Grafana 的 web 界面。
 50 | 
 51 | 其他平台安装方案，请参考[更多安装](https://grafana.com/grafana/download)。
 52 | 
 53 | ## 登录并设置 Prometheus 数据源
 54 | 
 55 | Grafana 本身支持 Prometheus 数据源，故不需要安装其他插件。
 56 | 
 57 | 使用默认账号 admin/admin 登录 grafana
 58 | 
 59 | ![grafana-login](/images/visualiztion/grafana-login.png)
 60 | 
 61 | 在 Dashboard 首页，点击添加数据源
 62 | 
 63 | ![grafana-datasource](/images/visualiztion/grafana-datasource.png)
 64 | 
 65 | 配置 Prometheus 数据源
 66 | 
 67 | ![grafana-prometheus-data-source](/images/visualiztion/grafana-prometheus-data-source.png)
 68 | 
 69 | 目前为止，Grafana 已经和 Prometheus 连上了，你将看到这样的 Dashboard
 70 | 
 71 | ![grafana-default-dashbord](/images/visualiztion/grafana-default-dashbord.png)
 72 | 
 73 | ## 自定义监视画板
 74 | 
 75 | 由顶部 `Manage dashboard` -> `Settings` 进入管理页面
 76 | 
 77 | ![rafana-into-manage-dashboard](/images/visualiztion/grafana-into-manage-dashboard.png)
 78 | 
 79 | 在管理页面中取消 `Hide Controls`
 80 | 
 81 | ![grafana-hide-controls](/images/visualiztion/grafana-hide-controls.png)
 82 | 
 83 | 点击页面底部 `+ ADD ROW` 按钮, 并选择 `Graph` 类型
 84 | 
 85 | ![grafana-add-graph](/images/visualiztion/grafana-add-graph.png)
 86 | 
 87 | 点击  `Panel Title` -> `Edit` 进入 Panel 编辑页面，并在 `Metrics` 中
 88 | 的 `Metric lookup` 选择 `go_goroutines`
 89 | 
 90 | ![grafana-edit-panel](/images/visualiztion/grafana-edit-panel.png)
 91 | 
 92 | 你也可以直接在管理界面中填写 Prometheus 的查询语句，以及修改查询的 step 数值。
 93 | 
 94 | 当你修改了 Dashboard 后，记得点击顶部的 `Save dashboard` 按钮，或直接 `CTRL+S` 保存。
 95 | 
 96 | 至此，我们自定义的 Panel 已添加完成
 97 | 
 98 | ![grafana-added-panel](/images/visualiztion/grafana-added-panel.png)
 99 | 
100 | 我们可以通过拖拽，拉升调节 panel 的位置和尺寸，我们调节的目的是尽量在一个屏幕显示更多信息。
101 | 
102 | ## 总结
103 | 
104 | Grafana 是一款非常漂亮，强大的监视分析平台，本身支持了 Prometheus 数据源，所以在做数据和监视可视化的时候，Grafana + Prometheus 是个不错的选择。
105 | 


--------------------------------------------------------------------------------