├── .gitattributes
├── .gitignore
├── README.md
├── __init__.py
├── all_prome_query
├── config.yaml
├── consul_delete.py
├── images
    ├── arch.jpg
    └── heavy_query_diff.png
├── init.sh
├── libs.py
├── nginx.conf
├── ngx_prome_redirect.conf
├── parse_prome_query_log.py
├── prome_heavy_expr_parse.yaml
├── prome_redirect.lua
├── re_work.py
├── recovery_by_local_yaml.py
├── recovery_heavy_metrics.sh
├── requirements.txt
├── to_del_record_key_file
└── 部署.md


/.gitattributes:
--------------------------------------------------------------------------------
1 | *.html linguist-language=python


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | bin/
 2 | bin/*
 3 | out/
 4 | *.swp
 5 | *.swo
 6 | *.tar.gz
 7 | docs/_site
 8 | .idea/
 9 | package_cache_tmp/
10 | /.idea
11 | *DS_Store


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # k8s纯源码解读教程(3个课程内容合成一个大课程)
 2 | - [k8s底层原理和源码讲解之精华篇](https://ke.qq.com/course/4093533)
 3 | - [k8s底层原理和源码讲解之进阶篇](https://ke.qq.com/course/4236389)
 4 | - [k8s纯源码解读课程，助力你变成k8s专家](https://ke.qq.com/course/4697341)
 5 | 
 6 | 
 7 | # k8s运维进阶调优课程
 8 | - [k8s运维大师课程](https://ke.qq.com/course/5586848)
 9 | - [tekton全流水线实战和pipeline运行原理源码解读](https://ke.qq.com/course/5467720)
10 | 
11 | # k8s二次开发课程
12 | - [k8s二次开发之基于真实负载的调度器](https://ke.qq.com/course/5814034)
13 | - [k8s-operator和crd实战开发 助你成为k8s专家](https://ke.qq.com/course/5458555)
14 | 
15 | # prometheus全组件的教程
16 | - [01_prometheus全组件配置使用、底层原理解析、高可用实战](https://ke.qq.com/course/3549215)
17 | - [02_prometheus-thanos使用和源码解读](https://ke.qq.com/course/3883439)
18 | - [03_kube-prometheus和prometheus-operator实战和原理介绍](https://ke.qq.com/course/3912017)
19 | - [04_prometheus源码讲解和二次开发](https://ke.qq.com/course/4236995)
20 | 
21 | # go语言课程
22 | - [golang基础课程](https://ke.qq.com/course/4334898)
23 | - [golang运维平台实战，服务树,日志监控，任务执行，分布式探测](https://ke.qq.com/course/4334675)
24 | - [golang运维开发实战课程之k8s巡检平台](https://ke.qq.com/course/5818923)
25 | 
26 | # 直播答疑sre职业发展规划
27 | - [k8s-prometheus课程答疑和运维开发职业发展规划](https://ke.qq.com/course/5506477)
28 | 
29 | 
30 | # 关于白嫖和付费
31 | - 白嫖当然没关系，我已经贡献了很多文章和开源项目，当然还有免费的视频
32 | - 但是客观的讲，如果你能力超强是可以一直白嫖的，可以看源码。什么问题都可以解决
33 | - 看似免费的资料很多，但大部分都是边角料，核心的东西不会免费，更不会有大神给你答疑
34 | - thanos和kube-prometheus如果你对prometheus源码把控很好的话，再加上k8s知识的话就觉得不难了
35 | 
36 | 
37 | # 架构图
38 | ![image](https://github.com/ning1875/pre_query/blob/master/images/arch.jpg)
39 | ![image](https://github.com/ning1875/pre_query/blob/master/images/heavy_query_diff.png)
40 | 
41 | # heavy_query原因总结 
42 | ## 资源原因
43 | - 因为tsdb都有压缩算法对datapoint压缩，比如dod 和xor
44 | - 那么当查询时数据必然涉及到解压放大的问题
45 | - 比如压缩正常一个datapoint大小为16byte
46 | - 一个heavy_query加载1万个series，查询时间24小时，30秒一个点来算，所需要的内存大小为 439MB，所以同时多个heavy_query会将prometheus内存打爆，prometheus也加了上面一堆参数去限制
47 | - 当然除了上面说的queryPreparation过程外，查询时还涉及sort和eval等也需要耗时
48 | 
49 | ## prometheus原生不支持downsample
50 | - 还有个原因是prometheus原生不支持downsample，所以无论grafana上面的step随时间如何变化，涉及到到查询都是将指定的block解压再按step切割
51 | - 所以查询时间跨度大对应消耗的cpu和内存就会报增，同时原始点的存储也浪费了，因为grafana的step会随时间跨度变大变大
52 | ## 实时查询/聚合 VS 预查询/聚合
53 | prometheus的query都是实时查询的/聚合
54 | **实时查询的优点很明显**
55 | - 查询/聚合条件随意组合，比如 rate后再sum然后再叠加一个histogram_quantile
56 | 
57 | **实时查询的缺点也很明显**
58 | - 那就是慢，或者说资源消耗大
59 | **实时查询的优缺点反过来就是预查询/聚合的**
60 | 一个预聚合的例子请看我写的falcon组件 [监控聚合器系列之: open-falcon新聚合器polymetric](https://segmentfault.com/a/1190000023092934)
61 | - 所有的聚合方法提前定义好，并定时被计算出结果
62 | - 查询时不涉及任何的聚合，直接查询结果
63 | - 比如实时聚合需要每次加载10万个series，预聚合则只需要查询几个结果集
64 | **那么问题来了prometheus有没有预查询/聚合呢**
65 | 答案是有的
66 | ## prometheus的预查询/聚合
67 | [prometheus record](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/)
68 | 
69 | Recording rules allow you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series. Querying the precomputed result will then often be much faster than executing the original expression every time it is needed. This is especially useful for dashboards, which need to query the same expression repeatedly every time they refresh.
70 | 
71 | # 本项目说明
72 | ## 解决方案说明  
73 | - heavy_query对用户侧表现为查询速度慢  
74 | - 在服务端会导致资源占用过多甚至打挂后端存储  
75 | - 查询如果命中heavy_query策略(目前为查询返回时间超过2秒)则会被替换为预先计算好的轻量查询结果返回,两种方式查询的结果一致  
76 | - 未命中的查询按原始查询返回  
77 | - 替换后的metrics_name 会变成 `hke:heavy_expr:xxxx` 字样,而对应的tag不变。对于大分部panel中已经设置了曲线的Legend,所以展示没有区别  
78 | - 现在每晚23:30增量更新heavy_query策略。对于大部分设定好的dashboard没有影响(因为已经存量heavy_query已经跑7天以上了),对于新增策略会从策略生效后开始展示数据,对于查询高峰的白天来说至少保证有10+小时的数据
79 | 
80 | ## 代码架构说明
81 | - parse组件根据prometheus的query log分析heavy_query记录
82 | - 把记录算哈希后增量写入consul，和redis集群中
83 | - prometheus 根据confd拉取属于自己分片的consul数据生成record.yml
84 | - 根据record做预查询聚合写入tsdb
85 | - query前面的lua会将grafana传过来的查询expr算哈希
86 | - 和redis中的记录匹配，匹配中说明这条是heavy_query
87 | - 那么替换其expr到后端查询
88 | 
89 | 
90 | ## 使用指南
91 | 
92 | [安装部署](./部署.md)
93 | 


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ning1875/pre_query/07387e462e1d3e7f2efb97d9a74a4a1541ae1762/__init__.py


--------------------------------------------------------------------------------
/all_prome_query:
--------------------------------------------------------------------------------
1 | 172.20.70.205
2 | 172.20.70.215


--------------------------------------------------------------------------------
/config.yaml:
--------------------------------------------------------------------------------
 1 | prome_query_log:
 2 |   prome_log_path: /opt/logs/prometheus_query.log # prometheus query log文件path 和prometheus query配置保持一致
 3 |   heavy_query_threhold: 0.0001                   # heavy_query阈值，单位秒，依据情况修改
 4 |   py_name: parse_prome_query_log.py            # 主文件名。不动
 5 |   local_work_dir: all_prome_query_log # parser拉取query_log的保存路径，默认当前目录下的all_prome_query_log
 6 |   local_record_yml_dir: local_record_yml_dir # 写入本地的record yaml结果，默认当前目录下的 local_record_yml_dir
 7 |   check_heavy_query_api: http://localhost:9090  # 一个prometheus查询地址，用来double_check记录是否真的heavy，避免误添加。默认不启用这个功能
 8 | 
 9 | redis:
10 |   host: localhost  # redis地址
11 |   port: 6379
12 |   redis_set_key: hke:heavy_query_set
13 |   redis_one_key_prefix: hke:heavy_expr # heavy_query key前缀
14 | consul:
15 |   host: localhost  #consul地址
16 |   port: 8500
17 |   consul_record_key_prefix: prometheus/records #  heavy_query key前缀
18 | 
19 | heavy_blacklist_metrics:   # 黑名单metric_names
20 |   - kafka_log_log_logendoffset
21 |   - requests_latency_bucket
22 |   - count(node_cpu_seconds_total)
23 |   - '{__name__=~".+"}'
24 |   - '{__name__=~".*"}'


--------------------------------------------------------------------------------
/consul_delete.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | 
  3 | import consul
  4 | import time
  5 | 
  6 | import redis
  7 | 
  8 | 
  9 | class Consul(object):
 10 |     def __init__(self, host, port):
 11 |         '''初始化，连接consul服务器'''
 12 |         self._consul = consul.Consul(host, port)
 13 | 
 14 |     def RegisterService(self, name, service_id, host, port, check_url, tags=None):
 15 |         tags = tags or []
 16 |         # 注册服务
 17 |         self._consul.agent.service.register(
 18 |             name,
 19 |             service_id,
 20 |             host,
 21 |             port,
 22 |             tags,
 23 |             # 健康检查ip端口，检查时间：5,超时时间：30，注销时间：30s
 24 |             check=consul.Check().http(check_url, "5s", "5s"))
 25 | 
 26 |     def GetService(self, name):
 27 |         res = self._consul.health.service(name, passing=True)
 28 |         print(res)
 29 |         # services = self._consul.agent.services()
 30 |         # print(services)
 31 |         # service = services.get(name)
 32 |         #
 33 |         # if not service:
 34 |         #     return None, None
 35 |         # addr = "{0}:{1}".format(service['Address'], service['Port'])
 36 |         # return service, addr
 37 | 
 38 |     def delete_key(self, key='prometheus/records'):
 39 |         res = self._consul.kv.delete(key, recurse=True)
 40 |         return res
 41 | 
 42 |     def get_key_by_record(self, key='prometheus/records', record=""):
 43 |         res = self._consul.kv.get(key, recurse=True)
 44 |         data = res[1]
 45 |         if not data:
 46 |             return None
 47 | 
 48 |         for i in data:
 49 |             key = i.get("Key")
 50 | 
 51 |             v = json.loads(i.get('Value').decode("utf-8"))
 52 |             if record in v.get('record'):
 53 |                 print(key, v.get('record'), v.get('expr'))
 54 |                 return key
 55 |         return None
 56 | 
 57 | 
 58 | def redis_conn():
 59 |     redis_host = "localhost"
 60 | 
 61 |     redis_port = 6379
 62 |     conn = redis.Redis(host=redis_host, port=redis_port)
 63 |     return conn
 64 | 
 65 | 
 66 | def delete_key():
 67 |     to_del_keys = []
 68 |     with open('to_del_record_key_file') as f:
 69 |         to_del_keys = [x.strip() for x in f.readlines()]
 70 |     print(to_del_keys)
 71 | 
 72 |     host = 'localhost'
 73 |     port = 8500
 74 | 
 75 |     consul_record_key_prefix = 'prometheus/records'
 76 |     consul_client = Consul(host, port)
 77 |     redis_c = redis_conn()
 78 | 
 79 |     for key in to_del_keys:
 80 |         if not key:
 81 |             continue
 82 |         to_del_key = consul_client.get_key_by_record(record=key)
 83 |         print(to_del_key)
 84 |         if to_del_key:
 85 |             consul_client.delete_key(to_del_key)
 86 |             redis_key = "hke:heavy_expr:{}".format(key)
 87 |             delete_res = redis_c.delete(redis_key)
 88 |             print(delete_res)
 89 | 
 90 | 
 91 | def run_register():
 92 |     host = 'localhost'
 93 |     port = 8500
 94 |     consul_client = Consul(host, port)
 95 | 
 96 |     s_name = "pushgateway_a"
 97 |     s_hosts = [
 98 |         'localhost',
 99 |     ]
100 | 
101 |     s_port = 9091
102 | 
103 |     for h in s_hosts:
104 |         s_check_url = 'http://{}:{}/-/healthy'.format(h, s_port)
105 |         # consul_client.RegisterService(s_name, h, h, s_port, s_check_url)
106 |         # check = consul.Check().http(s_check_url, "5s", "5s", "5s")
107 |         # print(check)
108 | 
109 |     res = consul_client._consul.agent.service.deregister(s_name)
110 |     print(res)
111 |     res = consul_client.GetService(s_name)
112 |     # print(res[0])
113 | 
114 | 
115 | def run_query():
116 |     host = 'localhost'
117 |     port = 8500
118 |     consul_record_key_prefix = 'prometheus/records'
119 |     consul_client = Consul(host, port)
120 | 
121 | 
122 | if __name__ == '__main__':
123 |     # run_register()
124 |     # run_query()
125 |     delete_key()
126 | 


--------------------------------------------------------------------------------
/images/arch.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ning1875/pre_query/07387e462e1d3e7f2efb97d9a74a4a1541ae1762/images/arch.jpg


--------------------------------------------------------------------------------
/images/heavy_query_diff.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ning1875/pre_query/07387e462e1d3e7f2efb97d9a74a4a1541ae1762/images/heavy_query_diff.png


--------------------------------------------------------------------------------
/init.sh:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env bash
 2 | 
 3 | # 安装指南
 4 | # parse组件 在python3.6+中运行
 5 | #1.安装依赖
 6 | # pip3 install -r  requirements.txt
 7 | 
 8 | #2. 修改config.yaml中各个配置
 9 | #3. 准备真实prometheus地址写入all_prome_query
10 | #4. 运行添加crontab 每晚11:30定时运行一次即可
11 | ansible-playbook -i all_prome_query  prome_heavy_expr_parse.yaml
12 | 
13 | 
14 | # prometheus 和confd组件
15 | # 1.安装prometheus 和confd
16 | # 将confd下的配置文件放置好，启动服务
17 | # prometheus开启query_log
18 | ```
19 | global:
20 |   query_log_file: /App/logs/prometheus_query.log
21 | ```
22 | 
23 | # openresty组件
24 | #1. 安装openresty ，准备lua环境
25 | yum install yum-utils -y
26 | yum-config-manager --add-repo https://openresty.org/package/centos/openresty.repo
27 | yum install openresty openresty-resty -y
28 | 
29 | #2.
30 | # 修改lua文件中的redis地址为你自己的
31 | # 修改ngx_prome_redirect.conf文件中 真实real_prometheus后端,使用前请修改
32 | 
33 | mkdir -pv /usr/local/openresty/nginx/conf/conf.d/
34 | mkdir -pv /usr/local/openresty/nginx/lua_files/
35 | 
36 | #3.
37 | # 将nginx配置和lua文件放到指定目录
38 | /bin/cp -f  ngx_prome_redirect.conf /usr/local/openresty/nginx/conf/conf.d/
39 | /bin/cp -f  nginx.conf /usr/local/openresty/nginx/conf/
40 | /bin/cp -f prome_redirect.lua /usr/local/openresty/nginx/lua_files/
41 | 
42 | 
43 | #4.
44 | # 启动openresty
45 | systemctl enable openresty
46 | systemctl start openresty
47 | 
48 | #5.
49 | # 修改grafana数据源，将原来的指向真实prometheus地址改为指向openresty的9992端口
50 | 
51 | 
52 | # 运维操作
53 | # 查看redis中的heavy_query记录
54 | redis-cli -h $redis_host   keys hke:heavy_expr*
55 | # 查看consul中的heavy_query记录
56 | curl http://$consul_addr:8500/v1/kv/prometheus/record?recurse= |python -m json.tool
57 | # 根据一个heavy_record文件恢复记录
58 | python3 recovery_by_local_yaml.py local_record_yml/record_to_keep.yml
59 | # 根据一个metric_name前缀删除record记录
60 | bash -x recovery_heavy_metrics.sh  $metric_name
61 | 
62 | 
63 | 
64 | 
65 | 
66 | 
67 | 


--------------------------------------------------------------------------------
/libs.py:
--------------------------------------------------------------------------------
 1 | import datetime
 2 | import hashlib
 3 | 
 4 | import yaml
 5 | 
 6 | def now_date_str():
 7 |     # return datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
 8 |     return datetime.datetime.now().strftime("%Y-%m-%d")
 9 | 
10 | 
11 | def get_str_md5(input_str):
12 |     m = hashlib.md5()
13 |     m.update(input_str)
14 |     return m.hexdigest()
15 | 
16 | 
17 | def load_base_config(yaml_path):
18 |     with open(yaml_path) as f:
19 |         config = yaml.load(f,Loader=yaml.FullLoader)
20 |     return config
21 | 


--------------------------------------------------------------------------------
/nginx.conf:
--------------------------------------------------------------------------------
 1 | #user  nobody;
 2 | worker_processes  auto;
 3 | 
 4 | 
 5 | #error_log  logs/error.log;
 6 | #error_log  logs/error.log  notice;
 7 | #error_log  logs/error.log  info;
 8 | 
 9 | #pid        logs/nginx.pid;
10 | worker_rlimit_nofile    60000;
11 | 
12 | events
13 | {
14 |     use epoll;
15 |     worker_connections  60000;
16 | }
17 | 
18 | 
19 | 
20 | http {
21 |     include       mime.types;
22 |     default_type  text/html;
23 | 
24 |     charset	UTF-8;
25 |     server_names_hash_bucket_size	128;
26 |     client_header_buffer_size		4k;
27 |     large_client_header_buffers	 4	32k;
28 | 
29 | 
30 |     #log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
31 |     #                  '$status $body_bytes_sent "$http_referer" '
32 |     #                  '"$http_user_agent" "$http_x_forwarded_for"';
33 | 
34 |     #access_log  logs/access.log  main;
35 | 
36 |     sendfile        on;
37 |     #tcp_nopush     on;
38 | 
39 |     #keepalive_timeout  0;
40 |     keepalive_timeout  65;
41 | 
42 |     #gzip  on;
43 |     include /usr/local/openresty/nginx/conf/conf.d/*.conf;
44 | 
45 | }
46 | 


--------------------------------------------------------------------------------
/ngx_prome_redirect.conf:
--------------------------------------------------------------------------------
 1 | # 真实prometheus后端,使用前请修改
 2 | upstream real_prometheus {
 3 | 
 4 |        server 172.20.70.205:9090;
 5 |        server 172.20.70.215:9090;
 6 | 
 7 | }
 8 | 
 9 | 
10 | 
11 | server{
12 |     listen 9992;
13 |     server_name _;
14 |     location / {  
15 |         proxy_set_header Host $host:$server_port;
16 |         proxy_pass http://real_prometheus;
17 |     } 
18 |     location /api/v1/query_range { 
19 |         access_by_lua_file /usr/local/openresty/nginx/lua_files/prome_redirect.lua;
20 |         proxy_pass http://real_prometheus;
21 |     }
22 |       
23 |     
24 | }
25 | 
26 | 


--------------------------------------------------------------------------------
/parse_prome_query_log.py:
--------------------------------------------------------------------------------
  1 | import base64
  2 | import glob
  3 | import json
  4 | import logging
  5 | import os
  6 | import re
  7 | import sys
  8 | import time
  9 | from datetime import datetime
 10 | from multiprocessing.pool import ThreadPool
 11 | 
 12 | import consul
 13 | import redis
 14 | import requests
 15 | import yaml
 16 | 
 17 | from libs import get_str_md5, load_base_config, now_date_str
 18 | 
 19 | logging.basicConfig(
 20 |     # TODO console 日志,上线时删掉
 21 |     # filename=LOG_PATH,
 22 |     format='%(asctime)s %(levelname)s %(filename)s %(funcName)s [line:%(lineno)d]:%(message)s',
 23 |     datefmt="%Y-%m-%d %H:%M:%S",
 24 |     level="INFO"
 25 | )
 26 | G_VAR_YAML = "config.yaml"
 27 | 
 28 | 
 29 | class Consul(object):
 30 |     def __init__(self, host, port):
 31 |         '''初始化，连接consul服务器'''
 32 |         self._consul = consul.Consul(host, port)
 33 | 
 34 |     def RegisterService(self, name, host, port, tags=None):
 35 |         tags = tags or []
 36 |         # 注册服务
 37 |         self._consul.agent.service.register(
 38 |             name,
 39 |             name,
 40 |             host,
 41 |             port,
 42 |             tags,
 43 |             # 健康检查ip端口，检查时间：5,超时时间：30，注销时间：30s
 44 |             check=consul.Check().tcp(host, port, "5s", "30s", "30s"))
 45 | 
 46 |     def GetService(self, name):
 47 |         services = self._consul.agent.services()
 48 |         service = services.get(name)
 49 |         if not service:
 50 |             return None, None
 51 |         addr = "{0}:{1}".format(service['Address'], service['Port'])
 52 |         return service, addr
 53 | 
 54 |     def delete_key(self, key='prometheus/records'):
 55 |         res = self._consul.kv.delete(key, recurse=True)
 56 |         return res
 57 | 
 58 |     def get_list(self, key='prometheus/records'):
 59 |         res = self._consul.kv.get(key, recurse=True)
 60 | 
 61 |         data = res[1]
 62 |         if not data:
 63 |             return {}
 64 |         pre_record_d = {}
 65 | 
 66 |         for i in data:
 67 |             v = json.loads(i.get('Value').decode("utf-8"))
 68 |             pre_record_d[v.get('record')] = v.get('expr')
 69 |         return pre_record_d
 70 | 
 71 |     def set_data(self, key, value):
 72 |         '''
 73 |         self._consul.kv.put('prometheus/records/1',
 74 | 
 75 |                             json.dumps(
 76 |                                 {
 77 | 
 78 |                                     "record": "nyy_record_test_a",
 79 |                                     "expr": 'sum(kafka_log_log_size{project=~"metis - main1 - sg2"}) by (topic)'
 80 |                                 }
 81 |                             )
 82 |                             )
 83 |         '''
 84 |         self._consul.kv.put(key, value)
 85 | 
 86 |     def get_b64encode(self, message):
 87 |         message_bytes = message.encode('ascii')
 88 |         base64_bytes = base64.b64encode(message_bytes)
 89 |         return base64_bytes.decode("utf8")
 90 | 
 91 |     def txn_mset(self, record_expr_list):
 92 |         lens = len(record_expr_list)
 93 |         logging.info("top_lens:{}".format(lens))
 94 |         max_txn_once = 64
 95 |         yu_d = lens // max_txn_once
 96 |         yu = lens / max_txn_once
 97 | 
 98 |         if lens <= max_txn_once:
 99 |             pass
100 |         else:
101 |             max = yu_d
102 | 
103 |             if yu > yu_d:
104 |                 max += 1
105 | 
106 |             for i in range(0, max):
107 |                 sli = record_expr_list[i * max_txn_once:(i + 1) * max_txn_once]
108 |                 self.txn_mset(sli)
109 |             return True
110 |         '''
111 |              {
112 |                     "KV": {
113 |                       "Verb": "<verb>",
114 |                       "Key": "<key>",
115 |                       "Value": "<Base64-encoded blob of data>",
116 |                       "Flags": 0,
117 |                       "Index": 0,
118 |                       "Session": "<session id>"
119 |                     }
120 |                 }
121 | 
122 |         :return:
123 |         '''
124 | 
125 |         txn_data = []
126 |         logging.info("middle_lens:{}".format(len(record_expr_list)))
127 |         for index, data in record_expr_list:
128 |             txn_data.append(
129 |                 {
130 |                     "KV": {
131 |                         "Key": "{}/{}".format(CONSUL_RECORD_KEY_PREFIX, index),
132 |                         "Verb": "set",
133 |                         "Value": self.get_b64encode(json.dumps(
134 |                             data
135 |                         )),
136 | 
137 |                     }
138 |                 }
139 |             )
140 |         # TODO local test
141 |         # print(txn_data)
142 |         # return True
143 |         res = self._consul.txn.put(txn_data)
144 |         if not res:
145 |             logging.error("txn_mset_error")
146 |             return False
147 |         if res.get("Errors"):
148 |             logging.error("txn_mset_error:{}".format(str(res.get("Errors"))))
149 |             return False
150 |         return True
151 | 
152 | 
153 | def batch_delete_redis_key(conn, prefix):
154 |     CHUNK_SIZE = 5000
155 |     """
156 |     Clears a namespace
157 |     :param ns: str, namespace i.e your:prefix
158 |     :return: int, cleared keys
159 |     """
160 |     cursor = '0'
161 |     ns_keys = prefix + '*'
162 |     while cursor != 0:
163 |         cursor, keys = conn.scan(cursor=cursor, match=ns_keys, count=CHUNK_SIZE)
164 |         if keys:
165 |             conn.delete(*keys)
166 | 
167 |     return cursor
168 | 
169 | 
170 | def redis_conn():
171 |     redis_host = ONLINE_REDIS_HOST
172 |     redis_port = ONLINE_REDIS_PORT
173 |     conn = redis.Redis(host=redis_host, port=redis_port)
174 |     return conn
175 | 
176 | 
177 | def parse_log_file(log_f):
178 |     '''
179 |     {
180 |     "httpRequest":{
181 |         "clientIP":"1.1.1.1",
182 |         "method":"GET",
183 |         "path":"/api/v1/query_range"
184 |     },
185 |     "params":{
186 |         "end":"2020-04-09T06:20:00.000Z",
187 |         "query":"api_request_counter{job="kubernetes-pods",kubernetes_namespace="sprs",app="model-server"}/60",
188 |         "start":"2020-04-02T06:20:00.000Z",
189 |         "step":1200
190 |     },
191 |     "stats":{
192 |         "timings":{
193 |             "evalTotalTime":0.467329174,
194 |             "resultSortTime":0.000476303,
195 |             "queryPreparationTime":0.373947928,
196 |             "innerEvalTime":0.092889708,
197 |             "execQueueTime":0.000008911,
198 |             "execTotalTime":0.467345411
199 |         }
200 |     },
201 |     "ts":"2020-04-09T06:20:28.353Z"
202 |     }
203 |     :param log_f:
204 |     :return:
205 |     '''
206 |     heavy_expr_set = set()
207 |     heavy_expr_dict = dict()
208 |     record_expr_dict = dict()
209 | 
210 |     with open(log_f) as f:
211 |         for x in f.readlines():
212 |             x = json.loads(x.strip())
213 |             if not isinstance(x, dict):
214 |                 continue
215 |             httpRequest = x.get("httpRequest")
216 |             if not httpRequest:
217 |                 continue
218 |             path = httpRequest.get("path")
219 |             # 只处理path为query_range的
220 |             if path != "/api/v1/query_range":
221 |                 continue
222 |             params = x.get("params")
223 |             if not params:
224 |                 continue
225 |             start_time = params.get("start")
226 |             end_time = params.get("end")
227 |             stats = x.get("stats")
228 |             if not stats:
229 |                 continue
230 |             timings = stats.get("timings")
231 |             if not timings:
232 |                 continue
233 |             evalTotalTime = timings.get("evalTotalTime")
234 |             execTotalTime = timings.get("execTotalTime")
235 |             queryPreparationTime = timings.get("queryPreparationTime")
236 |             execQueueTime = timings.get("execQueueTime")
237 |             innerEvalTime = timings.get("innerEvalTime")
238 | 
239 |             # 如果查询事件段大于6小时则不认为是heavy-query
240 |             if not start_time or not end_time:
241 |                 continue
242 |             start_time = datetime.strptime(start_time, '%Y-%m-%dT%H:%M:%S.%fZ').timestamp()
243 |             end_time = datetime.strptime(end_time, '%Y-%m-%dT%H:%M:%S.%fZ').timestamp()
244 |             if end_time - start_time > 3600 * 6:
245 |                 continue
246 | 
247 |             # 如果两个时间都小于阈值则不为heavy-query
248 |             c = (queryPreparationTime < HEAVY_QUERY_THREHOLD) and (innerEvalTime < HEAVY_QUERY_THREHOLD)
249 |             if c:
250 |                 continue
251 | 
252 |             if queryPreparationTime > 40:
253 |                 continue
254 |             if execQueueTime > 40:
255 |                 continue
256 |             if innerEvalTime > 40:
257 |                 continue
258 |             if evalTotalTime > 40:
259 |                 continue
260 |             if execTotalTime > 40:
261 |                 continue
262 |             query = params.get("query").strip()
263 |             if not query:
264 |                 continue
265 |             is_bl = False
266 |             for bl in HEAVY_BLACKLIST_METRICS:
267 |                 if not isinstance(bl, str):
268 |                     continue
269 |                 if bl in query:
270 |                     is_bl = True
271 |                     break
272 |             if is_bl:
273 |                 continue
274 |             # avoid multi heavy query
275 |             if REDIS_ONE_KEY_PREFIX in query:
276 |                 continue
277 |             # \r\n should not in query ,replace it
278 |             if "\r\n" in query:
279 |                 query = query.replace("\r\n", "", -1)
280 |             # \n should not in query ,replace it
281 |             if "\n" in query:
282 |                 query = query.replace("\n", "", -1)
283 | 
284 |             # - startwith for grafana network out
285 | 
286 |             if query.startswith("-"):
287 |                 query = query.replace("-", "", 1)
288 |             md5_str = get_str_md5(query.encode("utf-8"))
289 | 
290 |             record_name = "{}:{}".format(REDIS_ONE_KEY_PREFIX, md5_str)
291 |             record_expr_dict[record_name] = query
292 |             heavy_expr_set.add(query)
293 |             last_time = heavy_expr_dict.get(query)
294 |             this_time = evalTotalTime
295 |             if last_time and last_time > this_time:
296 |                 this_time = last_time
297 | 
298 |             heavy_expr_dict[query] = this_time
299 |     logging.info("log_file:{} get :{} heavy expr".format(log_f, len(record_expr_dict)))
300 |     return record_expr_dict
301 | 
302 | 
303 | # 解析一个日志文件
304 | def run_log_parse_local_test(log_path):
305 |     res = parse_log_file(log_path)
306 |     print(res)
307 | 
308 | 
309 | def mset_record_to_redis(res_dic):
310 |     if not res_dic:
311 |         logging.fatal("record_expr_list empty")
312 |     rc = redis_conn()
313 |     if not rc:
314 |         logging.fatal("failed to connect to redis-server")
315 |     mset_res = rc.mset(res_dic)
316 |     logging.info("mset_res:{} len:{}".format(str(mset_res), format(len(res_dic))))
317 |     sadd_res = rc.sadd(REDIS_SET_KEY, *res_dic.keys())
318 |     logging.info("sadd_res:{}".format(str(sadd_res)))
319 |     smems = rc.smembers(REDIS_SET_KEY)
320 |     logging.info("smember_res_len:{}".format(len(smems)))
321 | 
322 | 
323 | def write_record_yaml_file(record_expr_list):
324 |     '''
325 |     data = {
326 |         "groups": [
327 |             {
328 |                 "name": "example",
329 |                 "rules": [
330 |                     {
331 |                         "record": "nyy_record_test_a",
332 |                         "expr": "sum(kafka_log_log_size{project=~"metis-main1-sg2"}) by (topic)"
333 |                     },
334 |                 ],
335 |             },
336 |         ]
337 | 
338 |     }
339 |     '''
340 |     data = {
341 |         "groups": [
342 |             {
343 |                 "name": "heavy_expr_record",
344 |                 "rules": record_expr_list,
345 |             },
346 |         ]
347 | 
348 |     }
349 |     file_name = "{}/record_{}_{}.yml".format(PROME_RECORD_FILE, len(record_expr_list), now_date_str())
350 |     with open(file_name, 'w') as f:
351 |         yaml.dump(data, f, default_flow_style=False, sort_keys=False)
352 |     if not os.path.isfile("./promtool"):
353 |         logging.error("promtool not exist skip rule check ")
354 |         return []
355 | 
356 |     cmd = "./promtool check rules {}".format(file_name)
357 |     r = os.popen(cmd)
358 |     out = r.read()
359 |     r.close()
360 | 
361 |     record_name_re = re.compile('.*?\"(%s:.*?)\".*?' % REDIS_ONE_KEY_PREFIX)
362 |     invalid_keys = []
363 |     for line in out.strip().split("\n"):
364 | 
365 |         record_name = re.findall(record_name_re, line)
366 |         logging.info("[record_name:{}]".format(record_name))
367 |         if len(record_name) == 1:
368 |             invalid_keys.append(record_name[0])
369 |     return invalid_keys
370 | 
371 | 
372 | def recovery_concurrent_log_parse(res_dic):
373 |     if not res_dic:
374 |         logging.fatal("get empty result exit ....")
375 |     # print(res_dic)
376 |     # return
377 | 
378 |     consul_client = Consul(CONSUL_HOST, CONSUL_PORT)
379 |     if not consul_client:
380 |         logging.fatal("connect_to_consul_error")
381 | 
382 |     # purge consul
383 |     purge_consul_res = consul_client.delete_key(key=CONSUL_RECORD_KEY_PREFIX)
384 |     logging.info("[purge consul] res:{}".format(str(purge_consul_res)))
385 |     # purge redis
386 |     rc = redis_conn()
387 |     if not rc:
388 |         logging.fatal("failed to connect to redis-server")
389 | 
390 |     rc_delete_res = batch_delete_redis_key(rc, "hke:heavy_expr*")
391 |     logging.info("[purge redis heavy_key] res:{}".format(str(rc_delete_res)))
392 |     rc_delete_res = rc.delete("hke:heavy_query_set")
393 |     logging.info("[purge redis heavy_query_set] res:{}".format(str(rc_delete_res)))
394 | 
395 |     record_expr_list = []
396 |     for k in sorted(res_dic.keys()):
397 |         record_expr_list.append({"record": k, "expr": res_dic.get(k)})
398 |     logging.info("get_all_record_heavy_query:{} ".format(len(record_expr_list)))
399 | 
400 |     # write to local prome record yml
401 |     write_record_yaml_file(record_expr_list)
402 | 
403 |     # write to consul
404 | 
405 |     new_record_expr_list = []
406 |     for index, data in enumerate(record_expr_list):
407 |         new_record_expr_list.append((index, data))
408 | 
409 |     consul_w_res = consul_client.txn_mset(new_record_expr_list)
410 |     if not consul_w_res:
411 |         logging.fatal("write_to_consul_error")
412 | 
413 |     # write to redis
414 |     mset_record_to_redis(res_dic)
415 | 
416 | 
417 | def query_range_judge_heavy(host, expr, record):
418 |     '''
419 | 
420 |     :param host:
421 |     :param expr:
422 |     调用举例: 获取group=ugc的project
423 | 
424 |     query_range(inf,
425 |             'avg(100 - (avg by (instance,name) (rate(node_cpu_seconds_total{region=~"ap-southeast-3",account=~"HW-SHAREit",group=~"UGC",project=~"cassandra-client", mode="idle"}[5m])) * 100))')
426 | 
427 | 
428 |     :return:
429 |     {
430 |         "status":"success",
431 |         "data":{
432 |             "resultType":"matrix",
433 |             "result":[
434 |                 {
435 |                     "metric":{
436 | 
437 |                     },
438 |                     "values":[
439 |                         [
440 |                             1588149960,
441 |                             "0.1999999996688473"
442 |                         ],
443 |                         [
444 |                             1588150020,
445 |                             "0.20000000035872745"
446 |                         ],
447 |                         [
448 |                             1588150080,
449 |                             "0.19629629604793308"
450 |                         ],
451 |                         [
452 |                             1588150140,
453 |                             "0.19629629673781324"
454 |                         ],
455 |                         [
456 |                             1588150200,
457 |                             "0.1999999996688473"
458 |                         ],
459 |                         [
460 |                             1588150260,
461 |                             "0.2074074076005843"
462 |                         ]
463 |                     ]
464 |                 }
465 |             ]
466 |         }
467 |     }
468 |     '''
469 |     # logging.info("host:{} expr:{}".format(host, expr))
470 |     uri = '{}/api/v1/query_range'.format(host)
471 | 
472 |     end = int(time.time())
473 |     q_start = time.time()
474 |     start = end - 60 * 60
475 | 
476 |     G_PARMS = {
477 |         "query": expr,
478 |         "start": start,
479 |         "end": end,
480 |         "step": 30
481 |     }
482 |     res = requests.get(uri, G_PARMS)
483 |     data = res.json()
484 |     status = data.get("status")
485 |     if status != "success":
486 |         return (expr, record, False)
487 | 
488 |     res = data.get("data").get("result")
489 |     if not res:
490 |         return (expr, record, False)
491 |     if len(res) == 0:
492 |         return (expr, record, False)
493 |     took = time.time() - q_start
494 |     logging.info("key:{} expr:{} time_took:{}".format(
495 |         record,
496 |         expr,
497 |         took
498 |     ))
499 |     if took > 3.0:
500 |         return (expr, record, True)
501 |     return (expr, record, False)
502 | 
503 | 
504 | def concurrent_log_parse(log_dir):
505 |     # 步骤1 解析日志
506 |     t_num = 500
507 |     pool = ThreadPool(t_num)
508 | 
509 |     log_file_s = glob.glob("{}/*.log".format(log_dir))
510 | 
511 |     results = pool.map(parse_log_file, log_file_s)
512 | 
513 |     pool.close()
514 |     pool.join()
515 |     res_dic = {}
516 |     for x in results:
517 |         res_dic.update(x)
518 |     logging.info("[before_heavy_query_check_num:{}]".format(len(res_dic)))
519 |     # 1 end
520 | 
521 |     # 步骤2 拿解析结果去查询一下，做double-check，可以禁止
522 |     # pool = ThreadPool(t_num)
523 |     #
524 |     # parms = []
525 |     # for k, v in res_dic.items():
526 |     #     expr = res_dic
527 |     #     record = k
528 |     #     parms.append([CHECK_HEAVY_QUERY_API, expr, record])
529 |     # results = pool.starmap(query_range_judge_heavy, parms)
530 |     #
531 |     # pool.close()
532 |     # pool.join()
533 |     #
534 |     # res_dic = {}
535 |     # for x in results:
536 |     #     expr, record, real_heavy = x[0], x[1], x[2]
537 |     #     if real_heavy:
538 |     #         res_dic[record] = expr
539 |     # logging.info("[after_heavy_query_check_num:{}]".format(len(res_dic)))
540 |     # 2 end
541 | 
542 |     if not res_dic:
543 |         logging.fatal("get empty result exit ....")
544 | 
545 |     #  步骤3 增量更新consul数据
546 |     consul_client = Consul(CONSUL_HOST, CONSUL_PORT)
547 |     if not consul_client:
548 |         logging.fatal("connect_to_consul_error")
549 | 
550 |     ## get pre data from consul
551 |     pre_dic = consul_client.get_list(key=CONSUL_RECORD_KEY_PREFIX)
552 |     old_len = len(pre_dic) + 1
553 |     # res_dic.update(pre_dic)
554 |     ## 增量更新
555 |     old_key_set = set(pre_dic.keys())
556 |     this_key_set = set(res_dic.keys())
557 |     ## 更新的keys
558 |     new_dic = {}
559 |     today_all_dic = {}
560 |     new_key_set = this_key_set - old_key_set
561 |     logging.info("new_key_set:{} ".format(len(new_key_set)))
562 |     for k in new_key_set:
563 |         new_dic[k] = res_dic[k]
564 | 
565 |     # 构造record 记录
566 |     record_list_new = []
567 |     for k in sorted(new_dic.keys()):
568 |         one_expr_list = {"record": k, "expr": new_dic.get(k)}
569 | 
570 |         record_list_new.append(one_expr_list)
571 | 
572 |     # 写入本地record文件检查rules
573 |     invalid_keys = write_record_yaml_file(record_list_new)
574 |     logging.info("invalid_keys: num {}  details:{}".format(len(invalid_keys), str(invalid_keys)))
575 |     for del_key in invalid_keys:
576 |         new_dic.pop(del_key)
577 |     f_record_list_new = []
578 |     for k in sorted(new_dic.keys()):
579 |         one_expr_list = {"record": k, "expr": new_dic.get(k)}
580 | 
581 |         f_record_list_new.append(one_expr_list)
582 | 
583 |     today_all_dic.update(pre_dic)
584 |     today_all_dic.update(new_dic)
585 |     local_record_expr_list = []
586 | 
587 |     for k in sorted(today_all_dic.keys()):
588 |         local_record_expr_list.append({"record": k, "expr": today_all_dic.get(k)})
589 |     logging.info("get_all_record_heavy_query:{} ".format(len(local_record_expr_list)))
590 | 
591 |     # 写入本地yaml
592 |     write_record_yaml_file(local_record_expr_list)
593 | 
594 |     # 写入consul中
595 |     new_record_expr_list = []
596 |     # 给record记录添加索引，为confd分片做准备
597 |     for index, data in enumerate(f_record_list_new):
598 |         new_record_expr_list.append((index + old_len, data))
599 |     if new_record_expr_list:
600 |         consul_w_res = consul_client.txn_mset(new_record_expr_list)
601 |         if not consul_w_res:
602 |             logging.fatal("write_to_consul_error")
603 |     else:
604 |         logging.info("zero_new_heavy_record:{}")
605 | 
606 |     # 写入redis中
607 |     mset_record_to_redis(today_all_dic)
608 | 
609 | 
610 | def run():
611 |     '''
612 |     1.all prome query_log need to be scpped  here
613 |     2.parse log
614 |     3.txn_mput to consul
615 |     4.merge result and meset to redis
616 |     5.generate record yaml file
617 | 
618 |     :return:
619 |     '''
620 |     concurrent_log_parse(PROME_QUERY_LOG_DIR)
621 | 
622 | 
623 | yaml_path = G_VAR_YAML
624 | 
625 | config = load_base_config(yaml_path)
626 | # path
627 | HEAVY_QUERY_THREHOLD = config.get("prome_query_log").get("heavy_query_threhold")
628 | PROME_QUERY_LOG_DIR = config.get("prome_query_log").get("local_work_dir")
629 | PROME_RECORD_FILE = config.get("prome_query_log").get("local_record_yml_dir")
630 | CHECK_HEAVY_QUERY_API = config.get("prome_query_log").get("check_heavy_query_api")
631 | # redis
632 | ONLINE_REDIS_HOST = config.get("redis").get("host")
633 | ONLINE_REDIS_PORT = int(config.get("redis").get("port"))
634 | REDIS_SET_KEY = config.get("redis").get("redis_set_key")
635 | REDIS_ONE_KEY_PREFIX = config.get("redis").get("redis_one_key_prefix")
636 | # consul
637 | CONSUL_RECORD_KEY_PREFIX = config.get("consul").get("consul_record_key_prefix")
638 | CONSUL_HOST = config.get("consul").get("host")
639 | CONSUL_PORT = config.get("consul").get("port")
640 | # heavy
641 | 
642 | HEAVY_BLACKLIST_METRICS = config.get("heavy_blacklist_metrics")
643 | 
644 | # print(HEAVY_BLACKLIST_METRICS)
645 | 
646 | if __name__ == '__main__':
647 |     if len(sys.argv) == 3 and sys.argv[1] == "run_log_parse_local_test":
648 |         run_log_parse_local_test(sys.argv[2])
649 |         sys.exit(0)
650 | 
651 |     try:
652 |         run()
653 |     except Exception as e:
654 |         logging.error("got_error:{}".format(e))
655 | 


--------------------------------------------------------------------------------
/prome_heavy_expr_parse.yaml:
--------------------------------------------------------------------------------
 1 | - name:  fetch log and push expr to cache
 2 |   hosts: all
 3 |   user: root
 4 |   gather_facts:  false
 5 |   vars_files:
 6 |     - config.yaml
 7 | 
 8 |   tasks:
 9 | 
10 |       - name: fetch query log
11 |         fetch: src={{ prome_query_log.prome_log_path }} dest={{ prome_query_log.local_work_dir }}/{{ inventory_hostname }}_query.log flat=yes validate_checksum=no
12 |         register: result
13 | 
14 |       - name: Show debug info
15 |         debug: var=result verbosity=0
16 | 
17 | 
18 | - name:  localhost
19 |   hosts: localhost
20 |   user: root
21 |   gather_facts:  false
22 |   vars_files:
23 |     - config.yaml
24 |   tasks:
25 | 
26 |       - name:  merge result
27 |         shell: /usr/bin/python3 {{ prome_query_log.py_name }}
28 |         connection: local
29 |         run_once: true
30 | 
31 |         register: result
32 |       - name: Show debug info
33 |         debug: var=result verbosity=0
34 | # useage : ansible-playbook -i all_prome_query  prome_heavy_expr_parse.yaml
35 | 
36 | 


--------------------------------------------------------------------------------
/prome_redirect.lua:
--------------------------------------------------------------------------------
 1 | function get_str_md5(input_s)
 2 |     local resty_md5 = require "resty.md5"
 3 |     local md5 = resty_md5:new()
 4 |     if not md5 then
 5 |         ngx.log(ngx.ERR, "failed to create md5 object")
 6 |         return
 7 |     end
 8 | 
 9 |     local ok = md5:update(input_s)
10 |     if not ok then
11 |         ngx.log(ngx.ERR, "failed to add data")
12 |         return
13 |     end
14 |     local digest = md5:final()
15 | 
16 |     local str = require "resty.string"
17 |     local md5_str = str.to_hex(digest)
18 |     return md5_str
19 | end
20 | 
21 | function redis_get(key)
22 |     -- start of redis
23 | 
24 |     local redis = require "resty.redis"
25 |     local red = redis:new()
26 |     --red:set_timeouts(1000, 1000, 1000)
27 |     local ok, conn_err = red:connect("localhost", 6379)
28 |     if not ok then
29 |         ngx.log(ngx.ERR, "[redis]failed to connect redis server:", conn_err)
30 |         return false
31 |     end
32 | 
33 |     local res, get_err = red:get(key)
34 |     if get_err then
35 |         ngx.log(ngx.ERR, "[redis]failed to get value by key: ", key, "err:", get_err)
36 |         return false
37 |     end
38 | 
39 |     red:set_keepalive(30000, 1000)
40 |     if res ~= ngx.null then
41 |         ngx.log(ngx.INFO, "[redis]success  get value by key: ", key, "value: ", res)
42 |         return true
43 |     else
44 |         return false
45 |     end
46 | 
47 |     -- end of  redis
48 | end
49 | 
50 | function replace_work()
51 |     --Nginx服务器中使用lua获取get或post参数
52 | 
53 |     local request_method = ngx.var.request_method;
54 |     local args = {}
55 |     --获取参数的值
56 | 
57 |     if "GET" == request_method then
58 |         args = ngx.req.get_uri_args();
59 |     elseif "POST" == request_method then
60 |         ngx.req.read_body();
61 |         args = ngx.req.get_post_args();
62 |     end
63 | 
64 |     local q_query = args["query"];
65 |     local q_start = args["start"];
66 |     local q_end = args["end"];
67 |     local q_step = args["step"];
68 | 
69 |     local md5_str = get_str_md5(q_query)
70 | 
71 |     if md5_str == null then
72 |         return
73 |     end
74 |     local redis_query_key = "hke:heavy_expr:" .. md5_str
75 |     --ngx.log(ngx.ERR, "redis_query_key: ",redis_query_key)
76 |     local redis_get_res = redis_get(redis_query_key)
77 |     if redis_get_res == true then
78 |         q_query = redis_query_key
79 |     end
80 | 
81 |     local new_args = {}
82 |     new_args["query"] = q_query
83 |     new_args["start"] = q_start
84 |     new_args["end"] = q_end
85 |     new_args["step"] = q_step
86 | 
87 |     ngx.req.set_uri_args(new_args)
88 |     --ngx.req.set_uri_args("end=" .. q_end)
89 |     --local arg = ngx.req.get_uri_args()
90 |     --for k, v in pairs(arg) do
91 |     --    ngx.say("[GET ] key:", k, " v:", v)
92 |     --end
93 | 
94 | end
95 | 
96 | return replace_work();


--------------------------------------------------------------------------------
/re_work.py:
--------------------------------------------------------------------------------
1 | import re
2 | 
3 | prefix = "hke:heavy_expr"
4 | 
5 | record_name_re = re.compile('.*?\"(%s:.*?)\".*?' % prefix)
6 | s = ' local_record_yml_dir/record_28_2021-09-13_14-41-16.yml: 80:11: group "heavy_expr_record", rule 27, "hke:heavy_expr:ed28d1000288d2c806827acfc2cfb48b": could not parse expression: 1:90: parse error: unexpected identifier "ormax_over_time"'
7 | record_name = re.findall(record_name_re, s)
8 | print(record_name)
9 | 


--------------------------------------------------------------------------------
/recovery_by_local_yaml.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | 
  3 | import requests
  4 | import yaml
  5 | import logging
  6 | from multiprocessing.pool import ThreadPool
  7 | from itertools import repeat
  8 | from parse_prome_query_log import recovery_concurrent_log_parse
  9 | 
 10 | logging.basicConfig(
 11 |     # TODO console 日志,上线时删掉
 12 |     # filename=LOG_PATH,
 13 |     format='%(asctime)s %(levelname)s %(filename)s %(funcName)s [line:%(lineno)d]:%(message)s',
 14 |     datefmt="%Y-%m-%d %H:%M:%S",
 15 |     level="INFO"
 16 | )
 17 | 
 18 | 
 19 | def load_yaml(yal_path=r'C:\Users\Administrator\Desktop\record_2202_2020-04-29_23-30-24.yml'):
 20 |     f = open(yal_path)
 21 |     y = yaml.load(f)
 22 | 
 23 |     all_heavy = y.get("groups")[0].get("rules")
 24 |     msg = "get {} heavy_record".format(len(all_heavy))
 25 |     logging.info(msg)
 26 |     return all_heavy
 27 | 
 28 | 
 29 | def query_range(host, expr, key):
 30 |     '''
 31 | 
 32 |     :param host:
 33 |     :param expr:
 34 |     调用举例: 获取group=ugc的project
 35 | 
 36 |     query_range(inf,
 37 |             'avg(100 - (avg by (instance,name) (rate(node_cpu_seconds_total{region=~"ap-southeast-3",account=~"HW-SHAREit",group=~"UGC",project=~"cassandra-client",name=~"UGC-cassandra-client-prod", mode="idle"}[5m])) * 100))')
 38 | 
 39 | 
 40 |     :return:
 41 |     {
 42 |         "status":"success",
 43 |         "data":{
 44 |             "resultType":"matrix",
 45 |             "result":[
 46 |                 {
 47 |                     "metric":{
 48 | 
 49 |                     },
 50 |                     "values":[
 51 |                         [
 52 |                             1588149960,
 53 |                             "0.1999999996688473"
 54 |                         ],
 55 |                         [
 56 |                             1588150020,
 57 |                             "0.20000000035872745"
 58 |                         ],
 59 |                         [
 60 |                             1588150080,
 61 |                             "0.19629629604793308"
 62 |                         ],
 63 |                         [
 64 |                             1588150140,
 65 |                             "0.19629629673781324"
 66 |                         ],
 67 |                         [
 68 |                             1588150200,
 69 |                             "0.1999999996688473"
 70 |                         ],
 71 |                         [
 72 |                             1588150260,
 73 |                             "0.2074074076005843"
 74 |                         ]
 75 |                     ]
 76 |                 }
 77 |             ]
 78 |         }
 79 |     }
 80 |     '''
 81 |     # logging.info("host:{} expr:{}".format(host, expr))
 82 |     uri = '{}/api/v1/query_range'.format(host)
 83 | 
 84 |     end = int(time.time())
 85 |     start = end - 60 * 60
 86 | 
 87 |     G_PARMS = {
 88 |         "query": expr,
 89 |         "start": start,
 90 |         "end": end,
 91 |         "step": 30
 92 |     }
 93 |     res = requests.get(uri, G_PARMS)
 94 |     data = res.json()
 95 |     now = int(time.time())
 96 |     took = now - end
 97 |     if took > 4:
 98 |         return (key, True)
 99 |     return (key, False)
100 | 
101 | 
102 | def concurrency_query():
103 |     t_num = 20
104 |     pool = ThreadPool(t_num)
105 |     yaml_data = load_yaml()
106 |     all_expr = [x.get("expr") for x in yaml_data][:100]
107 |     all_key = [x.get("record") for x in yaml_data][:100]
108 |     host = "http://localhost:9999"
109 | 
110 |     pars = zip(repeat(host), all_expr, all_key)
111 |     results = pool.starmap(query_range, pars)
112 | 
113 |     pool.close()
114 |     pool.join()
115 |     for x in results:
116 |         print(x)
117 | 
118 | 
119 | def recovery(yaml_path):
120 |     yaml_data = load_yaml(yal_path=yaml_path)
121 |     res_dic = {}
122 | 
123 |     for x in yaml_data:
124 |         res_dic[x.get("record")] = x.get("expr")
125 | 
126 |     recovery_concurrent_log_parse(res_dic)
127 |     # purge consul
128 | 
129 |     # purge redis
130 | 
131 | 
132 | if __name__ == '__main__':
133 |     import sys
134 | 
135 |     yaml_path = sys.argv[1]
136 |     recovery(yaml_path)
137 | 


--------------------------------------------------------------------------------
/recovery_heavy_metrics.sh:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env bash
 2 | 
 3 | 
 4 | last_file=`ls /App/tgzs/conf_dir/prome_heavy_expr_parse/local_record_yml/*yml -rt |tail -1`
 5 | metrics=$1
 6 | grep $1 ${last_file} ${last_file} -B 1 |grep "record: hke:heavy_expr" |sort |awk -F ":" '{print $NF}' |sort |uniq >  to_del_record_key_file
 7 | wc -l to_del_record_key_file
 8 | 
 9 | python3 consul_delete.py
10 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | python-consul
2 | redis
3 | PyYaml
4 | ansible
5 | 


--------------------------------------------------------------------------------
/to_del_record_key_file:
--------------------------------------------------------------------------------
1 | 9133202933f1394e368971d59f3c9d67
2 | deb1781d21403b68571452323cc1d142
3 | 4ff1fc99d6ed14c62c30df3dbe192da7
4 | 627df33956d3fda20ddb2ef80fa7c46e
5 | 


--------------------------------------------------------------------------------
/部署.md:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | 
  4 | 
  5 | ## 01 在prometheus record机器上 安装confd
  6 | 
  7 | - 下载 带分片功能的confd二进制 
  8 | ```shell script
  9 | wget https://github.com/ning1875/confd/releases/download/v0.16.0/confd_shard-0.16.0-linux-amd64.tar.gz
 10 | ```
 11 | 
 12 | 
 13 | - 创建目录
 14 | 
 15 | ```shell script
 16 | mkdir -p /etc/confd/{conf.d,templates}
 17 | ```
 18 | 
 19 | - 主配置文件/etc/confd/conf.d/records.yml.toml ，注意dest要和你的prometheus目录一致
 20 | 
 21 | ```shell script
 22 | cat <<-"EOF"  > /etc/confd/conf.d/records.yml.toml
 23 | [template] 
 24 | prefix = "/prometheus"
 25 | src = "records.yml.tmpl"
 26 | dest = "/opt/app/prometheus/confd_record.yml"
 27 | #shards=3
 28 | #num=0
 29 | keys = [
 30 |     "/records"
 31 | ]
 32 | reload_cmd = "curl -X POST http://localhost:9090/-/reload"
 33 | 
 34 | 
 35 | EOF
 36 | ```
 37 | 
 38 | - shards代表分片总数，num代表第几个分片
 39 | - record模板文件 /etc/confd/templates/records.yml.tmpl
 40 | > 每个record单独的group分组，好处是互相不影响，缺点是group过多
 41 | ```shell script
 42 | cat <<-"EOF"  > /etc/confd/templates/records.yml.tmpl
 43 | groups:
 44 | {{range gets "/records/*"}}{{$item := json .Value}}
 45 | - name: {{$item.record}}
 46 |   rules:	
 47 |   - record: {{$item.record}}
 48 |     expr: {{$item.expr}}
 49 | {{end}}
 50 | EOF
 51 | ```
 52 | 
 53 | > 使用相同分组，需要按顺序执行record
 54 | ```shell script
 55 | cat <<-"EOF"  > /etc/confd/templates/records.yml.tmpl
 56 | groups:
 57 | - name: confd_record
 58 |   interval: 30s
 59 |   rules:{{range gets "/records/*"}}{{$item := json .Value}}
 60 |   - record: {{$item.record}}
 61 |     expr: {{$item.expr}}{{end}}
 62 | EOF
 63 | 
 64 | ```
 65 | 
 66 | 
 67 | ### 指定consul backend 启动confd
 68 | 
 69 | - onetime代表运行一次
 70 | 
 71 | ```shell script
 72 | confd -onetime --backend consul --node localhost:8500 --log-level debug
 73 | ```
 74 | 
 75 | ```shell script
 76 | cat <<EOF>  /etc/systemd/system/confd.service
 77 | [Unit]
 78 | Description=confd server
 79 | Wants=network-online.target
 80 | After=network-online.target
 81 | 
 82 | [Service]
 83 | ExecStart=/usr/bin/confd  --backend consul --node 172.20.70.205:8500 --log-level debug -interval=30
 84 | StandardOutput=syslog
 85 | StandardError=syslog
 86 | SyslogIdentifier=confd
 87 | [Install]
 88 | WantedBy=default.target
 89 | EOF
 90 | 
 91 | # 启动服务
 92 | systemctl daemon-reload && systemctl start confd   
 93 | 
 94 | systemctl status confd 
 95 | 
 96 | ```
 97 | 
 98 | ## 02 中控机上部署consul redis ansible 
 99 | 
100 | 
101 | 
102 | ### consul 安装
103 | 
104 | #### 准备工作
105 | 
106 | ```shell
107 | 
108 | # 下载consul
109 | wget -O /opt/tgzs/consul_1.9.4_linux_amd64.zip  https://releases.hashicorp.com/consul/1.9.4/consul_1.9.4_linux_amd64.zip 
110 | 
111 | cd /opt/tgzs/
112 | unzip consul_1.9.4_linux_amd64.zip
113 | 
114 | /bin/cp -f consul /usr/bin/
115 | 
116 | 
117 | ```
118 | 
119 | #### 启动单机版consul
120 | 
121 | ```shell
122 | 
123 | # 
124 | mkdir  /opt/app/consul
125 | 
126 | # 准备配置文件
127 | cat <<EOF > /opt/app/consul/single_server.json
128 | {
129 |     "datacenter": "dc1",
130 |     "node_name": "consul-svr-01",
131 |     "server": true,
132 |     "bootstrap_expect": 1,
133 |     "data_dir": "/opt/app/consul/",
134 |     "log_level": "INFO",
135 |     "log_file": "/opt/logs/",
136 |     "ui": true,
137 |     "bind_addr": "0.0.0.0",
138 |     "client_addr": "0.0.0.0",
139 |     "retry_interval": "10s",
140 |     "raft_protocol": 3,
141 |     "enable_debug": false,
142 |     "rejoin_after_leave": true,
143 |     "enable_syslog": false
144 | }
145 | EOF
146 | 
147 | # 多个ip地址时，将bind_addr 改为一个内网的ip
148 | 
149 | # 写入service文件
150 | cat <<EOF > /etc/systemd/system/consul.service
151 | [Unit]
152 | Description=consul server
153 | Wants=network-online.target
154 | After=network-online.target
155 | 
156 | [Service]
157 | ExecStart=/usr/bin/consul agent  -config-file=/opt/app/consul/single_server.json
158 | StandardOutput=syslog
159 | StandardError=syslog
160 | SyslogIdentifier=consul
161 | [Install]
162 | WantedBy=default.target
163 | EOF
164 | 
165 | # 启动服务
166 | systemctl daemon-reload && systemctl start consul   
167 | 
168 | systemctl status consul 
169 | 
170 | 
171 | ```
172 | 
173 | #### 验证访问
174 | 
175 | - http://localhost:8500/
176 | 
177 | ## 03 将pre_query 放到中控机上
178 | - all_prome_query 中的prometheus query ip改为自己的
179 | - prometheus query 开启query log
180 | ```yaml
181 | global:
182 |   query_log_file: /App/logs/prometheus_query.log
183 | ```
184 | - config.yaml 填写相关配置项
185 | 
186 | ## 04 执行pre_query中的分析record命令
187 | > 将promtool 复制到当前目录用作 record promql的check
188 | - /bin/cp -f /opt/app/prometheus/promtool
189 | 
190 | 
191 | > pre_query目录下执行ansible命令
192 | ```shell script
193 | ansible-playbook -i all_prome_query  prome_heavy_expr_parse.yaml
194 | ```
195 | 
196 | > 检查本地record yaml
197 | ```shell script
198 | [root@k8s-master01 pre_query]# ll local_record_yml_dir/
199 | total 12
200 | -rw-r--r-- 1 root root  551 Sep 13 15:53 record_2_2021-09-13.yml
201 | -rw-r--r-- 1 root root 5455 Sep 13 15:53 record_26_2021-09-13.yml
202 | [root@k8s-master01 pre_query]# head local_record_yml_dir/record_26_2021-09-13.yml 
203 | groups:
204 | - name: heavy_expr_record
205 |   rules:
206 |   - record: hke:heavy_expr:082a631dfddb7cf65ddd0fb4923ab17e
207 |     expr: rate(mysql_global_status_sort_scan{instance=~"172.20.70.205:9104"}[5s])
208 |       or irate(mysql_global_status_sort_scan{instance=~"172.20.70.205:9104"}[5m])
209 |   - record: hke:heavy_expr:1416fc3de389e2a5c36aa5c8c376391f
210 |     expr: mysql_global_status_threads_cached{instance=~"172.20.70.205:9104"}
211 |   - record: hke:heavy_expr:14e8a540527123cc11ad96c5faa03f43
212 |     expr: irate(mysql_slave_status_relay_log_pos{instance=~"172.20.70.205:9104"}[5m])
213 | ```
214 | 
215 | > 检查consul中的记录
216 | ```shell script
217 | curl http://localhost:8500/v1/kv/prometheus/record?recurse= |python -m json.tool
218 |     {
219 |         "CreateIndex": 585468,
220 |         "Flags": 0,
221 |         "Key": "prometheus/records/6",
222 |         "LockIndex": 0,
223 |         "ModifyIndex": 585468,
224 |         "Value": "eyJyZWNvcmQiOiAiaGtlOmhlYXZ5X2V4cHI6MjY1YzUwMzMxZjRiNzk4MzRjMzc1MDY2ZTY2NWQ4NDYiLCAiZXhwciI6ICJyYXRlKG15c3FsX2dsb2JhbF9zdGF0dXNfY3JlYXRlZF90bXBfdGFibGVze2luc3RhbmNlPX5cIjE3Mi4yMC43MC4yMDU6OTEwNFwifVs1c10pIG9yIGlyYXRlKG15c3FsX2dsb2JhbF9zdGF0dXNfY3JlYXRlZF90bXBfdGFibGVze2luc3RhbmNlPX5cIjE3Mi4yMC43MC4yMDU6OTEwNFwifVs1bV0pIn0="
225 |     },
226 |     {
227 |         "CreateIndex": 585468,
228 |         "Flags": 0,
229 |         "Key": "prometheus/records/7",
230 |         "LockIndex": 0,
231 |         "ModifyIndex": 585468,
232 |         "Value": "eyJyZWNvcmQiOiAiaGtlOmhlYXZ5X2V4cHI6MjZkODYwNzY4NzcxOTUyOTc3ZGNiZjUzYzU3ZWZhNTUiLCAiZXhwciI6ICJyYXRlKG15c3FsX2dsb2JhbF9zdGF0dXNfcXVlcmllc3tpbnN0YW5jZT1+XCIxNzIuMjAuNzAuMjA1OjkxMDRcIn1bNXNdKSBvciBpcmF0ZShteXNxbF9nbG9iYWxfc3RhdHVzX3F1ZXJpZXN7aW5zdGFuY2U9flwiMTcyLjIwLjcwLjIwNTo5MTA0XCJ9WzVtXSkifQ=="
233 |     },
234 | ```
235 | 
236 | 
237 | 
238 | > 检测部署了confd的 prometheus record 上的record文件内容
239 | ```shell script
240 | [root@k8s-master01 pre_query]# cat /opt/app/prometheus/confd_record.yml  |head 
241 | groups:
242 | 
243 | - name: hke:heavy_expr:082a631dfddb7cf65ddd0fb4923ab17e
244 |   rules:
245 |   - record: hke:heavy_expr:082a631dfddb7cf65ddd0fb4923ab17e
246 |     expr: rate(mysql_global_status_sort_scan{instance=~"172.20.70.205:9104"}[5s]) or irate(mysql_global_status_sort_scan{instance=~"172.20.70.205:9104"}[5m])
247 | 
248 | - name: hke:heavy_expr:4b93ce0bd3db2848e1b6d330a03272f7
249 |   rules:
250 |   - record: hke:heavy_expr:4b93ce0bd3db2848e1b6d330a03272f7
251 | ```
252 | 
253 | > prometheus record页面上检查 聚合规则并查询数据
254 | - 截图
255 | 
256 | > 检查redis中的key
257 | ```shell script
258 | [root@k8s-master01 pre_query]# redis-cli keys "hke:heavy_expr*"
259 |  1) "hke:heavy_expr:bc7775bb5e33bf84afa9a1d4c0c45a9a"
260 |  2) "hke:heavy_expr:de2548ae6a00a90b1c2f85f8d6d9f13b"
261 |  3) "hke:heavy_expr:d86e3aa799b6a84790e133aa8a306e96"
262 |  4) "hke:heavy_expr:4fe8ee091e7823b66b475ba05b5fd030"
263 |  5) "hke:heavy_expr:b96a96befac765f6c00743a82ffae053"
264 |  6) "hke:heavy_expr:513ddfbf6f83d1ba1dd9b0b4a21a43bf"
265 |  7) "hke:heavy_expr:2998d2677fc1873a0e46802cbdd1bfee"
266 |  8) "hke:heavy_expr:22ccf0a71b6651763d1b7c16f5c05365"
267 |  9) "hke:heavy_expr:0d8c4be4ea8dccb9f06389246a02c6b3"
268 | 10) "hke:heavy_expr:f30b7b481bb0fdee0466902b9abb3b35"
269 | 11) "hke:heavy_expr:298afe40c3479e217b0b0b3666bd6904"
270 | 12) "hke:heavy_expr:bebca671decc9d5954af35628a05baa2"
271 | 13) "hke:heavy_expr:db9f0c1be81f91c95d9eb617ab70da36"
272 | 14) "hke:heavy_expr:45d5dc64bef02cf3f515481747cccd80"
273 | 15) "hke:heavy_expr:d797f93ad8ec0f7c80a5617eb5e4f3d8"
274 | 16) "hke:heavy_expr:eb1637bfe8f1388e99659d4621a79367"
275 | 17) "hke:heavy_expr:26d860768771952977dcbf53c57efa55"
276 | 18) "hke:heavy_expr:25bc18bd90a1a69d950802d937d337a0"
277 | 19) "hke:heavy_expr:d8aaf244a86fcfae8e51aeeb6935a5a5"
278 | 20) "hke:heavy_expr:189831b5aaa2d688c49a9c717fbf8b3d"
279 | ```
280 | 
281 | 
282 | ## 05 confd分片功能演示
283 | > 默认不开启分片 ,shards 和num注释掉就可以
284 | - confd配置文件 /etc/confd/conf.d/records.yml.toml
285 | ```yaml
286 | [template] 
287 | prefix = "/prometheus"
288 | src = "records.yml.tmpl"
289 | dest = "/opt/app/prometheus/confd_record.yml"
290 | #shards=2
291 | #num=0
292 | keys = [
293 |     "/records"
294 | ]
295 | reload_cmd = "curl -X POST http://localhost:9090/-/reload"
296 | 
297 | 
298 | ```
299 | - prometheus record 通过的结果 46个
300 | ```shell script
301 | [root@k8s-master01 conf.d]# confd -onetime --backend consul --node localhost:8500 
302 | 2021-09-13T16:45:15+08:00 k8s-master01 confd[30010]: INFO Backend set to consul
303 | 2021-09-13T16:45:15+08:00 k8s-master01 confd[30010]: INFO Starting confd
304 | 2021-09-13T16:45:15+08:00 k8s-master01 confd[30010]: INFO Backend source(s) set to localhost:8500
305 | 2021-09-13T16:45:15+08:00 k8s-master01 confd[30010]: INFO t.shards:0,t.nums:0
306 | [root@k8s-master01 conf.d]# /opt/app/prometheus/promtool check rules   /opt/app/prometheus/confd_record.yml 
307 | Checking /opt/app/prometheus/confd_record.yml
308 |   SUCCESS: 46 rules found
309 | 
310 | ```
311 | 
312 | > 开启分片 配置 shards=2 num=0 代表 2分片中的第一个
313 | - confd配置文件 /etc/confd/conf.d/records.yml.toml
314 | ```yaml
315 | [template] 
316 | prefix = "/prometheus"
317 | src = "records.yml.tmpl"
318 | dest = "/opt/app/prometheus/confd_record.yml"
319 | shards=2
320 | num=0
321 | keys = [
322 |     "/records"
323 | ]
324 | reload_cmd = "curl -X POST http://localhost:9090/-/reload"
325 | ```
326 | 
327 | 
328 | - prometheus record 通过的结果 23个
329 | ```shell script
330 | [root@k8s-master01 conf.d]# confd -onetime --backend consul --node localhost:8500                               
331 | 2021-09-13T16:47:16+08:00 k8s-master01 confd[32350]: INFO Backend set to consul
332 | 2021-09-13T16:47:16+08:00 k8s-master01 confd[32350]: INFO Starting confd
333 | 2021-09-13T16:47:16+08:00 k8s-master01 confd[32350]: INFO Backend source(s) set to localhost:8500
334 | 2021-09-13T16:47:16+08:00 k8s-master01 confd[32350]: INFO t.shards:2,t.nums:0
335 | 2021-09-13T16:47:16+08:00 k8s-master01 confd[32350]: INFO /opt/app/prometheus/confd_record.yml has md5sum a0c39c7a73d741ec911b64a6eb5d1b8c should be 50ad6045ba32557c64037702bbc2613c
336 | 2021-09-13T16:47:16+08:00 k8s-master01 confd[32350]: INFO Target config /opt/app/prometheus/confd_record.yml out of sync
337 | 2021-09-13T16:47:16+08:00 k8s-master01 confd[32350]: INFO Target config /opt/app/prometheus/confd_record.yml has been updated
338 | [root@k8s-master01 conf.d]# /opt/app/prometheus/promtool check rules   /opt/app/prometheus/confd_record.yml                                      
339 | Checking /opt/app/prometheus/confd_record.yml
340 |   SUCCESS: 23 rules found
341 | 
342 | [root@k8s-master01 conf.d
343 | 
344 | ```
345 | 
346 | 
347 | 
348 | ## 06  openresty和lua组件，新增grafana数据源
349 | 
350 | >  安装openresty ，准备lua环境
351 | ```shell script
352 | yum install yum-utils -y
353 | yum-config-manager --add-repo https://openresty.org/package/centos/openresty.repo
354 | yum install openresty openresty-resty -y
355 | ```
356 | 
357 | 
358 | 
359 | > 修改信息
360 | - 修改prome_redirect.lua 文件中的 27 行 localhost redis地址为你自己的
361 | - 修改ngx_prome_redirect.conf文件中 真实real_prometheus后端,使用前请修改
362 | 
363 | > 将nginx配置和lua文件放到指定目录
364 | ```shell script
365 | 
366 | mkdir -pv /usr/local/openresty/nginx/conf/conf.d/
367 | mkdir -pv /usr/local/openresty/nginx/lua_files/
368 | /bin/cp -f  ngx_prome_redirect.conf /usr/local/openresty/nginx/conf/conf.d/
369 | /bin/cp -f  nginx.conf /usr/local/openresty/nginx/conf/
370 | /bin/cp -f prome_redirect.lua /usr/local/openresty/nginx/lua_files/
371 | 
372 | ```
373 | 
374 | >  启动openresty
375 | ```shell script
376 | systemctl enable openresty
377 | systemctl start openresty
378 | ```
379 | 
380 | > 请求OpenResty 9992端口 ,出现/graph则正常
381 | ```shell script
382 | [root@k8s-master01 pre_query]# curl localhost:9992/
383 | <a href="/graph">Found</a>.
384 | ```
385 | 
386 | > openresty查看日志
387 | ```shell script
388 | tail -f /usr/local/openresty/nginx/logs/access.log 
389 | ```
390 | 
391 | > 修改grafana数据源，将原来的指向真实prometheus地址改为指向openresty的9992端口
392 | - 截图
393 | 
394 | 
395 | > 之前查询慢的大盘导出一份，再导入，选择新的9992数据源 查看对比
396 | - 截图
397 | 
398 | 
399 | 
400 | ## 运维指南
401 | ```
402 | # 查看redis中的heavy_query记录
403 | redis-cli -h $redis_host   keys hke:heavy_expr*
404 | # 查看consul中的heavy_query记录
405 | curl http://$consul_addr:8500/v1/kv/prometheus/record?recurse= |python -m json.tool
406 | # 根据一个heavy_record文件恢复记录
407 | python3 recovery_by_local_yaml.py local_record_yml/record_to_keep.yml
408 | # 根据一个metric_name前缀删除record记录
409 | bash -x recovery_heavy_metrics.sh  $metric_name
410 | ```
411 | 
412 | 
413 | ## 总结
414 | - 使用OpenResty的数据源 不会影响未配置预聚合的图
415 | - 因为只是nginx代理了一下，如果redis中没有要替换的expr就会以原查询ql查询
416 | 


--------------------------------------------------------------------------------