├── .gitattributes ├── .gitignore ├── README.md ├── __init__.py ├── all_prome_query ├── config.yaml ├── consul_delete.py ├── images ├── arch.jpg └── heavy_query_diff.png ├── init.sh ├── libs.py ├── nginx.conf ├── ngx_prome_redirect.conf ├── parse_prome_query_log.py ├── prome_heavy_expr_parse.yaml ├── prome_redirect.lua ├── re_work.py ├── recovery_by_local_yaml.py ├── recovery_heavy_metrics.sh ├── requirements.txt ├── to_del_record_key_file └── 部署.md /.gitattributes: -------------------------------------------------------------------------------- 1 | *.html linguist-language=python -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | bin/ 2 | bin/* 3 | out/ 4 | *.swp 5 | *.swo 6 | *.tar.gz 7 | docs/_site 8 | .idea/ 9 | package_cache_tmp/ 10 | /.idea 11 | *DS_Store -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # k8s纯源码解读教程(3个课程内容合成一个大课程) 2 | - [k8s底层原理和源码讲解之精华篇](https://ke.qq.com/course/4093533) 3 | - [k8s底层原理和源码讲解之进阶篇](https://ke.qq.com/course/4236389) 4 | - [k8s纯源码解读课程,助力你变成k8s专家](https://ke.qq.com/course/4697341) 5 | 6 | 7 | # k8s运维进阶调优课程 8 | - [k8s运维大师课程](https://ke.qq.com/course/5586848) 9 | - [tekton全流水线实战和pipeline运行原理源码解读](https://ke.qq.com/course/5467720) 10 | 11 | # k8s二次开发课程 12 | - [k8s二次开发之基于真实负载的调度器](https://ke.qq.com/course/5814034) 13 | - [k8s-operator和crd实战开发 助你成为k8s专家](https://ke.qq.com/course/5458555) 14 | 15 | # prometheus全组件的教程 16 | - [01_prometheus全组件配置使用、底层原理解析、高可用实战](https://ke.qq.com/course/3549215) 17 | - [02_prometheus-thanos使用和源码解读](https://ke.qq.com/course/3883439) 18 | - [03_kube-prometheus和prometheus-operator实战和原理介绍](https://ke.qq.com/course/3912017) 19 | - [04_prometheus源码讲解和二次开发](https://ke.qq.com/course/4236995) 20 | 21 | # go语言课程 22 | - [golang基础课程](https://ke.qq.com/course/4334898) 23 | - [golang运维平台实战,服务树,日志监控,任务执行,分布式探测](https://ke.qq.com/course/4334675) 24 | - [golang运维开发实战课程之k8s巡检平台](https://ke.qq.com/course/5818923) 25 | 26 | # 直播答疑sre职业发展规划 27 | - [k8s-prometheus课程答疑和运维开发职业发展规划](https://ke.qq.com/course/5506477) 28 | 29 | 30 | # 关于白嫖和付费 31 | - 白嫖当然没关系,我已经贡献了很多文章和开源项目,当然还有免费的视频 32 | - 但是客观的讲,如果你能力超强是可以一直白嫖的,可以看源码。什么问题都可以解决 33 | - 看似免费的资料很多,但大部分都是边角料,核心的东西不会免费,更不会有大神给你答疑 34 | - thanos和kube-prometheus如果你对prometheus源码把控很好的话,再加上k8s知识的话就觉得不难了 35 | 36 | 37 | # 架构图 38 | ![image](https://github.com/ning1875/pre_query/blob/master/images/arch.jpg) 39 | ![image](https://github.com/ning1875/pre_query/blob/master/images/heavy_query_diff.png) 40 | 41 | # heavy_query原因总结 42 | ## 资源原因 43 | - 因为tsdb都有压缩算法对datapoint压缩,比如dod 和xor 44 | - 那么当查询时数据必然涉及到解压放大的问题 45 | - 比如压缩正常一个datapoint大小为16byte 46 | - 一个heavy_query加载1万个series,查询时间24小时,30秒一个点来算,所需要的内存大小为 439MB,所以同时多个heavy_query会将prometheus内存打爆,prometheus也加了上面一堆参数去限制 47 | - 当然除了上面说的queryPreparation过程外,查询时还涉及sort和eval等也需要耗时 48 | 49 | ## prometheus原生不支持downsample 50 | - 还有个原因是prometheus原生不支持downsample,所以无论grafana上面的step随时间如何变化,涉及到到查询都是将指定的block解压再按step切割 51 | - 所以查询时间跨度大对应消耗的cpu和内存就会报增,同时原始点的存储也浪费了,因为grafana的step会随时间跨度变大变大 52 | ## 实时查询/聚合 VS 预查询/聚合 53 | prometheus的query都是实时查询的/聚合 54 | **实时查询的优点很明显** 55 | - 查询/聚合条件随意组合,比如 rate后再sum然后再叠加一个histogram_quantile 56 | 57 | **实时查询的缺点也很明显** 58 | - 那就是慢,或者说资源消耗大 59 | **实时查询的优缺点反过来就是预查询/聚合的** 60 | 一个预聚合的例子请看我写的falcon组件 [监控聚合器系列之: open-falcon新聚合器polymetric](https://segmentfault.com/a/1190000023092934) 61 | - 所有的聚合方法提前定义好,并定时被计算出结果 62 | - 查询时不涉及任何的聚合,直接查询结果 63 | - 比如实时聚合需要每次加载10万个series,预聚合则只需要查询几个结果集 64 | **那么问题来了prometheus有没有预查询/聚合呢** 65 | 答案是有的 66 | ## prometheus的预查询/聚合 67 | [prometheus record](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) 68 | 69 | Recording rules allow you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series. Querying the precomputed result will then often be much faster than executing the original expression every time it is needed. This is especially useful for dashboards, which need to query the same expression repeatedly every time they refresh. 70 | 71 | # 本项目说明 72 | ## 解决方案说明 73 | - heavy_query对用户侧表现为查询速度慢 74 | - 在服务端会导致资源占用过多甚至打挂后端存储 75 | - 查询如果命中heavy_query策略(目前为查询返回时间超过2秒)则会被替换为预先计算好的轻量查询结果返回,两种方式查询的结果一致 76 | - 未命中的查询按原始查询返回 77 | - 替换后的metrics_name 会变成 `hke:heavy_expr:xxxx` 字样,而对应的tag不变。对于大分部panel中已经设置了曲线的Legend,所以展示没有区别 78 | - 现在每晚23:30增量更新heavy_query策略。对于大部分设定好的dashboard没有影响(因为已经存量heavy_query已经跑7天以上了),对于新增策略会从策略生效后开始展示数据,对于查询高峰的白天来说至少保证有10+小时的数据 79 | 80 | ## 代码架构说明 81 | - parse组件根据prometheus的query log分析heavy_query记录 82 | - 把记录算哈希后增量写入consul,和redis集群中 83 | - prometheus 根据confd拉取属于自己分片的consul数据生成record.yml 84 | - 根据record做预查询聚合写入tsdb 85 | - query前面的lua会将grafana传过来的查询expr算哈希 86 | - 和redis中的记录匹配,匹配中说明这条是heavy_query 87 | - 那么替换其expr到后端查询 88 | 89 | 90 | ## 使用指南 91 | 92 | [安装部署](./部署.md) 93 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ning1875/pre_query/07387e462e1d3e7f2efb97d9a74a4a1541ae1762/__init__.py -------------------------------------------------------------------------------- /all_prome_query: -------------------------------------------------------------------------------- 1 | 172.20.70.205 2 | 172.20.70.215 -------------------------------------------------------------------------------- /config.yaml: -------------------------------------------------------------------------------- 1 | prome_query_log: 2 | prome_log_path: /opt/logs/prometheus_query.log # prometheus query log文件path 和prometheus query配置保持一致 3 | heavy_query_threhold: 0.0001 # heavy_query阈值,单位秒,依据情况修改 4 | py_name: parse_prome_query_log.py # 主文件名。不动 5 | local_work_dir: all_prome_query_log # parser拉取query_log的保存路径,默认当前目录下的all_prome_query_log 6 | local_record_yml_dir: local_record_yml_dir # 写入本地的record yaml结果,默认当前目录下的 local_record_yml_dir 7 | check_heavy_query_api: http://localhost:9090 # 一个prometheus查询地址,用来double_check记录是否真的heavy,避免误添加。默认不启用这个功能 8 | 9 | redis: 10 | host: localhost # redis地址 11 | port: 6379 12 | redis_set_key: hke:heavy_query_set 13 | redis_one_key_prefix: hke:heavy_expr # heavy_query key前缀 14 | consul: 15 | host: localhost #consul地址 16 | port: 8500 17 | consul_record_key_prefix: prometheus/records # heavy_query key前缀 18 | 19 | heavy_blacklist_metrics: # 黑名单metric_names 20 | - kafka_log_log_logendoffset 21 | - requests_latency_bucket 22 | - count(node_cpu_seconds_total) 23 | - '{__name__=~".+"}' 24 | - '{__name__=~".*"}' -------------------------------------------------------------------------------- /consul_delete.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | import consul 4 | import time 5 | 6 | import redis 7 | 8 | 9 | class Consul(object): 10 | def __init__(self, host, port): 11 | '''初始化,连接consul服务器''' 12 | self._consul = consul.Consul(host, port) 13 | 14 | def RegisterService(self, name, service_id, host, port, check_url, tags=None): 15 | tags = tags or [] 16 | # 注册服务 17 | self._consul.agent.service.register( 18 | name, 19 | service_id, 20 | host, 21 | port, 22 | tags, 23 | # 健康检查ip端口,检查时间:5,超时时间:30,注销时间:30s 24 | check=consul.Check().http(check_url, "5s", "5s")) 25 | 26 | def GetService(self, name): 27 | res = self._consul.health.service(name, passing=True) 28 | print(res) 29 | # services = self._consul.agent.services() 30 | # print(services) 31 | # service = services.get(name) 32 | # 33 | # if not service: 34 | # return None, None 35 | # addr = "{0}:{1}".format(service['Address'], service['Port']) 36 | # return service, addr 37 | 38 | def delete_key(self, key='prometheus/records'): 39 | res = self._consul.kv.delete(key, recurse=True) 40 | return res 41 | 42 | def get_key_by_record(self, key='prometheus/records', record=""): 43 | res = self._consul.kv.get(key, recurse=True) 44 | data = res[1] 45 | if not data: 46 | return None 47 | 48 | for i in data: 49 | key = i.get("Key") 50 | 51 | v = json.loads(i.get('Value').decode("utf-8")) 52 | if record in v.get('record'): 53 | print(key, v.get('record'), v.get('expr')) 54 | return key 55 | return None 56 | 57 | 58 | def redis_conn(): 59 | redis_host = "localhost" 60 | 61 | redis_port = 6379 62 | conn = redis.Redis(host=redis_host, port=redis_port) 63 | return conn 64 | 65 | 66 | def delete_key(): 67 | to_del_keys = [] 68 | with open('to_del_record_key_file') as f: 69 | to_del_keys = [x.strip() for x in f.readlines()] 70 | print(to_del_keys) 71 | 72 | host = 'localhost' 73 | port = 8500 74 | 75 | consul_record_key_prefix = 'prometheus/records' 76 | consul_client = Consul(host, port) 77 | redis_c = redis_conn() 78 | 79 | for key in to_del_keys: 80 | if not key: 81 | continue 82 | to_del_key = consul_client.get_key_by_record(record=key) 83 | print(to_del_key) 84 | if to_del_key: 85 | consul_client.delete_key(to_del_key) 86 | redis_key = "hke:heavy_expr:{}".format(key) 87 | delete_res = redis_c.delete(redis_key) 88 | print(delete_res) 89 | 90 | 91 | def run_register(): 92 | host = 'localhost' 93 | port = 8500 94 | consul_client = Consul(host, port) 95 | 96 | s_name = "pushgateway_a" 97 | s_hosts = [ 98 | 'localhost', 99 | ] 100 | 101 | s_port = 9091 102 | 103 | for h in s_hosts: 104 | s_check_url = 'http://{}:{}/-/healthy'.format(h, s_port) 105 | # consul_client.RegisterService(s_name, h, h, s_port, s_check_url) 106 | # check = consul.Check().http(s_check_url, "5s", "5s", "5s") 107 | # print(check) 108 | 109 | res = consul_client._consul.agent.service.deregister(s_name) 110 | print(res) 111 | res = consul_client.GetService(s_name) 112 | # print(res[0]) 113 | 114 | 115 | def run_query(): 116 | host = 'localhost' 117 | port = 8500 118 | consul_record_key_prefix = 'prometheus/records' 119 | consul_client = Consul(host, port) 120 | 121 | 122 | if __name__ == '__main__': 123 | # run_register() 124 | # run_query() 125 | delete_key() 126 | -------------------------------------------------------------------------------- /images/arch.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ning1875/pre_query/07387e462e1d3e7f2efb97d9a74a4a1541ae1762/images/arch.jpg -------------------------------------------------------------------------------- /images/heavy_query_diff.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ning1875/pre_query/07387e462e1d3e7f2efb97d9a74a4a1541ae1762/images/heavy_query_diff.png -------------------------------------------------------------------------------- /init.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | # 安装指南 4 | # parse组件 在python3.6+中运行 5 | #1.安装依赖 6 | # pip3 install -r requirements.txt 7 | 8 | #2. 修改config.yaml中各个配置 9 | #3. 准备真实prometheus地址写入all_prome_query 10 | #4. 运行添加crontab 每晚11:30定时运行一次即可 11 | ansible-playbook -i all_prome_query prome_heavy_expr_parse.yaml 12 | 13 | 14 | # prometheus 和confd组件 15 | # 1.安装prometheus 和confd 16 | # 将confd下的配置文件放置好,启动服务 17 | # prometheus开启query_log 18 | ``` 19 | global: 20 | query_log_file: /App/logs/prometheus_query.log 21 | ``` 22 | 23 | # openresty组件 24 | #1. 安装openresty ,准备lua环境 25 | yum install yum-utils -y 26 | yum-config-manager --add-repo https://openresty.org/package/centos/openresty.repo 27 | yum install openresty openresty-resty -y 28 | 29 | #2. 30 | # 修改lua文件中的redis地址为你自己的 31 | # 修改ngx_prome_redirect.conf文件中 真实real_prometheus后端,使用前请修改 32 | 33 | mkdir -pv /usr/local/openresty/nginx/conf/conf.d/ 34 | mkdir -pv /usr/local/openresty/nginx/lua_files/ 35 | 36 | #3. 37 | # 将nginx配置和lua文件放到指定目录 38 | /bin/cp -f ngx_prome_redirect.conf /usr/local/openresty/nginx/conf/conf.d/ 39 | /bin/cp -f nginx.conf /usr/local/openresty/nginx/conf/ 40 | /bin/cp -f prome_redirect.lua /usr/local/openresty/nginx/lua_files/ 41 | 42 | 43 | #4. 44 | # 启动openresty 45 | systemctl enable openresty 46 | systemctl start openresty 47 | 48 | #5. 49 | # 修改grafana数据源,将原来的指向真实prometheus地址改为指向openresty的9992端口 50 | 51 | 52 | # 运维操作 53 | # 查看redis中的heavy_query记录 54 | redis-cli -h $redis_host keys hke:heavy_expr* 55 | # 查看consul中的heavy_query记录 56 | curl http://$consul_addr:8500/v1/kv/prometheus/record?recurse= |python -m json.tool 57 | # 根据一个heavy_record文件恢复记录 58 | python3 recovery_by_local_yaml.py local_record_yml/record_to_keep.yml 59 | # 根据一个metric_name前缀删除record记录 60 | bash -x recovery_heavy_metrics.sh $metric_name 61 | 62 | 63 | 64 | 65 | 66 | 67 | -------------------------------------------------------------------------------- /libs.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import hashlib 3 | 4 | import yaml 5 | 6 | def now_date_str(): 7 | # return datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S") 8 | return datetime.datetime.now().strftime("%Y-%m-%d") 9 | 10 | 11 | def get_str_md5(input_str): 12 | m = hashlib.md5() 13 | m.update(input_str) 14 | return m.hexdigest() 15 | 16 | 17 | def load_base_config(yaml_path): 18 | with open(yaml_path) as f: 19 | config = yaml.load(f,Loader=yaml.FullLoader) 20 | return config 21 | -------------------------------------------------------------------------------- /nginx.conf: -------------------------------------------------------------------------------- 1 | #user nobody; 2 | worker_processes auto; 3 | 4 | 5 | #error_log logs/error.log; 6 | #error_log logs/error.log notice; 7 | #error_log logs/error.log info; 8 | 9 | #pid logs/nginx.pid; 10 | worker_rlimit_nofile 60000; 11 | 12 | events 13 | { 14 | use epoll; 15 | worker_connections 60000; 16 | } 17 | 18 | 19 | 20 | http { 21 | include mime.types; 22 | default_type text/html; 23 | 24 | charset UTF-8; 25 | server_names_hash_bucket_size 128; 26 | client_header_buffer_size 4k; 27 | large_client_header_buffers 4 32k; 28 | 29 | 30 | #log_format main '$remote_addr - $remote_user [$time_local] "$request" ' 31 | # '$status $body_bytes_sent "$http_referer" ' 32 | # '"$http_user_agent" "$http_x_forwarded_for"'; 33 | 34 | #access_log logs/access.log main; 35 | 36 | sendfile on; 37 | #tcp_nopush on; 38 | 39 | #keepalive_timeout 0; 40 | keepalive_timeout 65; 41 | 42 | #gzip on; 43 | include /usr/local/openresty/nginx/conf/conf.d/*.conf; 44 | 45 | } 46 | -------------------------------------------------------------------------------- /ngx_prome_redirect.conf: -------------------------------------------------------------------------------- 1 | # 真实prometheus后端,使用前请修改 2 | upstream real_prometheus { 3 | 4 | server 172.20.70.205:9090; 5 | server 172.20.70.215:9090; 6 | 7 | } 8 | 9 | 10 | 11 | server{ 12 | listen 9992; 13 | server_name _; 14 | location / { 15 | proxy_set_header Host $host:$server_port; 16 | proxy_pass http://real_prometheus; 17 | } 18 | location /api/v1/query_range { 19 | access_by_lua_file /usr/local/openresty/nginx/lua_files/prome_redirect.lua; 20 | proxy_pass http://real_prometheus; 21 | } 22 | 23 | 24 | } 25 | 26 | -------------------------------------------------------------------------------- /parse_prome_query_log.py: -------------------------------------------------------------------------------- 1 | import base64 2 | import glob 3 | import json 4 | import logging 5 | import os 6 | import re 7 | import sys 8 | import time 9 | from datetime import datetime 10 | from multiprocessing.pool import ThreadPool 11 | 12 | import consul 13 | import redis 14 | import requests 15 | import yaml 16 | 17 | from libs import get_str_md5, load_base_config, now_date_str 18 | 19 | logging.basicConfig( 20 | # TODO console 日志,上线时删掉 21 | # filename=LOG_PATH, 22 | format='%(asctime)s %(levelname)s %(filename)s %(funcName)s [line:%(lineno)d]:%(message)s', 23 | datefmt="%Y-%m-%d %H:%M:%S", 24 | level="INFO" 25 | ) 26 | G_VAR_YAML = "config.yaml" 27 | 28 | 29 | class Consul(object): 30 | def __init__(self, host, port): 31 | '''初始化,连接consul服务器''' 32 | self._consul = consul.Consul(host, port) 33 | 34 | def RegisterService(self, name, host, port, tags=None): 35 | tags = tags or [] 36 | # 注册服务 37 | self._consul.agent.service.register( 38 | name, 39 | name, 40 | host, 41 | port, 42 | tags, 43 | # 健康检查ip端口,检查时间:5,超时时间:30,注销时间:30s 44 | check=consul.Check().tcp(host, port, "5s", "30s", "30s")) 45 | 46 | def GetService(self, name): 47 | services = self._consul.agent.services() 48 | service = services.get(name) 49 | if not service: 50 | return None, None 51 | addr = "{0}:{1}".format(service['Address'], service['Port']) 52 | return service, addr 53 | 54 | def delete_key(self, key='prometheus/records'): 55 | res = self._consul.kv.delete(key, recurse=True) 56 | return res 57 | 58 | def get_list(self, key='prometheus/records'): 59 | res = self._consul.kv.get(key, recurse=True) 60 | 61 | data = res[1] 62 | if not data: 63 | return {} 64 | pre_record_d = {} 65 | 66 | for i in data: 67 | v = json.loads(i.get('Value').decode("utf-8")) 68 | pre_record_d[v.get('record')] = v.get('expr') 69 | return pre_record_d 70 | 71 | def set_data(self, key, value): 72 | ''' 73 | self._consul.kv.put('prometheus/records/1', 74 | 75 | json.dumps( 76 | { 77 | 78 | "record": "nyy_record_test_a", 79 | "expr": 'sum(kafka_log_log_size{project=~"metis - main1 - sg2"}) by (topic)' 80 | } 81 | ) 82 | ) 83 | ''' 84 | self._consul.kv.put(key, value) 85 | 86 | def get_b64encode(self, message): 87 | message_bytes = message.encode('ascii') 88 | base64_bytes = base64.b64encode(message_bytes) 89 | return base64_bytes.decode("utf8") 90 | 91 | def txn_mset(self, record_expr_list): 92 | lens = len(record_expr_list) 93 | logging.info("top_lens:{}".format(lens)) 94 | max_txn_once = 64 95 | yu_d = lens // max_txn_once 96 | yu = lens / max_txn_once 97 | 98 | if lens <= max_txn_once: 99 | pass 100 | else: 101 | max = yu_d 102 | 103 | if yu > yu_d: 104 | max += 1 105 | 106 | for i in range(0, max): 107 | sli = record_expr_list[i * max_txn_once:(i + 1) * max_txn_once] 108 | self.txn_mset(sli) 109 | return True 110 | ''' 111 | { 112 | "KV": { 113 | "Verb": "", 114 | "Key": "", 115 | "Value": "", 116 | "Flags": 0, 117 | "Index": 0, 118 | "Session": "" 119 | } 120 | } 121 | 122 | :return: 123 | ''' 124 | 125 | txn_data = [] 126 | logging.info("middle_lens:{}".format(len(record_expr_list))) 127 | for index, data in record_expr_list: 128 | txn_data.append( 129 | { 130 | "KV": { 131 | "Key": "{}/{}".format(CONSUL_RECORD_KEY_PREFIX, index), 132 | "Verb": "set", 133 | "Value": self.get_b64encode(json.dumps( 134 | data 135 | )), 136 | 137 | } 138 | } 139 | ) 140 | # TODO local test 141 | # print(txn_data) 142 | # return True 143 | res = self._consul.txn.put(txn_data) 144 | if not res: 145 | logging.error("txn_mset_error") 146 | return False 147 | if res.get("Errors"): 148 | logging.error("txn_mset_error:{}".format(str(res.get("Errors")))) 149 | return False 150 | return True 151 | 152 | 153 | def batch_delete_redis_key(conn, prefix): 154 | CHUNK_SIZE = 5000 155 | """ 156 | Clears a namespace 157 | :param ns: str, namespace i.e your:prefix 158 | :return: int, cleared keys 159 | """ 160 | cursor = '0' 161 | ns_keys = prefix + '*' 162 | while cursor != 0: 163 | cursor, keys = conn.scan(cursor=cursor, match=ns_keys, count=CHUNK_SIZE) 164 | if keys: 165 | conn.delete(*keys) 166 | 167 | return cursor 168 | 169 | 170 | def redis_conn(): 171 | redis_host = ONLINE_REDIS_HOST 172 | redis_port = ONLINE_REDIS_PORT 173 | conn = redis.Redis(host=redis_host, port=redis_port) 174 | return conn 175 | 176 | 177 | def parse_log_file(log_f): 178 | ''' 179 | { 180 | "httpRequest":{ 181 | "clientIP":"1.1.1.1", 182 | "method":"GET", 183 | "path":"/api/v1/query_range" 184 | }, 185 | "params":{ 186 | "end":"2020-04-09T06:20:00.000Z", 187 | "query":"api_request_counter{job="kubernetes-pods",kubernetes_namespace="sprs",app="model-server"}/60", 188 | "start":"2020-04-02T06:20:00.000Z", 189 | "step":1200 190 | }, 191 | "stats":{ 192 | "timings":{ 193 | "evalTotalTime":0.467329174, 194 | "resultSortTime":0.000476303, 195 | "queryPreparationTime":0.373947928, 196 | "innerEvalTime":0.092889708, 197 | "execQueueTime":0.000008911, 198 | "execTotalTime":0.467345411 199 | } 200 | }, 201 | "ts":"2020-04-09T06:20:28.353Z" 202 | } 203 | :param log_f: 204 | :return: 205 | ''' 206 | heavy_expr_set = set() 207 | heavy_expr_dict = dict() 208 | record_expr_dict = dict() 209 | 210 | with open(log_f) as f: 211 | for x in f.readlines(): 212 | x = json.loads(x.strip()) 213 | if not isinstance(x, dict): 214 | continue 215 | httpRequest = x.get("httpRequest") 216 | if not httpRequest: 217 | continue 218 | path = httpRequest.get("path") 219 | # 只处理path为query_range的 220 | if path != "/api/v1/query_range": 221 | continue 222 | params = x.get("params") 223 | if not params: 224 | continue 225 | start_time = params.get("start") 226 | end_time = params.get("end") 227 | stats = x.get("stats") 228 | if not stats: 229 | continue 230 | timings = stats.get("timings") 231 | if not timings: 232 | continue 233 | evalTotalTime = timings.get("evalTotalTime") 234 | execTotalTime = timings.get("execTotalTime") 235 | queryPreparationTime = timings.get("queryPreparationTime") 236 | execQueueTime = timings.get("execQueueTime") 237 | innerEvalTime = timings.get("innerEvalTime") 238 | 239 | # 如果查询事件段大于6小时则不认为是heavy-query 240 | if not start_time or not end_time: 241 | continue 242 | start_time = datetime.strptime(start_time, '%Y-%m-%dT%H:%M:%S.%fZ').timestamp() 243 | end_time = datetime.strptime(end_time, '%Y-%m-%dT%H:%M:%S.%fZ').timestamp() 244 | if end_time - start_time > 3600 * 6: 245 | continue 246 | 247 | # 如果两个时间都小于阈值则不为heavy-query 248 | c = (queryPreparationTime < HEAVY_QUERY_THREHOLD) and (innerEvalTime < HEAVY_QUERY_THREHOLD) 249 | if c: 250 | continue 251 | 252 | if queryPreparationTime > 40: 253 | continue 254 | if execQueueTime > 40: 255 | continue 256 | if innerEvalTime > 40: 257 | continue 258 | if evalTotalTime > 40: 259 | continue 260 | if execTotalTime > 40: 261 | continue 262 | query = params.get("query").strip() 263 | if not query: 264 | continue 265 | is_bl = False 266 | for bl in HEAVY_BLACKLIST_METRICS: 267 | if not isinstance(bl, str): 268 | continue 269 | if bl in query: 270 | is_bl = True 271 | break 272 | if is_bl: 273 | continue 274 | # avoid multi heavy query 275 | if REDIS_ONE_KEY_PREFIX in query: 276 | continue 277 | # \r\n should not in query ,replace it 278 | if "\r\n" in query: 279 | query = query.replace("\r\n", "", -1) 280 | # \n should not in query ,replace it 281 | if "\n" in query: 282 | query = query.replace("\n", "", -1) 283 | 284 | # - startwith for grafana network out 285 | 286 | if query.startswith("-"): 287 | query = query.replace("-", "", 1) 288 | md5_str = get_str_md5(query.encode("utf-8")) 289 | 290 | record_name = "{}:{}".format(REDIS_ONE_KEY_PREFIX, md5_str) 291 | record_expr_dict[record_name] = query 292 | heavy_expr_set.add(query) 293 | last_time = heavy_expr_dict.get(query) 294 | this_time = evalTotalTime 295 | if last_time and last_time > this_time: 296 | this_time = last_time 297 | 298 | heavy_expr_dict[query] = this_time 299 | logging.info("log_file:{} get :{} heavy expr".format(log_f, len(record_expr_dict))) 300 | return record_expr_dict 301 | 302 | 303 | # 解析一个日志文件 304 | def run_log_parse_local_test(log_path): 305 | res = parse_log_file(log_path) 306 | print(res) 307 | 308 | 309 | def mset_record_to_redis(res_dic): 310 | if not res_dic: 311 | logging.fatal("record_expr_list empty") 312 | rc = redis_conn() 313 | if not rc: 314 | logging.fatal("failed to connect to redis-server") 315 | mset_res = rc.mset(res_dic) 316 | logging.info("mset_res:{} len:{}".format(str(mset_res), format(len(res_dic)))) 317 | sadd_res = rc.sadd(REDIS_SET_KEY, *res_dic.keys()) 318 | logging.info("sadd_res:{}".format(str(sadd_res))) 319 | smems = rc.smembers(REDIS_SET_KEY) 320 | logging.info("smember_res_len:{}".format(len(smems))) 321 | 322 | 323 | def write_record_yaml_file(record_expr_list): 324 | ''' 325 | data = { 326 | "groups": [ 327 | { 328 | "name": "example", 329 | "rules": [ 330 | { 331 | "record": "nyy_record_test_a", 332 | "expr": "sum(kafka_log_log_size{project=~"metis-main1-sg2"}) by (topic)" 333 | }, 334 | ], 335 | }, 336 | ] 337 | 338 | } 339 | ''' 340 | data = { 341 | "groups": [ 342 | { 343 | "name": "heavy_expr_record", 344 | "rules": record_expr_list, 345 | }, 346 | ] 347 | 348 | } 349 | file_name = "{}/record_{}_{}.yml".format(PROME_RECORD_FILE, len(record_expr_list), now_date_str()) 350 | with open(file_name, 'w') as f: 351 | yaml.dump(data, f, default_flow_style=False, sort_keys=False) 352 | if not os.path.isfile("./promtool"): 353 | logging.error("promtool not exist skip rule check ") 354 | return [] 355 | 356 | cmd = "./promtool check rules {}".format(file_name) 357 | r = os.popen(cmd) 358 | out = r.read() 359 | r.close() 360 | 361 | record_name_re = re.compile('.*?\"(%s:.*?)\".*?' % REDIS_ONE_KEY_PREFIX) 362 | invalid_keys = [] 363 | for line in out.strip().split("\n"): 364 | 365 | record_name = re.findall(record_name_re, line) 366 | logging.info("[record_name:{}]".format(record_name)) 367 | if len(record_name) == 1: 368 | invalid_keys.append(record_name[0]) 369 | return invalid_keys 370 | 371 | 372 | def recovery_concurrent_log_parse(res_dic): 373 | if not res_dic: 374 | logging.fatal("get empty result exit ....") 375 | # print(res_dic) 376 | # return 377 | 378 | consul_client = Consul(CONSUL_HOST, CONSUL_PORT) 379 | if not consul_client: 380 | logging.fatal("connect_to_consul_error") 381 | 382 | # purge consul 383 | purge_consul_res = consul_client.delete_key(key=CONSUL_RECORD_KEY_PREFIX) 384 | logging.info("[purge consul] res:{}".format(str(purge_consul_res))) 385 | # purge redis 386 | rc = redis_conn() 387 | if not rc: 388 | logging.fatal("failed to connect to redis-server") 389 | 390 | rc_delete_res = batch_delete_redis_key(rc, "hke:heavy_expr*") 391 | logging.info("[purge redis heavy_key] res:{}".format(str(rc_delete_res))) 392 | rc_delete_res = rc.delete("hke:heavy_query_set") 393 | logging.info("[purge redis heavy_query_set] res:{}".format(str(rc_delete_res))) 394 | 395 | record_expr_list = [] 396 | for k in sorted(res_dic.keys()): 397 | record_expr_list.append({"record": k, "expr": res_dic.get(k)}) 398 | logging.info("get_all_record_heavy_query:{} ".format(len(record_expr_list))) 399 | 400 | # write to local prome record yml 401 | write_record_yaml_file(record_expr_list) 402 | 403 | # write to consul 404 | 405 | new_record_expr_list = [] 406 | for index, data in enumerate(record_expr_list): 407 | new_record_expr_list.append((index, data)) 408 | 409 | consul_w_res = consul_client.txn_mset(new_record_expr_list) 410 | if not consul_w_res: 411 | logging.fatal("write_to_consul_error") 412 | 413 | # write to redis 414 | mset_record_to_redis(res_dic) 415 | 416 | 417 | def query_range_judge_heavy(host, expr, record): 418 | ''' 419 | 420 | :param host: 421 | :param expr: 422 | 调用举例: 获取group=ugc的project 423 | 424 | query_range(inf, 425 | 'avg(100 - (avg by (instance,name) (rate(node_cpu_seconds_total{region=~"ap-southeast-3",account=~"HW-SHAREit",group=~"UGC",project=~"cassandra-client", mode="idle"}[5m])) * 100))') 426 | 427 | 428 | :return: 429 | { 430 | "status":"success", 431 | "data":{ 432 | "resultType":"matrix", 433 | "result":[ 434 | { 435 | "metric":{ 436 | 437 | }, 438 | "values":[ 439 | [ 440 | 1588149960, 441 | "0.1999999996688473" 442 | ], 443 | [ 444 | 1588150020, 445 | "0.20000000035872745" 446 | ], 447 | [ 448 | 1588150080, 449 | "0.19629629604793308" 450 | ], 451 | [ 452 | 1588150140, 453 | "0.19629629673781324" 454 | ], 455 | [ 456 | 1588150200, 457 | "0.1999999996688473" 458 | ], 459 | [ 460 | 1588150260, 461 | "0.2074074076005843" 462 | ] 463 | ] 464 | } 465 | ] 466 | } 467 | } 468 | ''' 469 | # logging.info("host:{} expr:{}".format(host, expr)) 470 | uri = '{}/api/v1/query_range'.format(host) 471 | 472 | end = int(time.time()) 473 | q_start = time.time() 474 | start = end - 60 * 60 475 | 476 | G_PARMS = { 477 | "query": expr, 478 | "start": start, 479 | "end": end, 480 | "step": 30 481 | } 482 | res = requests.get(uri, G_PARMS) 483 | data = res.json() 484 | status = data.get("status") 485 | if status != "success": 486 | return (expr, record, False) 487 | 488 | res = data.get("data").get("result") 489 | if not res: 490 | return (expr, record, False) 491 | if len(res) == 0: 492 | return (expr, record, False) 493 | took = time.time() - q_start 494 | logging.info("key:{} expr:{} time_took:{}".format( 495 | record, 496 | expr, 497 | took 498 | )) 499 | if took > 3.0: 500 | return (expr, record, True) 501 | return (expr, record, False) 502 | 503 | 504 | def concurrent_log_parse(log_dir): 505 | # 步骤1 解析日志 506 | t_num = 500 507 | pool = ThreadPool(t_num) 508 | 509 | log_file_s = glob.glob("{}/*.log".format(log_dir)) 510 | 511 | results = pool.map(parse_log_file, log_file_s) 512 | 513 | pool.close() 514 | pool.join() 515 | res_dic = {} 516 | for x in results: 517 | res_dic.update(x) 518 | logging.info("[before_heavy_query_check_num:{}]".format(len(res_dic))) 519 | # 1 end 520 | 521 | # 步骤2 拿解析结果去查询一下,做double-check,可以禁止 522 | # pool = ThreadPool(t_num) 523 | # 524 | # parms = [] 525 | # for k, v in res_dic.items(): 526 | # expr = res_dic 527 | # record = k 528 | # parms.append([CHECK_HEAVY_QUERY_API, expr, record]) 529 | # results = pool.starmap(query_range_judge_heavy, parms) 530 | # 531 | # pool.close() 532 | # pool.join() 533 | # 534 | # res_dic = {} 535 | # for x in results: 536 | # expr, record, real_heavy = x[0], x[1], x[2] 537 | # if real_heavy: 538 | # res_dic[record] = expr 539 | # logging.info("[after_heavy_query_check_num:{}]".format(len(res_dic))) 540 | # 2 end 541 | 542 | if not res_dic: 543 | logging.fatal("get empty result exit ....") 544 | 545 | # 步骤3 增量更新consul数据 546 | consul_client = Consul(CONSUL_HOST, CONSUL_PORT) 547 | if not consul_client: 548 | logging.fatal("connect_to_consul_error") 549 | 550 | ## get pre data from consul 551 | pre_dic = consul_client.get_list(key=CONSUL_RECORD_KEY_PREFIX) 552 | old_len = len(pre_dic) + 1 553 | # res_dic.update(pre_dic) 554 | ## 增量更新 555 | old_key_set = set(pre_dic.keys()) 556 | this_key_set = set(res_dic.keys()) 557 | ## 更新的keys 558 | new_dic = {} 559 | today_all_dic = {} 560 | new_key_set = this_key_set - old_key_set 561 | logging.info("new_key_set:{} ".format(len(new_key_set))) 562 | for k in new_key_set: 563 | new_dic[k] = res_dic[k] 564 | 565 | # 构造record 记录 566 | record_list_new = [] 567 | for k in sorted(new_dic.keys()): 568 | one_expr_list = {"record": k, "expr": new_dic.get(k)} 569 | 570 | record_list_new.append(one_expr_list) 571 | 572 | # 写入本地record文件检查rules 573 | invalid_keys = write_record_yaml_file(record_list_new) 574 | logging.info("invalid_keys: num {} details:{}".format(len(invalid_keys), str(invalid_keys))) 575 | for del_key in invalid_keys: 576 | new_dic.pop(del_key) 577 | f_record_list_new = [] 578 | for k in sorted(new_dic.keys()): 579 | one_expr_list = {"record": k, "expr": new_dic.get(k)} 580 | 581 | f_record_list_new.append(one_expr_list) 582 | 583 | today_all_dic.update(pre_dic) 584 | today_all_dic.update(new_dic) 585 | local_record_expr_list = [] 586 | 587 | for k in sorted(today_all_dic.keys()): 588 | local_record_expr_list.append({"record": k, "expr": today_all_dic.get(k)}) 589 | logging.info("get_all_record_heavy_query:{} ".format(len(local_record_expr_list))) 590 | 591 | # 写入本地yaml 592 | write_record_yaml_file(local_record_expr_list) 593 | 594 | # 写入consul中 595 | new_record_expr_list = [] 596 | # 给record记录添加索引,为confd分片做准备 597 | for index, data in enumerate(f_record_list_new): 598 | new_record_expr_list.append((index + old_len, data)) 599 | if new_record_expr_list: 600 | consul_w_res = consul_client.txn_mset(new_record_expr_list) 601 | if not consul_w_res: 602 | logging.fatal("write_to_consul_error") 603 | else: 604 | logging.info("zero_new_heavy_record:{}") 605 | 606 | # 写入redis中 607 | mset_record_to_redis(today_all_dic) 608 | 609 | 610 | def run(): 611 | ''' 612 | 1.all prome query_log need to be scpped here 613 | 2.parse log 614 | 3.txn_mput to consul 615 | 4.merge result and meset to redis 616 | 5.generate record yaml file 617 | 618 | :return: 619 | ''' 620 | concurrent_log_parse(PROME_QUERY_LOG_DIR) 621 | 622 | 623 | yaml_path = G_VAR_YAML 624 | 625 | config = load_base_config(yaml_path) 626 | # path 627 | HEAVY_QUERY_THREHOLD = config.get("prome_query_log").get("heavy_query_threhold") 628 | PROME_QUERY_LOG_DIR = config.get("prome_query_log").get("local_work_dir") 629 | PROME_RECORD_FILE = config.get("prome_query_log").get("local_record_yml_dir") 630 | CHECK_HEAVY_QUERY_API = config.get("prome_query_log").get("check_heavy_query_api") 631 | # redis 632 | ONLINE_REDIS_HOST = config.get("redis").get("host") 633 | ONLINE_REDIS_PORT = int(config.get("redis").get("port")) 634 | REDIS_SET_KEY = config.get("redis").get("redis_set_key") 635 | REDIS_ONE_KEY_PREFIX = config.get("redis").get("redis_one_key_prefix") 636 | # consul 637 | CONSUL_RECORD_KEY_PREFIX = config.get("consul").get("consul_record_key_prefix") 638 | CONSUL_HOST = config.get("consul").get("host") 639 | CONSUL_PORT = config.get("consul").get("port") 640 | # heavy 641 | 642 | HEAVY_BLACKLIST_METRICS = config.get("heavy_blacklist_metrics") 643 | 644 | # print(HEAVY_BLACKLIST_METRICS) 645 | 646 | if __name__ == '__main__': 647 | if len(sys.argv) == 3 and sys.argv[1] == "run_log_parse_local_test": 648 | run_log_parse_local_test(sys.argv[2]) 649 | sys.exit(0) 650 | 651 | try: 652 | run() 653 | except Exception as e: 654 | logging.error("got_error:{}".format(e)) 655 | -------------------------------------------------------------------------------- /prome_heavy_expr_parse.yaml: -------------------------------------------------------------------------------- 1 | - name: fetch log and push expr to cache 2 | hosts: all 3 | user: root 4 | gather_facts: false 5 | vars_files: 6 | - config.yaml 7 | 8 | tasks: 9 | 10 | - name: fetch query log 11 | fetch: src={{ prome_query_log.prome_log_path }} dest={{ prome_query_log.local_work_dir }}/{{ inventory_hostname }}_query.log flat=yes validate_checksum=no 12 | register: result 13 | 14 | - name: Show debug info 15 | debug: var=result verbosity=0 16 | 17 | 18 | - name: localhost 19 | hosts: localhost 20 | user: root 21 | gather_facts: false 22 | vars_files: 23 | - config.yaml 24 | tasks: 25 | 26 | - name: merge result 27 | shell: /usr/bin/python3 {{ prome_query_log.py_name }} 28 | connection: local 29 | run_once: true 30 | 31 | register: result 32 | - name: Show debug info 33 | debug: var=result verbosity=0 34 | # useage : ansible-playbook -i all_prome_query prome_heavy_expr_parse.yaml 35 | 36 | -------------------------------------------------------------------------------- /prome_redirect.lua: -------------------------------------------------------------------------------- 1 | function get_str_md5(input_s) 2 | local resty_md5 = require "resty.md5" 3 | local md5 = resty_md5:new() 4 | if not md5 then 5 | ngx.log(ngx.ERR, "failed to create md5 object") 6 | return 7 | end 8 | 9 | local ok = md5:update(input_s) 10 | if not ok then 11 | ngx.log(ngx.ERR, "failed to add data") 12 | return 13 | end 14 | local digest = md5:final() 15 | 16 | local str = require "resty.string" 17 | local md5_str = str.to_hex(digest) 18 | return md5_str 19 | end 20 | 21 | function redis_get(key) 22 | -- start of redis 23 | 24 | local redis = require "resty.redis" 25 | local red = redis:new() 26 | --red:set_timeouts(1000, 1000, 1000) 27 | local ok, conn_err = red:connect("localhost", 6379) 28 | if not ok then 29 | ngx.log(ngx.ERR, "[redis]failed to connect redis server:", conn_err) 30 | return false 31 | end 32 | 33 | local res, get_err = red:get(key) 34 | if get_err then 35 | ngx.log(ngx.ERR, "[redis]failed to get value by key: ", key, "err:", get_err) 36 | return false 37 | end 38 | 39 | red:set_keepalive(30000, 1000) 40 | if res ~= ngx.null then 41 | ngx.log(ngx.INFO, "[redis]success get value by key: ", key, "value: ", res) 42 | return true 43 | else 44 | return false 45 | end 46 | 47 | -- end of redis 48 | end 49 | 50 | function replace_work() 51 | --Nginx服务器中使用lua获取get或post参数 52 | 53 | local request_method = ngx.var.request_method; 54 | local args = {} 55 | --获取参数的值 56 | 57 | if "GET" == request_method then 58 | args = ngx.req.get_uri_args(); 59 | elseif "POST" == request_method then 60 | ngx.req.read_body(); 61 | args = ngx.req.get_post_args(); 62 | end 63 | 64 | local q_query = args["query"]; 65 | local q_start = args["start"]; 66 | local q_end = args["end"]; 67 | local q_step = args["step"]; 68 | 69 | local md5_str = get_str_md5(q_query) 70 | 71 | if md5_str == null then 72 | return 73 | end 74 | local redis_query_key = "hke:heavy_expr:" .. md5_str 75 | --ngx.log(ngx.ERR, "redis_query_key: ",redis_query_key) 76 | local redis_get_res = redis_get(redis_query_key) 77 | if redis_get_res == true then 78 | q_query = redis_query_key 79 | end 80 | 81 | local new_args = {} 82 | new_args["query"] = q_query 83 | new_args["start"] = q_start 84 | new_args["end"] = q_end 85 | new_args["step"] = q_step 86 | 87 | ngx.req.set_uri_args(new_args) 88 | --ngx.req.set_uri_args("end=" .. q_end) 89 | --local arg = ngx.req.get_uri_args() 90 | --for k, v in pairs(arg) do 91 | -- ngx.say("[GET ] key:", k, " v:", v) 92 | --end 93 | 94 | end 95 | 96 | return replace_work(); -------------------------------------------------------------------------------- /re_work.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | prefix = "hke:heavy_expr" 4 | 5 | record_name_re = re.compile('.*?\"(%s:.*?)\".*?' % prefix) 6 | s = ' local_record_yml_dir/record_28_2021-09-13_14-41-16.yml: 80:11: group "heavy_expr_record", rule 27, "hke:heavy_expr:ed28d1000288d2c806827acfc2cfb48b": could not parse expression: 1:90: parse error: unexpected identifier "ormax_over_time"' 7 | record_name = re.findall(record_name_re, s) 8 | print(record_name) 9 | -------------------------------------------------------------------------------- /recovery_by_local_yaml.py: -------------------------------------------------------------------------------- 1 | import time 2 | 3 | import requests 4 | import yaml 5 | import logging 6 | from multiprocessing.pool import ThreadPool 7 | from itertools import repeat 8 | from parse_prome_query_log import recovery_concurrent_log_parse 9 | 10 | logging.basicConfig( 11 | # TODO console 日志,上线时删掉 12 | # filename=LOG_PATH, 13 | format='%(asctime)s %(levelname)s %(filename)s %(funcName)s [line:%(lineno)d]:%(message)s', 14 | datefmt="%Y-%m-%d %H:%M:%S", 15 | level="INFO" 16 | ) 17 | 18 | 19 | def load_yaml(yal_path=r'C:\Users\Administrator\Desktop\record_2202_2020-04-29_23-30-24.yml'): 20 | f = open(yal_path) 21 | y = yaml.load(f) 22 | 23 | all_heavy = y.get("groups")[0].get("rules") 24 | msg = "get {} heavy_record".format(len(all_heavy)) 25 | logging.info(msg) 26 | return all_heavy 27 | 28 | 29 | def query_range(host, expr, key): 30 | ''' 31 | 32 | :param host: 33 | :param expr: 34 | 调用举例: 获取group=ugc的project 35 | 36 | query_range(inf, 37 | 'avg(100 - (avg by (instance,name) (rate(node_cpu_seconds_total{region=~"ap-southeast-3",account=~"HW-SHAREit",group=~"UGC",project=~"cassandra-client",name=~"UGC-cassandra-client-prod", mode="idle"}[5m])) * 100))') 38 | 39 | 40 | :return: 41 | { 42 | "status":"success", 43 | "data":{ 44 | "resultType":"matrix", 45 | "result":[ 46 | { 47 | "metric":{ 48 | 49 | }, 50 | "values":[ 51 | [ 52 | 1588149960, 53 | "0.1999999996688473" 54 | ], 55 | [ 56 | 1588150020, 57 | "0.20000000035872745" 58 | ], 59 | [ 60 | 1588150080, 61 | "0.19629629604793308" 62 | ], 63 | [ 64 | 1588150140, 65 | "0.19629629673781324" 66 | ], 67 | [ 68 | 1588150200, 69 | "0.1999999996688473" 70 | ], 71 | [ 72 | 1588150260, 73 | "0.2074074076005843" 74 | ] 75 | ] 76 | } 77 | ] 78 | } 79 | } 80 | ''' 81 | # logging.info("host:{} expr:{}".format(host, expr)) 82 | uri = '{}/api/v1/query_range'.format(host) 83 | 84 | end = int(time.time()) 85 | start = end - 60 * 60 86 | 87 | G_PARMS = { 88 | "query": expr, 89 | "start": start, 90 | "end": end, 91 | "step": 30 92 | } 93 | res = requests.get(uri, G_PARMS) 94 | data = res.json() 95 | now = int(time.time()) 96 | took = now - end 97 | if took > 4: 98 | return (key, True) 99 | return (key, False) 100 | 101 | 102 | def concurrency_query(): 103 | t_num = 20 104 | pool = ThreadPool(t_num) 105 | yaml_data = load_yaml() 106 | all_expr = [x.get("expr") for x in yaml_data][:100] 107 | all_key = [x.get("record") for x in yaml_data][:100] 108 | host = "http://localhost:9999" 109 | 110 | pars = zip(repeat(host), all_expr, all_key) 111 | results = pool.starmap(query_range, pars) 112 | 113 | pool.close() 114 | pool.join() 115 | for x in results: 116 | print(x) 117 | 118 | 119 | def recovery(yaml_path): 120 | yaml_data = load_yaml(yal_path=yaml_path) 121 | res_dic = {} 122 | 123 | for x in yaml_data: 124 | res_dic[x.get("record")] = x.get("expr") 125 | 126 | recovery_concurrent_log_parse(res_dic) 127 | # purge consul 128 | 129 | # purge redis 130 | 131 | 132 | if __name__ == '__main__': 133 | import sys 134 | 135 | yaml_path = sys.argv[1] 136 | recovery(yaml_path) 137 | -------------------------------------------------------------------------------- /recovery_heavy_metrics.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | 4 | last_file=`ls /App/tgzs/conf_dir/prome_heavy_expr_parse/local_record_yml/*yml -rt |tail -1` 5 | metrics=$1 6 | grep $1 ${last_file} ${last_file} -B 1 |grep "record: hke:heavy_expr" |sort |awk -F ":" '{print $NF}' |sort |uniq > to_del_record_key_file 7 | wc -l to_del_record_key_file 8 | 9 | python3 consul_delete.py 10 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | python-consul 2 | redis 3 | PyYaml 4 | ansible 5 | -------------------------------------------------------------------------------- /to_del_record_key_file: -------------------------------------------------------------------------------- 1 | 9133202933f1394e368971d59f3c9d67 2 | deb1781d21403b68571452323cc1d142 3 | 4ff1fc99d6ed14c62c30df3dbe192da7 4 | 627df33956d3fda20ddb2ef80fa7c46e 5 | -------------------------------------------------------------------------------- /部署.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | ## 01 在prometheus record机器上 安装confd 6 | 7 | - 下载 带分片功能的confd二进制 8 | ```shell script 9 | wget https://github.com/ning1875/confd/releases/download/v0.16.0/confd_shard-0.16.0-linux-amd64.tar.gz 10 | ``` 11 | 12 | 13 | - 创建目录 14 | 15 | ```shell script 16 | mkdir -p /etc/confd/{conf.d,templates} 17 | ``` 18 | 19 | - 主配置文件/etc/confd/conf.d/records.yml.toml ,注意dest要和你的prometheus目录一致 20 | 21 | ```shell script 22 | cat <<-"EOF" > /etc/confd/conf.d/records.yml.toml 23 | [template] 24 | prefix = "/prometheus" 25 | src = "records.yml.tmpl" 26 | dest = "/opt/app/prometheus/confd_record.yml" 27 | #shards=3 28 | #num=0 29 | keys = [ 30 | "/records" 31 | ] 32 | reload_cmd = "curl -X POST http://localhost:9090/-/reload" 33 | 34 | 35 | EOF 36 | ``` 37 | 38 | - shards代表分片总数,num代表第几个分片 39 | - record模板文件 /etc/confd/templates/records.yml.tmpl 40 | > 每个record单独的group分组,好处是互相不影响,缺点是group过多 41 | ```shell script 42 | cat <<-"EOF" > /etc/confd/templates/records.yml.tmpl 43 | groups: 44 | {{range gets "/records/*"}}{{$item := json .Value}} 45 | - name: {{$item.record}} 46 | rules: 47 | - record: {{$item.record}} 48 | expr: {{$item.expr}} 49 | {{end}} 50 | EOF 51 | ``` 52 | 53 | > 使用相同分组,需要按顺序执行record 54 | ```shell script 55 | cat <<-"EOF" > /etc/confd/templates/records.yml.tmpl 56 | groups: 57 | - name: confd_record 58 | interval: 30s 59 | rules:{{range gets "/records/*"}}{{$item := json .Value}} 60 | - record: {{$item.record}} 61 | expr: {{$item.expr}}{{end}} 62 | EOF 63 | 64 | ``` 65 | 66 | 67 | ### 指定consul backend 启动confd 68 | 69 | - onetime代表运行一次 70 | 71 | ```shell script 72 | confd -onetime --backend consul --node localhost:8500 --log-level debug 73 | ``` 74 | 75 | ```shell script 76 | cat < /etc/systemd/system/confd.service 77 | [Unit] 78 | Description=confd server 79 | Wants=network-online.target 80 | After=network-online.target 81 | 82 | [Service] 83 | ExecStart=/usr/bin/confd --backend consul --node 172.20.70.205:8500 --log-level debug -interval=30 84 | StandardOutput=syslog 85 | StandardError=syslog 86 | SyslogIdentifier=confd 87 | [Install] 88 | WantedBy=default.target 89 | EOF 90 | 91 | # 启动服务 92 | systemctl daemon-reload && systemctl start confd 93 | 94 | systemctl status confd 95 | 96 | ``` 97 | 98 | ## 02 中控机上部署consul redis ansible 99 | 100 | 101 | 102 | ### consul 安装 103 | 104 | #### 准备工作 105 | 106 | ```shell 107 | 108 | # 下载consul 109 | wget -O /opt/tgzs/consul_1.9.4_linux_amd64.zip https://releases.hashicorp.com/consul/1.9.4/consul_1.9.4_linux_amd64.zip 110 | 111 | cd /opt/tgzs/ 112 | unzip consul_1.9.4_linux_amd64.zip 113 | 114 | /bin/cp -f consul /usr/bin/ 115 | 116 | 117 | ``` 118 | 119 | #### 启动单机版consul 120 | 121 | ```shell 122 | 123 | # 124 | mkdir /opt/app/consul 125 | 126 | # 准备配置文件 127 | cat < /opt/app/consul/single_server.json 128 | { 129 | "datacenter": "dc1", 130 | "node_name": "consul-svr-01", 131 | "server": true, 132 | "bootstrap_expect": 1, 133 | "data_dir": "/opt/app/consul/", 134 | "log_level": "INFO", 135 | "log_file": "/opt/logs/", 136 | "ui": true, 137 | "bind_addr": "0.0.0.0", 138 | "client_addr": "0.0.0.0", 139 | "retry_interval": "10s", 140 | "raft_protocol": 3, 141 | "enable_debug": false, 142 | "rejoin_after_leave": true, 143 | "enable_syslog": false 144 | } 145 | EOF 146 | 147 | # 多个ip地址时,将bind_addr 改为一个内网的ip 148 | 149 | # 写入service文件 150 | cat < /etc/systemd/system/consul.service 151 | [Unit] 152 | Description=consul server 153 | Wants=network-online.target 154 | After=network-online.target 155 | 156 | [Service] 157 | ExecStart=/usr/bin/consul agent -config-file=/opt/app/consul/single_server.json 158 | StandardOutput=syslog 159 | StandardError=syslog 160 | SyslogIdentifier=consul 161 | [Install] 162 | WantedBy=default.target 163 | EOF 164 | 165 | # 启动服务 166 | systemctl daemon-reload && systemctl start consul 167 | 168 | systemctl status consul 169 | 170 | 171 | ``` 172 | 173 | #### 验证访问 174 | 175 | - http://localhost:8500/ 176 | 177 | ## 03 将pre_query 放到中控机上 178 | - all_prome_query 中的prometheus query ip改为自己的 179 | - prometheus query 开启query log 180 | ```yaml 181 | global: 182 | query_log_file: /App/logs/prometheus_query.log 183 | ``` 184 | - config.yaml 填写相关配置项 185 | 186 | ## 04 执行pre_query中的分析record命令 187 | > 将promtool 复制到当前目录用作 record promql的check 188 | - /bin/cp -f /opt/app/prometheus/promtool 189 | 190 | 191 | > pre_query目录下执行ansible命令 192 | ```shell script 193 | ansible-playbook -i all_prome_query prome_heavy_expr_parse.yaml 194 | ``` 195 | 196 | > 检查本地record yaml 197 | ```shell script 198 | [root@k8s-master01 pre_query]# ll local_record_yml_dir/ 199 | total 12 200 | -rw-r--r-- 1 root root 551 Sep 13 15:53 record_2_2021-09-13.yml 201 | -rw-r--r-- 1 root root 5455 Sep 13 15:53 record_26_2021-09-13.yml 202 | [root@k8s-master01 pre_query]# head local_record_yml_dir/record_26_2021-09-13.yml 203 | groups: 204 | - name: heavy_expr_record 205 | rules: 206 | - record: hke:heavy_expr:082a631dfddb7cf65ddd0fb4923ab17e 207 | expr: rate(mysql_global_status_sort_scan{instance=~"172.20.70.205:9104"}[5s]) 208 | or irate(mysql_global_status_sort_scan{instance=~"172.20.70.205:9104"}[5m]) 209 | - record: hke:heavy_expr:1416fc3de389e2a5c36aa5c8c376391f 210 | expr: mysql_global_status_threads_cached{instance=~"172.20.70.205:9104"} 211 | - record: hke:heavy_expr:14e8a540527123cc11ad96c5faa03f43 212 | expr: irate(mysql_slave_status_relay_log_pos{instance=~"172.20.70.205:9104"}[5m]) 213 | ``` 214 | 215 | > 检查consul中的记录 216 | ```shell script 217 | curl http://localhost:8500/v1/kv/prometheus/record?recurse= |python -m json.tool 218 | { 219 | "CreateIndex": 585468, 220 | "Flags": 0, 221 | "Key": "prometheus/records/6", 222 | "LockIndex": 0, 223 | "ModifyIndex": 585468, 224 | "Value": "eyJyZWNvcmQiOiAiaGtlOmhlYXZ5X2V4cHI6MjY1YzUwMzMxZjRiNzk4MzRjMzc1MDY2ZTY2NWQ4NDYiLCAiZXhwciI6ICJyYXRlKG15c3FsX2dsb2JhbF9zdGF0dXNfY3JlYXRlZF90bXBfdGFibGVze2luc3RhbmNlPX5cIjE3Mi4yMC43MC4yMDU6OTEwNFwifVs1c10pIG9yIGlyYXRlKG15c3FsX2dsb2JhbF9zdGF0dXNfY3JlYXRlZF90bXBfdGFibGVze2luc3RhbmNlPX5cIjE3Mi4yMC43MC4yMDU6OTEwNFwifVs1bV0pIn0=" 225 | }, 226 | { 227 | "CreateIndex": 585468, 228 | "Flags": 0, 229 | "Key": "prometheus/records/7", 230 | "LockIndex": 0, 231 | "ModifyIndex": 585468, 232 | "Value": "eyJyZWNvcmQiOiAiaGtlOmhlYXZ5X2V4cHI6MjZkODYwNzY4NzcxOTUyOTc3ZGNiZjUzYzU3ZWZhNTUiLCAiZXhwciI6ICJyYXRlKG15c3FsX2dsb2JhbF9zdGF0dXNfcXVlcmllc3tpbnN0YW5jZT1+XCIxNzIuMjAuNzAuMjA1OjkxMDRcIn1bNXNdKSBvciBpcmF0ZShteXNxbF9nbG9iYWxfc3RhdHVzX3F1ZXJpZXN7aW5zdGFuY2U9flwiMTcyLjIwLjcwLjIwNTo5MTA0XCJ9WzVtXSkifQ==" 233 | }, 234 | ``` 235 | 236 | 237 | 238 | > 检测部署了confd的 prometheus record 上的record文件内容 239 | ```shell script 240 | [root@k8s-master01 pre_query]# cat /opt/app/prometheus/confd_record.yml |head 241 | groups: 242 | 243 | - name: hke:heavy_expr:082a631dfddb7cf65ddd0fb4923ab17e 244 | rules: 245 | - record: hke:heavy_expr:082a631dfddb7cf65ddd0fb4923ab17e 246 | expr: rate(mysql_global_status_sort_scan{instance=~"172.20.70.205:9104"}[5s]) or irate(mysql_global_status_sort_scan{instance=~"172.20.70.205:9104"}[5m]) 247 | 248 | - name: hke:heavy_expr:4b93ce0bd3db2848e1b6d330a03272f7 249 | rules: 250 | - record: hke:heavy_expr:4b93ce0bd3db2848e1b6d330a03272f7 251 | ``` 252 | 253 | > prometheus record页面上检查 聚合规则并查询数据 254 | - 截图 255 | 256 | > 检查redis中的key 257 | ```shell script 258 | [root@k8s-master01 pre_query]# redis-cli keys "hke:heavy_expr*" 259 | 1) "hke:heavy_expr:bc7775bb5e33bf84afa9a1d4c0c45a9a" 260 | 2) "hke:heavy_expr:de2548ae6a00a90b1c2f85f8d6d9f13b" 261 | 3) "hke:heavy_expr:d86e3aa799b6a84790e133aa8a306e96" 262 | 4) "hke:heavy_expr:4fe8ee091e7823b66b475ba05b5fd030" 263 | 5) "hke:heavy_expr:b96a96befac765f6c00743a82ffae053" 264 | 6) "hke:heavy_expr:513ddfbf6f83d1ba1dd9b0b4a21a43bf" 265 | 7) "hke:heavy_expr:2998d2677fc1873a0e46802cbdd1bfee" 266 | 8) "hke:heavy_expr:22ccf0a71b6651763d1b7c16f5c05365" 267 | 9) "hke:heavy_expr:0d8c4be4ea8dccb9f06389246a02c6b3" 268 | 10) "hke:heavy_expr:f30b7b481bb0fdee0466902b9abb3b35" 269 | 11) "hke:heavy_expr:298afe40c3479e217b0b0b3666bd6904" 270 | 12) "hke:heavy_expr:bebca671decc9d5954af35628a05baa2" 271 | 13) "hke:heavy_expr:db9f0c1be81f91c95d9eb617ab70da36" 272 | 14) "hke:heavy_expr:45d5dc64bef02cf3f515481747cccd80" 273 | 15) "hke:heavy_expr:d797f93ad8ec0f7c80a5617eb5e4f3d8" 274 | 16) "hke:heavy_expr:eb1637bfe8f1388e99659d4621a79367" 275 | 17) "hke:heavy_expr:26d860768771952977dcbf53c57efa55" 276 | 18) "hke:heavy_expr:25bc18bd90a1a69d950802d937d337a0" 277 | 19) "hke:heavy_expr:d8aaf244a86fcfae8e51aeeb6935a5a5" 278 | 20) "hke:heavy_expr:189831b5aaa2d688c49a9c717fbf8b3d" 279 | ``` 280 | 281 | 282 | ## 05 confd分片功能演示 283 | > 默认不开启分片 ,shards 和num注释掉就可以 284 | - confd配置文件 /etc/confd/conf.d/records.yml.toml 285 | ```yaml 286 | [template] 287 | prefix = "/prometheus" 288 | src = "records.yml.tmpl" 289 | dest = "/opt/app/prometheus/confd_record.yml" 290 | #shards=2 291 | #num=0 292 | keys = [ 293 | "/records" 294 | ] 295 | reload_cmd = "curl -X POST http://localhost:9090/-/reload" 296 | 297 | 298 | ``` 299 | - prometheus record 通过的结果 46个 300 | ```shell script 301 | [root@k8s-master01 conf.d]# confd -onetime --backend consul --node localhost:8500 302 | 2021-09-13T16:45:15+08:00 k8s-master01 confd[30010]: INFO Backend set to consul 303 | 2021-09-13T16:45:15+08:00 k8s-master01 confd[30010]: INFO Starting confd 304 | 2021-09-13T16:45:15+08:00 k8s-master01 confd[30010]: INFO Backend source(s) set to localhost:8500 305 | 2021-09-13T16:45:15+08:00 k8s-master01 confd[30010]: INFO t.shards:0,t.nums:0 306 | [root@k8s-master01 conf.d]# /opt/app/prometheus/promtool check rules /opt/app/prometheus/confd_record.yml 307 | Checking /opt/app/prometheus/confd_record.yml 308 | SUCCESS: 46 rules found 309 | 310 | ``` 311 | 312 | > 开启分片 配置 shards=2 num=0 代表 2分片中的第一个 313 | - confd配置文件 /etc/confd/conf.d/records.yml.toml 314 | ```yaml 315 | [template] 316 | prefix = "/prometheus" 317 | src = "records.yml.tmpl" 318 | dest = "/opt/app/prometheus/confd_record.yml" 319 | shards=2 320 | num=0 321 | keys = [ 322 | "/records" 323 | ] 324 | reload_cmd = "curl -X POST http://localhost:9090/-/reload" 325 | ``` 326 | 327 | 328 | - prometheus record 通过的结果 23个 329 | ```shell script 330 | [root@k8s-master01 conf.d]# confd -onetime --backend consul --node localhost:8500 331 | 2021-09-13T16:47:16+08:00 k8s-master01 confd[32350]: INFO Backend set to consul 332 | 2021-09-13T16:47:16+08:00 k8s-master01 confd[32350]: INFO Starting confd 333 | 2021-09-13T16:47:16+08:00 k8s-master01 confd[32350]: INFO Backend source(s) set to localhost:8500 334 | 2021-09-13T16:47:16+08:00 k8s-master01 confd[32350]: INFO t.shards:2,t.nums:0 335 | 2021-09-13T16:47:16+08:00 k8s-master01 confd[32350]: INFO /opt/app/prometheus/confd_record.yml has md5sum a0c39c7a73d741ec911b64a6eb5d1b8c should be 50ad6045ba32557c64037702bbc2613c 336 | 2021-09-13T16:47:16+08:00 k8s-master01 confd[32350]: INFO Target config /opt/app/prometheus/confd_record.yml out of sync 337 | 2021-09-13T16:47:16+08:00 k8s-master01 confd[32350]: INFO Target config /opt/app/prometheus/confd_record.yml has been updated 338 | [root@k8s-master01 conf.d]# /opt/app/prometheus/promtool check rules /opt/app/prometheus/confd_record.yml 339 | Checking /opt/app/prometheus/confd_record.yml 340 | SUCCESS: 23 rules found 341 | 342 | [root@k8s-master01 conf.d 343 | 344 | ``` 345 | 346 | 347 | 348 | ## 06 openresty和lua组件,新增grafana数据源 349 | 350 | > 安装openresty ,准备lua环境 351 | ```shell script 352 | yum install yum-utils -y 353 | yum-config-manager --add-repo https://openresty.org/package/centos/openresty.repo 354 | yum install openresty openresty-resty -y 355 | ``` 356 | 357 | 358 | 359 | > 修改信息 360 | - 修改prome_redirect.lua 文件中的 27 行 localhost redis地址为你自己的 361 | - 修改ngx_prome_redirect.conf文件中 真实real_prometheus后端,使用前请修改 362 | 363 | > 将nginx配置和lua文件放到指定目录 364 | ```shell script 365 | 366 | mkdir -pv /usr/local/openresty/nginx/conf/conf.d/ 367 | mkdir -pv /usr/local/openresty/nginx/lua_files/ 368 | /bin/cp -f ngx_prome_redirect.conf /usr/local/openresty/nginx/conf/conf.d/ 369 | /bin/cp -f nginx.conf /usr/local/openresty/nginx/conf/ 370 | /bin/cp -f prome_redirect.lua /usr/local/openresty/nginx/lua_files/ 371 | 372 | ``` 373 | 374 | > 启动openresty 375 | ```shell script 376 | systemctl enable openresty 377 | systemctl start openresty 378 | ``` 379 | 380 | > 请求OpenResty 9992端口 ,出现/graph则正常 381 | ```shell script 382 | [root@k8s-master01 pre_query]# curl localhost:9992/ 383 | Found. 384 | ``` 385 | 386 | > openresty查看日志 387 | ```shell script 388 | tail -f /usr/local/openresty/nginx/logs/access.log 389 | ``` 390 | 391 | > 修改grafana数据源,将原来的指向真实prometheus地址改为指向openresty的9992端口 392 | - 截图 393 | 394 | 395 | > 之前查询慢的大盘导出一份,再导入,选择新的9992数据源 查看对比 396 | - 截图 397 | 398 | 399 | 400 | ## 运维指南 401 | ``` 402 | # 查看redis中的heavy_query记录 403 | redis-cli -h $redis_host keys hke:heavy_expr* 404 | # 查看consul中的heavy_query记录 405 | curl http://$consul_addr:8500/v1/kv/prometheus/record?recurse= |python -m json.tool 406 | # 根据一个heavy_record文件恢复记录 407 | python3 recovery_by_local_yaml.py local_record_yml/record_to_keep.yml 408 | # 根据一个metric_name前缀删除record记录 409 | bash -x recovery_heavy_metrics.sh $metric_name 410 | ``` 411 | 412 | 413 | ## 总结 414 | - 使用OpenResty的数据源 不会影响未配置预聚合的图 415 | - 因为只是nginx代理了一下,如果redis中没有要替换的expr就会以原查询ql查询 416 | --------------------------------------------------------------------------------