├── .dockerignore ├── .gitignore ├── Dockerfile ├── LICENSE ├── Makefile ├── README.md ├── ecs_container_exporter ├── __init__.py ├── cpu_metrics.py ├── io_metrics.py ├── main.py ├── memory_metrics.py ├── network_metrics.py └── utils.py ├── requirements.txt └── setup.py /.dockerignore: -------------------------------------------------------------------------------- 1 | .git 2 | .dockerignore 3 | tests 4 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.swp 2 | *.pyc 3 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | From python:3-alpine 2 | ARG WORK_DIR=/usr/src/app 3 | 4 | WORKDIR ${WORK_DIR} 5 | 6 | RUN pip install --upgrade pip 7 | 8 | # TODO: find a way to do --no-cache install with setup.py 9 | # instead of using requirements.txt 10 | COPY . . 11 | RUN pip install --no-cache-dir -r requirements.txt 12 | 13 | ENV EXPORTER_PORT=9545 14 | EXPOSE ${EXPORTER_PORT} 15 | 16 | ENV LOG_LEVEL=info 17 | ENV PYTHONUNBUFFERED=True 18 | ENV EXCLUDE="ecs-container-exporter,~internal~ecs~pause" 19 | CMD ["ecs-container-exporter"] 20 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Spaced-Out 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | .PHONY: build test 2 | 3 | TAG ?= latest 4 | LOG_LEVEL ?= DEBUG 5 | 6 | build: 7 | docker build . -t raags/ecs-container-exporter:${TAG} 8 | 9 | clean: 10 | rm -rf dist/* 11 | 12 | package: clean 13 | pip install --upgrade build 14 | python3 -m build 15 | 16 | pypi: 17 | python3 -m twine upload dist/* 18 | 19 | run: 20 | docker run --rm --name ecs-container-exporter -p 9545:9545 -e ECS_CONTAINER_METADATA_URI=http://10.200.10.1:8080 raags/ecs-container-exporter:${TAG} 21 | # --log-level debug 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ecs-container-exporter 2 | [AWS ECS](https://aws.amazon.com/ecs/) side car that exports ECS container level docker stats metrics to [Prometheus](https://prometheus.io) as well as publish it via Statsd. 3 | 4 | # Motivation 5 | The default metrics available in AWS ECS are limited, mostly at the task level, across all containers in a task; the container level metrics are not available. In addition, more detailed cgroup metrics are also not available, such as per cpu, and memory usage breakdown into cache, rss, etc. 6 | 7 | Luckily AWS exposes the [docker stats](https://docs.docker.com/engine/api/v1.40/#operation/ContainerInspect) data via a [Task metadata endpoint](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint.html). 8 | 9 | The ecs-container-exporter parses this data, and can expose it to Prometheus or push them via StatsD. 10 | 11 | # Usage 12 | Install via Pip: 13 | 14 | ``` 15 | $ pip3 install ecs-container-exporter 16 | ``` 17 | 18 | or via docker: 19 | 20 | ``` 21 | $ docker pull raags/ecs-container-exporter 22 | ``` 23 | 24 | On ECS, add the following json to the task definition: 25 | 26 | ```json 27 | { 28 | "name": "ecs-container-exporter", 29 | "image": "raags/ecs-container-exporter:latest", 30 | "portMappings": [ 31 | { 32 | "hostPort": 0, 33 | "protocol": "tcp", 34 | "containerPort": 9545 35 | } 36 | ], 37 | "command": [], 38 | "cpu": 256, 39 | "dockerLabels": { 40 | "PROMETHEUS_EXPORTER_PORT": "9545" 41 | } 42 | } 43 | ``` 44 | The `PROMETHEUS_EXPORTER_PORT` label is for ECS discovery via https://github.com/teralytics/prometheus-ecs-discovery 45 | 46 | To include or exclude application containers use the `INCLUDE` or `EXCLUDE` environment variable. By default `ecs-container-exporter` 47 | and `~internal~ecs~pause` (a Fargate internal sidecar) is excluded. 48 | 49 | ## Statsd 50 | 51 | Version `2.0.0` add Statsd support with `--use-statsd` flag or env `USE_STATSD`. Metrics are emitted with DogStatsd Tag format. 52 | 53 | 54 | # Metrics 55 | 56 | The metrics are sampled twice as per the configured `interval` (default 60s), and than aggregated in this interval. This should be set to the Prometheus scrape interval. 57 | 58 | 59 | ## CPU 60 | 61 | CPU usage ratio is calculated and scaled as per the applicable container or task cpu limit: 62 | 63 | | Task Limit | Container Limit | Task Metric | Container Metric | 64 | | --- | --- | --- | --- | 65 | | 0 | 0 | no scaling | no scaling | 66 | | 0 | x | no scaling | scale cpu (can burst above 100%) | 67 | | x | 0 | scale as per limit | scale as per task limit | 68 | | x | x | scale as per limit | scale as per container limit (can burst above 100%) | 69 | 70 | Note that unlike `docker stats` command and others, CPU usage is not scaled to 71 | the number of CPUs. This means a task with 4 CPUs with all 4 having full 72 | utilization will show up as 400% in `docker stats`, but 100% here. 73 | 74 | ## Memory 75 | 76 | Memory usage and cache is emitted separately, but the memory usage also includes cache, so 77 | subtract cache from it to plot application memory usage specifically. 78 | 79 | ## IO 80 | 81 | ## Network 82 | 83 | Network metrics were recently added in Fargate 1.4.0 and ECS agent 1.41.0 onwards. 84 | 85 | # TODO 86 | [] - Support Non ECS docker host containers 87 | -------------------------------------------------------------------------------- /ecs_container_exporter/__init__.py: -------------------------------------------------------------------------------- 1 | __version__ = '2.4.0' 2 | -------------------------------------------------------------------------------- /ecs_container_exporter/cpu_metrics.py: -------------------------------------------------------------------------------- 1 | import logging 2 | from ecs_container_exporter.utils import create_metric, task_metric_tags, TASK_CONTAINER_NAME_TAG 3 | 4 | log = logging.getLogger(__name__) 5 | 6 | 7 | def calculate_cpu_metrics(docker_stats, task_cpu_limit, task_container_limits, 8 | task_container_tags): 9 | """ 10 | Calculate cpu metrics based on the docker container stats json. 11 | 12 | The precpu_stats is calculated by docker stats api at 1s interval, 13 | which is too small, instead we scample twice over a longer interval. 14 | https://github.com/moby/moby/blob/b50ba3da1239a56456634f74660f43a27df6b3f2/daemon/daemon.go#L1055 15 | 16 | `docker_stats` is a list with two samples over an interval. CPU usage 17 | is caculated by comparing the `cpu_stats` of the first and second sample. 18 | 19 | Task and Container Cpu usage is `scaled` against the applicable CPU 20 | limits. This is more accurate than using the actual values which give 21 | utilization against the Host CPU. 22 | 23 | Specifically calculate based on the following data. 24 | 25 | "cpu_stats": { 26 | "cpu_usage": { 27 | "percpu_usage": [ 28 | 826860687, 29 | 830807540, 30 | 823365887, 31 | 844077056 32 | ], 33 | "total_usage": 3325111170, 34 | "usage_in_kernelmode": 1620000000, 35 | "usage_in_usermode": 1600000000 36 | }, 37 | "online_cpus": 4, 38 | "system_cpu_usage": 35595977360000000, 39 | "throttling_data": { 40 | "periods": 0, 41 | "throttled_periods": 0, 42 | "throttled_time": 0 43 | } 44 | }, 45 | 46 | """ 47 | # Container_id to metrics mapping 48 | metrics_by_container = {} 49 | # Total task cpu usage 50 | task_cpu_usage_ratio = 0.0 51 | 52 | prev_stats = docker_stats[0] 53 | stats = docker_stats[1] 54 | online_cpus = get_online_cpus(stats) 55 | for container_id, container_stats in stats.items(): 56 | metrics = [] 57 | tags = task_container_tags[container_id] 58 | 59 | cpu_stats = container_stats.get('cpu_stats', {}) 60 | prev_cpu_stats = prev_stats[container_id].get('cpu_stats', {}) 61 | log.debug(f'CPU Stats: {container_id} - curr {cpu_stats} - prev {prev_cpu_stats}') 62 | 63 | # `cpu_stats` sometimes maybe empty when container is just started 64 | if not prev_cpu_stats: 65 | continue 66 | 67 | usage_delta = cpu_usage_diff(cpu_stats, prev_cpu_stats) 68 | # system_cpu_usage is sum of `all` cpu usage on the `host`: 69 | # https://github.com/moby/moby/blob/54d88a7cd366fd8169b8a96bec5d9f303d57c425/daemon/stats/collector_unix.go#L31 70 | # 71 | system_delta = system_cpu_usage_diff(cpu_stats, prev_cpu_stats) 72 | log.debug(f'usage_delta {usage_delta} system_delta {system_delta}') 73 | 74 | cpu_usage_ratio = calculate_cpu_usage(usage_delta, system_delta) 75 | log.debug(f'Usage Ratio {cpu_usage_ratio} {online_cpus}') 76 | # add to total task cpu usage 77 | task_cpu_usage_ratio += cpu_usage_ratio 78 | 79 | # Scale cpu usage against the Task or Container Limit 80 | container_cpu_limit = task_container_limits[container_id]['cpu'] 81 | value = normalize_cpu_usage(cpu_usage_ratio, online_cpus, 82 | task_cpu_limit, 83 | container_cpu_limit) 84 | 85 | metric = create_metric('cpu_usage_ratio', round(value, 4), tags, 'gauge', 86 | 'CPU usage ratio') 87 | metrics.append(metric) 88 | 89 | # Per cpu metrics 90 | metrics.extend( 91 | percpu_metrics(cpu_stats, prev_cpu_stats, system_delta, online_cpus, 92 | task_cpu_limit, container_cpu_limit, tags) 93 | ) 94 | 95 | metrics.extend( 96 | cpu_user_kernelmode_metrics(cpu_stats, prev_cpu_stats, tags) 97 | ) 98 | 99 | # CPU throttling 100 | # TODO: figure out how this will look at the Task level 101 | metrics.extend( 102 | cpu_throttle_metrics(cpu_stats, tags) 103 | ) 104 | 105 | metrics_by_container[container_id] = metrics 106 | 107 | # total task cpu ratio 108 | if task_cpu_limit != 0: 109 | task_cpu_usage_ratio = scale_cpu_usage(task_cpu_usage_ratio, online_cpus, task_cpu_limit) 110 | 111 | tags = task_metric_tags() 112 | metric = create_metric('cpu_usage_ratio', round(task_cpu_usage_ratio, 4), tags, 'gauge', 113 | 'CPU usage ratio') 114 | 115 | metrics_by_container[TASK_CONTAINER_NAME_TAG] = [metric] 116 | 117 | return metrics_by_container 118 | 119 | 120 | def cpu_throttle_metrics(cpu_stats, tags): 121 | metrics = [] 122 | throttling_data = cpu_stats.get('throttling_data') 123 | for mkey, mvalue in throttling_data.items(): 124 | metric = create_metric('cpu_throttle_' + mkey, mvalue, tags, 'gauge') 125 | metrics.append(metric) 126 | 127 | return metrics 128 | 129 | 130 | def cpu_user_kernelmode_metrics(cpu_stats, prev_cpu_stats, tags): 131 | # user and kernel mode split 132 | curr_stats = cpu_stats['cpu_usage'] 133 | prev_stats = prev_cpu_stats['cpu_usage'] 134 | kernelmode_delta = curr_prev_diff(curr_stats, prev_stats, 'usage_in_kernelmode') 135 | kernelmode_metric = create_metric('cpu_kernelmode', kernelmode_delta, tags, 'gauge', 136 | 'cpu usage in kernel mode') 137 | 138 | usermode_delta = curr_prev_diff(curr_stats, prev_stats, 'usage_in_usermode') 139 | usermode_metric = create_metric('cpu_usermode', usermode_delta, tags, 'gauge', 140 | 'cpu usage in user mode') 141 | 142 | return (kernelmode_metric, usermode_metric) 143 | 144 | 145 | def percpu_metrics(cpu_stats, prev_cpu_stats, system_delta, online_cpus, 146 | task_cpu_limit, container_cpu_limit, tags): 147 | metrics = [] 148 | percpu_usage = cpu_stats.get('cpu_usage', {}).get('percpu_usage', []) 149 | prev_percpu_usage = prev_cpu_stats.get('cpu_usage', {}).get('percpu_usage', []) 150 | 151 | if percpu_usage and prev_percpu_usage: 152 | for i, value in enumerate(percpu_usage): 153 | # skip inactive Cpus - https://github.com/torvalds/linux/commit/5ca3726 154 | if value != 0: 155 | cpu_tags = tags.copy() 156 | cpu_tags['cpu'] = str(i) 157 | if prev_percpu_usage[i] and system_delta: 158 | usage_delta = float(value) - float(prev_percpu_usage[i]) 159 | usage_ratio = usage_delta / system_delta 160 | else: 161 | usage_ratio = 0.0 162 | 163 | value = normalize_cpu_usage(usage_ratio, online_cpus, 164 | task_cpu_limit, 165 | container_cpu_limit) 166 | 167 | metric = create_metric('percpu_usage_ratio', round(usage_ratio, 4), cpu_tags, 'gauge', 168 | 'Per CPU usage ratio') 169 | metrics.append(metric) 170 | 171 | return metrics 172 | 173 | 174 | def calculate_cpu_usage(usage_delta, system_delta): 175 | if usage_delta and system_delta: 176 | # Keep it to 100% instead of scaling by number of cpus : 177 | # https://github.com/moby/moby/issues/29306#issuecomment-405198198 178 | # 179 | cpu_usage_ratio = usage_delta / system_delta 180 | else: 181 | cpu_usage_ratio = 0.0 182 | 183 | return cpu_usage_ratio 184 | 185 | 186 | def system_cpu_usage_diff(cpu_stats, prev_cpu_stats): 187 | return curr_prev_diff(cpu_stats, prev_cpu_stats, 'system_cpu_usage') 188 | 189 | 190 | def cpu_usage_diff(cpu_stats, prev_cpu_stats): 191 | return curr_prev_diff(cpu_stats['cpu_usage'], prev_cpu_stats['cpu_usage'], 192 | 'total_usage') 193 | 194 | 195 | def curr_prev_diff(curr_stats, prev_stats, metric_key): 196 | curr = curr_stats[metric_key] 197 | prev = prev_stats[metric_key] 198 | if curr and prev: 199 | return curr - prev 200 | else: 201 | return 0.0 202 | 203 | 204 | def scale_cpu_usage(cpu_usage_ratio, host_cpu_count, cpu_share_limit): 205 | """ 206 | Scales cpu ratio originally calculated against total host cpu capacity, 207 | with the corresponding cpu shares limit (at task or container level) 208 | 209 | host_cpu_count is multiplied by 1024 to get the total available cpu shares 210 | 211 | """ 212 | scaled_cpu_usage_ratio = (cpu_usage_ratio * host_cpu_count * 1024) / cpu_share_limit 213 | return scaled_cpu_usage_ratio 214 | 215 | 216 | def normalize_cpu_usage(cpu_usage_ratio, host_cpu_count, 217 | task_cpu_limit, container_cpu_limit): 218 | """ 219 | Task Limit - Container Limit - Cpu metric 220 | 0 - 0 - no scaling 221 | 0 - x - scale against container limit 222 | x - x - scale against container limit 223 | x - 0 - scale against task limit 224 | 225 | """ 226 | log.debug(f'Normalize CPU: {cpu_usage_ratio} - {host_cpu_count} - {task_cpu_limit} - {container_cpu_limit}') 227 | if task_cpu_limit == 0 and container_cpu_limit == 0: 228 | # no scaling, task can use all of host CPU 229 | return cpu_usage_ratio 230 | 231 | if container_cpu_limit != 0: 232 | # case 2 and 3 above 233 | # this can go above 100% when task_cpu_limit is 0 234 | # 235 | return scale_cpu_usage(cpu_usage_ratio, host_cpu_count, container_cpu_limit) 236 | 237 | if task_cpu_limit != 0: 238 | # case 4, should not go above 100% as cpu usage 239 | # is limited by task_cpu_limit 240 | # 241 | return scale_cpu_usage(cpu_usage_ratio, host_cpu_count, task_cpu_limit) 242 | 243 | 244 | def get_online_cpus(stats): 245 | """ 246 | Assume this number is same for all containers 247 | 248 | """ 249 | for container_id, container_stats in stats.items(): 250 | return container_stats['cpu_stats']['online_cpus'] 251 | -------------------------------------------------------------------------------- /ecs_container_exporter/io_metrics.py: -------------------------------------------------------------------------------- 1 | import logging 2 | from collections import defaultdict 3 | from ecs_container_exporter.utils import create_metric, create_task_metrics, TASK_CONTAINER_NAME_TAG 4 | 5 | log = logging.getLogger() 6 | 7 | 8 | def calculate_io_metrics(stats, task_container_tags): 9 | """ 10 | Calculate IO metrics from the below data: 11 | 12 | "blkio_stats": { 13 | "io_merged_recursive": [], 14 | "io_queue_recursive": [], 15 | "io_service_bytes_recursive": [ 16 | { 17 | "major": 259, 18 | "minor": 0, 19 | "op": "Read", 20 | "value": 10653696 21 | }, 22 | { 23 | "major": 259, 24 | "minor": 0, 25 | "op": "Write", 26 | "value": 0 27 | }, 28 | { 29 | "major": 259, 30 | "minor": 0, 31 | "op": "Sync", 32 | "value": 10653696 33 | }, 34 | { 35 | "major": 259, 36 | "minor": 0, 37 | "op": "Async", 38 | "value": 0 39 | }, 40 | { 41 | "major": 259, 42 | "minor": 0, 43 | "op": "Total", 44 | "value": 10653696 45 | } 46 | ], 47 | "io_service_time_recursive": [], 48 | "io_serviced_recursive": [ 49 | { 50 | "major": 259, 51 | "minor": 0, 52 | "op": "Read", 53 | "value": 164 54 | }, 55 | { 56 | "major": 259, 57 | "minor": 0, 58 | "op": "Write", 59 | "value": 0 60 | }, 61 | { 62 | "major": 259, 63 | "minor": 0, 64 | "op": "Sync", 65 | "value": 164 66 | }, 67 | { 68 | "major": 259, 69 | "minor": 0, 70 | "op": "Async", 71 | "value": 0 72 | }, 73 | { 74 | "major": 259, 75 | "minor": 0, 76 | "op": "Total", 77 | "value": 164 78 | } 79 | ], 80 | "io_time_recursive": [], 81 | "io_wait_time_recursive": [], 82 | "sectors_recursive": [] 83 | }, 84 | """ 85 | metrics_by_container = {} 86 | # task level metrics 87 | task_metrics = defaultdict(int) 88 | # assume these will always be gauge 89 | metric_type = 'gauge' 90 | for container_id, container_stats in stats.items(): 91 | metrics = [] 92 | blkio_stats = container_stats.get('blkio_stats') 93 | 94 | iostats = {'io_service_bytes_recursive': 'bytes', 'io_serviced_recursive': 'iops'} 95 | for blk_key, blk_type in iostats.items(): 96 | tags = task_container_tags[container_id] 97 | read_counter = write_counter = 0 98 | for blk_stat in blkio_stats.get(blk_key): 99 | if blk_stat['op'] == 'Read' and 'value' in blk_stat: 100 | read_counter += blk_stat['value'] 101 | elif blk_stat['op'] == 'Write' and 'value' in blk_stat: 102 | write_counter += blk_stat['value'] 103 | 104 | metrics.append( 105 | create_metric('disk_read_' + blk_type + '_total', read_counter, tags, 106 | metric_type, 'Total disk read ' + blk_type) 107 | ) 108 | metrics.append( 109 | create_metric('disk_written_' + blk_type + '_total', write_counter, tags, 110 | metric_type, 'Total disk written ' + blk_type) 111 | ) 112 | task_metrics['disk_read_' + blk_type + '_total'] += read_counter 113 | task_metrics['disk_written_' + blk_type + '_total'] += write_counter 114 | 115 | metrics_by_container[container_id] = metrics 116 | 117 | # task level metrics 118 | metrics_by_container[TASK_CONTAINER_NAME_TAG] = create_task_metrics(task_metrics, metric_type) 119 | 120 | return metrics_by_container 121 | -------------------------------------------------------------------------------- /ecs_container_exporter/main.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import sys 3 | import time 4 | import click 5 | import pickle 6 | import signal 7 | from requests.compat import urljoin 8 | 9 | from prometheus_client import start_http_server 10 | from prometheus_client.core import REGISTRY 11 | 12 | from ecs_container_exporter import utils 13 | from ecs_container_exporter.utils import create_metric 14 | from ecs_container_exporter.cpu_metrics import calculate_cpu_metrics 15 | from ecs_container_exporter.memory_metrics import calculate_memory_metrics 16 | from ecs_container_exporter.io_metrics import calculate_io_metrics 17 | from ecs_container_exporter.network_metrics import calculate_network_metrics 18 | 19 | import logging 20 | log = logging.getLogger(__name__) 21 | 22 | 23 | class ECSContainerExporter(object): 24 | 25 | include_containers = [] 26 | exclude_containers = [] 27 | # 1 - healthy, 0 - unhealthy 28 | exporter_status = 1 29 | # initial task metrics that do not change 30 | static_task_metrics = [] 31 | # individual container tags 32 | task_container_tags = {} 33 | # task limits 34 | task_cpu_limit = 0 35 | task_mem_limit = 0 36 | # individual container limits 37 | task_container_limits = {} 38 | # the Task level metrics are included by default 39 | include_container_ids = [utils.TASK_CONTAINER_NAME_TAG] 40 | # stats are collected and aggregated across this interval 41 | # default to 55s for scrape_interval and scrape_timeout of 1m 42 | # should be set to 60 when using statsd, which doesn't have this issue 43 | interval = 55 44 | 45 | def __init__(self, metadata_url=None, include_containers=None, exclude_containers=None, interval=55, http_timeout=60): 46 | 47 | self.task_metadata_url = urljoin(metadata_url + '/', 'task') 48 | # For testing 49 | # self.task_stats_url = urljoin(metadata_url + '/', 'stats') 50 | self.task_stats_url = urljoin(metadata_url + '/', 'task/stats') 51 | 52 | if exclude_containers: 53 | self.exclude_containers = exclude_containers 54 | if include_containers: 55 | self.include_containers = include_containers 56 | 57 | self.interval = interval 58 | self.http_timeout = http_timeout 59 | 60 | self.log = logging.getLogger(__name__) 61 | # cacheing mainly for statsd metric type, since prometheus will run this 62 | # only once its lifetime 63 | self.cache_file_path = '/tmp/ecs-container-exporter.cache' 64 | self.collect_static_metrics() 65 | 66 | def start_prometheus_eporter(self, exporter_port): 67 | self.log.info(f'Exporter initialized with ' 68 | f'metadata_url: {self.task_metadata_url}, ' 69 | f'task_stats_url: {self.task_stats_url}, ' 70 | f'http_timeout: {self.http_timeout}, ' 71 | f'include_containers: {self.include_containers}, ' 72 | f'exclude_containers: {self.exclude_containers}') 73 | 74 | REGISTRY.register(self) 75 | # Start exporter http service 76 | start_http_server(int(exporter_port)) 77 | while True: 78 | time.sleep(10) 79 | 80 | def send_statsd_metrics(self, statsd_host, statsd_port): 81 | utils.init_statsd_client(statsd_host, statsd_port) 82 | for metric in self.collect_all_metrics(): 83 | utils.send_statsd(metric) 84 | 85 | def cache_load_task_metadata(self): 86 | try: 87 | with open(self.cache_file_path, 'rb') as cf: 88 | return pickle.load(cf) 89 | except FileNotFoundError: 90 | pass 91 | 92 | return None 93 | 94 | def cache_write_task_metadata(self, data): 95 | with open(self.cache_file_path, 'wb') as cf: 96 | pickle.dump(data, cf) 97 | 98 | def collect_static_metrics(self): 99 | metadata = self.cache_load_task_metadata() 100 | if metadata: 101 | self.log.debug(f'Using cached task metadata response from {self.cache_file_path}') 102 | else: 103 | retries = 24 104 | while True and retries > 0: 105 | # wait for the task to be in running state 106 | time.sleep(5) 107 | retries -= 1 108 | try: 109 | metadata = utils.ecs_task_metdata(self.task_metadata_url, self.http_timeout) 110 | self.cache_write_task_metadata(metadata) 111 | except Exception as e: 112 | self.exporter_status = 0 113 | self.log.exception(e) 114 | continue 115 | 116 | if metadata.get('KnownStatus') != 'RUNNING': 117 | self.log.warning(f'ECS Task not yet in RUNNING state, current status is: {metadata["KnownStatus"]}') 118 | continue 119 | else: 120 | break 121 | 122 | self.exporter_status = 1 123 | 124 | self.log.debug(f'Discovered Task metadata: {metadata}') 125 | self.parse_task_metadata(metadata) 126 | 127 | def parse_task_metadata(self, metadata): 128 | self.static_task_metrics = [] 129 | self.task_container_tags = {} 130 | self.task_container_limits = {} 131 | 132 | # task cpu/mem limit 133 | task_tag = utils.task_metric_tags() 134 | self.task_cpu_limit, self.task_mem_limit = self.cpu_mem_limit(metadata, normalise_cpu=True) 135 | 136 | metric = create_metric('cpu_limit', self.task_cpu_limit, task_tag, 'gauge', 'Task CPU limit') 137 | self.static_task_metrics.append(metric) 138 | 139 | metric = create_metric('mem_limit', self.task_mem_limit, task_tag, 'gauge', 'Task Memory limit') 140 | self.static_task_metrics.append(metric) 141 | 142 | # container tags and limits 143 | for container in metadata['Containers']: 144 | container_id = container['DockerId'] 145 | container_name = container['Name'] 146 | 147 | if self.should_process_container(container_name, 148 | self.include_containers, 149 | self.exclude_containers): 150 | self.log.info(f'Processing stats for container: {container_name} - {container_id}') 151 | self.include_container_ids.append(container_id) 152 | else: 153 | self.log.info(f'Excluding container: {container_name} - {container_id} as per exclusion') 154 | 155 | self.task_container_tags[container_id] = {'container_name': container_name} 156 | 157 | # container cpu/mem limit 158 | cpu_value, mem_value = self.cpu_mem_limit(container) 159 | self.task_container_limits[container_id] = {'cpu': cpu_value, 'mem': mem_value} 160 | 161 | if container_id in self.include_container_ids: 162 | metric = create_metric('cpu_limit', cpu_value, self.task_container_tags[container_id], 163 | 'gauge', 'Limit in percent of the CPU usage') 164 | self.static_task_metrics.append(metric) 165 | 166 | metric = create_metric('mem_limit', mem_value, self.task_container_tags[container_id], 167 | 'gauge', 'Limit in memory usage in MBs') 168 | self.static_task_metrics.append(metric) 169 | 170 | def should_process_container(self, container_name, include_containers, exclude_containers): 171 | if container_name in exclude_containers: 172 | return False 173 | else: 174 | if include_containers: 175 | if container_name in include_containers: 176 | return True 177 | else: 178 | return False 179 | else: 180 | return True 181 | 182 | def cpu_mem_limit(self, metadata, normalise_cpu=False): 183 | # normalise to `cpu shares` 184 | cpu_limit = metadata.get('Limits', {}).get('CPU', 0) * (1024 if normalise_cpu else 1) 185 | mem_limit = metadata.get('Limits', {}).get('Memory', 0) 186 | 187 | return ( 188 | cpu_limit, mem_limit 189 | ) 190 | 191 | # All metrics are collected here 192 | def collect_all_metrics(self): 193 | container_metrics = self.collect_container_metrics() 194 | 195 | # exporter status metric 196 | metric = create_metric('exporter_status', self.exporter_status, {}, 197 | 'gauge', 'Exporter Status') 198 | container_metrics.append(metric) 199 | 200 | return self.static_task_metrics + container_metrics 201 | 202 | # prometheus exporter collect function 203 | def collect(self): 204 | for metric in self.collect_all_metrics(): 205 | yield utils.create_prometheus_metric(metric) 206 | 207 | def collect_container_metrics(self): 208 | metrics = [] 209 | docker_stats = [] 210 | # collect two stats samples as per the configured interval 211 | try: 212 | docker_stats.append( 213 | utils.ecs_task_metdata(self.task_stats_url, self.http_timeout) 214 | ) 215 | # ECS stats API returns immediately, but docker stats waits for 1s adjust this accordingly 216 | time.sleep(self.interval) 217 | 218 | docker_stats.append( 219 | utils.ecs_task_metdata(self.task_stats_url, self.http_timeout) 220 | ) 221 | if not all(docker_stats): 222 | raise Exception('Some stats are empty, try again') 223 | 224 | self.exporter_status = 1 225 | 226 | except Exception as e: 227 | self.exporter_status = 0 228 | self.log.exception(e) 229 | return metrics 230 | 231 | container_metrics_all = self.parse_container_metadata(docker_stats, 232 | self.task_cpu_limit, 233 | self.task_container_limits, 234 | self.task_container_tags) 235 | 236 | # flatten and filter excluded containers 237 | filtered_container_metrics = [] 238 | for metrics_by_container in container_metrics_all: 239 | for container_id, metrics in metrics_by_container.items(): 240 | if container_id in self.include_container_ids: 241 | filtered_container_metrics.extend(metrics) 242 | 243 | return filtered_container_metrics 244 | 245 | def parse_container_metadata(self, docker_stats, task_cpu_limit, 246 | task_container_limits, task_container_tags): 247 | """ 248 | More details on the exposed docker metrics 249 | https://github.com/moby/moby/blob/c1d090fcc88fa3bc5b804aead91ec60e30207538/api/types/stats.go 250 | 251 | """ 252 | container_metrics_all = [] 253 | 254 | # ignore stats for containers in STOPPED state 255 | for container_id in list(docker_stats[1].keys()): 256 | if not docker_stats[1][container_id]: 257 | del(docker_stats[0][container_id]) 258 | del(docker_stats[1][container_id]) 259 | self.log.debug(f'Ignoring null stats for container_id {container_id}') 260 | 261 | try: 262 | # CPU metrics 263 | container_metrics_all.append( 264 | calculate_cpu_metrics(docker_stats, 265 | task_cpu_limit, 266 | task_container_limits, 267 | task_container_tags) 268 | ) 269 | 270 | # Memory metrics 271 | container_metrics_all.append( 272 | calculate_memory_metrics(docker_stats[1], task_container_tags) 273 | ) 274 | 275 | # I/O metrics 276 | container_metrics_all.append( 277 | calculate_io_metrics(docker_stats[1], task_container_tags) 278 | ) 279 | 280 | # network metrics 281 | container_metrics_all.append( 282 | calculate_network_metrics(docker_stats[1], task_container_tags) 283 | ) 284 | 285 | except Exception as e: 286 | self.log.warning("Could not retrieve metrics for {}: {}".format(task_container_tags, e), exc_info=True) 287 | self.exporter_status = 1 288 | 289 | return container_metrics_all 290 | 291 | 292 | def shutdown(sig_number, frame): 293 | log.info("Recevied signal {}, Shuttting down".format(sig_number)) 294 | sys.exit(0) 295 | 296 | 297 | @click.command() 298 | @click.option('--metadata-url', envvar='ECS_CONTAINER_METADATA_URI', type=str, default=None, 299 | help='Override ECS Metadata Url') 300 | @click.option('--exporter-port', envvar='EXPORTER_PORT', type=int, default=9545, 301 | help='Change exporter listen port') 302 | @click.option('--use-statsd', envvar='USE_STATSD', is_flag=True, type=bool, default=False, 303 | help='Emit metrics to statsd instead of starting Prometheus exporter') 304 | @click.option('--statsd-host', envvar='STATSD_HOST', type=str, default='localhost', 305 | help='Override Stasd Host') 306 | @click.option('--statsd-port', envvar='STATSD_PORT', type=int, default=8125, 307 | help='Override Stasd Port') 308 | @click.option('--include', envvar='INCLUDE', type=str, default=None, 309 | help='Comma seperated list of container names to include, or use env var INCLUDE') 310 | @click.option('--exclude', envvar='EXCLUDE', type=str, default=None, 311 | help='Comma seperated list of container names to exclude, or use env var EXCLUDE') 312 | @click.option('--interval', envvar='INTERVAL', type=int, default=55, 313 | help='Stats collection and aggregation interval in seconds (specifically for CPU stats)') 314 | @click.option('--log-level', envvar='LOG_LEVEL', type=str, default='INFO', 315 | help='Log level, default: INFO') 316 | def main( 317 | metadata_url=None, exporter_port=9545, use_statsd=False, statsd_port=8125, statsd_host='localhost', 318 | include=None, exclude=None, interval=55, log_level='INFO' 319 | ): 320 | if not metadata_url: 321 | sys.exit('AWS environment variable ECS_CONTAINER_METADATA_URI not found ' 322 | 'nor is --metadata-url set') 323 | 324 | signal.signal(signal.SIGTERM, shutdown) 325 | signal.signal(signal.SIGINT, shutdown) 326 | 327 | logging.basicConfig( 328 | format='%(asctime)s:%(levelname)s:%(message)s', 329 | ) 330 | logging.getLogger().setLevel( 331 | getattr(logging, log_level.upper()) 332 | ) 333 | 334 | if exclude: 335 | exclude=exclude.strip().split(',') 336 | if include: 337 | include=include.strip().split(',') 338 | 339 | collector = ECSContainerExporter(metadata_url=metadata_url, 340 | include_containers=include, 341 | exclude_containers=exclude, 342 | interval=interval) 343 | 344 | if use_statsd: 345 | collector.send_statsd_metrics(statsd_host, statsd_port) 346 | else: 347 | collector.start_prometheus_eporter(exporter_port) 348 | 349 | 350 | if __name__ == '__main__': 351 | main() 352 | -------------------------------------------------------------------------------- /ecs_container_exporter/memory_metrics.py: -------------------------------------------------------------------------------- 1 | import logging 2 | from collections import defaultdict 3 | from ecs_container_exporter.utils import create_metric, TASK_CONTAINER_NAME_TAG, create_task_metrics 4 | 5 | log = logging.getLogger() 6 | # Default value is maxed out for some cgroup metrics 7 | CGROUP_NO_VALUE = 0x7FFFFFFFFFFFF000 8 | 9 | 10 | def calculate_memory_metrics(stats, task_container_tags): 11 | metrics_by_container = {} 12 | # task level metrics 13 | task_metrics = defaultdict(int) 14 | metric_type = 'gauge' 15 | for container_id, container_stats in stats.items(): 16 | metrics = [] 17 | tags = task_container_tags[container_id] 18 | memory_stats = container_stats.get('memory_stats', {}) 19 | log.debug(f'Memory Stats: {container_id} - {memory_stats}') 20 | 21 | for mkey in ['cache']: 22 | value = memory_stats.get('stats', {}).get(mkey) 23 | # some values have default garbage value 24 | if value < CGROUP_NO_VALUE: 25 | metric = create_metric('mem_' + mkey, value, tags, metric_type) 26 | metrics.append(metric) 27 | if mkey == 'cache': 28 | task_metrics['mem_cache'] += value 29 | 30 | for mkey in ['usage']: 31 | value = memory_stats.get(mkey) 32 | # some values have default garbage value 33 | if value < CGROUP_NO_VALUE: 34 | metric = create_metric('mem_' + mkey, value, tags, metric_type) 35 | metrics.append(metric) 36 | if mkey == 'usage': 37 | task_metrics['mem_usage'] += value 38 | 39 | metrics_by_container[container_id] = metrics 40 | 41 | # task level metrics 42 | metrics_by_container[TASK_CONTAINER_NAME_TAG] = create_task_metrics(task_metrics, metric_type) 43 | 44 | return metrics_by_container 45 | -------------------------------------------------------------------------------- /ecs_container_exporter/network_metrics.py: -------------------------------------------------------------------------------- 1 | import logging 2 | from collections import defaultdict 3 | from .utils import create_metric, create_task_metrics, TASK_CONTAINER_NAME_TAG 4 | 5 | log = logging.getLogger() 6 | 7 | 8 | def calculate_network_metrics(stats, task_container_tags): 9 | """ 10 | "networks": { 11 | "eth1": { 12 | "rx_bytes": 564655295, 13 | "rx_packets": 384960, 14 | "rx_errors": 0, 15 | "rx_dropped": 0, 16 | "tx_bytes": 3043269, 17 | "tx_packets": 54355, 18 | "tx_errors": 0, 19 | "tx_dropped": 0 20 | } 21 | } 22 | 23 | """ 24 | metrics_by_container = {} 25 | # task level metrics 26 | task_metrics = defaultdict(int) 27 | # assume these will always be gauge 28 | metric_type = 'gauge' 29 | for container_id, container_stats in stats.items(): 30 | metrics = [] 31 | tags = task_container_tags[container_id] 32 | network_stats = container_stats.get('networks') 33 | if not network_stats: 34 | continue 35 | 36 | for iface, iface_stats in network_stats.items(): 37 | iface_tag = tags.copy() 38 | iface_tag['iface'] = iface 39 | 40 | for stat, value in iface_stats.items(): 41 | metrics.append( 42 | create_metric('network_' + stat + '_total', value, tags, 43 | metric_type, 'Network '+stat) 44 | ) 45 | # assume single iface 46 | task_metrics['network_' + stat + '_total'] += value 47 | 48 | metrics_by_container[container_id] = metrics 49 | 50 | # Task metrics 51 | metrics_by_container[TASK_CONTAINER_NAME_TAG] = create_task_metrics(task_metrics, metric_type) 52 | 53 | return metrics_by_container 54 | -------------------------------------------------------------------------------- /ecs_container_exporter/utils.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import logging 3 | from collections import namedtuple 4 | from datadog.dogstatsd.base import DogStatsd 5 | from prometheus_client.core import CounterMetricFamily, GaugeMetricFamily 6 | 7 | # special container name tag for Task level (instead of container) metrics 8 | TASK_CONTAINER_NAME_TAG = '_task_' 9 | METRIC_PREFIX = 'ecs_task' 10 | 11 | logging.basicConfig( 12 | format='%(asctime)s:%(levelname)s:%(message)s', 13 | ) 14 | 15 | Metric = namedtuple('Metric', ['name', 'value', 'tags', 'type', 'desc']) 16 | 17 | statsd = None 18 | 19 | 20 | def init_statsd_client(statsd_host='localhost', statsd_port=8125): 21 | global statsd 22 | if not statsd: 23 | statsd = DogStatsd( 24 | statsd_host, statsd_port, 25 | use_ms=True, 26 | namespace=METRIC_PREFIX 27 | ) 28 | 29 | 30 | def get_logger(log_level): 31 | return logging.getLogger() 32 | 33 | 34 | def create_metric(name, value, tags, type, desc=''): 35 | return Metric(name, value, tags, type, desc) 36 | 37 | 38 | def create_prometheus_metric(metric): 39 | metric_name = METRIC_PREFIX +'_'+ metric.name 40 | if metric.type == 'counter': 41 | pm = CounterMetricFamily(metric_name, metric.desc, labels=metric.tags.keys()) 42 | elif metric.type == 'gauge': 43 | pm = GaugeMetricFamily(metric_name, metric.desc, labels=metric.tags.keys()) 44 | else: 45 | raise Exception(f'Unknown metric type: {metric.type}') 46 | 47 | pm.add_metric(labels=metric.tags.values(), value=metric.value) 48 | return pm 49 | 50 | 51 | def format_dogstatsd_tags(tags): 52 | """ 53 | {k: v} => ['k:v'] 54 | 55 | """ 56 | return [f'{k}:{v}' for k, v in tags.items()] 57 | 58 | 59 | def send_statsd(metric): 60 | global statsd 61 | tags = format_dogstatsd_tags(metric.tags) 62 | if metric.type == 'counter': 63 | statsd.increment(metric.name, metric.value, tags=tags) 64 | elif metric.type == 'gauge': 65 | statsd.gauge(metric.name, metric.value, tags=tags) 66 | else: 67 | raise Exception(f'Unknown metric type: {metric.type}') 68 | 69 | 70 | def create_task_metrics(task_metrics, metric_type): 71 | """ 72 | Convert task_metrics key, value hash to actual metrics 73 | 74 | """ 75 | metrics = [] 76 | tags = task_metric_tags() 77 | for mkey, mvalue in task_metrics.items(): 78 | metrics.append( 79 | create_metric(mkey, mvalue, tags, metric_type) 80 | ) 81 | 82 | return metrics 83 | 84 | 85 | def ecs_task_metdata(url, timeout): 86 | response = requests.get(url, timeout=timeout) 87 | 88 | if response.status_code != 200: 89 | raise Exception(f'Error: Non 200 response from url {url}') 90 | 91 | try: 92 | return response.json() 93 | 94 | except ValueError as e: 95 | raise Exception(f'Error: decoding json response from url {url} response {response.text}: {e}') 96 | 97 | 98 | def task_metric_tags(): 99 | """ 100 | Special tag that will apply to `task` level metrics, as opposed to 101 | `container` level metrics. 102 | 103 | """ 104 | return {'container_name': TASK_CONTAINER_NAME_TAG} 105 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | . 2 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import re 2 | import ast 3 | from setuptools import setup 4 | 5 | 6 | _version_re = re.compile(r'__version__\s+=\s+(.*)') 7 | 8 | with open('ecs_container_exporter/__init__.py', 'rb') as f: 9 | version = str(ast.literal_eval(_version_re.search(f.read().decode('utf-8')).group(1))) 10 | 11 | with open('./README.md') as f: 12 | long_desc = f.read() 13 | 14 | 15 | setup( 16 | name='ecs-container-exporter', 17 | version=version, 18 | author='Raghu Udiyar', 19 | author_email='raghusiddarth@gmail.com', 20 | url='https://github.com/Spaced-Out/ecs-container-exporter', 21 | license='MIT License', 22 | description='Prometheus exporter for AWS ECS Task and Container Metrics', 23 | long_description=long_desc, 24 | long_description_content_type="text/markdown", 25 | install_requires=['setuptools>=36.0.0', 26 | 'Click==7.0', 27 | 'prometheus-client==0.7.1', 28 | 'requests==2.22.0', 29 | 'datadog==0.39.0'], 30 | packages=['ecs_container_exporter'], 31 | entry_points={ 32 | 'console_scripts': [ 33 | 'ecs-container-exporter = ecs_container_exporter.main:main' 34 | ] 35 | }, 36 | classifiers=[ 37 | 'Environment :: Console', 38 | 'Intended Audience :: Developers', 39 | 'Programming Language :: Python', 40 | 'Programming Language :: Python :: 3.6', 41 | 'Programming Language :: Python :: 3.7', 42 | ] 43 | ) 44 | --------------------------------------------------------------------------------