├── .dockerignore
├── .gitignore
├── Dockerfile
├── LICENSE
├── Makefile
├── README.md
├── ecs_container_exporter
    ├── __init__.py
    ├── cpu_metrics.py
    ├── io_metrics.py
    ├── main.py
    ├── memory_metrics.py
    ├── network_metrics.py
    └── utils.py
├── requirements.txt
└── setup.py


/.dockerignore:
--------------------------------------------------------------------------------
1 | .git
2 | .dockerignore
3 | tests
4 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | *.swp
2 | *.pyc
3 | 


--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
 1 | From python:3-alpine
 2 | ARG WORK_DIR=/usr/src/app
 3 | 
 4 | WORKDIR ${WORK_DIR}
 5 | 
 6 | RUN pip install --upgrade pip
 7 | 
 8 | # TODO: find a way to do --no-cache install with setup.py
 9 | # instead of using requirements.txt
10 | COPY . .
11 | RUN pip install --no-cache-dir -r requirements.txt
12 | 
13 | ENV EXPORTER_PORT=9545
14 | EXPOSE ${EXPORTER_PORT}
15 | 
16 | ENV LOG_LEVEL=info
17 | ENV PYTHONUNBUFFERED=True
18 | ENV EXCLUDE="ecs-container-exporter,~internal~ecs~pause"
19 | CMD ["ecs-container-exporter"]
20 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Spaced-Out
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
 1 | .PHONY: build test
 2 | 
 3 | TAG ?= latest
 4 | LOG_LEVEL ?= DEBUG
 5 | 
 6 | build:
 7 | 	docker build . -t raags/ecs-container-exporter:${TAG}
 8 | 
 9 | clean:
10 | 	rm -rf dist/*
11 | 
12 | package: clean
13 | 	pip install --upgrade build
14 | 	python3 -m build
15 | 
16 | pypi:
17 | 	python3 -m twine upload dist/*
18 | 
19 | run:
20 | 	docker run --rm --name ecs-container-exporter -p 9545:9545 -e ECS_CONTAINER_METADATA_URI=http://10.200.10.1:8080 raags/ecs-container-exporter:${TAG}
21 | 	# --log-level debug
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ecs-container-exporter
 2 | [AWS ECS](https://aws.amazon.com/ecs/) side car that exports ECS container level docker stats metrics to [Prometheus](https://prometheus.io) as well as publish it via Statsd.
 3 | 
 4 | # Motivation
 5 | The default metrics available in AWS ECS are limited, mostly at the task level, across all containers in a task; the container level metrics are not available. In addition, more detailed cgroup metrics are also not available, such as per cpu, and memory usage breakdown into cache, rss, etc.
 6 | 
 7 | Luckily AWS exposes the [docker stats](https://docs.docker.com/engine/api/v1.40/#operation/ContainerInspect) data via a [Task metadata endpoint](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint.html).
 8 | 
 9 | The ecs-container-exporter parses this data, and can expose it to Prometheus or push them via StatsD.
10 | 
11 | # Usage
12 | Install via Pip:
13 | 
14 | ```
15 | $ pip3 install ecs-container-exporter
16 | ```
17 | 
18 | or via docker:
19 | 
20 | ```
21 | $ docker pull raags/ecs-container-exporter
22 | ```
23 | 
24 | On ECS, add the following json to the task definition:
25 | 
26 | ```json
27 | {
28 | 	"name": "ecs-container-exporter",
29 | 	"image": "raags/ecs-container-exporter:latest",
30 | 	"portMappings": [
31 | 	{
32 | 		"hostPort": 0,
33 | 		"protocol": "tcp",
34 | 		"containerPort": 9545
35 | 	}
36 | 	],
37 | 	"command": [],
38 | 	"cpu": 256,
39 | 	"dockerLabels": {
40 | 		"PROMETHEUS_EXPORTER_PORT": "9545"
41 | 	}
42 | }
43 | ```
44 | The `PROMETHEUS_EXPORTER_PORT` label is for ECS discovery via https://github.com/teralytics/prometheus-ecs-discovery
45 | 
46 | To include or exclude application containers use the `INCLUDE` or `EXCLUDE` environment variable. By default `ecs-container-exporter`
47 | and `~internal~ecs~pause` (a Fargate internal sidecar) is excluded.
48 | 
49 | ## Statsd
50 | 
51 | Version `2.0.0` add Statsd support with `--use-statsd` flag or env `USE_STATSD`. Metrics are emitted with DogStatsd Tag format.
52 | 
53 | 
54 | # Metrics
55 | 
56 | The metrics are sampled twice as per the configured `interval` (default 60s), and than aggregated in this interval. This should be set to the Prometheus scrape interval.
57 | 
58 | 
59 | ## CPU
60 | 
61 | CPU usage ratio is calculated and scaled as per the applicable container or task cpu limit:
62 | 
63 | | Task Limit | Container Limit | Task Metric | Container Metric |
64 | | --- | --- | ---  | --- |
65 | | 0   | 0   | no scaling   | no scaling |
66 | | 0   | x   | no scaling   | scale cpu (can burst above 100%) |
67 | | x   | 0   | scale as per limit | scale as per task limit  |
68 | | x   | x   | scale as per limit | scale as per container limit (can burst above 100%) |
69 | 
70 | Note that unlike `docker stats` command and others, CPU usage is not scaled to
71 | the number of CPUs. This means a task with 4 CPUs with all 4 having full
72 | utilization will show up as 400% in `docker stats`, but 100% here.
73 | 
74 | ## Memory
75 | 
76 | Memory usage and cache is emitted separately, but the memory usage also includes cache, so
77 | subtract cache from it to plot application memory usage specifically.
78 | 
79 | ## IO
80 | 
81 | ## Network
82 | 
83 | Network metrics were recently added in Fargate 1.4.0 and ECS agent 1.41.0 onwards.
84 | 
85 | # TODO
86 | [] - Support Non ECS docker host containers
87 | 


--------------------------------------------------------------------------------
/ecs_container_exporter/__init__.py:
--------------------------------------------------------------------------------
1 | __version__ = '2.4.0'
2 | 


--------------------------------------------------------------------------------
/ecs_container_exporter/cpu_metrics.py:
--------------------------------------------------------------------------------
  1 | import logging
  2 | from ecs_container_exporter.utils import create_metric, task_metric_tags, TASK_CONTAINER_NAME_TAG
  3 | 
  4 | log = logging.getLogger(__name__)
  5 | 
  6 | 
  7 | def calculate_cpu_metrics(docker_stats, task_cpu_limit, task_container_limits,
  8 |                           task_container_tags):
  9 |     """
 10 |     Calculate cpu metrics based on the docker container stats json.
 11 | 
 12 |     The precpu_stats is calculated by docker stats api at 1s interval,
 13 |     which is too small, instead we scample twice over a longer interval.
 14 |     https://github.com/moby/moby/blob/b50ba3da1239a56456634f74660f43a27df6b3f2/daemon/daemon.go#L1055
 15 | 
 16 |     `docker_stats` is a list with two samples over an interval. CPU usage
 17 |     is caculated by comparing the `cpu_stats` of the first and second sample.
 18 | 
 19 |     Task and Container Cpu usage is `scaled` against the applicable CPU
 20 |     limits. This is more accurate than using the actual values which give
 21 |     utilization against the Host CPU.
 22 | 
 23 |     Specifically calculate based on the following data.
 24 | 
 25 |     "cpu_stats": {
 26 |         "cpu_usage": {
 27 |             "percpu_usage": [
 28 |                 826860687,
 29 |                 830807540,
 30 |                 823365887,
 31 |                 844077056
 32 |             ],
 33 |             "total_usage": 3325111170,
 34 |             "usage_in_kernelmode": 1620000000,
 35 |             "usage_in_usermode": 1600000000
 36 |         },
 37 |         "online_cpus": 4,
 38 |         "system_cpu_usage": 35595977360000000,
 39 |         "throttling_data": {
 40 |             "periods": 0,
 41 |             "throttled_periods": 0,
 42 |             "throttled_time": 0
 43 |         }
 44 |     },
 45 | 
 46 |     """
 47 |     # Container_id to metrics mapping
 48 |     metrics_by_container = {}
 49 |     # Total task cpu usage
 50 |     task_cpu_usage_ratio = 0.0
 51 | 
 52 |     prev_stats = docker_stats[0]
 53 |     stats = docker_stats[1]
 54 |     online_cpus = get_online_cpus(stats)
 55 |     for container_id, container_stats in stats.items():
 56 |         metrics = []
 57 |         tags = task_container_tags[container_id]
 58 | 
 59 |         cpu_stats = container_stats.get('cpu_stats', {})
 60 |         prev_cpu_stats = prev_stats[container_id].get('cpu_stats', {})
 61 |         log.debug(f'CPU Stats: {container_id} - curr {cpu_stats} - prev {prev_cpu_stats}')
 62 | 
 63 |         # `cpu_stats` sometimes maybe empty when container is just started
 64 |         if not prev_cpu_stats:
 65 |             continue
 66 | 
 67 |         usage_delta = cpu_usage_diff(cpu_stats, prev_cpu_stats)
 68 |         # system_cpu_usage is sum of `all` cpu usage on the `host`:
 69 |         # https://github.com/moby/moby/blob/54d88a7cd366fd8169b8a96bec5d9f303d57c425/daemon/stats/collector_unix.go#L31
 70 |         #
 71 |         system_delta = system_cpu_usage_diff(cpu_stats, prev_cpu_stats)
 72 |         log.debug(f'usage_delta {usage_delta} system_delta {system_delta}')
 73 | 
 74 |         cpu_usage_ratio = calculate_cpu_usage(usage_delta, system_delta)
 75 |         log.debug(f'Usage Ratio {cpu_usage_ratio} {online_cpus}')
 76 |         # add to total task cpu usage
 77 |         task_cpu_usage_ratio += cpu_usage_ratio
 78 | 
 79 |         # Scale cpu usage against the Task or Container Limit
 80 |         container_cpu_limit = task_container_limits[container_id]['cpu']
 81 |         value = normalize_cpu_usage(cpu_usage_ratio, online_cpus,
 82 |                                     task_cpu_limit,
 83 |                                     container_cpu_limit)
 84 | 
 85 |         metric = create_metric('cpu_usage_ratio', round(value, 4), tags, 'gauge',
 86 |                                'CPU usage ratio')
 87 |         metrics.append(metric)
 88 | 
 89 |         # Per cpu metrics
 90 |         metrics.extend(
 91 |             percpu_metrics(cpu_stats, prev_cpu_stats, system_delta, online_cpus,
 92 |                            task_cpu_limit, container_cpu_limit, tags)
 93 |         )
 94 | 
 95 |         metrics.extend(
 96 |             cpu_user_kernelmode_metrics(cpu_stats, prev_cpu_stats, tags)
 97 |         )
 98 | 
 99 |         # CPU throttling
100 |         # TODO: figure out how this will look at the Task level
101 |         metrics.extend(
102 |             cpu_throttle_metrics(cpu_stats, tags)
103 |         )
104 | 
105 |         metrics_by_container[container_id] = metrics
106 | 
107 |     # total task cpu ratio
108 |     if task_cpu_limit != 0:
109 |         task_cpu_usage_ratio = scale_cpu_usage(task_cpu_usage_ratio, online_cpus, task_cpu_limit)
110 | 
111 |     tags = task_metric_tags()
112 |     metric = create_metric('cpu_usage_ratio', round(task_cpu_usage_ratio, 4), tags, 'gauge',
113 |                            'CPU usage ratio')
114 | 
115 |     metrics_by_container[TASK_CONTAINER_NAME_TAG] = [metric]
116 | 
117 |     return metrics_by_container
118 | 
119 | 
120 | def cpu_throttle_metrics(cpu_stats, tags):
121 |     metrics = []
122 |     throttling_data = cpu_stats.get('throttling_data')
123 |     for mkey, mvalue in throttling_data.items():
124 |         metric = create_metric('cpu_throttle_' + mkey, mvalue, tags, 'gauge')
125 |         metrics.append(metric)
126 | 
127 |     return metrics
128 | 
129 | 
130 | def cpu_user_kernelmode_metrics(cpu_stats, prev_cpu_stats, tags):
131 |     # user and kernel mode split
132 |     curr_stats = cpu_stats['cpu_usage']
133 |     prev_stats = prev_cpu_stats['cpu_usage']
134 |     kernelmode_delta = curr_prev_diff(curr_stats, prev_stats, 'usage_in_kernelmode')
135 |     kernelmode_metric = create_metric('cpu_kernelmode', kernelmode_delta, tags, 'gauge',
136 |                                       'cpu usage in kernel mode')
137 | 
138 |     usermode_delta = curr_prev_diff(curr_stats, prev_stats, 'usage_in_usermode')
139 |     usermode_metric = create_metric('cpu_usermode', usermode_delta, tags, 'gauge',
140 |                                     'cpu usage in user mode')
141 | 
142 |     return (kernelmode_metric, usermode_metric)
143 | 
144 | 
145 | def percpu_metrics(cpu_stats, prev_cpu_stats, system_delta, online_cpus,
146 |                    task_cpu_limit, container_cpu_limit, tags):
147 |     metrics = []
148 |     percpu_usage = cpu_stats.get('cpu_usage', {}).get('percpu_usage', [])
149 |     prev_percpu_usage = prev_cpu_stats.get('cpu_usage', {}).get('percpu_usage', [])
150 | 
151 |     if percpu_usage and prev_percpu_usage:
152 |         for i, value in enumerate(percpu_usage):
153 |             # skip inactive Cpus - https://github.com/torvalds/linux/commit/5ca3726
154 |             if value != 0:
155 |                 cpu_tags = tags.copy()
156 |                 cpu_tags['cpu'] = str(i)
157 |                 if prev_percpu_usage[i] and system_delta:
158 |                     usage_delta = float(value) - float(prev_percpu_usage[i])
159 |                     usage_ratio = usage_delta / system_delta
160 |                 else:
161 |                     usage_ratio = 0.0
162 | 
163 |                 value = normalize_cpu_usage(usage_ratio, online_cpus,
164 |                                             task_cpu_limit,
165 |                                             container_cpu_limit)
166 | 
167 |                 metric = create_metric('percpu_usage_ratio', round(usage_ratio, 4), cpu_tags, 'gauge',
168 |                                        'Per CPU usage ratio')
169 |                 metrics.append(metric)
170 | 
171 |     return metrics
172 | 
173 | 
174 | def calculate_cpu_usage(usage_delta, system_delta):
175 |     if usage_delta and system_delta:
176 |         # Keep it to 100% instead of scaling by number of cpus :
177 |         # https://github.com/moby/moby/issues/29306#issuecomment-405198198
178 |         #
179 |         cpu_usage_ratio = usage_delta / system_delta
180 |     else:
181 |         cpu_usage_ratio = 0.0
182 | 
183 |     return cpu_usage_ratio
184 | 
185 | 
186 | def system_cpu_usage_diff(cpu_stats, prev_cpu_stats):
187 |     return curr_prev_diff(cpu_stats, prev_cpu_stats, 'system_cpu_usage')
188 | 
189 | 
190 | def cpu_usage_diff(cpu_stats, prev_cpu_stats):
191 |     return curr_prev_diff(cpu_stats['cpu_usage'], prev_cpu_stats['cpu_usage'],
192 |                           'total_usage')
193 | 
194 | 
195 | def curr_prev_diff(curr_stats, prev_stats, metric_key):
196 |     curr = curr_stats[metric_key]
197 |     prev = prev_stats[metric_key]
198 |     if curr and prev:
199 |         return curr - prev
200 |     else:
201 |         return 0.0
202 | 
203 | 
204 | def scale_cpu_usage(cpu_usage_ratio, host_cpu_count, cpu_share_limit):
205 |     """
206 |     Scales cpu ratio originally calculated against total host cpu capacity,
207 |     with the corresponding cpu shares limit (at task or container level)
208 | 
209 |     host_cpu_count is multiplied by 1024 to get the total available cpu shares
210 | 
211 |     """
212 |     scaled_cpu_usage_ratio = (cpu_usage_ratio * host_cpu_count * 1024) / cpu_share_limit
213 |     return scaled_cpu_usage_ratio
214 | 
215 | 
216 | def normalize_cpu_usage(cpu_usage_ratio, host_cpu_count,
217 |                         task_cpu_limit, container_cpu_limit):
218 |     """
219 |     Task Limit - Container Limit - Cpu metric
220 |         0  -  0  -  no scaling
221 |         0  -  x  -  scale against container limit
222 |         x  -  x  -  scale against container limit
223 |         x  -  0  -  scale against task limit
224 | 
225 |     """
226 |     log.debug(f'Normalize CPU: {cpu_usage_ratio} - {host_cpu_count} - {task_cpu_limit} - {container_cpu_limit}')
227 |     if task_cpu_limit == 0 and container_cpu_limit == 0:
228 |         # no scaling, task can use all of host CPU
229 |         return cpu_usage_ratio
230 | 
231 |     if container_cpu_limit != 0:
232 |         # case 2 and 3 above
233 |         # this can go above 100% when task_cpu_limit is 0
234 |         #
235 |         return scale_cpu_usage(cpu_usage_ratio, host_cpu_count, container_cpu_limit)
236 | 
237 |     if task_cpu_limit != 0:
238 |         # case 4, should not go above 100% as cpu usage
239 |         # is limited by task_cpu_limit
240 |         #
241 |         return scale_cpu_usage(cpu_usage_ratio, host_cpu_count, task_cpu_limit)
242 | 
243 | 
244 | def get_online_cpus(stats):
245 |     """
246 |     Assume this number is same for all containers
247 | 
248 |     """
249 |     for container_id, container_stats in stats.items():
250 |         return container_stats['cpu_stats']['online_cpus']
251 | 


--------------------------------------------------------------------------------
/ecs_container_exporter/io_metrics.py:
--------------------------------------------------------------------------------
  1 | import logging
  2 | from collections import defaultdict
  3 | from ecs_container_exporter.utils import create_metric, create_task_metrics, TASK_CONTAINER_NAME_TAG
  4 | 
  5 | log = logging.getLogger()
  6 | 
  7 | 
  8 | def calculate_io_metrics(stats, task_container_tags):
  9 |     """
 10 |     Calculate IO metrics from the below data:
 11 | 
 12 |     "blkio_stats": {
 13 |         "io_merged_recursive": [],
 14 |         "io_queue_recursive": [],
 15 |         "io_service_bytes_recursive": [
 16 |             {
 17 |                 "major": 259,
 18 |                 "minor": 0,
 19 |                 "op": "Read",
 20 |                 "value": 10653696
 21 |             },
 22 |             {
 23 |                 "major": 259,
 24 |                 "minor": 0,
 25 |                 "op": "Write",
 26 |                 "value": 0
 27 |             },
 28 |             {
 29 |                 "major": 259,
 30 |                 "minor": 0,
 31 |                 "op": "Sync",
 32 |                 "value": 10653696
 33 |             },
 34 |             {
 35 |                 "major": 259,
 36 |                 "minor": 0,
 37 |                 "op": "Async",
 38 |                 "value": 0
 39 |             },
 40 |             {
 41 |                 "major": 259,
 42 |                 "minor": 0,
 43 |                 "op": "Total",
 44 |                 "value": 10653696
 45 |             }
 46 |         ],
 47 |         "io_service_time_recursive": [],
 48 |         "io_serviced_recursive": [
 49 |             {
 50 |                 "major": 259,
 51 |                 "minor": 0,
 52 |                 "op": "Read",
 53 |                 "value": 164
 54 |             },
 55 |             {
 56 |                 "major": 259,
 57 |                 "minor": 0,
 58 |                 "op": "Write",
 59 |                 "value": 0
 60 |             },
 61 |             {
 62 |                 "major": 259,
 63 |                 "minor": 0,
 64 |                 "op": "Sync",
 65 |                 "value": 164
 66 |             },
 67 |             {
 68 |                 "major": 259,
 69 |                 "minor": 0,
 70 |                 "op": "Async",
 71 |                 "value": 0
 72 |             },
 73 |             {
 74 |                 "major": 259,
 75 |                 "minor": 0,
 76 |                 "op": "Total",
 77 |                 "value": 164
 78 |             }
 79 |         ],
 80 |         "io_time_recursive": [],
 81 |         "io_wait_time_recursive": [],
 82 |         "sectors_recursive": []
 83 |     },
 84 |     """
 85 |     metrics_by_container = {}
 86 |     # task level metrics
 87 |     task_metrics = defaultdict(int)
 88 |     # assume these will always be gauge
 89 |     metric_type = 'gauge'
 90 |     for container_id, container_stats in stats.items():
 91 |         metrics = []
 92 |         blkio_stats = container_stats.get('blkio_stats')
 93 | 
 94 |         iostats = {'io_service_bytes_recursive': 'bytes', 'io_serviced_recursive': 'iops'}
 95 |         for blk_key, blk_type in iostats.items():
 96 |             tags = task_container_tags[container_id]
 97 |             read_counter = write_counter = 0
 98 |             for blk_stat in blkio_stats.get(blk_key):
 99 |                 if blk_stat['op'] == 'Read' and 'value' in blk_stat:
100 |                     read_counter += blk_stat['value']
101 |                 elif blk_stat['op'] == 'Write' and 'value' in blk_stat:
102 |                     write_counter += blk_stat['value']
103 | 
104 |             metrics.append(
105 |                 create_metric('disk_read_' + blk_type + '_total', read_counter, tags,
106 |                               metric_type, 'Total disk read ' + blk_type)
107 |             )
108 |             metrics.append(
109 |                 create_metric('disk_written_' + blk_type + '_total', write_counter, tags,
110 |                               metric_type, 'Total disk written ' + blk_type)
111 |             )
112 |             task_metrics['disk_read_' + blk_type + '_total'] += read_counter
113 |             task_metrics['disk_written_' + blk_type + '_total'] += write_counter
114 | 
115 |         metrics_by_container[container_id] = metrics
116 | 
117 |     # task level metrics
118 |     metrics_by_container[TASK_CONTAINER_NAME_TAG] = create_task_metrics(task_metrics, metric_type)
119 | 
120 |     return metrics_by_container
121 | 


--------------------------------------------------------------------------------
/ecs_container_exporter/main.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | import sys
  3 | import time
  4 | import click
  5 | import pickle
  6 | import signal
  7 | from requests.compat import urljoin
  8 | 
  9 | from prometheus_client import start_http_server
 10 | from prometheus_client.core import REGISTRY
 11 | 
 12 | from ecs_container_exporter import utils
 13 | from ecs_container_exporter.utils import create_metric
 14 | from ecs_container_exporter.cpu_metrics import calculate_cpu_metrics
 15 | from ecs_container_exporter.memory_metrics import calculate_memory_metrics
 16 | from ecs_container_exporter.io_metrics import calculate_io_metrics
 17 | from ecs_container_exporter.network_metrics import calculate_network_metrics
 18 | 
 19 | import logging
 20 | log = logging.getLogger(__name__)
 21 | 
 22 | 
 23 | class ECSContainerExporter(object):
 24 | 
 25 |     include_containers = []
 26 |     exclude_containers = []
 27 |     # 1 - healthy, 0 - unhealthy
 28 |     exporter_status = 1
 29 |     # initial task metrics that do not change
 30 |     static_task_metrics = []
 31 |     # individual container tags
 32 |     task_container_tags = {}
 33 |     # task limits
 34 |     task_cpu_limit = 0
 35 |     task_mem_limit = 0
 36 |     # individual container limits
 37 |     task_container_limits = {}
 38 |     # the Task level metrics are included by default
 39 |     include_container_ids = [utils.TASK_CONTAINER_NAME_TAG]
 40 |     # stats are collected and aggregated across this interval
 41 |     # default to 55s for scrape_interval and scrape_timeout of 1m
 42 |     # should be set to 60 when using statsd, which doesn't have this issue
 43 |     interval = 55
 44 | 
 45 |     def __init__(self, metadata_url=None, include_containers=None, exclude_containers=None, interval=55, http_timeout=60):
 46 | 
 47 |         self.task_metadata_url = urljoin(metadata_url + '/', 'task')
 48 |         # For testing
 49 |         # self.task_stats_url = urljoin(metadata_url + '/', 'stats')
 50 |         self.task_stats_url = urljoin(metadata_url + '/', 'task/stats')
 51 | 
 52 |         if exclude_containers:
 53 |             self.exclude_containers = exclude_containers
 54 |         if include_containers:
 55 |             self.include_containers = include_containers
 56 | 
 57 |         self.interval = interval
 58 |         self.http_timeout = http_timeout
 59 | 
 60 |         self.log = logging.getLogger(__name__)
 61 |         # cacheing mainly for statsd metric type, since prometheus will run this
 62 |         # only once its lifetime
 63 |         self.cache_file_path = '/tmp/ecs-container-exporter.cache'
 64 |         self.collect_static_metrics()
 65 | 
 66 |     def start_prometheus_eporter(self, exporter_port):
 67 |         self.log.info(f'Exporter initialized with '
 68 |                       f'metadata_url: {self.task_metadata_url}, '
 69 |                       f'task_stats_url: {self.task_stats_url}, '
 70 |                       f'http_timeout: {self.http_timeout}, '
 71 |                       f'include_containers: {self.include_containers}, '
 72 |                       f'exclude_containers: {self.exclude_containers}')
 73 | 
 74 |         REGISTRY.register(self)
 75 |         # Start exporter http service
 76 |         start_http_server(int(exporter_port))
 77 |         while True:
 78 |             time.sleep(10)
 79 | 
 80 |     def send_statsd_metrics(self, statsd_host, statsd_port):
 81 |         utils.init_statsd_client(statsd_host, statsd_port)
 82 |         for metric in self.collect_all_metrics():
 83 |             utils.send_statsd(metric)
 84 | 
 85 |     def cache_load_task_metadata(self):
 86 |         try:
 87 |             with open(self.cache_file_path, 'rb') as cf:
 88 |                 return pickle.load(cf)
 89 |         except FileNotFoundError:
 90 |             pass
 91 | 
 92 |         return None
 93 | 
 94 |     def cache_write_task_metadata(self, data):
 95 |         with open(self.cache_file_path, 'wb') as cf:
 96 |             pickle.dump(data, cf)
 97 | 
 98 |     def collect_static_metrics(self):
 99 |         metadata = self.cache_load_task_metadata()
100 |         if metadata:
101 |             self.log.debug(f'Using cached task metadata response from {self.cache_file_path}')
102 |         else:
103 |             retries = 24
104 |             while True and retries > 0:
105 |                 # wait for the task to be in running state
106 |                 time.sleep(5)
107 |                 retries -= 1
108 |                 try:
109 |                     metadata = utils.ecs_task_metdata(self.task_metadata_url, self.http_timeout)
110 |                     self.cache_write_task_metadata(metadata)
111 |                 except Exception as e:
112 |                     self.exporter_status = 0
113 |                     self.log.exception(e)
114 |                     continue
115 | 
116 |                 if metadata.get('KnownStatus') != 'RUNNING':
117 |                     self.log.warning(f'ECS Task not yet in RUNNING state, current status is: {metadata["KnownStatus"]}')
118 |                     continue
119 |                 else:
120 |                     break
121 | 
122 |             self.exporter_status = 1
123 | 
124 |         self.log.debug(f'Discovered Task metadata: {metadata}')
125 |         self.parse_task_metadata(metadata)
126 | 
127 |     def parse_task_metadata(self, metadata):
128 |         self.static_task_metrics = []
129 |         self.task_container_tags = {}
130 |         self.task_container_limits = {}
131 | 
132 |         # task cpu/mem limit
133 |         task_tag = utils.task_metric_tags()
134 |         self.task_cpu_limit, self.task_mem_limit = self.cpu_mem_limit(metadata, normalise_cpu=True)
135 | 
136 |         metric = create_metric('cpu_limit', self.task_cpu_limit, task_tag, 'gauge', 'Task CPU limit')
137 |         self.static_task_metrics.append(metric)
138 | 
139 |         metric = create_metric('mem_limit', self.task_mem_limit, task_tag, 'gauge', 'Task Memory limit')
140 |         self.static_task_metrics.append(metric)
141 | 
142 |         # container tags and limits
143 |         for container in metadata['Containers']:
144 |             container_id = container['DockerId']
145 |             container_name = container['Name']
146 | 
147 |             if self.should_process_container(container_name,
148 |                                              self.include_containers,
149 |                                              self.exclude_containers):
150 |                 self.log.info(f'Processing stats for container: {container_name} - {container_id}')
151 |                 self.include_container_ids.append(container_id)
152 |             else:
153 |                 self.log.info(f'Excluding container: {container_name} - {container_id} as per exclusion')
154 | 
155 |             self.task_container_tags[container_id] = {'container_name': container_name}
156 | 
157 |             # container cpu/mem limit
158 |             cpu_value, mem_value = self.cpu_mem_limit(container)
159 |             self.task_container_limits[container_id] = {'cpu': cpu_value, 'mem': mem_value}
160 | 
161 |             if container_id in self.include_container_ids:
162 |                 metric = create_metric('cpu_limit', cpu_value, self.task_container_tags[container_id],
163 |                                        'gauge', 'Limit in percent of the CPU usage')
164 |                 self.static_task_metrics.append(metric)
165 | 
166 |                 metric = create_metric('mem_limit', mem_value, self.task_container_tags[container_id],
167 |                                        'gauge', 'Limit in memory usage in MBs')
168 |                 self.static_task_metrics.append(metric)
169 | 
170 |     def should_process_container(self, container_name, include_containers, exclude_containers):
171 |         if container_name in exclude_containers:
172 |             return False
173 |         else:
174 |             if include_containers:
175 |                 if container_name in include_containers:
176 |                     return True
177 |                 else:
178 |                     return False
179 |             else:
180 |                 return True
181 | 
182 |     def cpu_mem_limit(self, metadata, normalise_cpu=False):
183 |         # normalise to `cpu shares`
184 |         cpu_limit = metadata.get('Limits', {}).get('CPU', 0) * (1024 if normalise_cpu else 1)
185 |         mem_limit = metadata.get('Limits', {}).get('Memory', 0)
186 | 
187 |         return (
188 |             cpu_limit, mem_limit
189 |         )
190 | 
191 |     # All metrics are collected here
192 |     def collect_all_metrics(self):
193 |         container_metrics = self.collect_container_metrics()
194 | 
195 |         # exporter status metric
196 |         metric = create_metric('exporter_status', self.exporter_status, {},
197 |                                'gauge', 'Exporter Status')
198 |         container_metrics.append(metric)
199 | 
200 |         return self.static_task_metrics + container_metrics
201 | 
202 |     # prometheus exporter collect function
203 |     def collect(self):
204 |         for metric in self.collect_all_metrics():
205 |             yield utils.create_prometheus_metric(metric)
206 | 
207 |     def collect_container_metrics(self):
208 |         metrics = []
209 |         docker_stats = []
210 |         # collect two stats samples as per the configured interval
211 |         try:
212 |             docker_stats.append(
213 |                 utils.ecs_task_metdata(self.task_stats_url, self.http_timeout)
214 |             )
215 |             # ECS stats API returns immediately, but docker stats waits for 1s adjust this accordingly
216 |             time.sleep(self.interval)
217 | 
218 |             docker_stats.append(
219 |                 utils.ecs_task_metdata(self.task_stats_url, self.http_timeout)
220 |             )
221 |             if not all(docker_stats):
222 |                 raise Exception('Some stats are empty, try again')
223 | 
224 |             self.exporter_status = 1
225 | 
226 |         except Exception as e:
227 |             self.exporter_status = 0
228 |             self.log.exception(e)
229 |             return metrics
230 | 
231 |         container_metrics_all = self.parse_container_metadata(docker_stats,
232 |                                                               self.task_cpu_limit,
233 |                                                               self.task_container_limits,
234 |                                                               self.task_container_tags)
235 | 
236 |         # flatten and filter excluded containers
237 |         filtered_container_metrics = []
238 |         for metrics_by_container in container_metrics_all:
239 |             for container_id, metrics in metrics_by_container.items():
240 |                 if container_id in self.include_container_ids:
241 |                     filtered_container_metrics.extend(metrics)
242 | 
243 |         return filtered_container_metrics
244 | 
245 |     def parse_container_metadata(self, docker_stats, task_cpu_limit,
246 |                                  task_container_limits, task_container_tags):
247 |         """
248 |         More details on the exposed docker metrics
249 |         https://github.com/moby/moby/blob/c1d090fcc88fa3bc5b804aead91ec60e30207538/api/types/stats.go
250 | 
251 |         """
252 |         container_metrics_all = []
253 | 
254 |         # ignore stats for containers in STOPPED state
255 |         for container_id in list(docker_stats[1].keys()):
256 |             if not docker_stats[1][container_id]:
257 |                 del(docker_stats[0][container_id])
258 |                 del(docker_stats[1][container_id])
259 |                 self.log.debug(f'Ignoring null stats for container_id {container_id}')
260 | 
261 |         try:
262 |             # CPU metrics
263 |             container_metrics_all.append(
264 |                 calculate_cpu_metrics(docker_stats,
265 |                                       task_cpu_limit,
266 |                                       task_container_limits,
267 |                                       task_container_tags)
268 |             )
269 | 
270 |             # Memory metrics
271 |             container_metrics_all.append(
272 |                 calculate_memory_metrics(docker_stats[1], task_container_tags)
273 |             )
274 | 
275 |             # I/O metrics
276 |             container_metrics_all.append(
277 |                 calculate_io_metrics(docker_stats[1], task_container_tags)
278 |             )
279 | 
280 |             # network metrics
281 |             container_metrics_all.append(
282 |                 calculate_network_metrics(docker_stats[1], task_container_tags)
283 |             )
284 | 
285 |         except Exception as e:
286 |             self.log.warning("Could not retrieve metrics for {}: {}".format(task_container_tags, e), exc_info=True)
287 |             self.exporter_status = 1
288 | 
289 |         return container_metrics_all
290 | 
291 | 
292 | def shutdown(sig_number, frame):
293 |     log.info("Recevied signal {}, Shuttting down".format(sig_number))
294 |     sys.exit(0)
295 | 
296 | 
297 | @click.command()
298 | @click.option('--metadata-url', envvar='ECS_CONTAINER_METADATA_URI', type=str, default=None,
299 |               help='Override ECS Metadata Url')
300 | @click.option('--exporter-port', envvar='EXPORTER_PORT', type=int, default=9545,
301 |               help='Change exporter listen port')
302 | @click.option('--use-statsd', envvar='USE_STATSD', is_flag=True, type=bool, default=False,
303 |               help='Emit metrics to statsd instead of starting Prometheus exporter')
304 | @click.option('--statsd-host', envvar='STATSD_HOST', type=str, default='localhost',
305 |               help='Override Stasd Host')
306 | @click.option('--statsd-port', envvar='STATSD_PORT', type=int, default=8125,
307 |               help='Override Stasd Port')
308 | @click.option('--include', envvar='INCLUDE', type=str, default=None,
309 |               help='Comma seperated list of container names to include, or use env var INCLUDE')
310 | @click.option('--exclude', envvar='EXCLUDE', type=str, default=None,
311 |               help='Comma seperated list of container names to exclude, or use env var EXCLUDE')
312 | @click.option('--interval', envvar='INTERVAL', type=int, default=55,
313 |               help='Stats collection and aggregation interval in seconds (specifically for CPU stats)')
314 | @click.option('--log-level', envvar='LOG_LEVEL', type=str, default='INFO',
315 |               help='Log level, default: INFO')
316 | def main(
317 |     metadata_url=None, exporter_port=9545, use_statsd=False, statsd_port=8125, statsd_host='localhost',
318 |     include=None, exclude=None, interval=55, log_level='INFO'
319 | ):
320 |     if not metadata_url:
321 |         sys.exit('AWS environment variable ECS_CONTAINER_METADATA_URI not found '
322 |                  'nor is --metadata-url set')
323 | 
324 |     signal.signal(signal.SIGTERM, shutdown)
325 |     signal.signal(signal.SIGINT, shutdown)
326 | 
327 |     logging.basicConfig(
328 |         format='%(asctime)s:%(levelname)s:%(message)s',
329 |     )
330 |     logging.getLogger().setLevel(
331 |         getattr(logging, log_level.upper())
332 |     )
333 | 
334 |     if exclude:
335 |         exclude=exclude.strip().split(',')
336 |     if include:
337 |         include=include.strip().split(',')
338 | 
339 |     collector = ECSContainerExporter(metadata_url=metadata_url,
340 |                                      include_containers=include,
341 |                                      exclude_containers=exclude,
342 |                                      interval=interval)
343 | 
344 |     if use_statsd:
345 |         collector.send_statsd_metrics(statsd_host, statsd_port)
346 |     else:
347 |         collector.start_prometheus_eporter(exporter_port)
348 | 
349 | 
350 | if __name__ == '__main__':
351 |     main()
352 | 


--------------------------------------------------------------------------------
/ecs_container_exporter/memory_metrics.py:
--------------------------------------------------------------------------------
 1 | import logging
 2 | from collections import defaultdict
 3 | from ecs_container_exporter.utils import create_metric, TASK_CONTAINER_NAME_TAG, create_task_metrics
 4 | 
 5 | log = logging.getLogger()
 6 | # Default value is maxed out for some cgroup metrics
 7 | CGROUP_NO_VALUE = 0x7FFFFFFFFFFFF000
 8 | 
 9 | 
10 | def calculate_memory_metrics(stats, task_container_tags):
11 |     metrics_by_container = {}
12 |     # task level metrics
13 |     task_metrics = defaultdict(int)
14 |     metric_type = 'gauge'
15 |     for container_id, container_stats in stats.items():
16 |         metrics = []
17 |         tags = task_container_tags[container_id]
18 |         memory_stats = container_stats.get('memory_stats', {})
19 |         log.debug(f'Memory Stats: {container_id} - {memory_stats}')
20 | 
21 |         for mkey in ['cache']:
22 |             value = memory_stats.get('stats', {}).get(mkey)
23 |             # some values have default garbage value
24 |             if value < CGROUP_NO_VALUE:
25 |                 metric = create_metric('mem_' + mkey, value, tags, metric_type)
26 |                 metrics.append(metric)
27 |                 if mkey == 'cache':
28 |                     task_metrics['mem_cache'] += value
29 | 
30 |         for mkey in ['usage']:
31 |             value = memory_stats.get(mkey)
32 |             # some values have default garbage value
33 |             if value < CGROUP_NO_VALUE:
34 |                 metric = create_metric('mem_' + mkey, value, tags, metric_type)
35 |                 metrics.append(metric)
36 |                 if mkey == 'usage':
37 |                     task_metrics['mem_usage'] += value
38 | 
39 |         metrics_by_container[container_id] = metrics
40 | 
41 |     # task level metrics
42 |     metrics_by_container[TASK_CONTAINER_NAME_TAG] = create_task_metrics(task_metrics, metric_type)
43 | 
44 |     return metrics_by_container
45 | 


--------------------------------------------------------------------------------
/ecs_container_exporter/network_metrics.py:
--------------------------------------------------------------------------------
 1 | import logging
 2 | from collections import defaultdict
 3 | from .utils import create_metric, create_task_metrics, TASK_CONTAINER_NAME_TAG
 4 | 
 5 | log = logging.getLogger()
 6 | 
 7 | 
 8 | def calculate_network_metrics(stats, task_container_tags):
 9 |     """
10 |     "networks": {
11 |         "eth1": {
12 |             "rx_bytes": 564655295,
13 |             "rx_packets": 384960,
14 |             "rx_errors": 0,
15 |             "rx_dropped": 0,
16 |             "tx_bytes": 3043269,
17 |             "tx_packets": 54355,
18 |             "tx_errors": 0,
19 |             "tx_dropped": 0
20 |         }
21 |     }
22 | 
23 |     """
24 |     metrics_by_container = {}
25 |     # task level metrics
26 |     task_metrics = defaultdict(int)
27 |     # assume these will always be gauge
28 |     metric_type = 'gauge'
29 |     for container_id, container_stats in stats.items():
30 |         metrics = []
31 |         tags = task_container_tags[container_id]
32 |         network_stats = container_stats.get('networks')
33 |         if not network_stats:
34 |             continue
35 | 
36 |         for iface, iface_stats in network_stats.items():
37 |             iface_tag = tags.copy()
38 |             iface_tag['iface'] = iface
39 | 
40 |             for stat, value in iface_stats.items():
41 |                 metrics.append(
42 |                     create_metric('network_' + stat + '_total', value, tags,
43 |                                   metric_type, 'Network '+stat)
44 |                 )
45 |                 # assume single iface
46 |                 task_metrics['network_' + stat + '_total'] += value
47 | 
48 |         metrics_by_container[container_id] = metrics
49 | 
50 |     # Task metrics
51 |     metrics_by_container[TASK_CONTAINER_NAME_TAG] = create_task_metrics(task_metrics, metric_type)
52 | 
53 |     return metrics_by_container
54 | 


--------------------------------------------------------------------------------
/ecs_container_exporter/utils.py:
--------------------------------------------------------------------------------
  1 | import requests
  2 | import logging
  3 | from collections import namedtuple
  4 | from datadog.dogstatsd.base import DogStatsd
  5 | from prometheus_client.core import CounterMetricFamily, GaugeMetricFamily
  6 | 
  7 | # special container name tag for Task level (instead of container) metrics
  8 | TASK_CONTAINER_NAME_TAG = '_task_'
  9 | METRIC_PREFIX = 'ecs_task'
 10 | 
 11 | logging.basicConfig(
 12 |     format='%(asctime)s:%(levelname)s:%(message)s',
 13 | )
 14 | 
 15 | Metric = namedtuple('Metric', ['name', 'value', 'tags', 'type', 'desc'])
 16 | 
 17 | statsd = None
 18 | 
 19 | 
 20 | def init_statsd_client(statsd_host='localhost', statsd_port=8125):
 21 |     global statsd
 22 |     if not statsd:
 23 |         statsd = DogStatsd(
 24 |             statsd_host, statsd_port,
 25 |             use_ms=True,
 26 |             namespace=METRIC_PREFIX
 27 |         )
 28 | 
 29 | 
 30 | def get_logger(log_level):
 31 |     return logging.getLogger()
 32 | 
 33 | 
 34 | def create_metric(name, value, tags, type, desc=''):
 35 |     return Metric(name, value, tags, type, desc)
 36 | 
 37 | 
 38 | def create_prometheus_metric(metric):
 39 |     metric_name = METRIC_PREFIX +'_'+ metric.name
 40 |     if metric.type == 'counter':
 41 |         pm = CounterMetricFamily(metric_name, metric.desc, labels=metric.tags.keys())
 42 |     elif metric.type == 'gauge':
 43 |         pm = GaugeMetricFamily(metric_name, metric.desc, labels=metric.tags.keys())
 44 |     else:
 45 |         raise Exception(f'Unknown metric type: {metric.type}')
 46 | 
 47 |     pm.add_metric(labels=metric.tags.values(), value=metric.value)
 48 |     return pm
 49 | 
 50 | 
 51 | def format_dogstatsd_tags(tags):
 52 |     """
 53 |     {k: v} => ['k:v']
 54 | 
 55 |     """
 56 |     return [f'{k}:{v}' for k, v in tags.items()]
 57 | 
 58 | 
 59 | def send_statsd(metric):
 60 |     global statsd
 61 |     tags = format_dogstatsd_tags(metric.tags)
 62 |     if metric.type == 'counter':
 63 |         statsd.increment(metric.name, metric.value, tags=tags)
 64 |     elif metric.type == 'gauge':
 65 |         statsd.gauge(metric.name, metric.value, tags=tags)
 66 |     else:
 67 |         raise Exception(f'Unknown metric type: {metric.type}')
 68 | 
 69 | 
 70 | def create_task_metrics(task_metrics, metric_type):
 71 |     """
 72 |     Convert task_metrics key, value hash to actual metrics
 73 | 
 74 |     """
 75 |     metrics = []
 76 |     tags = task_metric_tags()
 77 |     for mkey, mvalue in task_metrics.items():
 78 |         metrics.append(
 79 |             create_metric(mkey, mvalue, tags, metric_type)
 80 |         )
 81 | 
 82 |     return metrics
 83 | 
 84 | 
 85 | def ecs_task_metdata(url, timeout):
 86 |     response = requests.get(url, timeout=timeout)
 87 | 
 88 |     if response.status_code != 200:
 89 |         raise Exception(f'Error: Non 200 response from url {url}')
 90 | 
 91 |     try:
 92 |         return response.json()
 93 | 
 94 |     except ValueError as e:
 95 |         raise Exception(f'Error: decoding json response from url {url} response {response.text}: {e}')
 96 | 
 97 | 
 98 | def task_metric_tags():
 99 |     """
100 |     Special tag that will apply to `task` level metrics, as opposed to
101 |     `container` level metrics.
102 | 
103 |     """
104 |     return {'container_name': TASK_CONTAINER_NAME_TAG}
105 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | .
2 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | import ast
 3 | from setuptools import setup
 4 | 
 5 | 
 6 | _version_re = re.compile(r'__version__\s+=\s+(.*)')
 7 | 
 8 | with open('ecs_container_exporter/__init__.py', 'rb') as f:
 9 |     version = str(ast.literal_eval(_version_re.search(f.read().decode('utf-8')).group(1)))
10 | 
11 | with open('./README.md') as f:
12 |     long_desc = f.read()
13 | 
14 | 
15 | setup(
16 |     name='ecs-container-exporter',
17 |     version=version,
18 |     author='Raghu Udiyar',
19 |     author_email='raghusiddarth@gmail.com',
20 |     url='https://github.com/Spaced-Out/ecs-container-exporter',
21 |     license='MIT License',
22 |     description='Prometheus exporter for AWS ECS Task and Container Metrics',
23 |     long_description=long_desc,
24 |     long_description_content_type="text/markdown",
25 |     install_requires=['setuptools>=36.0.0',
26 |                       'Click==7.0',
27 |                       'prometheus-client==0.7.1',
28 |                       'requests==2.22.0',
29 |                       'datadog==0.39.0'],
30 |     packages=['ecs_container_exporter'],
31 |     entry_points={
32 |         'console_scripts': [
33 |             'ecs-container-exporter = ecs_container_exporter.main:main'
34 |         ]
35 |     },
36 |     classifiers=[
37 |         'Environment :: Console',
38 |         'Intended Audience :: Developers',
39 |         'Programming Language :: Python',
40 |         'Programming Language :: Python :: 3.6',
41 |         'Programming Language :: Python :: 3.7',
42 |     ]
43 | )
44 | 


--------------------------------------------------------------------------------