├── Dockerfile ├── README.md ├── deployment.yml ├── requirements.txt └── scheduler.py /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM ubuntu:14.04 2 | 3 | RUN apt-get update && apt-get install -y python3-pip 4 | RUN pip3 install pip --upgrade 5 | 6 | COPY requirements.txt . 7 | COPY scheduler.py . 8 | RUN pip3 install -r requirements.txt 9 | 10 | ENV NODE_FILTER_QUERY labelSelector=kubernetes.io/role=node 11 | 12 | CMD ["python3", "scheduler.py"] 13 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | `sticky-node-scheduler` is a custom [Kubernetes](http://kubernetes.io/) 2 | scheduler that schedules all variants of a given pod on the same node. 3 | 4 | If a pod variant isn't running on any nodes, we pick a node to schedule on to 5 | based on which node is running the least amount of our sticky-scheduled pods. 6 | If any varients of a pod *are* running, then we place all copies of the pod 7 | on the same node. 8 | 9 | If you're using e.g. EBS volumes, which can only be attached to a single node 10 | at a time, and need seamless updates of your stateful Deployments, this might 11 | be useful to you. 12 | 13 | This scheduler only works with Kubernetes >= 1.6, which support custom 14 | schedulers as a beta feature. 15 | 16 | NOTE: This is **not** ready for production usage. 17 | 18 | 19 | How to use 20 | ---------- 21 | 22 | First, deploy the scheduler into your cluster: 23 | 24 | $ kubectl apply -f deployment.yml 25 | 26 | Then, for each Deployment or ReplicationController you want to schedule 27 | using this scheduler, add: 28 | 29 | schedulerName: stickToExistingNodeScheduler 30 | 31 | to the pod spec. For instance: 32 | 33 | apiVersion: extensions/v1beta1 34 | kind: Deployment 35 | metadata: 36 | name: my-stateful-service 37 | spec: 38 | replicas: 2 39 | template: 40 | metadata: 41 | labels: 42 | app: my-stateful-service 43 | spec: 44 | schedulerName: stickToExistingNodeScheduler 45 | containers: 46 | - name: my-stateful-service 47 | image: stateful-service-image 48 | 49 | 50 | The scheduler identifies 'variants' of your pods based on the 51 | `metadata.labels` entry in your pod template. 52 | 53 | 54 | Options 55 | ------- 56 | 57 | The `NODE_FILTER_QUERY` environment variable controls which nodes are 58 | selected for scheduling. By default, it is set to 59 | `labelSelector=kubernetes.io/role=node`, which will filter only nodes (e.g. 60 | we won't schedule on to your master nodes). 61 | 62 | The `POLL_FREQUENCY` environment variable controls how often we poll the 63 | Kubernetes API for changes, per second. The default is `0.5`. 64 | 65 | 66 | License 67 | ------- 68 | 69 | Copyright (c) 2016, 2017 Shotwell Labs, Inc. 70 | 71 | Permission is hereby granted, free of charge, to any person obtaining a copy of 72 | this software and associated documentation files (the "Software"), to deal in 73 | the Software without restriction, including without limitation the rights to 74 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies 75 | of the Software, and to permit persons to whom the Software is furnished to do 76 | so, subject to the following conditions: 77 | 78 | The above copyright notice and this permission notice shall be included in all 79 | copies or substantial portions of the Software. 80 | 81 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 82 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 83 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 84 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 85 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 86 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 87 | SOFTWARE. 88 | -------------------------------------------------------------------------------- /deployment.yml: -------------------------------------------------------------------------------- 1 | apiVersion: extensions/v1beta1 2 | kind: Deployment 3 | metadata: 4 | name: sticky-node-scheduler 5 | spec: 6 | replicas: 1 7 | template: 8 | metadata: 9 | labels: 10 | app: sticky-node-scheduler 11 | spec: 12 | containers: 13 | - name: sticky-node-scheduler 14 | image: shotwell/kubernetes-sticky-node-scheduler:latest 15 | strategy: 16 | # We only want one scheduler running at a time, or we will have 17 | # race conditions. 18 | type: Recreate 19 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | requests>=2.10.0 2 | -------------------------------------------------------------------------------- /scheduler.py: -------------------------------------------------------------------------------- 1 | import os 2 | import time 3 | from urllib.parse import urljoin, quote 4 | from collections import defaultdict 5 | import json 6 | import random 7 | from copy import copy, deepcopy 8 | import logging 9 | 10 | import requests 11 | 12 | 13 | API_URL = 'https://kubernetes/api/v1/' 14 | OUR_SCHEDULER_NAME = 'stickToExistingNodeScheduler' 15 | DEFAULT_KUBERNETES_SCHEDULER = 'default-scheduler' 16 | NAMESPACE = 'default' 17 | 18 | TOKEN_LOCATION = '/var/run/secrets/kubernetes.io/serviceaccount/token' 19 | CA_BUNDLE_LOCATION = '/var/run/secrets/kubernetes.io/serviceaccount/ca.crt' 20 | 21 | logging.basicConfig(level=logging.INFO) 22 | _log = logging.getLogger(__name__) 23 | 24 | NODE_FILTER_QUERY = os.environ.get('NODE_FILTER_QUERY', '') 25 | SUPPORT_MINIKUBE = int(os.environ.get('SUPPORT_MINIKUBE', '1')) 26 | POLL_FREQUENCY = float(os.environ.get('POLL_FREQUENCY', '0.5')) 27 | 28 | logging.getLogger('requests').setLevel(logging.WARNING) 29 | logging.getLogger('urllib3').setLevel(logging.WARNING) 30 | 31 | _nodes_scheduled_to = defaultdict(list) 32 | _nodes_to_skip = defaultdict(list) 33 | 34 | 35 | class ErrorSchedulingPod(Exception): 36 | pass 37 | 38 | 39 | class ErrorDeletingPod(Exception): 40 | pass 41 | 42 | 43 | class ErrorCreatingPod(Exception): 44 | pass 45 | 46 | 47 | class NoValidNodesToScheduleTo(Exception): 48 | pass 49 | 50 | 51 | def k8_token_content(): 52 | with open(TOKEN_LOCATION) as f: 53 | return f.read() 54 | 55 | 56 | def k8_request(method, url, headers=None, **kwargs): 57 | if headers is None: 58 | headers = {} 59 | f = getattr(requests, method) 60 | our_headers = copy(headers) 61 | our_headers['Authorization'] = 'Bearer {}'.format(k8_token_content()) 62 | 63 | return f(url, headers=our_headers, verify=CA_BUNDLE_LOCATION, **kwargs) 64 | 65 | 66 | def get_unscheduled_pods(): 67 | pending_pods_url = urljoin(API_URL, 'pods?fieldSelector=spec.nodeName=') 68 | r = k8_request('get', pending_pods_url) 69 | pending_pods_info = r.json() 70 | return pending_pods_info.get('items', []) 71 | 72 | 73 | def get_failed_pods(): 74 | failed_pods = urljoin(API_URL, 'pods?fieldSelector=status.phase=Failed') 75 | r = k8_request('get', failed_pods) 76 | pending_pods_info = r.json() 77 | return pending_pods_info.get('items', []) 78 | 79 | 80 | def escape_jsonpatch_value(value): 81 | return value.replace('/', '~1') 82 | 83 | 84 | def _is_pod_running_or_pending(label_selector, pod_name): 85 | url = urljoin(API_URL, 'pods?labelSelector={}'.format( 86 | quote(label_selector))) 87 | r = k8_request('get', url) 88 | pods_for_selector = r.json() 89 | 90 | for pod in pods_for_selector.get('items', []): 91 | if pod['metadata']['name'] != pod_name: 92 | continue 93 | status = pod.get('status', {}).get('phase') 94 | if status.lower() == 'running' or status.lower() == 'pending': 95 | return True 96 | else: 97 | return False 98 | 99 | return True 100 | 101 | 102 | def get_pod_selector(pod): 103 | """ 104 | Returns: 105 | The string representing the labelSelector to select pods of this 106 | general type. 107 | """ 108 | labels = pod['metadata'].get('labels', {}) 109 | # We don't use the 'pod-template-hash' label because we want to schedule 110 | # our new pods onto the node they're currently running on, even if the 111 | # pod template has been updated (in our case, that's the point!) 112 | if 'pod-template-hash' in labels: 113 | del labels['pod-template-hash'] 114 | 115 | selector = [] 116 | for k in sorted(labels.keys()): 117 | selector.append('{}={}'.format(k, labels[k])) 118 | return ','.join(selector) 119 | 120 | 121 | def get_nodes(): 122 | """ 123 | Returns: 124 | A list of node name strings. 125 | """ 126 | url = urljoin(API_URL, 'nodes?{}'.format(NODE_FILTER_QUERY)) 127 | r = k8_request('get', url) 128 | result = r.json() 129 | nodes = result['items'] 130 | 131 | nodes = set([n['metadata']['name'] for n in nodes]) 132 | 133 | # Support for minikube, which doesn't have the usual 134 | # 'kubernetes.io/role=node' label 135 | if SUPPORT_MINIKUBE: 136 | url = urljoin(API_URL, 'nodes?labelSelector=kubernetes.io/hostname=minikube') 137 | r = k8_request('get', url) 138 | result = r.json() 139 | nodes = nodes.union(set([n['metadata']['name'] for n in result['items']])) 140 | 141 | return nodes 142 | 143 | 144 | def get_node_running_pod(pod): 145 | nodes = set() 146 | 147 | label_selector = get_pod_selector(pod) 148 | url = urljoin(API_URL, 'pods?labelSelector={}'.format( 149 | quote(label_selector))) 150 | r = k8_request('get', url) 151 | pods_for_selector = r.json() 152 | 153 | for pod in pods_for_selector.get('items', []): 154 | node_name = pod['spec'].get('nodeName') 155 | if not node_name: 156 | continue 157 | status = pod.get('status', {}).get('phase') 158 | if status.lower() != 'running' and status.lower() != 'pending': 159 | continue 160 | nodes.add(node_name) 161 | 162 | assert len(nodes) <= 1, "Pod should only be running on one or less node" 163 | return nodes.pop() if nodes else None 164 | 165 | 166 | def create_pod_definition(pod): 167 | """ 168 | Args: 169 | pod: A dictionary describing a pod. 170 | 171 | Returns: 172 | A pod definiton suitable for a create request from the API. 173 | """ 174 | pod = deepcopy(pod) 175 | 176 | # Remove elements that are not needed in the pod creation 177 | # definition, or elements that aren't allowed in the pod 178 | # creation definition. 179 | pod.pop('status', None) 180 | if 'annotations' in pod['metadata']: 181 | pod['metadata']['annotations'].pop('kubernetes.io/created-by', None) 182 | #pod['metadata'].pop('name', None) 183 | #pod['metadata'].pop('generateName', None) 184 | pod['metadata'].pop('creationTimestamp', None) 185 | pod['metadata'].pop('generateTime', None) 186 | #pod['metadata'].pop('ownerReferences', None) 187 | pod['metadata'].pop('resourceVersion', None) 188 | pod['metadata'].pop('selfLink', None) 189 | pod['metadata'].pop('uid', None) 190 | 191 | return pod 192 | 193 | 194 | def set_default_scheduler_on_pod(pod): 195 | # It's currently not possible to change the scheduler on an existing 196 | # pod -- see https://github.com/kubernetes/kubernetes/issues/24913 197 | # Because of this, we delete the pod and re-create it with the default 198 | # scheduler set. 199 | label_selector = get_pod_selector(pod) 200 | 201 | # We first create the new pod, because otherwise a RC/Deployment 202 | # may re-create the deleted pod before we can create it. 203 | new_pod = create_pod_definition(pod) 204 | del new_pod['spec']['schedulerName'] 205 | new_pod['metadata']['name'] += '-rescheduled' 206 | create_pod(new_pod) 207 | record_as_default_scheduled(label_selector, new_pod) 208 | delete_pod(pod) 209 | 210 | 211 | def create_pod(pod): 212 | pod_name = pod['metadata']['name'] 213 | label_selector = get_pod_selector(pod) 214 | _log.info('Creating pod {} ({})'.format( 215 | pod_name, label_selector)) 216 | 217 | url = urljoin(API_URL, 'namespaces/{}/pods'.format( 218 | NAMESPACE)) 219 | 220 | r = k8_request('post', url, json=pod) 221 | if r.status_code != 201: 222 | raise ErrorCreatingPod( 223 | 'There was an error creating pod {}.'.format( 224 | pod_name)) 225 | 226 | 227 | def delete_pod(pod): 228 | pod_name = pod['metadata']['name'] 229 | label_selector = get_pod_selector(pod) 230 | _log.info('Deleting pod {} ({})'.format( 231 | pod_name, label_selector)) 232 | 233 | url = urljoin(API_URL, 'namespaces/{}/pods/{}'.format( 234 | NAMESPACE, pod_name)) 235 | 236 | payload = { 237 | 'apiVersion': 'v1', 238 | 'gracePeriodSeconds': 0, 239 | } 240 | 241 | r = k8_request('delete', url, json=payload) 242 | if r.status_code != 200: 243 | raise ErrorDeletingPod( 244 | 'There was an error deleting pod {}.'.format( 245 | pod_name)) 246 | 247 | 248 | def bind_pod_to_node(pod, node_running_pod): 249 | pod_name = pod['metadata']['name'] 250 | label_selector = get_pod_selector(pod) 251 | _log.info('Binding pod {} ({}) to node {}'.format( 252 | pod_name, label_selector, node_running_pod)) 253 | 254 | url = urljoin(API_URL, 'namespaces/{}/pods/{}/binding'.format( 255 | NAMESPACE, pod_name)) 256 | 257 | payload = { 258 | 'apiVersion': 'v1', 259 | 'kind': 'Binding', 260 | 'metadata': { 261 | 'name': pod_name, 262 | }, 263 | 'target': { 264 | 'apiVersion': 'v1', 265 | 'kind': 'Node', 266 | 'name': node_running_pod, 267 | } 268 | } 269 | 270 | r = k8_request('post', url, json=payload) 271 | if r.status_code != 201: 272 | raise ErrorSchedulingPod( 273 | 'There was an error scheduling pod {} on node {}.'.format( 274 | pod_name, node_running_pod)) 275 | 276 | 277 | def pick_node_to_schedule_to(pod): 278 | label_selector = get_pod_selector(pod) 279 | nodes = get_nodes() 280 | 281 | # Remove nodes that are now gone 282 | old_nodes = set([n for n in _nodes_scheduled_to]) 283 | new_nodes = set(nodes) 284 | 285 | for node in (new_nodes - old_nodes): 286 | # Add new nodes 287 | _nodes_scheduled_to[node] = [] 288 | 289 | for node in (old_nodes - new_nodes): 290 | # Delete nodes that are gone 291 | del _nodes_scheduled_to[node] 292 | 293 | nodes_to_skip = _nodes_to_skip[label_selector] 294 | 295 | # Pick a node with the smallest number of our pods 296 | # scheduled to it. 297 | nodes = copy(_nodes_scheduled_to) 298 | for node in nodes_to_skip: 299 | if node in nodes: 300 | del nodes[node] 301 | nodes = list(nodes.items()) 302 | nodes.sort(key=lambda x: len(_nodes_scheduled_to[x[0]])) 303 | 304 | if not nodes: 305 | raise NoValidNodesToScheduleTo('No more valid nodes to schedule to.') 306 | 307 | return nodes[0][0] 308 | 309 | 310 | def mark_pod_as_scheduled(pod, node_name): 311 | label_selector = get_pod_selector(pod) 312 | _nodes_scheduled_to[node_name].append(label_selector) 313 | 314 | 315 | def unmark_pod_as_scheduled(pod, node_name): 316 | label_selector = get_pod_selector(pod) 317 | if label_selector in _nodes_scheduled_to[node_name]: 318 | _nodes_scheduled_to[node_name].remove(label_selector) 319 | 320 | 321 | def process_unscheduled_pods(pods): 322 | for pod in pods: 323 | spec = pod.get('spec', {}) 324 | pod_scheduler_name = spec.get('schedulerName') 325 | label_selector = get_pod_selector(pod) 326 | 327 | # We only schedule pods that are set to use this scheduler. 328 | if pod_scheduler_name == OUR_SCHEDULER_NAME: 329 | node_running_pod = get_node_running_pod(pod) 330 | 331 | if node_running_pod: 332 | node_to_schedule_to = node_running_pod 333 | else: 334 | # If the pod isn't already running somewhere, then we pick a 335 | # node to schedule the pod to. 336 | try: 337 | node_to_schedule_to = pick_node_to_schedule_to(pod) 338 | except NoValidNodesToScheduleTo: 339 | # We will re-try the scheduling again in the parent loop, 340 | # but for now we skip it. 341 | _log.info( 342 | 'Skipping scheduling pod of form {} for now.'.format( 343 | label_selector)) 344 | # We clear out the tainted nodes to avoid the scheduler 345 | # getting stuck -- we want to re-try previously 346 | # failed nodes, now. 347 | if label_selector in _nodes_to_skip: 348 | del _nodes_to_skip[label_selector] 349 | 350 | return 351 | 352 | _log.info( 353 | 'No node currently running pod of form {}. Scheduling it to ' 354 | 'node {}'.format(label_selector, node_to_schedule_to)) 355 | 356 | try: 357 | bind_pod_to_node(pod, node_to_schedule_to) 358 | except ErrorSchedulingPod: 359 | if not node_running_pod: 360 | # We want to now taint the node we attempted to schedule 361 | # on to, so that we will rotate over to a new node 362 | # when we try and schedule again. 363 | _nodes_to_skip[label_selector].append(node_to_schedule_to) 364 | else: 365 | mark_pod_as_scheduled(pod, node_to_schedule_to) 366 | # Because we were able to schedule the pod, let's clear out the 367 | # nodes to skip on this label selector. This will allow us to 368 | # try nodes that may now be schedulable next time we. 369 | if not node_running_pod: 370 | if _nodes_to_skip[label_selector]: 371 | del _nodes_to_skip[label_selector] 372 | 373 | 374 | def process_failed_pods(pods): 375 | for pod in pods: 376 | spec = pod.get('spec', {}) 377 | pod_scheduler_name = spec.get('schedulerName') 378 | label_selector = get_pod_selector(pod) 379 | 380 | # We only deal with pods that are set to use this scheduler. 381 | if pod_scheduler_name != OUR_SCHEDULER_NAME: 382 | continue 383 | 384 | status = pod.get('status') 385 | _log.error( 386 | 'Pod of type {} failed. Deleting pod and hoping it will ' 387 | 're-spawn correctly. Full pod status:\n{}'.format(label_selector, status)) 388 | 389 | # Delete the failed pod. Hopefully it's wired up to a replication 390 | # controller that will re-spawn it or something. 391 | delete_pod(pod) 392 | 393 | node_name = pod.get('spec', {}).get('nodeName') 394 | unmark_pod_as_scheduled(pod, node_name) 395 | 396 | 397 | def run_loop(): 398 | while True: 399 | time.sleep(POLL_FREQUENCY) 400 | 401 | unscheduled_pods = get_unscheduled_pods() 402 | # Return in random order to prevent infinite scheduling loops 403 | # when pods fail to schedule. 404 | random.shuffle(unscheduled_pods) 405 | 406 | process_unscheduled_pods(unscheduled_pods) 407 | failed_pods = get_failed_pods() 408 | process_failed_pods(failed_pods) 409 | 410 | 411 | if __name__ == '__main__': 412 | run_loop() 413 | --------------------------------------------------------------------------------