├── .gitignore ├── LICENSE ├── README.md ├── bootstrap.sh ├── docs └── images │ ├── cloud_config_aws_ec2.png │ └── cluster_layout.png ├── heatit ├── cloud-config.template.yml ├── files │ ├── custom-environment │ └── todo-dockerhub-login ├── generate-configs.sh ├── params.example.yml └── systemd-units │ └── cluster-bootstrap.service ├── manager-systemd-units.txt ├── systemd-units ├── node-lifecycle.service ├── vault-sk.service └── vault.service ├── tools ├── join-swarm-cluster.sh ├── node-down.sh ├── node-up.sh └── send-message-to-slack.sh └── worker-systemd-units.txt /.gitignore: -------------------------------------------------------------------------------- 1 | heatit/heatit 2 | heatit/manager.yml 3 | heatit/worker.yml 4 | heatit/params.yml -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Pavlo Lysov 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # coreos-docker-swarm-cluster 2 | 3 | This project features a set of tools to automate the setup and maintenance of a docker swarm cluster on CoreOS. 4 | 5 | ## Cluster layout 6 | 7 | Cluster layout would look like this: 8 | 9 | ![Cluster Layout](https://github.com/pavlo/coreos-docker-swarm-cluster/raw/develop/docs/images/cluster_layout.png) 10 | 11 | So, it has either 1, 3 (shown on the diagram) or 5 manager nodes as well as arbitrary number of worker nodes. 12 | 13 | ### So each manager node: 14 | 15 | * Runs an `etcd` service 16 | * Is a docker swarm manager node 17 | * Has [Vault](https://www.vaultproject.io) service running with `etcd` as a storage backend (optional) 18 | 19 | ### While every worker node: 20 | 21 | * Runs `etcd` service in [proxy mode](https://coreos.com/etcd/docs/latest/v2/proxy.html) (so it does not participate in consensus but allows to read and write the K/V stuff) 22 | * Is a docker swarm worker node 23 | 24 | ## Disclaimer 25 | 26 | The project does not make any assumptions on where you host the cluster and how you provision the boxes. Digitalocean or AWS or any other provider that allows to provision boxes with `user-data` or `cloud-config` stuff can be used. 27 | 28 | It is outside the scope of the project to set up a secure network for the cluster - that is what one needs to do specifically using VPC and public/private subnets if AWS is used. 29 | 30 | 31 | ## How it works 32 | 33 | The job is done in two distinct phases - generation of a `cloud-config` and node bootstrapping. The two topics are discussed in detail in the following sections. 34 | 35 | ### Generating the `cloud-config` file. 36 | 37 | A `cloud-config` file is used to bootstrap a CoreOS node. It essentially is a declaration of how a CoreOS node would look like and consist of. Once it is generated, it can be used to provision CoreOS nodes. 38 | 39 | Note, there're two cloud-config files need to be generated - the one for provisioning *manager* nodes and the other for provisioning *workers*. 40 | 41 | Here's how a generation routine would look like these: 42 | 43 | 1. Generating cluster token (it assumes that there're 3 managers cluster is it): 44 | 45 | ``` 46 | > curl https://discovery.etcd.io/new?size=3` 47 | https://discovery.etcd.io/3c105a68331369250997be6369adac6c 48 | ``` 49 | 50 | 2. Preparing the `params.yml` file that will be used as a source of configuration parameters during generation: 51 | 52 | ``` 53 | > mv ./heatit/params.example.yml ./heatit/params.yml 54 | ``` 55 | 56 | 3. Editing `./heatit/params.yml` - insert the cluster token generated in step #1 as a value for `coreos-cluster-token` parameter as well as set up Slack notification settings (todo: descripbe Slack setup in more detail) 57 | 58 | 4. Generate the two cloud-config files in a single command: 59 | 60 | ``` 61 | > cd ./heatit; ./generate-configs.sh 62 | ``` 63 | 64 | This will produce two files: `manager.yml` and `worker.yml` in `./heatit` directory. Use them to provision your boxes. For example if AWS EC2 used, you can insert the contents of the file in the UI: 65 | 66 | ![EC2 Cloud Config](https://github.com/pavlo/coreos-docker-swarm-cluster/raw/develop/docs/images/cloud_config_aws_ec2.png) 67 | 68 | It is a one time operation - once the two are generated make sure to store the files in a safe place. They'll be needed when more workers would need to be provisioned, or managers added or replaced. 69 | 70 | ### Node bootstrapping 71 | 72 | This is where all interesting things begin! When a node, that has been provisioned with either `master.yml` or `worker.yml` cloud-config file, boots up it runs a single systemd unit called *Cluster Bootstrap*. See its declaration in [heatit/systemd-units/cluster-bootstrap.service](heatit/systemd-units/bootstrap.sh). This service does two actions: 73 | 74 | 1. Clones *this* GIT repository in to `/etc/coreos-docker-swarm-cluster` folder on the CoreOS node 75 | 2. Executes `/etc/coreos-docker-swarm-cluster/bootstrap.sh` script 76 | 77 | This simple approach allows easily to change the node configuration, services etc without re-creating the node which in some circumstances would save tons of time. All one needs is just reboot the node and let it re-clone the repository and start the boostrap routine from scratch while maintaining etcd and swarm cluster belongings. 78 | 79 | Then the [/etc/coreos-docker-swarm-cluster/bootstrap.sh](bootstrap.sh) script, in its turn, reads a list of systemd units and runs them in order on the node. For the list of services to run on a *manager* nodes it reads [manager-systemd-units.txt](manager-systemd-units.txt) file. For worker nodes it reads [worker-systemd-units.txt](worker-systemd-units.txt) file. 80 | 81 | so, given the following is the content of `./manager-systemd-units.txt`: 82 | 83 | node-lifecycle.service 84 | vault.service 85 | vault-sk.service 86 | 87 | then each time you provision a new node with master.yml file, or reboot an existing one, it would execute `node-lifecycle.service`, `vault.service` and then `vault-sk.service` on the node. Same applies to worker nodes with the exception is that it would read `./worker-systemd-units.txt` file. 88 | 89 | ## Tooling 90 | 91 | Deprecated: In order to get it up and running you have to have the [Heatit!](https://github.com/pavlo/heatit) tool that is used to compile the scripts into proper `cloud-config` files. 92 | 93 | todo: Provide direct links to download Heatit! binary -------------------------------------------------------------------------------- /bootstrap.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | echo "------------------------------" 4 | echo "BOOTSTRAP FOR $1 IS RUNNING" 5 | 6 | while [ "`curl http://${COREOS_PRIVATE_IPV4}:4001/v2/keys/nodes/bootstrapping | jq -r '.node.value'`" != "null" ]; do 7 | echo "An other node is bootstrapping, waiting 5 seconds..." 8 | sleep 5 9 | done 10 | 11 | # Set a lock 12 | etcdctl set /nodes/bootstrapping ${COREOS_PRIVATE_IPV4} --ttl 180 13 | 14 | set -o allexport 15 | source /etc/custom-environment 16 | set +o allexport 17 | 18 | cluster_config_dir=/etc/coreos-docker-swarm-cluster 19 | slack=$cluster_config_dir/tools/send-message-to-slack.sh 20 | 21 | $slack -m "_${COREOS_PRIVATE_IPV4}_: Bootstraping a *$1* node" -u $SLACK_WEBHOOK_URL -c "#$SLACK_CHANNEL" 22 | 23 | # sleep a random number of seconds to prevent two managers to kick off at the same time 24 | # this happens when etcd cluster awaits for all manager nodes to come up - at that point 25 | # they all get launched nearly at the same time 26 | sleep $[ ( $RANDOM % 10 ) + 1 ]s 27 | 28 | $cluster_config_dir/tools/join-swarm-cluster.sh $1 29 | 30 | unit_files="$cluster_config_dir/$1-systemd-units.txt" 31 | cat $unit_files | while read unit 32 | do 33 | echo "Starting unit: $unit..." 34 | cp -rf $cluster_config_dir/systemd-units/$unit /etc/systemd/system 35 | systemctl daemon-reload 36 | systemctl start $unit 37 | $slack -m "_${COREOS_PRIVATE_IPV4}_: Started unit _${unit}_ on ${COREOS_PRIVATE_IPV4}" -u $SLACK_WEBHOOK_URL -c "#$SLACK_CHANNEL" 38 | done 39 | 40 | 41 | $slack -m "_${COREOS_PRIVATE_IPV4}_: Bootstrap completed for *$1* node at ${COREOS_PRIVATE_IPV4}!" -u $SLACK_WEBHOOK_URL -c "#$SLACK_CHANNEL" 42 | 43 | # release the lock 44 | etcdctl rm /nodes/bootstrapping 45 | 46 | echo "BOOTSTRAP FOR $1 COMPLETED" 47 | echo "------------------------------" -------------------------------------------------------------------------------- /docs/images/cloud_config_aws_ec2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pavlo/coreos-docker-swarm-cluster/04ac40eb8c11bb123e0386bc24729513e922118e/docs/images/cloud_config_aws_ec2.png -------------------------------------------------------------------------------- /docs/images/cluster_layout.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pavlo/coreos-docker-swarm-cluster/04ac40eb8c11bb123e0386bc24729513e922118e/docs/images/cluster_layout.png -------------------------------------------------------------------------------- /heatit/cloud-config.template.yml: -------------------------------------------------------------------------------- 1 | #cloud-config 2 | 3 | coreos: 4 | etcd2: 5 | discovery: https://discovery.etcd.io/@param:coreos-cluster-token 6 | advertise-client-urls: http://$private_ipv4:2379 7 | initial-advertise-peer-urls: http://$private_ipv4:2380 8 | listen-client-urls: http://0.0.0.0:2379,http://0.0.0.0:4001 9 | listen-peer-urls: http://$private_ipv4:2380,http://$private_ipv4:7001 10 | units: 11 | - name: etcd2.service 12 | command: restart 13 | - name: cluster-bootstrap.service 14 | command: start 15 | content: | 16 | @insert: file:./systemd-units/cluster-bootstrap.service 17 | update: 18 | reboot-strategy: "etcd-lock" 19 | write_files: 20 | - path: "/etc/custom-environment" 21 | permissions: "755" 22 | owner: "root" 23 | content: | 24 | @insert: file:./files/custom-environment 25 | -------------------------------------------------------------------------------- /heatit/files/custom-environment: -------------------------------------------------------------------------------- 1 | NODE_ROLE=@param:node-role 2 | SLACK_WEBHOOK_URL=@param:slack-webhook-url 3 | SLACK_CHANNEL=@param:slack-channel -------------------------------------------------------------------------------- /heatit/files/todo-dockerhub-login: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pavlo/coreos-docker-swarm-cluster/04ac40eb8c11bb123e0386bc24729513e922118e/heatit/files/todo-dockerhub-login -------------------------------------------------------------------------------- /heatit/generate-configs.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | ./heatit process -s ./cloud-config.template.yml -p ./params.yml -d manager.yml --param-override=node-role=manager 4 | echo -e "#cloud-config\n$(cat manager.yml)" > manager.yml 5 | 6 | ./heatit process -s ./cloud-config.template.yml -p ./params.yml -d worker.yml --param-override=node-role=worker 7 | echo -e "#cloud-config\n$(cat worker.yml)" > worker.yml -------------------------------------------------------------------------------- /heatit/params.example.yml: -------------------------------------------------------------------------------- 1 | node-role: either "manager" or "worker" 2 | coreos-cluster-token: 3 | slack-webhook-url: "Slack webhook URL" 4 | slack-channel: "#Slack channel to post cluster health notifications to (starts with #)" -------------------------------------------------------------------------------- /heatit/systemd-units/cluster-bootstrap.service: -------------------------------------------------------------------------------- 1 | [Unit] 2 | Description=Cluster Bootstrap 3 | Requires=docker.service 4 | Requires=etcd2.service 5 | After=docker.service 6 | After=etcd2.service 7 | 8 | [Service] 9 | EnvironmentFile=-/etc/environment 10 | Type=oneshot 11 | RemainAfterExit=yes 12 | ExecStartPre=-/usr/bin/rm -rf /etc/coreos-docker-swarm-cluster 13 | ExecStartPre=/usr/bin/git clone https://github.com/pavlo/coreos-docker-swarm-cluster.git /etc/coreos-docker-swarm-cluster 14 | ExecStart=/etc/coreos-docker-swarm-cluster/bootstrap.sh @param:node-role 15 | 16 | [Install] 17 | WantedBy=multi-user.target 18 | -------------------------------------------------------------------------------- /manager-systemd-units.txt: -------------------------------------------------------------------------------- 1 | node-lifecycle.service 2 | vault.service 3 | vault-sk.service 4 | -------------------------------------------------------------------------------- /systemd-units/node-lifecycle.service: -------------------------------------------------------------------------------- 1 | [Unit] 2 | Description=Node lifecycle service 3 | Requires=etcd2.service 4 | Requires=network-online.target 5 | Requires=systemd-resolved.service 6 | After=etcd2.service 7 | After=network-online.target 8 | After=systemd-resolved.service 9 | 10 | [Service] 11 | EnvironmentFile=-/etc/environment 12 | EnvironmentFile=-/etc/custom-environment 13 | ExecStart=/etc/coreos-docker-swarm-cluster/tools/node-up.sh 14 | ExecStop=/etc/coreos-docker-swarm-cluster/tools/node-down.sh 15 | 16 | [Install] 17 | WantedBy=multi-user.target 18 | -------------------------------------------------------------------------------- /systemd-units/vault-sk.service: -------------------------------------------------------------------------------- 1 | [Unit] 2 | Description=Sidekiq Unit for Vault 3 | BindsTo=vault.service 4 | Requires=vault.service 5 | After=vault.service 6 | 7 | [Service] 8 | TimeoutStartSec=0 9 | EnvironmentFile=/etc/environment 10 | EnvironmentFile=/etc/custom-environment 11 | 12 | ExecStart=/bin/bash -c 'while true; do \ 13 | if [ `curl http://localhost:8200/v1/sys/seal-status | jq '.sealed'` = 'false' ]; \ 14 | then /usr/bin/etcdctl set /services/vault/${COREOS_PRIVATE_IPV4} "http://${COREOS_PRIVATE_IPV4}:8200" --ttl 60; \ 15 | fi; \ 16 | if [ `curl http://localhost:8200/v1/sys/seal-status | jq '.sealed'` = 'true' ]; \ 17 | then /etc/coreos-docker-swarm-cluster/tools/send-message-to-slack.sh -m "_${COREOS_PRIVATE_IPV4}_: Vault is SEALED!" -u $SLACK_WEBHOOK_URL -c "$SLACK_CHANNEL"; \ 18 | fi; \ 19 | sleep 45; \ 20 | done' 21 | 22 | ExecStop=/usr/bin/etcdctl rm /services/vault/${COREOS_PRIVATE_IPV4} 23 | 24 | [Install] 25 | WantedBy=multi-user.target 26 | -------------------------------------------------------------------------------- /systemd-units/vault.service: -------------------------------------------------------------------------------- 1 | [Unit] 2 | Description=Vault Server 3 | Requires=docker.service 4 | Requires=etcd2.service 5 | After=docker.service 6 | After=etcd2.service 7 | 8 | [Service] 9 | EnvironmentFile=-/etc/environment 10 | EnvironmentFile=-/etc/custom-environment 11 | TimeoutStartSec=0 12 | ExecStartPre=-/usr/bin/docker kill vault 13 | ExecStartPre=-/usr/bin/docker rm vault 14 | ExecStartPre=/usr/bin/docker pull vault:0.6.4 15 | ExecStart=/usr/bin/docker run --rm -p 8200:8200 --name vault --cap-add=IPC_LOCK -e 'VAULT_LOCAL_CONFIG={"listener": [{"tcp": {"address": "0.0.0.0:8200","tls_disable": 1}}],"storage": {"etcd": {"address": "http://${COREOS_PRIVATE_IPV4}:2379","etcd_api": "v2"}}}' vault server 16 | ExecStop=-/usr/bin/docker stop vault 17 | 18 | [Install] 19 | WantedBy=multi-user.target 20 | -------------------------------------------------------------------------------- /tools/join-swarm-cluster.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | cluster_config_dir=/etc/coreos-docker-swarm-cluster 4 | slack=$cluster_config_dir/tools/send-message-to-slack.sh 5 | 6 | MANAGER_ADVERTISE_ADDR=`curl http://${COREOS_PRIVATE_IPV4}:4001/v2/keys/nodes/managers | jq -r '.node.nodes[0].value'` 7 | 8 | role=$1 9 | 10 | if [ "$role" == "manager" ]; then 11 | 12 | if [ "$MANAGER_ADVERTISE_ADDR" = "null" ]; then 13 | echo "INITINALIZING DOCKER SWARM..." 14 | $slack -m "_${COREOS_PRIVATE_IPV4}_: Initiailizing docker swarm cluster" -u $SLACK_WEBHOOK_URL -c "$SLACK_CHANNEL" 15 | 16 | docker swarm init --advertise-addr ${COREOS_PRIVATE_IPV4} 17 | etcdctl set /swarm/manager-join-token `docker swarm join-token -q manager` 18 | etcdctl set /swarm/worker-join-token `docker swarm join-token -q worker` 19 | else 20 | echo "JOINING DOCKER SWARM..." 21 | $slack -m "_${COREOS_PRIVATE_IPV4}_: Joining existing docker swarm cluster as $role" -u $SLACK_WEBHOOK_URL -c "$SLACK_CHANNEL" 22 | docker swarm join --token `curl http://${COREOS_PRIVATE_IPV4}:4001/v2/keys/swarm/manager-join-token | jq -r '.node.value'` ${MANAGER_ADVERTISE_ADDR}:2377 23 | fi 24 | 25 | elif [ "$role" == "worker" ]; then 26 | 27 | token=`curl http://${COREOS_PRIVATE_IPV4}:4001/v2/keys/swarm/worker-join-token | jq -r '.node.value'` 28 | if [ "$token" == "null" ]; then 29 | echo "Failed to find join token for a worker node in the swarm cluster!" 30 | exit -1 31 | else 32 | $slack -m "_${COREOS_PRIVATE_IPV4}_: Joining existing docker swarm cluster as $role" -u $SLACK_WEBHOOK_URL -c "$SLACK_CHANNEL" 33 | docker swarm join --token $token ${MANAGER_ADVERTISE_ADDR}:2377 34 | fi 35 | 36 | else 37 | echo "Unexpected role given: $role, expected either 'manager' or 'worker'!" 38 | exit -1 39 | fi 40 | -------------------------------------------------------------------------------- /tools/node-down.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -o allexport 4 | source /etc/custom-environment 5 | set +o allexport 6 | 7 | cluster_config_dir=/etc/coreos-docker-swarm-cluster 8 | slack=$cluster_config_dir/tools/send-message-to-slack.sh 9 | 10 | /usr/bin/etcdctl rm /nodes/${NODE_ROLE}s/${COREOS_PRIVATE_IPV4} 11 | 12 | if [ "$NODE_ROLE" == "worker" ]; then 13 | docker swarm leave 14 | fi 15 | 16 | $slack -m "_${COREOS_PRIVATE_IPV4}_ is DOWN ($NODE_ROLE)" -u $SLACK_WEBHOOK_URL -c "$SLACK_CHANNEL" -------------------------------------------------------------------------------- /tools/node-up.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -o allexport 4 | source /etc/custom-environment 5 | set +o allexport 6 | 7 | cluster_config_dir=/etc/coreos-docker-swarm-cluster 8 | slack=$cluster_config_dir/tools/send-message-to-slack.sh 9 | 10 | $slack -m "_${COREOS_PRIVATE_IPV4}_: Node is UP ($NODE_ROLE)" -u $SLACK_WEBHOOK_URL -c "$SLACK_CHANNEL" 11 | 12 | while true; do 13 | /usr/bin/etcdctl set /nodes/${NODE_ROLE}s/${COREOS_PRIVATE_IPV4} "${COREOS_PRIVATE_IPV4}" --ttl 60; 14 | sleep 45; 15 | done -------------------------------------------------------------------------------- /tools/send-message-to-slack.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | TEMP=`getopt -o m:u:i::c: -- "$@"` 4 | eval set -- "$TEMP" 5 | 6 | 7 | icon=":white_check_mark:" 8 | 9 | while true ; do 10 | case "$1" in 11 | -m) 12 | message=$2 ; shift 2 ;; 13 | -u) 14 | url=$2 ; shift 2 ;; 15 | -i) 16 | icon=$2 ; shift 2 ;; 17 | -c) 18 | channel=$2 ; shift 2 ;; 19 | --) shift ; break ;; 20 | *) echo "Internal error!" ; exit 1 ;; 21 | esac 22 | done 23 | 24 | echo "Message: $message" 25 | echo "URL: $url" 26 | echo "Icon: $icon" 27 | 28 | curl -X POST --data-urlencode "payload={'channel': '$channel', 'username': 'webhookbot', 'text': '$message'}" $url 29 | #echo -X POST --data-urlencode "payload={'channel': '$channel', 'username': 'webhookbot', 'text': '$message', 'icon_emoji': '$icon'}" $url -------------------------------------------------------------------------------- /worker-systemd-units.txt: -------------------------------------------------------------------------------- 1 | node-lifecycle.service 2 | --------------------------------------------------------------------------------