├── .gitignore ├── Makefile ├── RATIONALE.md ├── README.md ├── bin └── .keep ├── samples ├── basic.cr ├── http.cr └── stubborn.cr ├── shard.yml └── src ├── logger.cr ├── main.cr ├── monitor.cr ├── monitor_process.cr ├── panzer.cr ├── process.cr ├── timeout.cr └── worker.cr /.gitignore: -------------------------------------------------------------------------------- 1 | .crystal 2 | /bin 3 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | CRYSTAL_BIN ?= $(shell which crystal) 2 | 3 | all: bin/panzer 4 | 5 | basic: all bin/basic 6 | ./bin/panzer "$(PWD)/bin/basic" 7 | 8 | stubborn: all bin/stubborn 9 | ./bin/panzer "$(PWD)/bin/stubborn" 10 | 11 | http: all bin/http 12 | ./bin/panzer "$(PWD)/bin/http" 13 | 14 | bin/panzer: src/main.cr 15 | $(CRYSTAL_BIN) build -o bin/panzer src/main.cr 16 | 17 | bin/basic: samples/basic.cr 18 | $(CRYSTAL_BIN) build -o bin/basic samples/basic.cr 19 | 20 | bin/stubborn: samples/stubborn.cr 21 | $(CRYSTAL_BIN) build -o bin/stubborn samples/stubborn.cr 22 | 23 | bin/http: samples/http.cr 24 | $(CRYSTAL_BIN) build -o bin/http --release samples/http.cr 25 | 26 | clean: 27 | rm -f bin/panzer bin/basic bin/stubborn bin/http 28 | -------------------------------------------------------------------------------- /RATIONALE.md: -------------------------------------------------------------------------------- 1 | # PANZER 2 | 3 | ## Worker: Monitor starts a worker process: 4 | 5 | 1. run a Worker instance in its own `fork` 6 | 7 | ## Monitoring: Panzer starts a monitor process: 8 | 9 | 1. load application, execute global configuration (e.g. `TCPServer.new`) 10 | 2. when monitor has loaded: send SIGVTALRM to monitor process 11 | 3. forks N worker processes 12 | 4. on SIGCHLD: 13 | 1. collect zombie worker processes 14 | 2. fork worker processes to replenish pool 15 | 5. on SIGTERM: 16 | 1. send SIGTERM to worker processes (exit gracefully) 17 | 2. don't replenish pool anymore 18 | 3. send SIGINT to worker processes (exit now) after timeout 19 | 4. exit 20 | 6. on SIGINT (first time): 21 | 1. behave as SIGTERM 22 | 23 | TODO: 24 | 25 | 7. on SIGINT (second time): 26 | 1. send SIGINT to worker processes (exit now) 27 | 2. exit 28 | 8. on SIGTTIN: 29 | 1. print monitor status 30 | 1. print workers' status 31 | 32 | ## Zero downtime: Panzer starts a main process: 33 | 34 | 1. forks/execs monitor process 35 | 2. on SIGUSR1: 36 | 1. forks/execs a new monitor process (which then forks new worker processes) 37 | 3. on SIGVTALRM: 38 | 1. tells old monitor process to gracefully terminate (SIGTERM) 39 | 2. tells old monitor process to exit after timeout (SIGINT) 40 | 4. on SIGTERM: 41 | 1. send SIGTERM to monitor process (exit gracefully) 42 | 2. exit 43 | 5. on SIGINT (first time): 44 | 1. behave as SIGTERM 45 | 6. on SIGINT (second time): 46 | 1. send SIGINT to monitor process (exit now) 47 | 2. exit 48 | 49 | TODO: 50 | 51 | 7. on SIGTTIN: 52 | 1. print main process status 53 | 2. send SIGTTIN to monitor process 54 | 55 | ## TREE 56 | 57 | Running state: 58 | 59 | - panzer:main 60 | - panzer:monitor 61 | - panzer:worker (1) 62 | - panzer:worker (2) 63 | - panzer:worker (3) 64 | - panzer:worker (4) 65 | 66 | Restarting state: 67 | 68 | - panzer:main 69 | - panzer:monitor (exiting) 70 | - panzer:worker (2, exiting) 71 | - panzer:worker (4, exiting) 72 | - panzer:monitor (new) 73 | - panzer:worker (1) 74 | - panzer:worker (2) 75 | - panzer:worker (3) 76 | - panzer:worker (4) 77 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Panzer 2 | 3 | Multi-process service monitor for Crystal, to squeeze all the juice from modern 4 | multicore CPUs; also featuring zero-downtime restarts. 5 | 6 | Multi-process is achieved using a "monitor process" that will load the 7 | application, allowing to load configuration and create servers once, then 8 | forking the monitor application into a specified number of processes. The 9 | "monitor process" will monitor workers, and restart them whenever one fails, so 10 | the number of running workers is constant. 11 | 12 | Zero-downtime is achieved using a "main process" whose sole job is to start a 13 | "monitor process", starting a new one in parallel when asked to (on `SIGUSR1`), 14 | then killing the previous process once the new has started. It also takes care 15 | to restart a new "monitor process" if the current one ever crashes. 16 | 17 | Please note that zero-downtime is still a work-in-progress for incoming socket 18 | connections. A restart of the example provided below seems to never miss a 19 | connection, but still generates some read/write errors. 20 | 21 | ## Install 22 | 23 | Add `panzer` to your shard's dependencies, then run `shards install`. This will 24 | build and install a `bin/panzer` executable into your project. 25 | 26 | ```yaml 27 | dependencies: 28 | panzer: 29 | github: ysbaddaden/panzer 30 | ``` 31 | 32 | ## Usage 33 | 34 | You need a worker process. 35 | 36 | It's a simple class that implements a `run` method —that musn't return, 37 | otherwise your worker will terminate. This `run` method will be executed in each 38 | worker, forked from the monitor process, but the class initializer and any code 39 | loaded by your worker will be executed once. 40 | 41 | You'll may want to delay or have to reconnect some connections in each worker 42 | for resources that can't be shared between processes (e.g. database 43 | connections). 44 | 45 | Example: 46 | 47 | ```crystal 48 | require "http/server" 49 | require "panzer/monitor" 50 | 51 | class MyWorker 52 | include Panzer::Worker 53 | 54 | getter port : Int32 55 | private getter server : HTTP::Server 56 | 57 | # The worker is initialized once 58 | def initialize(@port) 59 | @server = HTTP::Server.new(port) do |ctx| 60 | ctx.response.content_type = "text/plain" 61 | ctx.response.print "Hello world, got #{context.request.path}!" 62 | end 63 | 64 | # force creation of underling TCPServer 65 | @server.bind 66 | end 67 | 68 | # The run method is executed for *each* worker 69 | def run 70 | logger.info "listening on #{server.local_address}" 71 | server.listen 72 | end 73 | end 74 | 75 | # Start the monitor that will manage worker processes: 76 | Panzer::Monitor.run(MyWorker.new(8080), count: 8) 77 | ``` 78 | 79 | You may now build and run your application: 80 | 81 | ```shell 82 | $ crystal build --release src/my_worker.cr -o bin/my_worker 83 | $ bin/panzer bin/my_worker 84 | ``` 85 | 86 | You may restart your application by sending the `SIGUSR1` signal to the main 87 | process, which `1234` is the PID of the main process: 88 | 89 | ```shell 90 | $ kill -USR1 1234 91 | ``` 92 | 93 | You may tell your application to exit gracefully by sending the `SIGTERM` 94 | signal, that will be propagated down to each worker: 95 | 96 | ```shell 97 | $ kill -TERM 1234 98 | ``` 99 | 100 | ## TODO 101 | 102 | main process: 103 | 104 | - [ ] support a YAML file (`config/panzer.yml`) to read configuration from (?) 105 | - [ ] --quiet and --verbose CLI start options 106 | - [ ] --timeout CLI option 107 | - [ ] retry delay to restart monitor on successive crashes (--delay option) 108 | - [ ] restart main process itself on SIGUSR2 109 | 110 | monitor process: 111 | 112 | - [ ] detect CPU number and use it as default workers count 113 | - [ ] retry delay to restart workers on successive crashes 114 | - [ ] print worker status on SIGTTIN 115 | 116 | panzerctl helper (?): 117 | 118 | - [ ] requires main process to save `tmp/panzer.pid` file 119 | - [ ] `--pid=/tmp/panzer.pid` option 120 | - [ ] status (send SIGTTIN) 121 | - [ ] reload (send SIGUSR1) 122 | - [ ] restart (send SIGUSR2) 123 | - [ ] shutdown (send SIGTERM) 124 | - [ ] exit (send SIGINT) 125 | 126 | tests: 127 | 128 | - [ ] integration tests that will build/run/kill/assert processes 129 | 130 | ## Authors 131 | 132 | - Julien Portalier (creator, maintainer) 133 | -------------------------------------------------------------------------------- /bin/.keep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ysbaddaden/panzer/536f19a42b947add17a53012a77d6e34df46d11b/bin/.keep -------------------------------------------------------------------------------- /samples/basic.cr: -------------------------------------------------------------------------------- 1 | require "../src/monitor" 2 | 3 | class MyWorker 4 | include Panzer::Worker 5 | 6 | def initialize 7 | logger.debug { "initializing (#{Process.pid})" } 8 | end 9 | 10 | def run 11 | logger.progname = "panzer:worker" 12 | logger.debug { "running (#{Process.pid})" } 13 | 14 | loop do 15 | sleep #rand(2..10) 16 | break 17 | end 18 | 19 | logger.debug { "exiting: (#{Process.pid})" } 20 | end 21 | end 22 | 23 | Panzer.logger.progname = "panzer:monitor" 24 | Panzer.logger.level = Logger::Severity::DEBUG 25 | 26 | Panzer::Monitor.run(MyWorker.new, count: 8) 27 | -------------------------------------------------------------------------------- /samples/http.cr: -------------------------------------------------------------------------------- 1 | require "http" 2 | require "../src/monitor" 3 | 4 | class MyWorker 5 | include Panzer::Worker 6 | 7 | def initialize(port = 8080) 8 | @server = HTTP::Server.new(port) do |context| 9 | context.response.content_type = "text/plain" 10 | context.response << "Hello World 2!\n" 11 | end 12 | @server.bind 13 | end 14 | 15 | def run 16 | logger.progname = "http" 17 | @server.listen 18 | end 19 | end 20 | 21 | Panzer.logger.progname = "panzer:monitor" 22 | Panzer.logger.level = Logger::Severity::DEBUG 23 | 24 | Panzer::Monitor.run(MyWorker.new, count: 4) 25 | -------------------------------------------------------------------------------- /samples/stubborn.cr: -------------------------------------------------------------------------------- 1 | require "../src/monitor" 2 | 3 | class StubbornWorker 4 | include Panzer::Worker 5 | 6 | def run 7 | Panzer.logger.progname = "panzer:worker" 8 | 9 | Signal::TERM.trap do 10 | logger.warn { "Received SIGTERM, but I won't die that easily!" } 11 | end 12 | 13 | Signal::INT.trap do 14 | logger.warn { "okay, I give up" } 15 | end 16 | 17 | sleep # forever 18 | end 19 | end 20 | 21 | Panzer.logger.progname = "panzer:monitor" 22 | Panzer.logger.level = Logger::Severity::DEBUG 23 | 24 | Panzer::Monitor.run(StubbornWorker.new, count: 1, timeout: 2.seconds) 25 | -------------------------------------------------------------------------------- /shard.yml: -------------------------------------------------------------------------------- 1 | name: panzer 2 | version: 0.1.0 3 | 4 | authors: 5 | - Julien Portalier 6 | 7 | description: | 8 | Multi-process workers with zero downtime restarts. 9 | 10 | #executables: 11 | # - panzer 12 | # - panzerctl 13 | 14 | targets: 15 | panzer: 16 | main: src/main.cr 17 | #panzerctl: 18 | # main: src/helper.cr 19 | 20 | scripts: 21 | postinstall: | 22 | shards build --debug --release && mkdir ../../bin && cp bin/panzer* ../../bin 23 | 24 | license: Apache-2.0 25 | -------------------------------------------------------------------------------- /src/logger.cr: -------------------------------------------------------------------------------- 1 | require "logger" 2 | 3 | module Panzer 4 | @@logger : Logger? 5 | 6 | def self.logger 7 | @@logger ||= begin 8 | logger = Logger.new(STDOUT) 9 | logger.level = Logger::Severity::INFO 10 | logger.progname = "panzer" 11 | logger 12 | end 13 | end 14 | 15 | def self.logger=(@@logger) 16 | end 17 | end 18 | -------------------------------------------------------------------------------- /src/main.cr: -------------------------------------------------------------------------------- 1 | require "mutex" 2 | require "./monitor_process" 3 | require "./logger" 4 | 5 | module Panzer 6 | def self.run(command) 7 | mutex = Mutex.new 8 | old_monitor = nil 9 | monitor = Panzer::MonitorProcess.new(command) 10 | 11 | # Restart application. 12 | Signal::USR1.trap do 13 | old_monitor, monitor = monitor, Panzer::MonitorProcess.new(command) 14 | end 15 | 16 | # Application started: exit previous application 17 | Signal::VTALRM.trap do 18 | old_monitor.try(&.exit) 19 | end 20 | 21 | # Exit gracefully 22 | Signal::TERM.trap do 23 | monitor.terminate 24 | exit 0 25 | end 26 | 27 | # Exit quickly. 28 | Signal::INT.trap do 29 | monitor.interrupt 30 | exit 0 31 | end 32 | 33 | # Application failed or previous application exited. 34 | Signal::CHLD.trap do 35 | mutex.synchronize do 36 | unless monitor.running? 37 | Panzer.logger.error { "monitor #{monitor.pid} exited" } 38 | monitor = Panzer::MonitorProcess.new(command) 39 | end 40 | end 41 | 42 | if old_monitor 43 | old_monitor = nil 44 | end 45 | end 46 | 47 | sleep # forever 48 | end 49 | end 50 | 51 | # TODO: OptionParser 52 | 53 | Panzer.logger.progname = "panzer:main" 54 | Panzer.logger.level = Logger::Severity::DEBUG 55 | Panzer.run(command: ARGV[0]) 56 | -------------------------------------------------------------------------------- /src/monitor.cr: -------------------------------------------------------------------------------- 1 | require "mutex" 2 | require "./logger" 3 | require "./process" 4 | require "./worker" 5 | 6 | module Panzer 7 | class Monitor 8 | def self.run(worker, count, timeout = 60.seconds) 9 | monitor = new(worker, count, timeout) 10 | monitor.fill 11 | 12 | # Collect zombie worker, refill worker pool 13 | Signal::CHLD.trap do 14 | monitor.collect 15 | monitor.fill 16 | end 17 | 18 | # Exit gracefully. 19 | Signal::TERM.trap do 20 | monitor.terminate 21 | exit 0 22 | end 23 | 24 | # Exit gracefully. 25 | Signal::INT.trap do 26 | monitor.terminate 27 | exit 0 28 | end 29 | 30 | # TODO: Print worker status 31 | #Signal::TTIN.trap do 32 | #end 33 | 34 | # Notify parent process that we are running. 35 | Process.kill(Signal::VTALRM, Process.ppid) 36 | 37 | sleep # forever 38 | end 39 | 40 | getter worker : Worker 41 | getter count : Int32 42 | getter timeout : Time::Span 43 | 44 | def initialize(@worker, @count, @timeout = 60.seconds) 45 | @pool = [] of LibC::PidT 46 | @mutex = Mutex.new 47 | @exiting = false 48 | end 49 | 50 | def logger 51 | Panzer.logger 52 | end 53 | 54 | def fill 55 | return if exiting? 56 | @mutex.synchronize do 57 | logger.debug { "filling pool (#{@pool.size} -> #{count})" } 58 | until @pool.size >= count 59 | @pool << worker.spawn.pid 60 | end 61 | end 62 | end 63 | 64 | def terminate 65 | return if exiting? 66 | @exiting = true 67 | 68 | logger.debug { "stopping workers" } 69 | timer = Timeout.new(timeout) 70 | 71 | @pool.each do |pid| 72 | terminate_worker(pid) 73 | end 74 | 75 | until @pool.empty? || timer.elapsed? 76 | collect(timer) 77 | end 78 | 79 | @pool.each do |pid| 80 | kill_worker(pid) 81 | end 82 | end 83 | 84 | private def terminate_worker(pid) 85 | begin 86 | logger.debug { "terminating worker #{pid}" } 87 | Process.kill(Signal::TERM, pid) 88 | rescue ex : Errno 89 | raise ex unless ex.errno == Errno::ESRCH 90 | end 91 | end 92 | 93 | private def kill_worker(pid) 94 | begin 95 | logger.debug { "killing worker #{pid}" } 96 | Process.kill(Signal::INT, pid) 97 | rescue ex : Errno 98 | raise ex unless ex.errno == Errno::ESRCH 99 | end 100 | end 101 | 102 | def collect(timer = nil) 103 | while ret = Process.waitpid(-1) 104 | pid, exit_code = ret 105 | 106 | @mutex.synchronize do 107 | @pool.delete(pid) 108 | end 109 | 110 | unless exiting? 111 | logger.error { "worker #{pid} terminated (exit code: #{exit_code})" } 112 | end 113 | end 114 | end 115 | 116 | def exiting? 117 | @exiting 118 | end 119 | end 120 | end 121 | -------------------------------------------------------------------------------- /src/monitor_process.cr: -------------------------------------------------------------------------------- 1 | require "./process" 2 | 3 | module Panzer 4 | class MonitorProcess 5 | @process : Process 6 | @timeout : Time::Span 7 | 8 | def initialize(command, @timeout = 60.seconds) 9 | @process = fork do 10 | Process.exec(command) 11 | end 12 | @sigint = 0 13 | logger.info { "spawned monitor #{pid}" } 14 | end 15 | 16 | def pid 17 | @process.pid 18 | end 19 | 20 | def logger 21 | Panzer.logger 22 | end 23 | 24 | def exit 25 | terminate 26 | 27 | begin 28 | Process.wait(pid, @timeout) 29 | rescue Timeout::Error 30 | kill 31 | end 32 | 33 | logger.info { "monitor #{pid} exited" } 34 | end 35 | 36 | def terminate 37 | logger.debug { "terminating monitor #{pid}" } 38 | Process.kill(Signal::TERM, pid) 39 | end 40 | 41 | def interrupt 42 | case @sigint += 1 43 | when 1 then self.exit 44 | when 2 then self.kill 45 | end 46 | end 47 | 48 | def kill 49 | logger.debug { "killing monitor #{pid}" } 50 | Process.kill(Signal::KILL, pid) 51 | end 52 | 53 | def running? 54 | Process.running?(pid) 55 | end 56 | end 57 | end 58 | -------------------------------------------------------------------------------- /src/panzer.cr: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ysbaddaden/panzer/536f19a42b947add17a53012a77d6e34df46d11b/src/panzer.cr -------------------------------------------------------------------------------- /src/process.cr: -------------------------------------------------------------------------------- 1 | require "./timeout" 2 | 3 | class Process 4 | # Waits for process `pid`. Returns the process exit code. 5 | def self.wait(pid, timeout = nil) 6 | timer = Timeout.new(timeout) if timeout 7 | loop do 8 | if args = waitpid(pid) 9 | return args[1] 10 | else 11 | timer.try(&.verify!) 12 | Fiber.yield 13 | end 14 | end 15 | end 16 | 17 | # Tries to collect process *pid*, or any child process if *pid* is -1. Returns 18 | # the collected process pid if any, otherwise returns `nil`. 19 | def self.waitpid(pid = -1) 20 | case child_pid = LibC.waitpid(pid, out exit_code, LibC::WNOHANG) 21 | when -1 22 | if pid == -1 && Errno.value == Errno::ECHILD 23 | # no child processes 24 | return 25 | end 26 | raise Errno.new("waitpid") 27 | when 0 28 | return 29 | else 30 | return {child_pid, exit_code} 31 | end 32 | end 33 | 34 | # Returns true if process *pid* exists and is accessible (e.g. child process), 35 | # and isn't in a zombie state or equivalent. Returns false otherwise. 36 | def self.running?(pid) 37 | case LibC.waitpid(pid, out exit_code, LibC::WNOHANG) 38 | when 0 39 | return true 40 | when -1 41 | unless Errno.value == Errno::ECHILD 42 | raise Errno.new("waitpid") 43 | end 44 | end 45 | false 46 | end 47 | end 48 | -------------------------------------------------------------------------------- /src/timeout.cr: -------------------------------------------------------------------------------- 1 | # FIXME: use a monotonic clock 2 | struct Timeout 3 | class Error < Exception 4 | end 5 | 6 | def self.new(seconds : Int, message = nil) 7 | new(seconds.seconds, message) 8 | end 9 | 10 | def initialize(@span : Time::Span, @message : String? = nil) 11 | @start = Time.now 12 | end 13 | 14 | def elapsed? 15 | (Time.now - @start) > @span 16 | end 17 | 18 | def verify! 19 | raise Error.new(@message || "Reached #{@span} timeout") if elapsed? 20 | end 21 | end 22 | -------------------------------------------------------------------------------- /src/worker.cr: -------------------------------------------------------------------------------- 1 | module Panzer 2 | module Worker 3 | abstract def run 4 | 5 | def spawn 6 | process = fork do 7 | Signal::CHLD.reset 8 | Signal::TERM.reset 9 | Signal::INT.reset 10 | Signal::TTIN.reset 11 | run 12 | end 13 | logger.info { "started worker (#{process.pid})" } 14 | process 15 | end 16 | 17 | def logger 18 | Panzer.logger 19 | end 20 | end 21 | end 22 | --------------------------------------------------------------------------------