├── .formatter.exs ├── .gitignore └── README.md /.formatter.exs: -------------------------------------------------------------------------------- 1 | [ 2 | import_deps: [:ecto, :ecto_sql, :phoenix], 3 | subdirectories: ["priv/*/migrations"], 4 | plugins: [Phoenix.LiveView.HTMLFormatter], 5 | inputs: ["*.{heex,ex,exs}", "{config,lib,test}/**/*.{heex,ex,exs}", "priv/*/seeds.exs"] 6 | ] 7 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # The directory Mix will write compiled artifacts to. 2 | /_build/ 3 | 4 | # If you run "mix test --cover", coverage assets end up here. 5 | /cover/ 6 | 7 | # The directory Mix downloads your dependencies sources to. 8 | /deps/ 9 | 10 | # Where 3rd-party dependencies like ExDoc output generated docs. 11 | /doc/ 12 | 13 | # Ignore .fetch files in case you like to edit your project deps locally. 14 | /.fetch 15 | 16 | # If the VM crashes, it generates a dump, let's ignore it too. 17 | erl_crash.dump 18 | 19 | # Also ignore archive artifacts (built via "mix archive.build"). 20 | *.ez 21 | 22 | # Temporary files, for example, from tests. 23 | /tmp/ 24 | 25 | # Ignore package tarball (built via "mix hex.build"). 26 | fucking_dave-*.tar 27 | 28 | # Ignore assets that are produced by build tools. 29 | /priv/static/assets/ 30 | 31 | # Ignore digested assets cache. 32 | /priv/static/cache_manifest.json 33 | 34 | # In case you use Node.js/npm, you want to ignore these. 35 | npm-debug.log 36 | /assets/node_modules/ 37 | 38 | *.sw* 39 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Writing A Job Runner (In Elixir) (Again) (10 years later) 2 | Ten years ago, [I wrote a job runner in Elixir after some inspiration from Jose](https://github.com/ybur-yug/genstage_tutorial/blob/master/README.md) 3 | 4 | This is an update on that post. 5 | 6 | Almost no code has changed, but I wrote it up a lot better, and added some more detail. 7 | 8 | I find it wildly amusing it held up this well, and felt like re-sharing with everyone and see if someone with fresh eyes may get some enjoyment or learn a bit from this. 9 | 10 | ### I also take things quite a bit further 11 | 12 | 13 | ## Who is this for? 14 | Are you curious? 15 | 16 | If you know a little bit of Elixir, this is a great "levelling up" piece. 17 | 18 | If you're seasoned, it might be fun to implement if you have not. 19 | 20 | If you don't know Elixir, it will hopefully be an interesting case study and sales pitch. 21 | 22 | Anyone with a Claude or Open AI subscription can easily follow along knowing no Elixir. 23 | 24 | ## Work? 25 | Applications must do work. This is typical of just about any program that reaches a sufficient size. In order to do that work, sometimes it's desirable to have it happen *elsewhere*. If you have built software, you have probably needed a background job. 26 | 27 | In this situation, you are fundamentally using code to run other code. Erlang has a nice format for this, called the Erlang term format. It can store its data in a way it can be passed around and run by other nodes We are going to examine doing this in Elixir with "tools in the shed". We will have a single dependency called `gen_stage` that is built and maintained by the language's creator, Jose Valim. 28 | 29 | For beginners, we will first cover a bit about Elixir and what it offers that might make this appealing 30 | 31 | ## The Landscape of Job Processing 32 | 33 | In Ruby, you might reach for Sidekiq. It's battle-tested, using Redis for storage and threads for concurrency. Jobs are JSON objects, workers pull from queues, and if something crashes, you hope your monitoring catches it. It works well until you need to scale beyond a single Redis instance or handle complex job dependencies. 34 | 35 | Python developers often turn to Celery. It's more distributed by design, supporting multiple brokers and result backends. But the complexity shows - you're configuring RabbitMQ, dealing with serialization formats, and debugging issues across multiple moving parts. When a worker dies mid-job, recovery depends on how well you've configured acknowledgments and retries. 36 | 37 | Go developers might use machinery or asynq, leveraging goroutines for concurrency. The static typing helps catch errors early, but you're still manually managing worker pools and carefully handling panics to prevent the whole process from dying. 38 | 39 | Each solution reflects its language's strengths and limitations. They all converge on similar patterns: a persistent queue, worker processes, and lots of defensive programming. What if the language itself provided better primitives for this problem? 40 | 41 | # Thinking About Job Runners, Producers, Consumers, and Events 42 | 43 | ## The Architecture of Work 44 | 45 | At its core, a job runner is a meta concept. It is code that runs code. There will always be work to be done in any given system that has users. But ensuring work gets done when it cannot be handled in a blocking, synchronous matter (and you have the time to await results) is nearly impossible. The devil is in these details. How do you handle failure? What is our plan when we have a situation that could overwhelm our worker pool? We seek out answers to these questions as we do this dive. 46 | 47 | GenStage answers the questions we have asked so far, in general, with demand driven architecture. Instead of pushing work out, workers pull when *they* are ready. This inversion becomes a very elegant abstraction in practice. 48 | 49 | ## Understanding Producer-Consumer Patterns 50 | 51 | The producer-consumer pattern isn't unique to Elixir. It's a fundamental pattern in distributed systems: 52 | 53 | **In Apache Spark**, RDDs (Resilient Distributed Datasets) flow through transformations. Each transformation is essentially a consumer of the previous stage and a producer for the next. Spark handles backpressure through its task scheduler - if executors are busy, new tasks wait. 54 | 55 | **In Kafka Streams**, topics act as buffers between producers and consumers. Consumers track their offset, pulling messages at their own pace. The broker handles persistence and replication. 56 | 57 | **In Go channels**, goroutines communicate through typed channels. A goroutine blocks when sending to a full channel or receiving from an empty one. This provides natural backpressure but requires careful capacity planning. 58 | 59 | GenStage takes a different approach. There are no intermediate buffers or brokers. Producers and consumers negotiate directly: 60 | 61 | 1. Consumer asks producer for work (specifying how much it can handle) 62 | 2. Producer responds with up to that many events 63 | 3. Consumer processes events and asks for more 64 | 65 | This creates a pull-based system with automatic flow control. No queues filling up, no brokers to manage, no capacity planning. The system self-regulates based on actual processing speed. 66 | 67 | ## What We're Actually Building 68 | 69 | ### Why Elixir Works for Job Processing 70 | 71 | **Processes are the unit of concurrency.** Not threads, not coroutines - processes. Each process has its own heap, runs concurrently, and can't corrupt another's memory. Starting one is measured in microseconds and takes about 2KB of memory. You don't manage a pool of workers; you spawn a process per job. 72 | 73 | **Failure is isolated by default.** When a process crashes, it dies alone. No corrupted global state, no locked mutexes, no zombie threads. The supervisor sees the death, logs it, and starts a fresh process. Your job processor doesn't need defensive try-catch blocks everywhere - it needs a good supervision tree. 74 | 75 | **Message passing is the only way to communicate.** No shared memory means no locks, no race conditions, no memory barriers. A process either receives a message or it doesn't. This constraint simplifies concurrent programming dramatically - you can reason about each process in isolation. 76 | 77 | **The scheduler handles fairness.** The BEAM VM runs its own scheduler, preemptively switching between processes every 2000 reductions. One process can't starve others by hogging the CPU. This is why Phoenix can handle millions of WebSocket connections - each connection is just another lightweight process. 78 | 79 | **Distribution is built-in.** Connect nodes with one function call. Send messages across the network with the same syntax as local messages. The Erlang Term Format serializes any data structure, including function references. Your job queue can span multiple machines without changing the core logic. 80 | 81 | **Hot code reloading works.** Deploy new code without stopping the system. The BEAM can run two versions of a module simultaneously, migrating processes gracefully. Your job processor can be upgraded while it's processing jobs. 82 | 83 | **Introspection is exceptional.** Connect to a running system and inspect any process. See its message queue, memory usage, current function. The observer GUI shows your entire system's health in real-time. When production misbehaves, you can debug it live. 84 | 85 | These aren't features bolted on top - they're fundamental to how the BEAM VM works. When you build a job processor in Elixir, you're not fighting the language to achieve reliability and concurrency. You're using it as designed. 86 | 87 | Our job runner will have three core components: 88 | 89 | **Producers** - These generate or fetch work. In our case, they'll pull jobs from a database table. A producer doesn't decide who gets work - it simply responds to demand. When a consumer asks for 10 jobs, the producer queries the database for 10 unclaimed jobs and returns them. 90 | 91 | **Consumers** - These execute jobs. Each consumer is a separate Elixir process, isolated from others. When a consumer is ready for work, it asks its producer for events. After processing, it asks for more. If a consumer crashes while processing a job, only that job is affected. 92 | 93 | **Events** - The unit of work flowing through the system. In GenStage, everything is an event. For our job runner, an event is a job to be executed. Events flow from producers to consumers based on demand, never faster than consumers can handle. 94 | 95 | ## The Beauty of Modeling Everything as Events 96 | 97 | When you model work as events, powerful patterns emerge: 98 | 99 | **Composition** - You can chain stages together. A consumer can also be a producer for another stage. Want to add a step that enriches jobs before execution? Insert a producer-consumer between your current stages. 100 | 101 | **Fan-out/Fan-in** - One producer can feed multiple consumers (fan-out). Multiple producers can feed one consumer (fan-in). The demand mechanism ensures fair distribution. 102 | 103 | **Buffering** - Need a buffer? Add a producer-consumer that accumulates events before passing them on. The buffer only fills as fast as downstream consumers can drain it. 104 | 105 | **Filtering** - A producer-consumer can selectively forward events. Only want to process high-priority jobs? Filter them in a middle stage. 106 | 107 | ``` 108 | Event Flow Pipeline: Social Media Processing 109 | 110 | [BlueSky] ──┐ 111 | [Twitter] ──┼──→ [Producer] ═══→ [ProducerConsumer] ═══→ [Consumer] ──→ [Database] 112 | [TikTok] ──┘ (transformation) 113 | 114 | Flow: Social media posts → Producer → Transformation → Consumer → Storage 115 | ``` 116 | 117 | ## Why This Matters for Job Processing 118 | 119 | Traditional job processors push jobs into queues. Workers poll these queues, hoping to grab work. This creates several problems: 120 | 121 | 1. **Queue overflow** - Producers can overwhelm the queue if consumers are slow 122 | 2. **Unfair distribution** - Fast workers might grab all the work 123 | 3. **Visibility** - Hard to see where bottlenecks are 124 | 4. **Error handling** - What happens to in-flight jobs when a worker dies? 125 | 126 | GenStage's demand-driven model solves these elegantly: 127 | 128 | 1. **No overflow** - Producers only generate what's demanded 129 | 2. **Fair distribution** - Each consumer gets what it asks for 130 | 3. **Clear bottlenecks** - Slow stages naturally build up demand 131 | 4. **Clean errors** - Crashed consumers simply stop demanding; their work remains unclaimed 132 | 133 | This isn't theoretical. Telecom systems have used these patterns for decades. When you make a phone call, switches don't push calls through the network - each hop pulls when ready. This prevents network overload even during disasters when everyone tries to call at once. 134 | 135 | We're applying the same battle-tested patterns to job processing. The result is a system that's naturally resilient, self-balancing, and surprisingly simple to reason about. 136 | 137 | Ready to see how this translates to code? Let's build our first producer. 138 | 139 | # Building the Foundation 140 | 141 | ## Step 1: Creating Your Phoenix Project 142 | 143 | Let's start fresh with a new Phoenix project. Open your terminal and run: 144 | 145 | ``` 146 | mix phx.new job_processor --live 147 | cd job_processor 148 | ``` 149 | 150 | We're keeping it lean - no dashboard or mailer for now. When prompted to install dependencies, say yes. 151 | 152 | Why Phoenix? We're not building a web app, but Phoenix gives us: 153 | - A supervision tree already set up 154 | - Configuration management 155 | - A database connection (Ecto) 156 | - LiveView for our monitoring dashboard (later) 157 | 158 | Think of Phoenix as our application framework, not just a web framework. 159 | 160 | ## Step 2: Adding GenStage 161 | 162 | Open `mix.exs` and add GenStage to your dependencies: 163 | 164 | ```elixir 165 | defp deps do 166 | [ 167 | {:phoenix, "~> 1.7.12"}, 168 | {:phoenix_ecto, "~> 4.5"}, 169 | {:ecto_sql, "~> 3.11"}, 170 | {:postgrex, ">= 0.0.0"}, 171 | {:phoenix_html, "~> 4.1"}, 172 | {:phoenix_live_reload, "~> 1.2", only: :dev}, 173 | {:phoenix_live_view, "~> 0.20.14"}, 174 | {:telemetry_metrics, "~> 0.6"}, 175 | {:telemetry_poller, "~> 1.0"}, 176 | {:jason, "~> 1.2"}, 177 | {:bandit, "~> 1.2"}, 178 | 179 | # Add this line 180 | {:gen_stage, "~> 1.2"} 181 | ] 182 | end 183 | ``` 184 | 185 | Now fetch the dependency: 186 | 187 | ``` 188 | mix deps.get 189 | ``` 190 | 191 | That's it. One dependency. GenStage is maintained by the Elixir core team, so it follows the same design principles as the language itself. 192 | 193 | ## Step 3: Understanding GenStage's Mental Model 194 | 195 | Before we write code, let's cement the mental model. GenStage orchestrates three types of processes: 196 | 197 | **Producers** emit events. They don't push events anywhere - they hold them until a consumer asks. Think of a producer as a lazy river of data. The water (events) only flows when someone downstream opens a valve (demands). 198 | 199 | **Consumers** receive events. They explicitly ask producers for a specific number of events. This is the key insight: consumers control the flow rate, not producers. 200 | 201 | **Producer-Consumers** do both. They receive events from upstream, transform them, and emit to downstream. Perfect for building pipelines. 202 | 203 | Every GenStage process follows this lifecycle: 204 | 1. Start and connect to other stages 205 | 2. Consumer sends demand upstream 206 | 3. Producer receives demand and emits events 207 | 4. Consumer receives and processes events 208 | 5. Repeat from step 2 209 | 210 | The demand mechanism is what makes this special. In a traditional queue, you might have: 211 | 212 | ```elixir 213 | # Traditional approach - producer decides when to push 214 | loop do 215 | job = create_job() 216 | Queue.push(job) # What if queue is full? 217 | end 218 | ``` 219 | 220 | With GenStage: 221 | 222 | ```elixir 223 | # GenStage approach - consumer decides when to pull 224 | def handle_demand(demand, state) do 225 | jobs = create_jobs(demand) # Only create what's asked for 226 | {:noreply, jobs, state} 227 | end 228 | ``` 229 | 230 | The consumer is in control. It's impossible to overwhelm a consumer because it only gets what it asked for. 231 | 232 | ## Step 4: Creating the Producer 233 | 234 | Now for the meat of it. Let's build a producer that understands our job processing needs. Create a new file at `lib/job_processor/producer.ex`: 235 | 236 | ```elixir 237 | defmodule JobProcessor.Producer do 238 | use GenStage 239 | require Logger 240 | 241 | @doc """ 242 | Starts the producer with an initial state. 243 | 244 | The state can be anything, but we'll use a counter to start simple. 245 | """ 246 | def start_link(initial \\ 0) do 247 | GenStage.start_link(__MODULE__, initial, name: __MODULE__) 248 | end 249 | 250 | @impl true 251 | def init(counter) do 252 | Logger.info("Producer starting with counter: #{counter}") 253 | {:producer, counter} 254 | end 255 | 256 | @impl true 257 | def handle_demand(demand, state) do 258 | Logger.info("Producer received demand for #{demand} events") 259 | 260 | # Generate events to fulfill demand 261 | events = Enum.to_list(state..(state + demand - 1)) 262 | 263 | # Update our state 264 | new_state = state + demand 265 | 266 | # Return events and new state 267 | {:noreply, events, new_state} 268 | end 269 | end 270 | ``` 271 | 272 | Let's dissect this line by line: 273 | 274 | **`use GenStage`** - This macro brings in the GenStage behavior. It's like `use GenServer` but for stages. It requires us to implement certain callbacks. 275 | 276 | **`start_link/1`** - Standard OTP pattern. We name the process after its module so we can find it easily. In production, you might want multiple producers, so you'd make the name configurable. 277 | 278 | **`init/1`** - The crucial part: `{:producer, counter}`. The first element declares this as a producer. The second is our initial state. GenStage now knows this process will emit events when asked. 279 | 280 | **`handle_demand/2`** - The heart of a producer. This callback fires when consumers ask for events. The arguments are: 281 | - `demand` - How many events the consumer wants 282 | - `state` - Our current state 283 | 284 | The return value `{:noreply, events, new_state}` means: 285 | - `:noreply` - We're responding to demand, not a synchronous call 286 | - `events` - The list of events to emit (must be a list) 287 | - `new_state` - Our updated state 288 | 289 | ### The Demand Buffer 290 | 291 | Here's something subtle but important: GenStage maintains an internal demand buffer. If multiple consumers ask for events before you can fulfill them, GenStage aggregates the demand. 292 | 293 | For example: 294 | 1. Consumer A asks for 10 events 295 | 2. Consumer B asks for 5 events 296 | 3. Your `handle_demand/2` receives demand for 15 events 297 | 298 | This batching is efficient and prevents your producer from being called repeatedly for small demands. 299 | 300 | ### What if You Can't Fulfill Demand? 301 | 302 | Sometimes you can't produce as many events as demanded. That's fine: 303 | 304 | ```elixir 305 | def handle_demand(demand, state) do 306 | available = calculate_available_work() 307 | 308 | if available >= demand do 309 | events = fetch_events(demand) 310 | {:noreply, events, state} 311 | else 312 | # Can only partially fulfill demand 313 | events = fetch_events(available) 314 | {:noreply, events, state} 315 | end 316 | end 317 | ``` 318 | 319 | GenStage tracks unfulfilled demand. If you return fewer events than demanded, it remembers. The next time you have events available, you can emit them even without new demand: 320 | 321 | ```elixir 322 | def handle_info(:new_data_available, state) do 323 | events = fetch_available_events() 324 | {:noreply, events, state} 325 | end 326 | ``` 327 | 328 | ### Producer Patterns 329 | 330 | Our simple counter producer is just the beginning. Real-world producers follow several patterns: 331 | 332 | **Database Polling Producer:** 333 | ```elixir 334 | def handle_demand(demand, state) do 335 | jobs = Repo.all( 336 | from j in Job, 337 | where: j.status == "pending", 338 | limit: ^demand, 339 | lock: "FOR UPDATE SKIP LOCKED" 340 | ) 341 | 342 | job_ids = Enum.map(jobs, & &1.id) 343 | 344 | Repo.update_all( 345 | from(j in Job, where: j.id in ^job_ids), 346 | set: [status: "processing"] 347 | ) 348 | 349 | {:noreply, jobs, state} 350 | end 351 | ``` 352 | 353 | **Rate-Limited Producer:** 354 | ```elixir 355 | def handle_demand(demand, %{rate_limit: limit} = state) do 356 | now = System.monotonic_time(:millisecond) 357 | time_passed = now - state.last_emit 358 | 359 | allowed = min(demand, div(time_passed * limit, 1000)) 360 | 361 | if allowed > 0 do 362 | events = generate_events(allowed) 363 | {:noreply, events, %{state | last_emit: now}} 364 | else 365 | # Schedule retry 366 | Process.send_after(self(), :retry_demand, 100) 367 | {:noreply, [], state} 368 | end 369 | end 370 | ``` 371 | 372 | **Buffering Producer:** 373 | ```elixir 374 | def handle_demand(demand, %{buffer: buffer} = state) do 375 | {to_emit, remaining} = Enum.split(buffer, demand) 376 | 377 | if length(to_emit) < demand do 378 | # Buffer exhausted, try to refill 379 | new_events = fetch_more_events() 380 | all_events = to_emit ++ new_events 381 | {to_emit_now, to_buffer} = Enum.split(all_events, demand) 382 | {:noreply, to_emit_now, %{state | buffer: to_buffer}} 383 | else 384 | {:noreply, to_emit, %{state | buffer: remaining}} 385 | end 386 | end 387 | ``` 388 | 389 | ### Testing Your Producer 390 | 391 | Let's make sure our producer works. Create `test/job_processor/producer_test.exs`: 392 | 393 | ```elixir 394 | defmodule JobProcessor.ProducerTest do 395 | use ExUnit.Case 396 | alias JobProcessor.Producer 397 | 398 | test "producer emits events on demand" do 399 | {:ok, producer} = Producer.start_link(0) 400 | 401 | # Manually subscribe and ask for events 402 | {:ok, _subscription} = GenStage.sync_subscribe(self(), to: producer, max_demand: 5) 403 | 404 | # We should receive 5 events (0 through 4) 405 | assert_receive {:"$gen_consumer", {_, _}, [0, 1, 2, 3, 4]} 406 | end 407 | 408 | test "producer maintains state across demands" do 409 | {:ok, producer} = Producer.start_link(10) 410 | 411 | # First demand 412 | {:ok, _} = GenStage.sync_subscribe(self(), to: producer, max_demand: 3) 413 | assert_receive {:"$gen_consumer", {_, _}, [10, 11, 12]} 414 | 415 | # Second demand should continue from where we left off 416 | send(producer, {:"$gen_producer", {self(), nil}, {:ask, 2}}) 417 | assert_receive {:"$gen_consumer", {_, _}, [13, 14]} 418 | end 419 | end 420 | ``` 421 | 422 | Run the tests with `mix test`. 423 | 424 | ### The Power of Stateful Producers 425 | 426 | Our producer maintains state - a simple counter. But state can be anything: 427 | 428 | - A database connection for polling 429 | - A buffer of pre-fetched events 430 | - Rate limiting information 431 | - Metrics and telemetry data 432 | 433 | Because each producer is just an Erlang process, it's isolated. If one producer crashes, others continue. The supervisor restarts the crashed producer with a fresh state. 434 | 435 | This is different from thread-based systems where shared state requires locks. Each producer owns its state exclusively. No locks, no race conditions, no defensive programming. 436 | 437 | ### What We've Built 438 | 439 | Our producer is deceptively simple, but it demonstrates core principles: 440 | 441 | 1. **Demand-driven** - Only produces when asked 442 | 2. **Stateful** - Maintains its own isolated state 443 | 3. **Supervised** - Can crash and restart safely 444 | 4. **Testable** - Easy to verify behavior 445 | 446 | In the next section, we'll build consumers that process these events. But the producer is the foundation - it controls the flow of work through our system. 447 | 448 | # Building A Consumer 449 | 450 | Now that we have a producer emitting events, we need something to consume them. This is where consumers come in - they're the workers that actually process the events flowing through our system. 451 | 452 | But here's the beautiful thing about GenStage consumers: they're not passive recipients waiting for work to be thrown at them. They're active participants in the flow control. A consumer decides how much work it can handle and explicitly asks for that amount. No more, no less. 453 | 454 | Think about how this changes the dynamics. In a traditional message queue, producers blast messages into a queue, hoping consumers can keep up. If consumers fall behind, the queue grows. If consumers are faster than expected, they sit idle waiting for work. It's a constant balancing act with lots of manual tuning. 455 | 456 | GenStage flips this completely. Consumers know their own capacity better than anyone else. They know if they're currently processing a heavy job, if they're running low on memory, or if they're about to restart. So they ask for exactly what they can handle right now. 457 | 458 | ## The Consumer's Lifecycle 459 | 460 | A GenStage consumer follows a simple but powerful lifecycle: 461 | 462 | 1. **Subscribe** - Connect to one or more producers 463 | 2. **Demand** - Ask for a specific number of events 464 | 3. **Receive** - Get events from producers (never more than requested) 465 | 4. **Process** - Handle each event 466 | 5. **Repeat** - Ask for more events when ready 467 | 468 | The key insight is step 4: processing happens between demands. The consumer processes its current batch completely before asking for more. This creates natural backpressure - slow consumers automatically reduce the flow rate. 469 | 470 | ## Building Our First Consumer 471 | 472 | Let's build a consumer that processes the events from our producer. Create a new file at `lib/job_processor/consumer.ex`: 473 | 474 | ```elixir 475 | defmodule JobProcessor.Consumer do 476 | use GenStage 477 | require Logger 478 | 479 | @doc """ 480 | Starts the consumer. 481 | 482 | Like producers, consumers are just GenServer-like processes. 483 | The state can be anything you need for processing. 484 | """ 485 | def start_link(opts \\ []) do 486 | GenStage.start_link(__MODULE__, opts) 487 | end 488 | 489 | @impl true 490 | def init(opts) do 491 | # The key difference: we declare ourselves as a :consumer 492 | # and specify which producer(s) to subscribe to 493 | {:consumer, opts, subscribe_to: [JobProcessor.Producer]} 494 | end 495 | 496 | @impl true 497 | def handle_events(events, _from, state) do 498 | Logger.info("Consumer received #{length(events)} events") 499 | 500 | # Process each event 501 | for event <- events do 502 | process_event(event, state) 503 | end 504 | 505 | # Always return {:noreply, [], state} for consumers 506 | # The empty list means we don't emit any events (we're not a producer) 507 | {:noreply, [], state} 508 | end 509 | 510 | defp process_event(event, state) do 511 | # For now, just log what we received 512 | Logger.info("Processing event: #{event}") 513 | IO.inspect({self(), event, state}, label: "Consumer processed") 514 | end 515 | end 516 | ``` 517 | 518 | ## Understanding the Consumer Architecture 519 | 520 | Let's break down what makes this consumer work: 521 | 522 | **`use GenStage`** - Just like producers, consumers use the GenStage behavior. But the callbacks they implement are different. 523 | 524 | **`init/1` returns `{:consumer, state, options}`** - The crucial difference from producers. The first element declares this process as a consumer. The `subscribe_to` option tells GenStage which producers to connect to. 525 | 526 | **`handle_events/3` instead of `handle_demand/2`** - Consumers implement `handle_events/3`, which receives: 527 | - `events` - The list of events to process 528 | - `from` - Which producer sent these events (usually ignored) 529 | - `state` - The consumer's current state 530 | 531 | **The return value `{:noreply, [], state}`** - Consumers don't emit events (that's producers' job), so the events list is always empty. They just process and update their state. 532 | 533 | ## The Magic of Subscription 534 | 535 | Notice the `subscribe_to: [JobProcessor.Producer]` option. This does several important things: 536 | 537 | **Automatic connection** - GenStage handles finding and connecting to the producer. No manual process linking or monitoring. 538 | 539 | **Automatic demand** - The consumer automatically asks the producer for events. By default, it requests batches of up to 1000 events, but you can tune this. 540 | 541 | **Fault tolerance** - If the producer crashes and restarts, the consumer automatically reconnects. If the consumer crashes, it doesn't take down the producer. 542 | 543 | **Flow control** - The consumer won't receive more events than it asks for. If it's slow processing the current batch, no new events arrive until it's ready. 544 | 545 | ## Tuning Consumer Demand 546 | 547 | You can control how many events a consumer requests at once: 548 | 549 | ```elixir 550 | def init(opts) do 551 | {:consumer, opts, 552 | subscribe_to: [ 553 | {JobProcessor.Producer, min_demand: 5, max_demand: 50} 554 | ]} 555 | end 556 | ``` 557 | 558 | **`min_demand`** - Don't ask for more events until we have fewer than this many 559 | **`max_demand`** - Never ask for more than this many events at once 560 | 561 | This creates a buffering effect. The consumer will receive events in batches between min_demand and max_demand, giving you control over throughput vs. latency tradeoffs. 562 | 563 | For job processing, you might want smaller batches to reduce memory usage: 564 | 565 | ```elixir 566 | subscribe_to: [ 567 | {JobProcessor.Producer, min_demand: 1, max_demand: 10} 568 | ] 569 | ``` 570 | 571 | Or larger batches for higher throughput: 572 | 573 | ```elixir 574 | subscribe_to: [ 575 | {JobProcessor.Producer, min_demand: 100, max_demand: 1000} 576 | ] 577 | ``` 578 | 579 | ## Why This Design Matters 580 | 581 | The producer-consumer subscription model solves several classic distributed systems problems: 582 | 583 | **Backpressure** - Slow consumers naturally slow down the entire pipeline. No queues overflow, no memory explosions. 584 | 585 | **Dynamic scaling** - Add more consumers and they automatically start receiving events. Remove consumers and the remaining ones pick up the slack. 586 | 587 | **Fault isolation** - A crashing consumer doesn't affect others. A crashing producer can be restarted without losing in-flight work. 588 | 589 | **Observable performance** - You can see exactly where bottlenecks are by monitoring demand patterns. High accumulated demand = bottleneck downstream. 590 | 591 | ## Consumer Patterns 592 | 593 | Real-world consumers follow several common patterns: 594 | 595 | **Database Writing Consumer:** 596 | ```elixir 597 | def handle_events(events, _from, state) do 598 | # Batch insert for efficiency 599 | records = Enum.map(events, &transform_event/1) 600 | Repo.insert_all(MyTable, records) 601 | 602 | {:noreply, [], state} 603 | end 604 | ``` 605 | 606 | **HTTP API Consumer:** 607 | ```elixir 608 | def handle_events(events, _from, state) do 609 | for event <- events do 610 | case HTTPoison.post(state.webhook_url, Jason.encode!(event)) do 611 | {:ok, %{status_code: 200}} -> :ok 612 | {:error, reason} -> Logger.error("Webhook failed: #{inspect(reason)}") 613 | end 614 | end 615 | 616 | {:noreply, [], state} 617 | end 618 | ``` 619 | 620 | **File Processing Consumer:** 621 | ```elixir 622 | def handle_events(events, _from, state) do 623 | for event <- events do 624 | file_path = "/tmp/processed_#{event.id}.json" 625 | File.write!(file_path, Jason.encode!(event)) 626 | end 627 | 628 | {:noreply, [], state} 629 | end 630 | ``` 631 | 632 | ## Error Handling in Consumers 633 | 634 | What happens when event processing fails? In traditional queue systems, you need complex retry logic, dead letter queues, and careful state management. 635 | 636 | With GenStage consumers, it's simpler. If a consumer crashes while processing events, those events are simply not acknowledged. When the consumer restarts, the producer still has them and will include them in the next batch. 637 | 638 | For more sophisticated error handling, you can catch exceptions: 639 | 640 | ```elixir 641 | def handle_events(events, _from, state) do 642 | for event <- events do 643 | try do 644 | process_event(event) 645 | rescue 646 | e -> 647 | Logger.error("Failed to process event #{event.id}: #{inspect(e)}") 648 | # Could send to dead letter queue, retry later, etc. 649 | end 650 | end 651 | 652 | {:noreply, [], state} 653 | end 654 | ``` 655 | 656 | But often, letting the process crash and restart is the right approach. It's simple, it clears any corrupted state, and the supervisor handles the restart automatically. 657 | 658 | # Wiring It Together 659 | 660 | Now we have both pieces: a producer that emits events and a consumer that processes them. But they're just modules sitting in files. We need to start them as processes and connect them. 661 | 662 | This is where OTP's supervision trees shine. We'll add both processes to our application's supervision tree, and OTP will ensure they start in the right order and restart if they crash. 663 | 664 | Open `lib/job_processor/application.ex` and modify the `start/2` function: 665 | 666 | ```elixir 667 | def start(_type, _args) do 668 | children = [ 669 | # Start the Producer first 670 | JobProcessor.Producer, 671 | 672 | # Then start the Consumer 673 | # The consumer will automatically connect to the producer 674 | JobProcessor.Consumer, 675 | 676 | # Other children like Ecto, Phoenix endpoint, etc. 677 | JobProcessorWeb.Endpoint 678 | ] 679 | 680 | opts = [strategy: :one_for_one, name: JobProcessor.Supervisor] 681 | Supervisor.start_link(children, opts) 682 | end 683 | ``` 684 | 685 | That's it! The supervision tree will: 686 | 687 | 1. Start the producer 688 | 2. Start the consumer 689 | 3. The consumer automatically subscribes to the producer 690 | 4. Events start flowing immediately 691 | 692 | ## Why This Supervision Strategy Works 693 | 694 | The `:one_for_one` strategy means if one process crashes, only that process is restarted. This is perfect for our producer-consumer setup: 695 | 696 | **Producer crashes** - The consumer notices the connection is lost and waits. When the supervisor restarts the producer, the consumer automatically reconnects. 697 | 698 | **Consumer crashes** - The producer keeps running, just stops emitting events. When the supervisor restarts the consumer, it reconnects and processing resumes. 699 | 700 | This is fault isolation in action. Problems in one part of the system don't cascade to other parts. 701 | 702 | ## Testing the Connection 703 | 704 | Let's see our producer and consumer working together. Start the application: 705 | 706 | ``` 707 | mix phx.server 708 | ``` 709 | 710 | You should see logs showing the consumer processing events from the producer. Each event will be displayed with the process ID, event number, and state - something like this: 711 | 712 | ``` 713 | Consumer processed: {#PID<0.234.0>, 0, []} 714 | Consumer processed: {#PID<0.234.0>, 1, []} 715 | Consumer processed: {#PID<0.234.0>, 2, []} 716 | Consumer processed: {#PID<0.234.0>, 3, []} 717 | Consumer processed: {#PID<0.234.0>, 4, []} 718 | ... 719 | ``` 720 | 721 | Notice something important: the same PID processes every event. This is because we have a single consumer. Our counter increments predictably from 0, 1, 2, 3, 4... and all events flow to the same process. 722 | 723 | ## The Single Consumer Scenario 724 | 725 | With one consumer, we get: 726 | - **Predictable ordering** - Events are processed in the exact order they're generated 727 | - **Sequential processing** - Each event is fully processed before the next one begins 728 | - **Simple state management** - Only one process to reason about 729 | - **Potential bottleneck** - If processing is slow, the entire pipeline slows down 730 | 731 | ``` 732 | Single Consumer Pattern: Sequential Processing 733 | 734 | [Producer] ──→ ⓪ ──→ ① ──→ ② ──→ ③ ──→ ④ ──→ ⑤ ──→ [Consumer] 735 | (Emits 0,1,2,3,4...) (Processes Sequentially) 736 | 737 | Timeline: 738 | t0: ████████ Process Event 0 739 | t1: ████████ Process Event 1 740 | t2: ████████ Process Event 2 741 | t3: ░░░░░░░░░░░░░░░░░░░ Events 3,4,5... waiting 742 | 743 | Key Characteristics: 744 | ✓ Predictable ordering - Events processed in exact sequence 745 | ✓ Sequential processing - One event completes before next begins 746 | ✓ Simple state management - Single process to track 747 | ⚠ Potential bottleneck - Slow processing blocks entire pipeline 748 | ``` 749 | 750 | This is perfect for scenarios where order matters or when you're just getting started. But what happens when we add more consumers? 751 | 752 | ## Scaling to Multiple Consumers 753 | 754 | Let's see what happens with multiple consumers. Add this to your supervision tree in `lib/job_processor/application.ex`: 755 | 756 | ```elixir 757 | def start(_type, _args) do 758 | children = [ 759 | # Start the Producer first 760 | JobProcessor.Producer, 761 | 762 | # Start multiple consumers 763 | {JobProcessor.Consumer, [id: :consumer_1]}, 764 | {JobProcessor.Consumer, [id: :consumer_2]}, 765 | {JobProcessor.Consumer, [id: :consumer_3]}, 766 | 767 | # Other children 768 | JobProcessorWeb.Endpoint 769 | ] 770 | 771 | opts = [strategy: :one_for_one, name: JobProcessor.Supervisor] 772 | Supervisor.start_link(children, opts) 773 | end 774 | ``` 775 | 776 | Now restart your application and watch the logs: 777 | 778 | ``` 779 | Consumer processed: {#PID<0.234.0>, 0, []} 780 | Consumer processed: {#PID<0.234.0>, 1, []} 781 | Consumer processed: {#PID<0.235.0>, 2, []} 782 | Consumer processed: {#PID<0.236.0>, 3, []} 783 | Consumer processed: {#PID<0.234.0>, 4, []} 784 | Consumer processed: {#PID<0.235.0>, 5, []} 785 | ... 786 | ``` 787 | 788 | Notice the different PIDs! Events are now distributed across multiple consumer processes. The distribution depends on which consumer asks for work first and how fast each consumer processes its events. 789 | 790 | ``` 791 | Multiple Consumer Pattern: Parallel Processing with Load Balancing 792 | 793 | ┌──→ Consumer 1 (PID <0.234.0>) ─→ Events: 0, 1, 4... 794 | │ ↑ demand 795 | [Producer] ─────────┼──→ Consumer 2 (PID <0.235.0>) ─→ Events: 2, 5, 8... 796 | (First-Come- │ ↑ demand 797 | First-Served) └──→ Consumer 3 (PID <0.236.0>) ─→ Events: 3, 6, 7... 798 | ↑ demand 799 | 800 | Timeline: Parallel Processing 801 | Consumer 1: ██████████ Event 0 ████████ Event 1 ████████████ Event 4 802 | Consumer 2: ████████████ Event 2 ██████████ Event 5 803 | Consumer 3: ████████ Event 3 ████████████ Event 6 804 | 805 | ✓ Benefits: ⚠ Challenges: 806 | • Higher throughput - parallel • No ordering guarantees 807 | • Fault tolerance - others continue • Shared resource contention 808 | • Natural load balancing • Debugging complexity 809 | • Better resource utilization • Potential race conditions 810 | ``` 811 | 812 | ## Understanding Event Distribution 813 | 814 | GenStage's default dispatcher (DemandDispatcher) uses a "first-come, first-served" approach: 815 | 816 | 1. Consumer A finishes its current batch and asks for 10 more events 817 | 2. Producer sends events 0-9 to Consumer A 818 | 3. Consumer B asks for 10 events 819 | 4. Producer sends events 10-19 to Consumer B 820 | 5. Consumer A finishes and asks for more, gets events 20-29 821 | 822 | This creates natural load balancing - faster consumers get more work. If Consumer A is processing heavy jobs slowly, Consumer B and C will pick up the slack. 823 | 824 | ## The Trade-offs 825 | 826 | **Benefits of Multiple Consumers:** 827 | - **Throughput** - More work gets done in parallel 828 | - **Fault tolerance** - If one consumer crashes, others continue 829 | - **Natural load balancing** - Fast consumers get more work 830 | - **Resource utilization** - Better use of multi-core systems 831 | 832 | **Challenges:** 833 | - **No ordering guarantees** - Event 5 might finish before event 3 834 | - **Shared resources** - Multiple consumers might compete for database connections 835 | - **Debugging complexity** - Multiple processes to track 836 | 837 | ## Different Distribution Strategies 838 | 839 | You can change how events are distributed by modifying the producer's dispatcher. Add this to your producer's `init/1` function: 840 | 841 | ```elixir 842 | def init(counter) do 843 | Logger.info("Producer starting with counter: #{counter}") 844 | {:producer, counter, dispatcher: GenStage.BroadcastDispatcher} 845 | end 846 | ``` 847 | 848 | Now restart and watch what happens: 849 | 850 | ``` 851 | Consumer processed: {#PID<0.234.0>, 0, []} 852 | Consumer processed: {#PID<0.235.0>, 0, []} 853 | Consumer processed: {#PID<0.236.0>, 0, []} 854 | Consumer processed: {#PID<0.234.0>, 1, []} 855 | Consumer processed: {#PID<0.235.0>, 1, []} 856 | Consumer processed: {#PID<0.236.0>, 1, []} 857 | ... 858 | ``` 859 | 860 | With BroadcastDispatcher, every consumer receives every event! This is useful for scenarios like: 861 | - Multiple consumers writing to different databases 862 | - One consumer processing events, another collecting metrics 863 | - Broadcasting notifications to multiple systems 864 | 865 | ``` 866 | BroadcastDispatcher: Every Consumer Receives Every Event 867 | 868 | ┌─→ Database Writer (PID <0.234.0>) ─→ Events: 0, 1, 2, 3... 869 | │ 870 | [Producer] ═══━━━━━━┼─→ Metrics Collector (PID <0.235.0>) ─→ Events: 0, 1, 2, 3... 871 | (Broadcasting) │ 872 | └─→ Notification Service (PID <0.236.0>) ─→ Events: 0, 1, 2, 3... 873 | 874 | Timeline: All Consumers Process Same Events Simultaneously 875 | Database Writer: ████████ Event 0 ████████ Event 1 ████████ Event 2 876 | Metrics Collector: ████████ Event 0 ████████ Event 1 ████████ Event 2 877 | Notification Service:████████ Event 0 ████████ Event 1 ████████ Event 2 878 | 879 | 🔄 Broadcasting Use Cases: 880 | • Multiple databases - Each consumer writes to different database 881 | • Parallel processing - One processes data, another collects metrics 882 | • Notification fanout - Broadcasting alerts to multiple services 883 | • Audit trails - Simultaneous logging to multiple destinations 884 | 885 | Key Differences from Load Balancing: 886 | ✓ Every consumer gets EVERY event (no distribution) 887 | ✓ Perfect for parallel processing different aspects 888 | ✓ Higher total throughput but more resource usage 889 | ⚠ N times more processing (N = number of consumers) 890 | ``` 891 | 892 | But we're still just processing numbers. In the next section, we'll replace our simple counter with a real job processing system that can execute arbitrary code. 893 | 894 | # From Toy Examples to Real Job Processing 895 | 896 | We've built a solid foundation with our producer-consumer setup, but we're still just processing incrementing numbers. That's useful for understanding the mechanics, but real job processing needs persistent storage, job queuing, and the ability to execute arbitrary code. 897 | 898 | This is where things get interesting. We're going to transform our simple counter into a full job processing system that can serialize function calls, store them in a database, and execute them across multiple workers. Think of it as building your own mini-Sidekiq, but with GenStage's elegant backpressure handling. 899 | 900 | ## Why We Need a Database 901 | 902 | Right now, our producer generates events from memory (a simple counter). But real job processors need persistence for several reasons: 903 | 904 | **Durability** - Jobs shouldn't disappear if the system restarts. When you queue a job to send an email, you expect it to survive server reboots. 905 | 906 | **Coordination** - Multiple producer processes might be running across different servers. They need a shared source of truth for what work exists. 907 | 908 | **Status tracking** - Jobs have lifecycles: queued, running, completed, failed. You need to track this state somewhere. 909 | 910 | **Debugging and monitoring** - When jobs fail, you need to see what went wrong and potentially retry them. 911 | 912 | The database becomes our job queue's persistent storage layer, but GenStage handles all the flow control and distribution logic. 913 | 914 | ## Setting Up Our Job Storage 915 | 916 | Since we're using Phoenix, we already have Ecto configured. But we need to set up our job storage table. The beauty of Elixir's job processing is that we can serialize entire function calls as binary data using the Erlang Term Format. 917 | 918 | Let's create a migration for our jobs table: 919 | 920 | ``` 921 | mix ecto.gen.migration create_jobs 922 | ``` 923 | 924 | Now edit the migration file: 925 | 926 | ```elixir 927 | defmodule JobProcessor.Repo.Migrations.CreateJobs do 928 | use Ecto.Migration 929 | 930 | def change do 931 | create table(:jobs) do 932 | add :status, :string, null: false, default: "queued" 933 | add :payload, :binary, null: false 934 | add :attempts, :integer, default: 0 935 | add :max_attempts, :integer, default: 3 936 | add :scheduled_at, :utc_datetime 937 | add :started_at, :utc_datetime 938 | add :completed_at, :utc_datetime 939 | add :error_message, :text 940 | 941 | timestamps() 942 | end 943 | 944 | create index(:jobs, [:status]) 945 | create index(:jobs, [:scheduled_at]) 946 | create index(:jobs, [:status, :scheduled_at]) 947 | end 948 | end 949 | ``` 950 | 951 | This gives us a robust job storage system: 952 | 953 | - **status** - Track job lifecycle (queued, running, completed, failed) 954 | - **payload** - The serialized function call 955 | - **attempts/max_attempts** - Retry logic 956 | - **scheduled_at** - Support for delayed jobs 957 | - **Timestamps** - Monitor performance and debug issues 958 | 959 | Run the migration: 960 | 961 | ``` 962 | mix ecto.migrate 963 | ``` 964 | 965 | ## Modeling Jobs 966 | 967 | Let's create an Ecto schema for our jobs. Create `lib/job_processor/job.ex`: 968 | 969 | ```elixir 970 | defmodule JobProcessor.Job do 971 | use Ecto.Schema 972 | import Ecto.Changeset 973 | 974 | schema "jobs" do 975 | field :status, :string, default: "queued" 976 | field :payload, :binary 977 | field :attempts, :integer, default: 0 978 | field :max_attempts, :integer, default: 3 979 | field :scheduled_at, :utc_datetime 980 | field :started_at, :utc_datetime 981 | field :completed_at, :utc_datetime 982 | field :error_message, :string 983 | 984 | timestamps() 985 | end 986 | 987 | def changeset(job, attrs) do 988 | job 989 | |> cast(attrs, [:status, :payload, :attempts, :max_attempts, 990 | :scheduled_at, :started_at, :completed_at, :error_message]) 991 | |> validate_required([:payload]) 992 | |> validate_inclusion(:status, ["queued", "running", "completed", "failed"]) 993 | end 994 | 995 | @doc """ 996 | Serialize a function call into a job payload. 997 | 998 | This is where the magic happens - we can serialize any module, function, 999 | and arguments into binary data that can be stored and executed later. 1000 | """ 1001 | def encode_job(module, function, args) do 1002 | {module, function, args} |> :erlang.term_to_binary() 1003 | end 1004 | 1005 | @doc """ 1006 | Deserialize a job payload back into a function call. 1007 | """ 1008 | def decode_job(payload) do 1009 | :erlang.binary_to_term(payload) 1010 | end 1011 | end 1012 | ``` 1013 | 1014 | ## Building the Job Queue Interface 1015 | 1016 | Now we need an interface for interacting with jobs. This is where we abstract the database operations and provide a clean API for enqueueing and processing jobs. Create `lib/job_processor/job_queue.ex`: 1017 | 1018 | ```elixir 1019 | defmodule JobProcessor.JobQueue do 1020 | import Ecto.Query 1021 | alias JobProcessor.{Repo, Job} 1022 | 1023 | @doc """ 1024 | Enqueue a job for processing. 1025 | 1026 | This is the public API that applications use to submit work. 1027 | """ 1028 | def enqueue(module, function, args, opts \\ []) do 1029 | payload = Job.encode_job(module, function, args) 1030 | 1031 | attrs = %{ 1032 | payload: payload, 1033 | max_attempts: Keyword.get(opts, :max_attempts, 3), 1034 | scheduled_at: Keyword.get(opts, :scheduled_at, DateTime.utc_now()) 1035 | } 1036 | 1037 | %Job{} 1038 | |> Job.changeset(attrs) 1039 | |> Repo.insert() 1040 | end 1041 | 1042 | @doc """ 1043 | Fetch available jobs for processing. 1044 | 1045 | This is called by our GenStage producer to get work. 1046 | Uses FOR UPDATE SKIP LOCKED to avoid race conditions. 1047 | """ 1048 | def fetch_jobs(limit) do 1049 | now = DateTime.utc_now() 1050 | 1051 | Repo.transaction(fn -> 1052 | # Find available jobs 1053 | job_ids = 1054 | from(j in Job, 1055 | where: j.status == "queued" and j.scheduled_at <= ^now, 1056 | limit: ^limit, 1057 | select: j.id, 1058 | lock: "FOR UPDATE SKIP LOCKED" 1059 | ) 1060 | |> Repo.all() 1061 | 1062 | # Mark them as running and return the full job data 1063 | {count, jobs} = 1064 | from(j in Job, where: j.id in ^job_ids) 1065 | |> Repo.update_all( 1066 | [set: [status: "running", started_at: DateTime.utc_now()]], 1067 | returning: [:id, :payload, :attempts, :max_attempts] 1068 | ) 1069 | 1070 | {count, jobs} 1071 | end) 1072 | end 1073 | 1074 | @doc """ 1075 | Mark a job as completed successfully. 1076 | """ 1077 | def complete_job(job_id) do 1078 | from(j in Job, where: j.id == ^job_id) 1079 | |> Repo.update_all( 1080 | set: [status: "completed", completed_at: DateTime.utc_now()] 1081 | ) 1082 | end 1083 | 1084 | @doc """ 1085 | Mark a job as failed and handle retry logic. 1086 | """ 1087 | def fail_job(job_id, error_message, attempts \\ 1) do 1088 | job = Repo.get!(Job, job_id) 1089 | 1090 | if attempts >= job.max_attempts do 1091 | # Permanently failed 1092 | from(j in Job, where: j.id == ^job_id) 1093 | |> Repo.update_all( 1094 | set: [ 1095 | status: "failed", 1096 | error_message: error_message, 1097 | attempts: attempts, 1098 | completed_at: DateTime.utc_now() 1099 | ] 1100 | ) 1101 | else 1102 | # Retry later 1103 | retry_at = DateTime.add(DateTime.utc_now(), 60 * attempts, :second) 1104 | 1105 | from(j in Job, where: j.id == ^job_id) 1106 | |> Repo.update_all( 1107 | set: [ 1108 | status: "queued", 1109 | error_message: error_message, 1110 | attempts: attempts, 1111 | scheduled_at: retry_at 1112 | ] 1113 | ) 1114 | end 1115 | end 1116 | end 1117 | ``` 1118 | 1119 | ## The Power of FOR UPDATE SKIP LOCKED 1120 | 1121 | Notice that crucial line: `lock: "FOR UPDATE SKIP LOCKED"`. This is a PostgreSQL feature that's essential for job processing systems. 1122 | 1123 | Here's what happens without it: 1124 | 1. Consumer A queries for jobs, gets job #123 1125 | 2. Consumer B queries for jobs, gets the same job #123 1126 | 3. Both consumers try to process job #123 simultaneously 1127 | 4. Chaos ensues 1128 | 1129 | With `FOR UPDATE SKIP LOCKED`: 1130 | 1. Consumer A queries for jobs, locks job #123 1131 | 2. Consumer B queries for jobs, skips locked job #123, gets job #124 1132 | 3. Each job is processed exactly once 1133 | 4. No race conditions, no duplicate processing 1134 | 1135 | This is why PostgreSQL (and similar databases) are preferred for job processing systems. The database handles the coordination for us. 1136 | 1137 | ## Updating Our Producer 1138 | 1139 | Now we can update our producer to fetch real jobs from the database instead of generating counter events. Update `lib/job_processor/producer.ex`: 1140 | 1141 | ```elixir 1142 | defmodule JobProcessor.Producer do 1143 | use GenStage 1144 | require Logger 1145 | alias JobProcessor.JobQueue 1146 | 1147 | def start_link(_opts) do 1148 | GenStage.start_link(__MODULE__, :ok, name: __MODULE__) 1149 | end 1150 | 1151 | @impl true 1152 | def init(:ok) do 1153 | Logger.info("Job Producer starting") 1154 | {:producer, %{}, dispatcher: GenStage.DemandDispatcher} 1155 | end 1156 | 1157 | @impl true 1158 | def handle_demand(demand, state) when demand > 0 do 1159 | Logger.info("Producer received demand for #{demand} jobs") 1160 | 1161 | case JobQueue.fetch_jobs(demand) do 1162 | {:ok, {count, jobs}} when count > 0 -> 1163 | Logger.info("Fetched #{count} jobs from database") 1164 | {:noreply, jobs, state} 1165 | 1166 | {:ok, {0, []}} -> 1167 | # No jobs available, schedule a check for later 1168 | Process.send_after(self(), :check_for_jobs, 1000) 1169 | {:noreply, [], state} 1170 | 1171 | {:error, reason} -> 1172 | Logger.error("Failed to fetch jobs: #{inspect(reason)}") 1173 | {:noreply, [], state} 1174 | end 1175 | end 1176 | 1177 | @impl true 1178 | def handle_info(:check_for_jobs, state) do 1179 | # This allows us to produce events even when there's no pending demand 1180 | # if jobs become available 1181 | case JobQueue.fetch_jobs(10) do 1182 | {:ok, {count, jobs}} when count > 0 -> 1183 | {:noreply, jobs, state} 1184 | _ -> 1185 | Process.send_after(self(), :check_for_jobs, 1000) 1186 | {:noreply, [], state} 1187 | end 1188 | end 1189 | end 1190 | ``` 1191 | 1192 | ## Understanding the Producer's Evolution 1193 | 1194 | Our producer has evolved significantly: 1195 | 1196 | **Database-driven** - Instead of generating events from memory, we fetch them from persistent storage 1197 | 1198 | **Handles empty queues gracefully** - When no jobs are available, we schedule a check for later instead of blocking 1199 | 1200 | **Error handling** - Database operations can fail, so we handle those cases 1201 | 1202 | **Polling mechanism** - The `:check_for_jobs` message lets us produce events even when there's no pending demand 1203 | 1204 | This polling approach works well for most job processing systems. For higher throughput systems, you could use PostgreSQL's LISTEN/NOTIFY to get push notifications when new jobs arrive. 1205 | 1206 | ## Updating Our Consumer 1207 | 1208 | Now our consumer needs to execute real job payloads instead of just logging numbers. Update `lib/job_processor/consumer.ex`: 1209 | 1210 | ```elixir 1211 | defmodule JobProcessor.Consumer do 1212 | use GenStage 1213 | require Logger 1214 | alias JobProcessor.{Job, JobQueue} 1215 | 1216 | def start_link(opts) do 1217 | GenStage.start_link(__MODULE__, opts) 1218 | end 1219 | 1220 | @impl true 1221 | def init(opts) do 1222 | {:consumer, opts, subscribe_to: [JobProcessor.Producer]} 1223 | end 1224 | 1225 | @impl true 1226 | def handle_events(jobs, _from, state) do 1227 | Logger.info("Consumer received #{length(jobs)} jobs") 1228 | 1229 | for job <- jobs do 1230 | execute_job(job) 1231 | end 1232 | 1233 | {:noreply, [], state} 1234 | end 1235 | 1236 | defp execute_job(%{id: job_id, payload: payload, attempts: attempts}) do 1237 | try do 1238 | {module, function, args} = Job.decode_job(payload) 1239 | 1240 | Logger.info("Executing job #{job_id}: #{module}.#{function}") 1241 | 1242 | # Execute the job 1243 | result = apply(module, function, args) 1244 | 1245 | # Mark as completed 1246 | JobQueue.complete_job(job_id) 1247 | 1248 | Logger.info("Job #{job_id} completed successfully") 1249 | 1250 | result 1251 | rescue 1252 | error -> 1253 | error_message = Exception.format(:error, error, __STACKTRACE__) 1254 | Logger.error("Job #{job_id} failed: #{error_message}") 1255 | 1256 | # Mark as failed (with retry logic) 1257 | JobQueue.fail_job(job_id, error_message, attempts + 1) 1258 | end 1259 | end 1260 | end 1261 | ``` 1262 | 1263 | ## The Magic of Code Serialization 1264 | 1265 | The real power of this system is in those two lines: 1266 | 1267 | ```elixir 1268 | {module, function, args} = Job.decode_job(payload) 1269 | result = apply(module, function, args) 1270 | ``` 1271 | 1272 | We're deserializing a function call that was stored as binary data and executing it. This means you can queue any function call: 1273 | 1274 | ```elixir 1275 | # Send an email 1276 | JobQueue.enqueue(MyApp.Mailer, :send_welcome_email, [user_id: 123]) 1277 | 1278 | # Process an image 1279 | JobQueue.enqueue(MyApp.ImageProcessor, :resize_image, ["/path/to/image.jpg", 300, 200]) 1280 | 1281 | # Call an API 1282 | JobQueue.enqueue(MyApp.ApiClient, :sync_user_data, [user_id: 456]) 1283 | 1284 | # Even complex data structures 1285 | JobQueue.enqueue(MyApp.ReportGenerator, :generate_report, [%{ 1286 | user_id: 789, 1287 | date_range: Date.range(~D[2024-01-01], ~D[2024-01-31]), 1288 | format: :pdf 1289 | }]) 1290 | ``` 1291 | 1292 | Each of these becomes a row in the database, gets picked up by our GenStage producer, distributed to available consumers, and executed. The serialization handles all the complex data structures automatically. 1293 | 1294 | ## What We've Built 1295 | 1296 | We now have a complete job processing system with: 1297 | 1298 | - **Persistent storage** - Jobs survive restarts 1299 | - **Automatic retries** - Failed jobs are retried with exponential backoff 1300 | - **Concurrent processing** - Multiple consumers process jobs in parallel 1301 | - **Backpressure handling** - GenStage ensures consumers aren't overwhelmed 1302 | - **Race condition prevention** - Database locking ensures each job runs exactly once 1303 | - **Delayed jobs** - Support for scheduling jobs to run later 1304 | - **Error tracking** - Failed jobs are logged with error messages 1305 | 1306 | And the beautiful part? GenStage handles all the complex coordination. We just focus on the business logic of our jobs. 1307 | 1308 | ## Testing Our Job System 1309 | 1310 | Let's create a simple job to test our system. Add this to `lib/job_processor/test_job.ex`: 1311 | 1312 | ```elixir 1313 | defmodule JobProcessor.TestJob do 1314 | require Logger 1315 | 1316 | def hello(name) do 1317 | Logger.info("Hello, #{name}!") 1318 | Process.sleep(1000) # Simulate some work 1319 | "Greeted #{name}" 1320 | end 1321 | 1322 | def failing_job do 1323 | Logger.info("This job will fail...") 1324 | raise "Intentional failure for testing" 1325 | end 1326 | 1327 | def heavy_job(duration_ms) do 1328 | Logger.info("Starting heavy job for #{duration_ms}ms") 1329 | Process.sleep(duration_ms) 1330 | Logger.info("Heavy job completed") 1331 | "Completed heavy work" 1332 | end 1333 | end 1334 | ``` 1335 | 1336 | Now you can queue jobs from the console: 1337 | 1338 | ```elixir 1339 | iex -S mix 1340 | 1341 | # Queue a simple job 1342 | JobProcessor.JobQueue.enqueue(JobProcessor.TestJob, :hello, ["World"]) 1343 | 1344 | # Queue a failing job (to test retry logic) 1345 | JobProcessor.JobQueue.enqueue(JobProcessor.TestJob, :failing_job, []) 1346 | 1347 | # Queue multiple jobs to see parallel processing 1348 | for i <- 1..10 do 1349 | JobProcessor.JobQueue.enqueue(JobProcessor.TestJob, :hello, ["Person #{i}"]) 1350 | end 1351 | ``` 1352 | 1353 | Watch the logs to see jobs being processed, failures being retried, and the natural load balancing across multiple consumers. 1354 | 1355 | We've transformed our simple counter example into a production-ready job processing system. The core GenStage concepts remained the same, but now we're processing real work with persistence, error handling, and retry logic. 1356 | 1357 | # Bringing It All Together: From Tutorial to Production 1358 | 1359 | Our tutorial system works well, but production systems need additional sophistication. Here's where you'd take this next: 1360 | 1361 | ## Multiple Job Types with Dedicated Queues 1362 | 1363 | Real applications have different types of work with different characteristics: 1364 | 1365 | ```elixir 1366 | # High-priority user-facing jobs 1367 | JobQueue.enqueue(:email_queue, Mailer, :send_welcome_email, [user_id]) 1368 | 1369 | # Background data processing 1370 | JobQueue.enqueue(:analytics_queue, Analytics, :process_events, [batch_id]) 1371 | 1372 | # Heavy computational work 1373 | JobQueue.enqueue(:ml_queue, ModelTrainer, :train_model, [dataset_id]) 1374 | ``` 1375 | 1376 | Each queue gets its own producer, consumer pool, and configuration: 1377 | 1378 | ```elixir 1379 | defmodule JobProcessor.QueueSupervisor do 1380 | use Supervisor 1381 | 1382 | def start_link(init_arg) do 1383 | Supervisor.start_link(__MODULE__, init_arg, name: __MODULE__) 1384 | end 1385 | 1386 | def init(_init_arg) do 1387 | children = [ 1388 | # Email queue - fast, lightweight 1389 | queue_spec(:email_queue, max_consumers: 5, max_demand: 1), 1390 | 1391 | # Analytics queue - batch processing 1392 | queue_spec(:analytics_queue, max_consumers: 3, max_demand: 100), 1393 | 1394 | # ML queue - heavy computation 1395 | queue_spec(:ml_queue, max_consumers: 1, max_demand: 1) 1396 | ] 1397 | 1398 | Supervisor.init(children, strategy: :one_for_one) 1399 | end 1400 | 1401 | defp queue_spec(queue_name, opts) do 1402 | %{ 1403 | id: :"#{queue_name}_supervisor", 1404 | start: {JobProcessor.QueueManager, :start_link, [queue_name, opts]}, 1405 | type: :supervisor 1406 | } 1407 | end 1408 | end 1409 | ``` 1410 | 1411 | ## Dynamic Consumer Scaling 1412 | 1413 | Scale consumers based on queue depth and system load: 1414 | 1415 | ```elixir 1416 | defmodule JobProcessor.AutoScaler do 1417 | use GenServer 1418 | 1419 | def init(queue_name) do 1420 | schedule_check() 1421 | {:ok, %{queue: queue_name, consumers: [], target_consumers: 2}} 1422 | end 1423 | 1424 | def handle_info(:check_scaling, state) do 1425 | queue_depth = JobQueue.queue_depth(state.queue) 1426 | current_consumers = length(state.consumers) 1427 | 1428 | target = calculate_target_consumers(queue_depth, current_consumers) 1429 | 1430 | new_state = 1431 | cond do 1432 | target > current_consumers -> scale_up(state, target - current_consumers) 1433 | target < current_consumers -> scale_down(state, current_consumers - target) 1434 | true -> state 1435 | end 1436 | 1437 | schedule_check() 1438 | {:noreply, new_state} 1439 | end 1440 | 1441 | defp calculate_target_consumers(queue_depth, current) do 1442 | cond do 1443 | queue_depth > 1000 -> min(current + 2, 10) 1444 | queue_depth > 100 -> min(current + 1, 10) 1445 | queue_depth < 10 -> max(current - 1, 1) 1446 | true -> current 1447 | end 1448 | end 1449 | end 1450 | ``` 1451 | 1452 | ## Worker Registries and Health Monitoring 1453 | 1454 | Track worker health and performance: 1455 | 1456 | ```elixir 1457 | defmodule JobProcessor.WorkerRegistry do 1458 | use GenServer 1459 | 1460 | def start_link(_) do 1461 | GenServer.start_link(__MODULE__, %{}, name: __MODULE__) 1462 | end 1463 | 1464 | def register_worker(queue, pid, metadata \\ %{}) do 1465 | GenServer.cast(__MODULE__, {:register, queue, pid, metadata}) 1466 | end 1467 | 1468 | def get_workers(queue) do 1469 | GenServer.call(__MODULE__, {:get_workers, queue}) 1470 | end 1471 | 1472 | def get_worker_stats do 1473 | GenServer.call(__MODULE__, :get_stats) 1474 | end 1475 | 1476 | def handle_cast({:register, queue, pid, metadata}, state) do 1477 | Process.monitor(pid) 1478 | 1479 | worker_info = %{ 1480 | pid: pid, 1481 | queue: queue, 1482 | started_at: DateTime.utc_now(), 1483 | jobs_processed: 0, 1484 | last_job_at: nil, 1485 | metadata: metadata 1486 | } 1487 | 1488 | new_workers = Map.put(state.workers || %{}, pid, worker_info) 1489 | {:noreply, %{state | workers: new_workers}} 1490 | end 1491 | 1492 | def handle_info({:DOWN, _ref, :process, pid, reason}, state) do 1493 | Logger.warn("Worker #{inspect(pid)} died: #{inspect(reason)}") 1494 | new_workers = Map.delete(state.workers, pid) 1495 | {:noreply, %{state | workers: new_workers}} 1496 | end 1497 | end 1498 | ``` 1499 | 1500 | ## Advanced Error Handling 1501 | 1502 | Circuit breakers for failing job types: 1503 | 1504 | ```elixir 1505 | defmodule JobProcessor.CircuitBreaker do 1506 | use GenServer 1507 | 1508 | def should_process_job?(job_type) do 1509 | GenServer.call(__MODULE__, {:should_process, job_type}) 1510 | end 1511 | 1512 | def record_success(job_type) do 1513 | GenServer.cast(__MODULE__, {:success, job_type}) 1514 | end 1515 | 1516 | def record_failure(job_type, error) do 1517 | GenServer.cast(__MODULE__, {:failure, job_type, error}) 1518 | end 1519 | 1520 | def handle_call({:should_process, job_type}, _from, state) do 1521 | circuit_state = Map.get(state.circuits, job_type, :closed) 1522 | 1523 | case circuit_state do 1524 | :closed -> {:reply, true, state} 1525 | :open -> 1526 | if circuit_should_retry?(state, job_type) do 1527 | {:reply, true, transition_to_half_open(state, job_type)} 1528 | else 1529 | {:reply, false, state} 1530 | end 1531 | :half_open -> {:reply, true, state} 1532 | end 1533 | end 1534 | end 1535 | ``` 1536 | 1537 | ## Dead Letter Queues 1538 | 1539 | Handle permanently failed jobs: 1540 | 1541 | ```elixir 1542 | defmodule JobProcessor.DeadLetterQueue do 1543 | def handle_permanent_failure(job, final_error) do 1544 | dead_job = %{ 1545 | original_job: job, 1546 | failed_at: DateTime.utc_now(), 1547 | final_error: final_error, 1548 | attempt_history: job.attempt_history || [], 1549 | forensics: collect_forensics(job) 1550 | } 1551 | 1552 | Repo.insert(%DeadJob{data: dead_job}) 1553 | JobProcessor.Notifications.send_dead_letter_alert(dead_job) 1554 | end 1555 | 1556 | defp collect_forensics(job) do 1557 | %{ 1558 | system_load: :erlang.statistics(:scheduler_utilization), 1559 | memory_usage: :erlang.memory(), 1560 | queue_depths: JobQueue.all_queue_depths(), 1561 | recent_errors: JobProcessor.ErrorTracker.recent_errors(job.module) 1562 | } 1563 | end 1564 | end 1565 | ``` 1566 | 1567 | ## Observability 1568 | 1569 | Comprehensive monitoring with telemetry: 1570 | 1571 | ```elixir 1572 | defmodule JobProcessor.Telemetry do 1573 | def setup do 1574 | events = [ 1575 | [:job_processor, :job, :start], 1576 | [:job_processor, :job, :stop], 1577 | [:job_processor, :job, :exception], 1578 | [:job_processor, :queue, :depth] 1579 | ] 1580 | 1581 | :telemetry.attach_many("job-processor-metrics", events, &handle_event/4, nil) 1582 | end 1583 | 1584 | def handle_event([:job_processor, :job, :stop], measurements, metadata, _config) do 1585 | JobProcessor.Metrics.record_job_duration(metadata.queue, measurements.duration) 1586 | JobProcessor.Metrics.increment_jobs_completed(metadata.queue) 1587 | JobProcessor.Metrics.record_job_success(metadata.module, metadata.function) 1588 | end 1589 | end 1590 | ``` 1591 | 1592 | GenStage's demand-driven architecture naturally handles backpressure, load balancing, and fault isolation. These production patterns build on that foundation, giving you the tools to run job processing at scale. The same principles that made our tutorial system work - processes, supervision, and message passing - scale to enterprise deployments. 1593 | --------------------------------------------------------------------------------