├── .formatter.exs
├── .gitignore
└── README.md


/.formatter.exs:
--------------------------------------------------------------------------------
1 | [
2 |   import_deps: [:ecto, :ecto_sql, :phoenix],
3 |   subdirectories: ["priv/*/migrations"],
4 |   plugins: [Phoenix.LiveView.HTMLFormatter],
5 |   inputs: ["*.{heex,ex,exs}", "{config,lib,test}/**/*.{heex,ex,exs}", "priv/*/seeds.exs"]
6 | ]
7 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # The directory Mix will write compiled artifacts to.
 2 | /_build/
 3 | 
 4 | # If you run "mix test --cover", coverage assets end up here.
 5 | /cover/
 6 | 
 7 | # The directory Mix downloads your dependencies sources to.
 8 | /deps/
 9 | 
10 | # Where 3rd-party dependencies like ExDoc output generated docs.
11 | /doc/
12 | 
13 | # Ignore .fetch files in case you like to edit your project deps locally.
14 | /.fetch
15 | 
16 | # If the VM crashes, it generates a dump, let's ignore it too.
17 | erl_crash.dump
18 | 
19 | # Also ignore archive artifacts (built via "mix archive.build").
20 | *.ez
21 | 
22 | # Temporary files, for example, from tests.
23 | /tmp/
24 | 
25 | # Ignore package tarball (built via "mix hex.build").
26 | fucking_dave-*.tar
27 | 
28 | # Ignore assets that are produced by build tools.
29 | /priv/static/assets/
30 | 
31 | # Ignore digested assets cache.
32 | /priv/static/cache_manifest.json
33 | 
34 | # In case you use Node.js/npm, you want to ignore these.
35 | npm-debug.log
36 | /assets/node_modules/
37 | 
38 | *.sw*
39 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
   1 | # Writing A Job Runner (In Elixir) (Again) (10 years later)
   2 | Ten years ago, [I wrote a job runner in Elixir after some inspiration from Jose](https://github.com/ybur-yug/genstage_tutorial/blob/master/README.md)
   3 | 
   4 | This is an update on that post.
   5 | 
   6 | Almost no code has changed, but I wrote it up a lot better, and added some more detail.
   7 | 
   8 | I find it wildly amusing it held up this well, and felt like re-sharing with everyone and see if someone with fresh eyes may get some enjoyment or learn a bit from this.
   9 | 
  10 | ### I also take things quite a bit further
  11 | 
  12 | 
  13 | ## Who is this for?
  14 | Are you curious?
  15 | 
  16 | If you know a little bit of Elixir, this is a great "levelling up" piece.
  17 | 
  18 | If you're seasoned, it might be fun to implement if you have not.
  19 | 
  20 | If you don't know Elixir, it will hopefully be an interesting case study and sales pitch.
  21 | 
  22 | Anyone with a Claude or Open AI subscription can easily follow along knowing no Elixir.
  23 | 
  24 | ## Work?
  25 | Applications must do work. This is typical of just about any program that reaches a sufficient size. In order to do that work, sometimes it's desirable to have it happen *elsewhere*. If you have built software, you have probably needed a background job.
  26 | 
  27 | In this situation, you are fundamentally using code to run other code. Erlang has a nice format for this, called the Erlang term format. It can store its data in a way it can be passed around and run by other nodes We are going to examine doing this in Elixir with "tools in the shed". We will have a single dependency called `gen_stage` that is built and maintained by the language's creator, Jose Valim.
  28 | 
  29 | For beginners, we will first cover a bit about Elixir and what it offers that might make this appealing
  30 | 
  31 | ## The Landscape of Job Processing
  32 | 
  33 | In Ruby, you might reach for Sidekiq. It's battle-tested, using Redis for storage and threads for concurrency. Jobs are JSON objects, workers pull from queues, and if something crashes, you hope your monitoring catches it. It works well until you need to scale beyond a single Redis instance or handle complex job dependencies.
  34 | 
  35 | Python developers often turn to Celery. It's more distributed by design, supporting multiple brokers and result backends. But the complexity shows - you're configuring RabbitMQ, dealing with serialization formats, and debugging issues across multiple moving parts. When a worker dies mid-job, recovery depends on how well you've configured acknowledgments and retries.
  36 | 
  37 | Go developers might use machinery or asynq, leveraging goroutines for concurrency. The static typing helps catch errors early, but you're still manually managing worker pools and carefully handling panics to prevent the whole process from dying.
  38 | 
  39 | Each solution reflects its language's strengths and limitations. They all converge on similar patterns: a persistent queue, worker processes, and lots of defensive programming. What if the language itself provided better primitives for this problem?
  40 | 
  41 | # Thinking About Job Runners, Producers, Consumers, and Events
  42 | 
  43 | ## The Architecture of Work
  44 | 
  45 | At its core, a job runner is a meta concept. It is code that runs code. There will always be work to be done in any given system that has users. But ensuring work gets done when it cannot be handled in a blocking, synchronous matter (and you have the time to await results) is nearly impossible. The devil is in these details. How do you handle failure? What is our plan when we have a situation that could overwhelm our worker pool? We seek out answers to these questions as we do this dive.
  46 | 
  47 | GenStage answers the questions we have asked so far, in general, with demand driven architecture. Instead of pushing work out, workers pull when *they* are ready. This inversion becomes a very elegant abstraction in practice.
  48 | 
  49 | ## Understanding Producer-Consumer Patterns
  50 | 
  51 | The producer-consumer pattern isn't unique to Elixir. It's a fundamental pattern in distributed systems:
  52 | 
  53 | **In Apache Spark**, RDDs (Resilient Distributed Datasets) flow through transformations. Each transformation is essentially a consumer of the previous stage and a producer for the next. Spark handles backpressure through its task scheduler - if executors are busy, new tasks wait.
  54 | 
  55 | **In Kafka Streams**, topics act as buffers between producers and consumers. Consumers track their offset, pulling messages at their own pace. The broker handles persistence and replication.
  56 | 
  57 | **In Go channels**, goroutines communicate through typed channels. A goroutine blocks when sending to a full channel or receiving from an empty one. This provides natural backpressure but requires careful capacity planning.
  58 | 
  59 | GenStage takes a different approach. There are no intermediate buffers or brokers. Producers and consumers negotiate directly:
  60 | 
  61 | 1. Consumer asks producer for work (specifying how much it can handle)
  62 | 2. Producer responds with up to that many events
  63 | 3. Consumer processes events and asks for more
  64 | 
  65 | This creates a pull-based system with automatic flow control. No queues filling up, no brokers to manage, no capacity planning. The system self-regulates based on actual processing speed.
  66 | 
  67 | ## What We're Actually Building
  68 | 
  69 | ### Why Elixir Works for Job Processing
  70 | 
  71 | **Processes are the unit of concurrency.** Not threads, not coroutines - processes. Each process has its own heap, runs concurrently, and can't corrupt another's memory. Starting one is measured in microseconds and takes about 2KB of memory. You don't manage a pool of workers; you spawn a process per job.
  72 | 
  73 | **Failure is isolated by default.** When a process crashes, it dies alone. No corrupted global state, no locked mutexes, no zombie threads. The supervisor sees the death, logs it, and starts a fresh process. Your job processor doesn't need defensive try-catch blocks everywhere - it needs a good supervision tree.
  74 | 
  75 | **Message passing is the only way to communicate.** No shared memory means no locks, no race conditions, no memory barriers. A process either receives a message or it doesn't. This constraint simplifies concurrent programming dramatically - you can reason about each process in isolation.
  76 | 
  77 | **The scheduler handles fairness.** The BEAM VM runs its own scheduler, preemptively switching between processes every 2000 reductions. One process can't starve others by hogging the CPU. This is why Phoenix can handle millions of WebSocket connections - each connection is just another lightweight process.
  78 | 
  79 | **Distribution is built-in.** Connect nodes with one function call. Send messages across the network with the same syntax as local messages. The Erlang Term Format serializes any data structure, including function references. Your job queue can span multiple machines without changing the core logic.
  80 | 
  81 | **Hot code reloading works.** Deploy new code without stopping the system. The BEAM can run two versions of a module simultaneously, migrating processes gracefully. Your job processor can be upgraded while it's processing jobs.
  82 | 
  83 | **Introspection is exceptional.** Connect to a running system and inspect any process. See its message queue, memory usage, current function. The observer GUI shows your entire system's health in real-time. When production misbehaves, you can debug it live.
  84 | 
  85 | These aren't features bolted on top - they're fundamental to how the BEAM VM works. When you build a job processor in Elixir, you're not fighting the language to achieve reliability and concurrency. You're using it as designed.
  86 | 
  87 | Our job runner will have three core components:
  88 | 
  89 | **Producers** - These generate or fetch work. In our case, they'll pull jobs from a database table. A producer doesn't decide who gets work - it simply responds to demand. When a consumer asks for 10 jobs, the producer queries the database for 10 unclaimed jobs and returns them.
  90 | 
  91 | **Consumers** - These execute jobs. Each consumer is a separate Elixir process, isolated from others. When a consumer is ready for work, it asks its producer for events. After processing, it asks for more. If a consumer crashes while processing a job, only that job is affected.
  92 | 
  93 | **Events** - The unit of work flowing through the system. In GenStage, everything is an event. For our job runner, an event is a job to be executed. Events flow from producers to consumers based on demand, never faster than consumers can handle.
  94 | 
  95 | ## The Beauty of Modeling Everything as Events
  96 | 
  97 | When you model work as events, powerful patterns emerge:
  98 | 
  99 | **Composition** - You can chain stages together. A consumer can also be a producer for another stage. Want to add a step that enriches jobs before execution? Insert a producer-consumer between your current stages.
 100 | 
 101 | **Fan-out/Fan-in** - One producer can feed multiple consumers (fan-out). Multiple producers can feed one consumer (fan-in). The demand mechanism ensures fair distribution.
 102 | 
 103 | **Buffering** - Need a buffer? Add a producer-consumer that accumulates events before passing them on. The buffer only fills as fast as downstream consumers can drain it.
 104 | 
 105 | **Filtering** - A producer-consumer can selectively forward events. Only want to process high-priority jobs? Filter them in a middle stage.
 106 | 
 107 | ```
 108 | Event Flow Pipeline: Social Media Processing
 109 | 
 110 | [BlueSky] ──┐
 111 | [Twitter] ──┼──→ [Producer] ═══→ [ProducerConsumer] ═══→ [Consumer] ──→ [Database]
 112 | [TikTok]  ──┘                    (transformation)
 113 | 
 114 | Flow: Social media posts → Producer → Transformation → Consumer → Storage
 115 | ```
 116 | 
 117 | ## Why This Matters for Job Processing
 118 | 
 119 | Traditional job processors push jobs into queues. Workers poll these queues, hoping to grab work. This creates several problems:
 120 | 
 121 | 1. **Queue overflow** - Producers can overwhelm the queue if consumers are slow
 122 | 2. **Unfair distribution** - Fast workers might grab all the work
 123 | 3. **Visibility** - Hard to see where bottlenecks are
 124 | 4. **Error handling** - What happens to in-flight jobs when a worker dies?
 125 | 
 126 | GenStage's demand-driven model solves these elegantly:
 127 | 
 128 | 1. **No overflow** - Producers only generate what's demanded
 129 | 2. **Fair distribution** - Each consumer gets what it asks for
 130 | 3. **Clear bottlenecks** - Slow stages naturally build up demand
 131 | 4. **Clean errors** - Crashed consumers simply stop demanding; their work remains unclaimed
 132 | 
 133 | This isn't theoretical. Telecom systems have used these patterns for decades. When you make a phone call, switches don't push calls through the network - each hop pulls when ready. This prevents network overload even during disasters when everyone tries to call at once.
 134 | 
 135 | We're applying the same battle-tested patterns to job processing. The result is a system that's naturally resilient, self-balancing, and surprisingly simple to reason about.
 136 | 
 137 | Ready to see how this translates to code? Let's build our first producer.
 138 | 
 139 | # Building the Foundation
 140 | 
 141 | ## Step 1: Creating Your Phoenix Project
 142 | 
 143 | Let's start fresh with a new Phoenix project. Open your terminal and run:
 144 | 
 145 | ```
 146 | mix phx.new job_processor --live
 147 | cd job_processor
 148 | ```
 149 | 
 150 | We're keeping it lean - no dashboard or mailer for now. When prompted to install dependencies, say yes.
 151 | 
 152 | Why Phoenix? We're not building a web app, but Phoenix gives us:
 153 | - A supervision tree already set up
 154 | - Configuration management
 155 | - A database connection (Ecto)
 156 | - LiveView for our monitoring dashboard (later)
 157 | 
 158 | Think of Phoenix as our application framework, not just a web framework.
 159 | 
 160 | ## Step 2: Adding GenStage
 161 | 
 162 | Open `mix.exs` and add GenStage to your dependencies:
 163 | 
 164 | ```elixir
 165 | defp deps do
 166 |   [
 167 |     {:phoenix, "~> 1.7.12"},
 168 |     {:phoenix_ecto, "~> 4.5"},
 169 |     {:ecto_sql, "~> 3.11"},
 170 |     {:postgrex, ">= 0.0.0"},
 171 |     {:phoenix_html, "~> 4.1"},
 172 |     {:phoenix_live_reload, "~> 1.2", only: :dev},
 173 |     {:phoenix_live_view, "~> 0.20.14"},
 174 |     {:telemetry_metrics, "~> 0.6"},
 175 |     {:telemetry_poller, "~> 1.0"},
 176 |     {:jason, "~> 1.2"},
 177 |     {:bandit, "~> 1.2"},
 178 | 
 179 |     # Add this line
 180 |     {:gen_stage, "~> 1.2"}
 181 |   ]
 182 | end
 183 | ```
 184 | 
 185 | Now fetch the dependency:
 186 | 
 187 | ```
 188 | mix deps.get
 189 | ```
 190 | 
 191 | That's it. One dependency. GenStage is maintained by the Elixir core team, so it follows the same design principles as the language itself.
 192 | 
 193 | ## Step 3: Understanding GenStage's Mental Model
 194 | 
 195 | Before we write code, let's cement the mental model. GenStage orchestrates three types of processes:
 196 | 
 197 | **Producers** emit events. They don't push events anywhere - they hold them until a consumer asks. Think of a producer as a lazy river of data. The water (events) only flows when someone downstream opens a valve (demands).
 198 | 
 199 | **Consumers** receive events. They explicitly ask producers for a specific number of events. This is the key insight: consumers control the flow rate, not producers.
 200 | 
 201 | **Producer-Consumers** do both. They receive events from upstream, transform them, and emit to downstream. Perfect for building pipelines.
 202 | 
 203 | Every GenStage process follows this lifecycle:
 204 | 1. Start and connect to other stages
 205 | 2. Consumer sends demand upstream
 206 | 3. Producer receives demand and emits events
 207 | 4. Consumer receives and processes events
 208 | 5. Repeat from step 2
 209 | 
 210 | The demand mechanism is what makes this special. In a traditional queue, you might have:
 211 | 
 212 | ```elixir
 213 | # Traditional approach - producer decides when to push
 214 | loop do
 215 |   job = create_job()
 216 |   Queue.push(job)  # What if queue is full?
 217 | end
 218 | ```
 219 | 
 220 | With GenStage:
 221 | 
 222 | ```elixir
 223 | # GenStage approach - consumer decides when to pull
 224 | def handle_demand(demand, state) do
 225 |   jobs = create_jobs(demand)  # Only create what's asked for
 226 |   {:noreply, jobs, state}
 227 | end
 228 | ```
 229 | 
 230 | The consumer is in control. It's impossible to overwhelm a consumer because it only gets what it asked for.
 231 | 
 232 | ## Step 4: Creating the Producer
 233 | 
 234 | Now for the meat of it. Let's build a producer that understands our job processing needs. Create a new file at `lib/job_processor/producer.ex`:
 235 | 
 236 | ```elixir
 237 | defmodule JobProcessor.Producer do
 238 |   use GenStage
 239 |   require Logger
 240 | 
 241 |   @doc """
 242 |   Starts the producer with an initial state.
 243 | 
 244 |   The state can be anything, but we'll use a counter to start simple.
 245 |   """
 246 |   def start_link(initial \\ 0) do
 247 |     GenStage.start_link(__MODULE__, initial, name: __MODULE__)
 248 |   end
 249 | 
 250 |   @impl true
 251 |   def init(counter) do
 252 |     Logger.info("Producer starting with counter: #{counter}")
 253 |     {:producer, counter}
 254 |   end
 255 | 
 256 |   @impl true
 257 |   def handle_demand(demand, state) do
 258 |     Logger.info("Producer received demand for #{demand} events")
 259 | 
 260 |     # Generate events to fulfill demand
 261 |     events = Enum.to_list(state..(state + demand - 1))
 262 | 
 263 |     # Update our state
 264 |     new_state = state + demand
 265 | 
 266 |     # Return events and new state
 267 |     {:noreply, events, new_state}
 268 |   end
 269 | end
 270 | ```
 271 | 
 272 | Let's dissect this line by line:
 273 | 
 274 | **`use GenStage`** - This macro brings in the GenStage behavior. It's like `use GenServer` but for stages. It requires us to implement certain callbacks.
 275 | 
 276 | **`start_link/1`** - Standard OTP pattern. We name the process after its module so we can find it easily. In production, you might want multiple producers, so you'd make the name configurable.
 277 | 
 278 | **`init/1`** - The crucial part: `{:producer, counter}`. The first element declares this as a producer. The second is our initial state. GenStage now knows this process will emit events when asked.
 279 | 
 280 | **`handle_demand/2`** - The heart of a producer. This callback fires when consumers ask for events. The arguments are:
 281 | - `demand` - How many events the consumer wants
 282 | - `state` - Our current state
 283 | 
 284 | The return value `{:noreply, events, new_state}` means:
 285 | - `:noreply` - We're responding to demand, not a synchronous call
 286 | - `events` - The list of events to emit (must be a list)
 287 | - `new_state` - Our updated state
 288 | 
 289 | ### The Demand Buffer
 290 | 
 291 | Here's something subtle but important: GenStage maintains an internal demand buffer. If multiple consumers ask for events before you can fulfill them, GenStage aggregates the demand.
 292 | 
 293 | For example:
 294 | 1. Consumer A asks for 10 events
 295 | 2. Consumer B asks for 5 events
 296 | 3. Your `handle_demand/2` receives demand for 15 events
 297 | 
 298 | This batching is efficient and prevents your producer from being called repeatedly for small demands.
 299 | 
 300 | ### What if You Can't Fulfill Demand?
 301 | 
 302 | Sometimes you can't produce as many events as demanded. That's fine:
 303 | 
 304 | ```elixir
 305 | def handle_demand(demand, state) do
 306 |   available = calculate_available_work()
 307 | 
 308 |   if available >= demand do
 309 |     events = fetch_events(demand)
 310 |     {:noreply, events, state}
 311 |   else
 312 |     # Can only partially fulfill demand
 313 |     events = fetch_events(available)
 314 |     {:noreply, events, state}
 315 |   end
 316 | end
 317 | ```
 318 | 
 319 | GenStage tracks unfulfilled demand. If you return fewer events than demanded, it remembers. The next time you have events available, you can emit them even without new demand:
 320 | 
 321 | ```elixir
 322 | def handle_info(:new_data_available, state) do
 323 |   events = fetch_available_events()
 324 |   {:noreply, events, state}
 325 | end
 326 | ```
 327 | 
 328 | ### Producer Patterns
 329 | 
 330 | Our simple counter producer is just the beginning. Real-world producers follow several patterns:
 331 | 
 332 | **Database Polling Producer:**
 333 | ```elixir
 334 | def handle_demand(demand, state) do
 335 |   jobs = Repo.all(
 336 |     from j in Job,
 337 |     where: j.status == "pending",
 338 |     limit: ^demand,
 339 |     lock: "FOR UPDATE SKIP LOCKED"
 340 |   )
 341 | 
 342 |   job_ids = Enum.map(jobs, & &1.id)
 343 | 
 344 |   Repo.update_all(
 345 |     from(j in Job, where: j.id in ^job_ids),
 346 |     set: [status: "processing"]
 347 |   )
 348 | 
 349 |   {:noreply, jobs, state}
 350 | end
 351 | ```
 352 | 
 353 | **Rate-Limited Producer:**
 354 | ```elixir
 355 | def handle_demand(demand, %{rate_limit: limit} = state) do
 356 |   now = System.monotonic_time(:millisecond)
 357 |   time_passed = now - state.last_emit
 358 | 
 359 |   allowed = min(demand, div(time_passed * limit, 1000))
 360 | 
 361 |   if allowed > 0 do
 362 |     events = generate_events(allowed)
 363 |     {:noreply, events, %{state | last_emit: now}}
 364 |   else
 365 |     # Schedule retry
 366 |     Process.send_after(self(), :retry_demand, 100)
 367 |     {:noreply, [], state}
 368 |   end
 369 | end
 370 | ```
 371 | 
 372 | **Buffering Producer:**
 373 | ```elixir
 374 | def handle_demand(demand, %{buffer: buffer} = state) do
 375 |   {to_emit, remaining} = Enum.split(buffer, demand)
 376 | 
 377 |   if length(to_emit) < demand do
 378 |     # Buffer exhausted, try to refill
 379 |     new_events = fetch_more_events()
 380 |     all_events = to_emit ++ new_events
 381 |     {to_emit_now, to_buffer} = Enum.split(all_events, demand)
 382 |     {:noreply, to_emit_now, %{state | buffer: to_buffer}}
 383 |   else
 384 |     {:noreply, to_emit, %{state | buffer: remaining}}
 385 |   end
 386 | end
 387 | ```
 388 | 
 389 | ### Testing Your Producer
 390 | 
 391 | Let's make sure our producer works. Create `test/job_processor/producer_test.exs`:
 392 | 
 393 | ```elixir
 394 | defmodule JobProcessor.ProducerTest do
 395 |   use ExUnit.Case
 396 |   alias JobProcessor.Producer
 397 | 
 398 |   test "producer emits events on demand" do
 399 |     {:ok, producer} = Producer.start_link(0)
 400 | 
 401 |     # Manually subscribe and ask for events
 402 |     {:ok, _subscription} = GenStage.sync_subscribe(self(), to: producer, max_demand: 5)
 403 | 
 404 |     # We should receive 5 events (0 through 4)
 405 |     assert_receive {:"$gen_consumer", {_, _}, [0, 1, 2, 3, 4]}
 406 |   end
 407 | 
 408 |   test "producer maintains state across demands" do
 409 |     {:ok, producer} = Producer.start_link(10)
 410 | 
 411 |     # First demand
 412 |     {:ok, _} = GenStage.sync_subscribe(self(), to: producer, max_demand: 3)
 413 |     assert_receive {:"$gen_consumer", {_, _}, [10, 11, 12]}
 414 | 
 415 |     # Second demand should continue from where we left off
 416 |     send(producer, {:"$gen_producer", {self(), nil}, {:ask, 2}})
 417 |     assert_receive {:"$gen_consumer", {_, _}, [13, 14]}
 418 |   end
 419 | end
 420 | ```
 421 | 
 422 | Run the tests with `mix test`.
 423 | 
 424 | ### The Power of Stateful Producers
 425 | 
 426 | Our producer maintains state - a simple counter. But state can be anything:
 427 | 
 428 | - A database connection for polling
 429 | - A buffer of pre-fetched events
 430 | - Rate limiting information
 431 | - Metrics and telemetry data
 432 | 
 433 | Because each producer is just an Erlang process, it's isolated. If one producer crashes, others continue. The supervisor restarts the crashed producer with a fresh state.
 434 | 
 435 | This is different from thread-based systems where shared state requires locks. Each producer owns its state exclusively. No locks, no race conditions, no defensive programming.
 436 | 
 437 | ### What We've Built
 438 | 
 439 | Our producer is deceptively simple, but it demonstrates core principles:
 440 | 
 441 | 1. **Demand-driven** - Only produces when asked
 442 | 2. **Stateful** - Maintains its own isolated state
 443 | 3. **Supervised** - Can crash and restart safely
 444 | 4. **Testable** - Easy to verify behavior
 445 | 
 446 | In the next section, we'll build consumers that process these events. But the producer is the foundation - it controls the flow of work through our system.
 447 | 
 448 | # Building A Consumer
 449 | 
 450 | Now that we have a producer emitting events, we need something to consume them. This is where consumers come in - they're the workers that actually process the events flowing through our system.
 451 | 
 452 | But here's the beautiful thing about GenStage consumers: they're not passive recipients waiting for work to be thrown at them. They're active participants in the flow control. A consumer decides how much work it can handle and explicitly asks for that amount. No more, no less.
 453 | 
 454 | Think about how this changes the dynamics. In a traditional message queue, producers blast messages into a queue, hoping consumers can keep up. If consumers fall behind, the queue grows. If consumers are faster than expected, they sit idle waiting for work. It's a constant balancing act with lots of manual tuning.
 455 | 
 456 | GenStage flips this completely. Consumers know their own capacity better than anyone else. They know if they're currently processing a heavy job, if they're running low on memory, or if they're about to restart. So they ask for exactly what they can handle right now.
 457 | 
 458 | ## The Consumer's Lifecycle
 459 | 
 460 | A GenStage consumer follows a simple but powerful lifecycle:
 461 | 
 462 | 1. **Subscribe** - Connect to one or more producers
 463 | 2. **Demand** - Ask for a specific number of events
 464 | 3. **Receive** - Get events from producers (never more than requested)
 465 | 4. **Process** - Handle each event
 466 | 5. **Repeat** - Ask for more events when ready
 467 | 
 468 | The key insight is step 4: processing happens between demands. The consumer processes its current batch completely before asking for more. This creates natural backpressure - slow consumers automatically reduce the flow rate.
 469 | 
 470 | ## Building Our First Consumer
 471 | 
 472 | Let's build a consumer that processes the events from our producer. Create a new file at `lib/job_processor/consumer.ex`:
 473 | 
 474 | ```elixir
 475 | defmodule JobProcessor.Consumer do
 476 |   use GenStage
 477 |   require Logger
 478 | 
 479 |   @doc """
 480 |   Starts the consumer.
 481 | 
 482 |   Like producers, consumers are just GenServer-like processes.
 483 |   The state can be anything you need for processing.
 484 |   """
 485 |   def start_link(opts \\ []) do
 486 |     GenStage.start_link(__MODULE__, opts)
 487 |   end
 488 | 
 489 |   @impl true
 490 |   def init(opts) do
 491 |     # The key difference: we declare ourselves as a :consumer
 492 |     # and specify which producer(s) to subscribe to
 493 |     {:consumer, opts, subscribe_to: [JobProcessor.Producer]}
 494 |   end
 495 | 
 496 |   @impl true
 497 |   def handle_events(events, _from, state) do
 498 |     Logger.info("Consumer received #{length(events)} events")
 499 | 
 500 |     # Process each event
 501 |     for event <- events do
 502 |       process_event(event, state)
 503 |     end
 504 | 
 505 |     # Always return {:noreply, [], state} for consumers
 506 |     # The empty list means we don't emit any events (we're not a producer)
 507 |     {:noreply, [], state}
 508 |   end
 509 | 
 510 |   defp process_event(event, state) do
 511 |     # For now, just log what we received
 512 |     Logger.info("Processing event: #{event}")
 513 |     IO.inspect({self(), event, state}, label: "Consumer processed")
 514 |   end
 515 | end
 516 | ```
 517 | 
 518 | ## Understanding the Consumer Architecture
 519 | 
 520 | Let's break down what makes this consumer work:
 521 | 
 522 | **`use GenStage`** - Just like producers, consumers use the GenStage behavior. But the callbacks they implement are different.
 523 | 
 524 | **`init/1` returns `{:consumer, state, options}`** - The crucial difference from producers. The first element declares this process as a consumer. The `subscribe_to` option tells GenStage which producers to connect to.
 525 | 
 526 | **`handle_events/3` instead of `handle_demand/2`** - Consumers implement `handle_events/3`, which receives:
 527 | - `events` - The list of events to process
 528 | - `from` - Which producer sent these events (usually ignored)
 529 | - `state` - The consumer's current state
 530 | 
 531 | **The return value `{:noreply, [], state}`** - Consumers don't emit events (that's producers' job), so the events list is always empty. They just process and update their state.
 532 | 
 533 | ## The Magic of Subscription
 534 | 
 535 | Notice the `subscribe_to: [JobProcessor.Producer]` option. This does several important things:
 536 | 
 537 | **Automatic connection** - GenStage handles finding and connecting to the producer. No manual process linking or monitoring.
 538 | 
 539 | **Automatic demand** - The consumer automatically asks the producer for events. By default, it requests batches of up to 1000 events, but you can tune this.
 540 | 
 541 | **Fault tolerance** - If the producer crashes and restarts, the consumer automatically reconnects. If the consumer crashes, it doesn't take down the producer.
 542 | 
 543 | **Flow control** - The consumer won't receive more events than it asks for. If it's slow processing the current batch, no new events arrive until it's ready.
 544 | 
 545 | ## Tuning Consumer Demand
 546 | 
 547 | You can control how many events a consumer requests at once:
 548 | 
 549 | ```elixir
 550 | def init(opts) do
 551 |   {:consumer, opts,
 552 |    subscribe_to: [
 553 |      {JobProcessor.Producer, min_demand: 5, max_demand: 50}
 554 |    ]}
 555 | end
 556 | ```
 557 | 
 558 | **`min_demand`** - Don't ask for more events until we have fewer than this many
 559 | **`max_demand`** - Never ask for more than this many events at once
 560 | 
 561 | This creates a buffering effect. The consumer will receive events in batches between min_demand and max_demand, giving you control over throughput vs. latency tradeoffs.
 562 | 
 563 | For job processing, you might want smaller batches to reduce memory usage:
 564 | 
 565 | ```elixir
 566 | subscribe_to: [
 567 |   {JobProcessor.Producer, min_demand: 1, max_demand: 10}
 568 | ]
 569 | ```
 570 | 
 571 | Or larger batches for higher throughput:
 572 | 
 573 | ```elixir
 574 | subscribe_to: [
 575 |   {JobProcessor.Producer, min_demand: 100, max_demand: 1000}
 576 | ]
 577 | ```
 578 | 
 579 | ## Why This Design Matters
 580 | 
 581 | The producer-consumer subscription model solves several classic distributed systems problems:
 582 | 
 583 | **Backpressure** - Slow consumers naturally slow down the entire pipeline. No queues overflow, no memory explosions.
 584 | 
 585 | **Dynamic scaling** - Add more consumers and they automatically start receiving events. Remove consumers and the remaining ones pick up the slack.
 586 | 
 587 | **Fault isolation** - A crashing consumer doesn't affect others. A crashing producer can be restarted without losing in-flight work.
 588 | 
 589 | **Observable performance** - You can see exactly where bottlenecks are by monitoring demand patterns. High accumulated demand = bottleneck downstream.
 590 | 
 591 | ## Consumer Patterns
 592 | 
 593 | Real-world consumers follow several common patterns:
 594 | 
 595 | **Database Writing Consumer:**
 596 | ```elixir
 597 | def handle_events(events, _from, state) do
 598 |   # Batch insert for efficiency
 599 |   records = Enum.map(events, &transform_event/1)
 600 |   Repo.insert_all(MyTable, records)
 601 | 
 602 |   {:noreply, [], state}
 603 | end
 604 | ```
 605 | 
 606 | **HTTP API Consumer:**
 607 | ```elixir
 608 | def handle_events(events, _from, state) do
 609 |   for event <- events do
 610 |     case HTTPoison.post(state.webhook_url, Jason.encode!(event)) do
 611 |       {:ok, %{status_code: 200}} -> :ok
 612 |       {:error, reason} -> Logger.error("Webhook failed: #{inspect(reason)}")
 613 |     end
 614 |   end
 615 | 
 616 |   {:noreply, [], state}
 617 | end
 618 | ```
 619 | 
 620 | **File Processing Consumer:**
 621 | ```elixir
 622 | def handle_events(events, _from, state) do
 623 |   for event <- events do
 624 |     file_path = "/tmp/processed_#{event.id}.json"
 625 |     File.write!(file_path, Jason.encode!(event))
 626 |   end
 627 | 
 628 |   {:noreply, [], state}
 629 | end
 630 | ```
 631 | 
 632 | ## Error Handling in Consumers
 633 | 
 634 | What happens when event processing fails? In traditional queue systems, you need complex retry logic, dead letter queues, and careful state management.
 635 | 
 636 | With GenStage consumers, it's simpler. If a consumer crashes while processing events, those events are simply not acknowledged. When the consumer restarts, the producer still has them and will include them in the next batch.
 637 | 
 638 | For more sophisticated error handling, you can catch exceptions:
 639 | 
 640 | ```elixir
 641 | def handle_events(events, _from, state) do
 642 |   for event <- events do
 643 |     try do
 644 |       process_event(event)
 645 |     rescue
 646 |       e ->
 647 |         Logger.error("Failed to process event #{event.id}: #{inspect(e)}")
 648 |         # Could send to dead letter queue, retry later, etc.
 649 |     end
 650 |   end
 651 | 
 652 |   {:noreply, [], state}
 653 | end
 654 | ```
 655 | 
 656 | But often, letting the process crash and restart is the right approach. It's simple, it clears any corrupted state, and the supervisor handles the restart automatically.
 657 | 
 658 | # Wiring It Together
 659 | 
 660 | Now we have both pieces: a producer that emits events and a consumer that processes them. But they're just modules sitting in files. We need to start them as processes and connect them.
 661 | 
 662 | This is where OTP's supervision trees shine. We'll add both processes to our application's supervision tree, and OTP will ensure they start in the right order and restart if they crash.
 663 | 
 664 | Open `lib/job_processor/application.ex` and modify the `start/2` function:
 665 | 
 666 | ```elixir
 667 | def start(_type, _args) do
 668 |   children = [
 669 |     # Start the Producer first
 670 |     JobProcessor.Producer,
 671 | 
 672 |     # Then start the Consumer
 673 |     # The consumer will automatically connect to the producer
 674 |     JobProcessor.Consumer,
 675 | 
 676 |     # Other children like Ecto, Phoenix endpoint, etc.
 677 |     JobProcessorWeb.Endpoint
 678 |   ]
 679 | 
 680 |   opts = [strategy: :one_for_one, name: JobProcessor.Supervisor]
 681 |   Supervisor.start_link(children, opts)
 682 | end
 683 | ```
 684 | 
 685 | That's it! The supervision tree will:
 686 | 
 687 | 1. Start the producer
 688 | 2. Start the consumer
 689 | 3. The consumer automatically subscribes to the producer
 690 | 4. Events start flowing immediately
 691 | 
 692 | ## Why This Supervision Strategy Works
 693 | 
 694 | The `:one_for_one` strategy means if one process crashes, only that process is restarted. This is perfect for our producer-consumer setup:
 695 | 
 696 | **Producer crashes** - The consumer notices the connection is lost and waits. When the supervisor restarts the producer, the consumer automatically reconnects.
 697 | 
 698 | **Consumer crashes** - The producer keeps running, just stops emitting events. When the supervisor restarts the consumer, it reconnects and processing resumes.
 699 | 
 700 | This is fault isolation in action. Problems in one part of the system don't cascade to other parts.
 701 | 
 702 | ## Testing the Connection
 703 | 
 704 | Let's see our producer and consumer working together. Start the application:
 705 | 
 706 | ```
 707 | mix phx.server
 708 | ```
 709 | 
 710 | You should see logs showing the consumer processing events from the producer. Each event will be displayed with the process ID, event number, and state - something like this:
 711 | 
 712 | ```
 713 | Consumer processed: {#PID<0.234.0>, 0, []}
 714 | Consumer processed: {#PID<0.234.0>, 1, []}
 715 | Consumer processed: {#PID<0.234.0>, 2, []}
 716 | Consumer processed: {#PID<0.234.0>, 3, []}
 717 | Consumer processed: {#PID<0.234.0>, 4, []}
 718 | ...
 719 | ```
 720 | 
 721 | Notice something important: the same PID processes every event. This is because we have a single consumer. Our counter increments predictably from 0, 1, 2, 3, 4... and all events flow to the same process.
 722 | 
 723 | ## The Single Consumer Scenario
 724 | 
 725 | With one consumer, we get:
 726 | - **Predictable ordering** - Events are processed in the exact order they're generated
 727 | - **Sequential processing** - Each event is fully processed before the next one begins
 728 | - **Simple state management** - Only one process to reason about
 729 | - **Potential bottleneck** - If processing is slow, the entire pipeline slows down
 730 | 
 731 | ```
 732 | Single Consumer Pattern: Sequential Processing
 733 | 
 734 | [Producer] ──→ ⓪ ──→ ① ──→ ② ──→ ③ ──→ ④ ──→ ⑤ ──→ [Consumer]
 735 | (Emits 0,1,2,3,4...)                                    (Processes Sequentially)
 736 | 
 737 | Timeline:
 738 | t0: ████████ Process Event 0
 739 | t1:         ████████ Process Event 1  
 740 | t2:                 ████████ Process Event 2
 741 | t3:                         ░░░░░░░░░░░░░░░░░░░ Events 3,4,5... waiting
 742 | 
 743 | Key Characteristics:
 744 | ✓ Predictable ordering - Events processed in exact sequence
 745 | ✓ Sequential processing - One event completes before next begins  
 746 | ✓ Simple state management - Single process to track
 747 | ⚠ Potential bottleneck - Slow processing blocks entire pipeline
 748 | ```
 749 | 
 750 | This is perfect for scenarios where order matters or when you're just getting started. But what happens when we add more consumers?
 751 | 
 752 | ## Scaling to Multiple Consumers
 753 | 
 754 | Let's see what happens with multiple consumers. Add this to your supervision tree in `lib/job_processor/application.ex`:
 755 | 
 756 | ```elixir
 757 | def start(_type, _args) do
 758 |   children = [
 759 |     # Start the Producer first
 760 |     JobProcessor.Producer,
 761 | 
 762 |     # Start multiple consumers
 763 |     {JobProcessor.Consumer, [id: :consumer_1]},
 764 |     {JobProcessor.Consumer, [id: :consumer_2]},
 765 |     {JobProcessor.Consumer, [id: :consumer_3]},
 766 | 
 767 |     # Other children
 768 |     JobProcessorWeb.Endpoint
 769 |   ]
 770 | 
 771 |   opts = [strategy: :one_for_one, name: JobProcessor.Supervisor]
 772 |   Supervisor.start_link(children, opts)
 773 | end
 774 | ```
 775 | 
 776 | Now restart your application and watch the logs:
 777 | 
 778 | ```
 779 | Consumer processed: {#PID<0.234.0>, 0, []}
 780 | Consumer processed: {#PID<0.234.0>, 1, []}
 781 | Consumer processed: {#PID<0.235.0>, 2, []}
 782 | Consumer processed: {#PID<0.236.0>, 3, []}
 783 | Consumer processed: {#PID<0.234.0>, 4, []}
 784 | Consumer processed: {#PID<0.235.0>, 5, []}
 785 | ...
 786 | ```
 787 | 
 788 | Notice the different PIDs! Events are now distributed across multiple consumer processes. The distribution depends on which consumer asks for work first and how fast each consumer processes its events.
 789 | 
 790 | ```
 791 | Multiple Consumer Pattern: Parallel Processing with Load Balancing
 792 | 
 793 |                     ┌──→ Consumer 1 (PID <0.234.0>) ─→ Events: 0, 1, 4...
 794 |                     │    ↑ demand
 795 | [Producer] ─────────┼──→ Consumer 2 (PID <0.235.0>) ─→ Events: 2, 5, 8...  
 796 | (First-Come-       │    ↑ demand
 797 | First-Served)      └──→ Consumer 3 (PID <0.236.0>) ─→ Events: 3, 6, 7...
 798 |                          ↑ demand
 799 | 
 800 | Timeline: Parallel Processing
 801 | Consumer 1: ██████████ Event 0    ████████ Event 1    ████████████ Event 4
 802 | Consumer 2:     ████████████ Event 2    ██████████ Event 5
 803 | Consumer 3:         ████████ Event 3    ████████████ Event 6
 804 | 
 805 | ✓ Benefits:                              ⚠ Challenges:
 806 | • Higher throughput - parallel          • No ordering guarantees  
 807 | • Fault tolerance - others continue     • Shared resource contention
 808 | • Natural load balancing               • Debugging complexity
 809 | • Better resource utilization         • Potential race conditions
 810 | ```
 811 | 
 812 | ## Understanding Event Distribution
 813 | 
 814 | GenStage's default dispatcher (DemandDispatcher) uses a "first-come, first-served" approach:
 815 | 
 816 | 1. Consumer A finishes its current batch and asks for 10 more events
 817 | 2. Producer sends events 0-9 to Consumer A
 818 | 3. Consumer B asks for 10 events
 819 | 4. Producer sends events 10-19 to Consumer B
 820 | 5. Consumer A finishes and asks for more, gets events 20-29
 821 | 
 822 | This creates natural load balancing - faster consumers get more work. If Consumer A is processing heavy jobs slowly, Consumer B and C will pick up the slack.
 823 | 
 824 | ## The Trade-offs
 825 | 
 826 | **Benefits of Multiple Consumers:**
 827 | - **Throughput** - More work gets done in parallel
 828 | - **Fault tolerance** - If one consumer crashes, others continue
 829 | - **Natural load balancing** - Fast consumers get more work
 830 | - **Resource utilization** - Better use of multi-core systems
 831 | 
 832 | **Challenges:**
 833 | - **No ordering guarantees** - Event 5 might finish before event 3
 834 | - **Shared resources** - Multiple consumers might compete for database connections
 835 | - **Debugging complexity** - Multiple processes to track
 836 | 
 837 | ## Different Distribution Strategies
 838 | 
 839 | You can change how events are distributed by modifying the producer's dispatcher. Add this to your producer's `init/1` function:
 840 | 
 841 | ```elixir
 842 | def init(counter) do
 843 |   Logger.info("Producer starting with counter: #{counter}")
 844 |   {:producer, counter, dispatcher: GenStage.BroadcastDispatcher}
 845 | end
 846 | ```
 847 | 
 848 | Now restart and watch what happens:
 849 | 
 850 | ```
 851 | Consumer processed: {#PID<0.234.0>, 0, []}
 852 | Consumer processed: {#PID<0.235.0>, 0, []}
 853 | Consumer processed: {#PID<0.236.0>, 0, []}
 854 | Consumer processed: {#PID<0.234.0>, 1, []}
 855 | Consumer processed: {#PID<0.235.0>, 1, []}
 856 | Consumer processed: {#PID<0.236.0>, 1, []}
 857 | ...
 858 | ```
 859 | 
 860 | With BroadcastDispatcher, every consumer receives every event! This is useful for scenarios like:
 861 | - Multiple consumers writing to different databases
 862 | - One consumer processing events, another collecting metrics
 863 | - Broadcasting notifications to multiple systems
 864 | 
 865 | ```
 866 | BroadcastDispatcher: Every Consumer Receives Every Event
 867 | 
 868 |                     ┌─→ Database Writer (PID <0.234.0>) ─→ Events: 0, 1, 2, 3...
 869 |                     │
 870 | [Producer] ═══━━━━━━┼─→ Metrics Collector (PID <0.235.0>) ─→ Events: 0, 1, 2, 3...
 871 | (Broadcasting)     │
 872 |                     └─→ Notification Service (PID <0.236.0>) ─→ Events: 0, 1, 2, 3...
 873 | 
 874 | Timeline: All Consumers Process Same Events Simultaneously
 875 | Database Writer:     ████████ Event 0    ████████ Event 1    ████████ Event 2
 876 | Metrics Collector:   ████████ Event 0    ████████ Event 1    ████████ Event 2  
 877 | Notification Service:████████ Event 0    ████████ Event 1    ████████ Event 2
 878 | 
 879 | 🔄 Broadcasting Use Cases:
 880 | • Multiple databases - Each consumer writes to different database
 881 | • Parallel processing - One processes data, another collects metrics  
 882 | • Notification fanout - Broadcasting alerts to multiple services
 883 | • Audit trails - Simultaneous logging to multiple destinations
 884 | 
 885 | Key Differences from Load Balancing:
 886 | ✓ Every consumer gets EVERY event (no distribution)
 887 | ✓ Perfect for parallel processing different aspects
 888 | ✓ Higher total throughput but more resource usage
 889 | ⚠ N times more processing (N = number of consumers)
 890 | ```
 891 | 
 892 | But we're still just processing numbers. In the next section, we'll replace our simple counter with a real job processing system that can execute arbitrary code.
 893 | 
 894 | # From Toy Examples to Real Job Processing
 895 | 
 896 | We've built a solid foundation with our producer-consumer setup, but we're still just processing incrementing numbers. That's useful for understanding the mechanics, but real job processing needs persistent storage, job queuing, and the ability to execute arbitrary code.
 897 | 
 898 | This is where things get interesting. We're going to transform our simple counter into a full job processing system that can serialize function calls, store them in a database, and execute them across multiple workers. Think of it as building your own mini-Sidekiq, but with GenStage's elegant backpressure handling.
 899 | 
 900 | ## Why We Need a Database
 901 | 
 902 | Right now, our producer generates events from memory (a simple counter). But real job processors need persistence for several reasons:
 903 | 
 904 | **Durability** - Jobs shouldn't disappear if the system restarts. When you queue a job to send an email, you expect it to survive server reboots.
 905 | 
 906 | **Coordination** - Multiple producer processes might be running across different servers. They need a shared source of truth for what work exists.
 907 | 
 908 | **Status tracking** - Jobs have lifecycles: queued, running, completed, failed. You need to track this state somewhere.
 909 | 
 910 | **Debugging and monitoring** - When jobs fail, you need to see what went wrong and potentially retry them.
 911 | 
 912 | The database becomes our job queue's persistent storage layer, but GenStage handles all the flow control and distribution logic.
 913 | 
 914 | ## Setting Up Our Job Storage
 915 | 
 916 | Since we're using Phoenix, we already have Ecto configured. But we need to set up our job storage table. The beauty of Elixir's job processing is that we can serialize entire function calls as binary data using the Erlang Term Format.
 917 | 
 918 | Let's create a migration for our jobs table:
 919 | 
 920 | ```
 921 | mix ecto.gen.migration create_jobs
 922 | ```
 923 | 
 924 | Now edit the migration file:
 925 | 
 926 | ```elixir
 927 | defmodule JobProcessor.Repo.Migrations.CreateJobs do
 928 |   use Ecto.Migration
 929 | 
 930 |   def change do
 931 |     create table(:jobs) do
 932 |       add :status, :string, null: false, default: "queued"
 933 |       add :payload, :binary, null: false
 934 |       add :attempts, :integer, default: 0
 935 |       add :max_attempts, :integer, default: 3
 936 |       add :scheduled_at, :utc_datetime
 937 |       add :started_at, :utc_datetime
 938 |       add :completed_at, :utc_datetime
 939 |       add :error_message, :text
 940 | 
 941 |       timestamps()
 942 |     end
 943 | 
 944 |     create index(:jobs, [:status])
 945 |     create index(:jobs, [:scheduled_at])
 946 |     create index(:jobs, [:status, :scheduled_at])
 947 |   end
 948 | end
 949 | ```
 950 | 
 951 | This gives us a robust job storage system:
 952 | 
 953 | - **status** - Track job lifecycle (queued, running, completed, failed)
 954 | - **payload** - The serialized function call
 955 | - **attempts/max_attempts** - Retry logic
 956 | - **scheduled_at** - Support for delayed jobs
 957 | - **Timestamps** - Monitor performance and debug issues
 958 | 
 959 | Run the migration:
 960 | 
 961 | ```
 962 | mix ecto.migrate
 963 | ```
 964 | 
 965 | ## Modeling Jobs
 966 | 
 967 | Let's create an Ecto schema for our jobs. Create `lib/job_processor/job.ex`:
 968 | 
 969 | ```elixir
 970 | defmodule JobProcessor.Job do
 971 |   use Ecto.Schema
 972 |   import Ecto.Changeset
 973 | 
 974 |   schema "jobs" do
 975 |     field :status, :string, default: "queued"
 976 |     field :payload, :binary
 977 |     field :attempts, :integer, default: 0
 978 |     field :max_attempts, :integer, default: 3
 979 |     field :scheduled_at, :utc_datetime
 980 |     field :started_at, :utc_datetime
 981 |     field :completed_at, :utc_datetime
 982 |     field :error_message, :string
 983 | 
 984 |     timestamps()
 985 |   end
 986 | 
 987 |   def changeset(job, attrs) do
 988 |     job
 989 |     |> cast(attrs, [:status, :payload, :attempts, :max_attempts,
 990 |                     :scheduled_at, :started_at, :completed_at, :error_message])
 991 |     |> validate_required([:payload])
 992 |     |> validate_inclusion(:status, ["queued", "running", "completed", "failed"])
 993 |   end
 994 | 
 995 |   @doc """
 996 |   Serialize a function call into a job payload.
 997 | 
 998 |   This is where the magic happens - we can serialize any module, function,
 999 |   and arguments into binary data that can be stored and executed later.
1000 |   """
1001 |   def encode_job(module, function, args) do
1002 |     {module, function, args} |> :erlang.term_to_binary()
1003 |   end
1004 | 
1005 |   @doc """
1006 |   Deserialize a job payload back into a function call.
1007 |   """
1008 |   def decode_job(payload) do
1009 |     :erlang.binary_to_term(payload)
1010 |   end
1011 | end
1012 | ```
1013 | 
1014 | ## Building the Job Queue Interface
1015 | 
1016 | Now we need an interface for interacting with jobs. This is where we abstract the database operations and provide a clean API for enqueueing and processing jobs. Create `lib/job_processor/job_queue.ex`:
1017 | 
1018 | ```elixir
1019 | defmodule JobProcessor.JobQueue do
1020 |   import Ecto.Query
1021 |   alias JobProcessor.{Repo, Job}
1022 | 
1023 |   @doc """
1024 |   Enqueue a job for processing.
1025 | 
1026 |   This is the public API that applications use to submit work.
1027 |   """
1028 |   def enqueue(module, function, args, opts \\ []) do
1029 |     payload = Job.encode_job(module, function, args)
1030 | 
1031 |     attrs = %{
1032 |       payload: payload,
1033 |       max_attempts: Keyword.get(opts, :max_attempts, 3),
1034 |       scheduled_at: Keyword.get(opts, :scheduled_at, DateTime.utc_now())
1035 |     }
1036 | 
1037 |     %Job{}
1038 |     |> Job.changeset(attrs)
1039 |     |> Repo.insert()
1040 |   end
1041 | 
1042 |   @doc """
1043 |   Fetch available jobs for processing.
1044 | 
1045 |   This is called by our GenStage producer to get work.
1046 |   Uses FOR UPDATE SKIP LOCKED to avoid race conditions.
1047 |   """
1048 |   def fetch_jobs(limit) do
1049 |     now = DateTime.utc_now()
1050 | 
1051 |     Repo.transaction(fn ->
1052 |       # Find available jobs
1053 |       job_ids =
1054 |         from(j in Job,
1055 |           where: j.status == "queued" and j.scheduled_at <= ^now,
1056 |           limit: ^limit,
1057 |           select: j.id,
1058 |           lock: "FOR UPDATE SKIP LOCKED"
1059 |         )
1060 |         |> Repo.all()
1061 | 
1062 |       # Mark them as running and return the full job data
1063 |       {count, jobs} =
1064 |         from(j in Job, where: j.id in ^job_ids)
1065 |         |> Repo.update_all(
1066 |           [set: [status: "running", started_at: DateTime.utc_now()]],
1067 |           returning: [:id, :payload, :attempts, :max_attempts]
1068 |         )
1069 | 
1070 |       {count, jobs}
1071 |     end)
1072 |   end
1073 | 
1074 |   @doc """
1075 |   Mark a job as completed successfully.
1076 |   """
1077 |   def complete_job(job_id) do
1078 |     from(j in Job, where: j.id == ^job_id)
1079 |     |> Repo.update_all(
1080 |       set: [status: "completed", completed_at: DateTime.utc_now()]
1081 |     )
1082 |   end
1083 | 
1084 |   @doc """
1085 |   Mark a job as failed and handle retry logic.
1086 |   """
1087 |   def fail_job(job_id, error_message, attempts \\ 1) do
1088 |     job = Repo.get!(Job, job_id)
1089 | 
1090 |     if attempts >= job.max_attempts do
1091 |       # Permanently failed
1092 |       from(j in Job, where: j.id == ^job_id)
1093 |       |> Repo.update_all(
1094 |         set: [
1095 |           status: "failed",
1096 |           error_message: error_message,
1097 |           attempts: attempts,
1098 |           completed_at: DateTime.utc_now()
1099 |         ]
1100 |       )
1101 |     else
1102 |       # Retry later
1103 |       retry_at = DateTime.add(DateTime.utc_now(), 60 * attempts, :second)
1104 | 
1105 |       from(j in Job, where: j.id == ^job_id)
1106 |       |> Repo.update_all(
1107 |         set: [
1108 |           status: "queued",
1109 |           error_message: error_message,
1110 |           attempts: attempts,
1111 |           scheduled_at: retry_at
1112 |         ]
1113 |       )
1114 |     end
1115 |   end
1116 | end
1117 | ```
1118 | 
1119 | ## The Power of FOR UPDATE SKIP LOCKED
1120 | 
1121 | Notice that crucial line: `lock: "FOR UPDATE SKIP LOCKED"`. This is a PostgreSQL feature that's essential for job processing systems.
1122 | 
1123 | Here's what happens without it:
1124 | 1. Consumer A queries for jobs, gets job #123
1125 | 2. Consumer B queries for jobs, gets the same job #123
1126 | 3. Both consumers try to process job #123 simultaneously
1127 | 4. Chaos ensues
1128 | 
1129 | With `FOR UPDATE SKIP LOCKED`:
1130 | 1. Consumer A queries for jobs, locks job #123
1131 | 2. Consumer B queries for jobs, skips locked job #123, gets job #124
1132 | 3. Each job is processed exactly once
1133 | 4. No race conditions, no duplicate processing
1134 | 
1135 | This is why PostgreSQL (and similar databases) are preferred for job processing systems. The database handles the coordination for us.
1136 | 
1137 | ## Updating Our Producer
1138 | 
1139 | Now we can update our producer to fetch real jobs from the database instead of generating counter events. Update `lib/job_processor/producer.ex`:
1140 | 
1141 | ```elixir
1142 | defmodule JobProcessor.Producer do
1143 |   use GenStage
1144 |   require Logger
1145 |   alias JobProcessor.JobQueue
1146 | 
1147 |   def start_link(_opts) do
1148 |     GenStage.start_link(__MODULE__, :ok, name: __MODULE__)
1149 |   end
1150 | 
1151 |   @impl true
1152 |   def init(:ok) do
1153 |     Logger.info("Job Producer starting")
1154 |     {:producer, %{}, dispatcher: GenStage.DemandDispatcher}
1155 |   end
1156 | 
1157 |   @impl true
1158 |   def handle_demand(demand, state) when demand > 0 do
1159 |     Logger.info("Producer received demand for #{demand} jobs")
1160 | 
1161 |     case JobQueue.fetch_jobs(demand) do
1162 |       {:ok, {count, jobs}} when count > 0 ->
1163 |         Logger.info("Fetched #{count} jobs from database")
1164 |         {:noreply, jobs, state}
1165 | 
1166 |       {:ok, {0, []}} ->
1167 |         # No jobs available, schedule a check for later
1168 |         Process.send_after(self(), :check_for_jobs, 1000)
1169 |         {:noreply, [], state}
1170 | 
1171 |       {:error, reason} ->
1172 |         Logger.error("Failed to fetch jobs: #{inspect(reason)}")
1173 |         {:noreply, [], state}
1174 |     end
1175 |   end
1176 | 
1177 |   @impl true
1178 |   def handle_info(:check_for_jobs, state) do
1179 |     # This allows us to produce events even when there's no pending demand
1180 |     # if jobs become available
1181 |     case JobQueue.fetch_jobs(10) do
1182 |       {:ok, {count, jobs}} when count > 0 ->
1183 |         {:noreply, jobs, state}
1184 |       _ ->
1185 |         Process.send_after(self(), :check_for_jobs, 1000)
1186 |         {:noreply, [], state}
1187 |     end
1188 |   end
1189 | end
1190 | ```
1191 | 
1192 | ## Understanding the Producer's Evolution
1193 | 
1194 | Our producer has evolved significantly:
1195 | 
1196 | **Database-driven** - Instead of generating events from memory, we fetch them from persistent storage
1197 | 
1198 | **Handles empty queues gracefully** - When no jobs are available, we schedule a check for later instead of blocking
1199 | 
1200 | **Error handling** - Database operations can fail, so we handle those cases
1201 | 
1202 | **Polling mechanism** - The `:check_for_jobs` message lets us produce events even when there's no pending demand
1203 | 
1204 | This polling approach works well for most job processing systems. For higher throughput systems, you could use PostgreSQL's LISTEN/NOTIFY to get push notifications when new jobs arrive.
1205 | 
1206 | ## Updating Our Consumer
1207 | 
1208 | Now our consumer needs to execute real job payloads instead of just logging numbers. Update `lib/job_processor/consumer.ex`:
1209 | 
1210 | ```elixir
1211 | defmodule JobProcessor.Consumer do
1212 |   use GenStage
1213 |   require Logger
1214 |   alias JobProcessor.{Job, JobQueue}
1215 | 
1216 |   def start_link(opts) do
1217 |     GenStage.start_link(__MODULE__, opts)
1218 |   end
1219 | 
1220 |   @impl true
1221 |   def init(opts) do
1222 |     {:consumer, opts, subscribe_to: [JobProcessor.Producer]}
1223 |   end
1224 | 
1225 |   @impl true
1226 |   def handle_events(jobs, _from, state) do
1227 |     Logger.info("Consumer received #{length(jobs)} jobs")
1228 | 
1229 |     for job <- jobs do
1230 |       execute_job(job)
1231 |     end
1232 | 
1233 |     {:noreply, [], state}
1234 |   end
1235 | 
1236 |   defp execute_job(%{id: job_id, payload: payload, attempts: attempts}) do
1237 |     try do
1238 |       {module, function, args} = Job.decode_job(payload)
1239 | 
1240 |       Logger.info("Executing job #{job_id}: #{module}.#{function}")
1241 | 
1242 |       # Execute the job
1243 |       result = apply(module, function, args)
1244 | 
1245 |       # Mark as completed
1246 |       JobQueue.complete_job(job_id)
1247 | 
1248 |       Logger.info("Job #{job_id} completed successfully")
1249 | 
1250 |       result
1251 |     rescue
1252 |       error ->
1253 |         error_message = Exception.format(:error, error, __STACKTRACE__)
1254 |         Logger.error("Job #{job_id} failed: #{error_message}")
1255 | 
1256 |         # Mark as failed (with retry logic)
1257 |         JobQueue.fail_job(job_id, error_message, attempts + 1)
1258 |     end
1259 |   end
1260 | end
1261 | ```
1262 | 
1263 | ## The Magic of Code Serialization
1264 | 
1265 | The real power of this system is in those two lines:
1266 | 
1267 | ```elixir
1268 | {module, function, args} = Job.decode_job(payload)
1269 | result = apply(module, function, args)
1270 | ```
1271 | 
1272 | We're deserializing a function call that was stored as binary data and executing it. This means you can queue any function call:
1273 | 
1274 | ```elixir
1275 | # Send an email
1276 | JobQueue.enqueue(MyApp.Mailer, :send_welcome_email, [user_id: 123])
1277 | 
1278 | # Process an image
1279 | JobQueue.enqueue(MyApp.ImageProcessor, :resize_image, ["/path/to/image.jpg", 300, 200])
1280 | 
1281 | # Call an API
1282 | JobQueue.enqueue(MyApp.ApiClient, :sync_user_data, [user_id: 456])
1283 | 
1284 | # Even complex data structures
1285 | JobQueue.enqueue(MyApp.ReportGenerator, :generate_report, [%{
1286 |   user_id: 789,
1287 |   date_range: Date.range(~D[2024-01-01], ~D[2024-01-31]),
1288 |   format: :pdf
1289 | }])
1290 | ```
1291 | 
1292 | Each of these becomes a row in the database, gets picked up by our GenStage producer, distributed to available consumers, and executed. The serialization handles all the complex data structures automatically.
1293 | 
1294 | ## What We've Built
1295 | 
1296 | We now have a complete job processing system with:
1297 | 
1298 | - **Persistent storage** - Jobs survive restarts
1299 | - **Automatic retries** - Failed jobs are retried with exponential backoff
1300 | - **Concurrent processing** - Multiple consumers process jobs in parallel
1301 | - **Backpressure handling** - GenStage ensures consumers aren't overwhelmed
1302 | - **Race condition prevention** - Database locking ensures each job runs exactly once
1303 | - **Delayed jobs** - Support for scheduling jobs to run later
1304 | - **Error tracking** - Failed jobs are logged with error messages
1305 | 
1306 | And the beautiful part? GenStage handles all the complex coordination. We just focus on the business logic of our jobs.
1307 | 
1308 | ## Testing Our Job System
1309 | 
1310 | Let's create a simple job to test our system. Add this to `lib/job_processor/test_job.ex`:
1311 | 
1312 | ```elixir
1313 | defmodule JobProcessor.TestJob do
1314 |   require Logger
1315 | 
1316 |   def hello(name) do
1317 |     Logger.info("Hello, #{name}!")
1318 |     Process.sleep(1000)  # Simulate some work
1319 |     "Greeted #{name}"
1320 |   end
1321 | 
1322 |   def failing_job do
1323 |     Logger.info("This job will fail...")
1324 |     raise "Intentional failure for testing"
1325 |   end
1326 | 
1327 |   def heavy_job(duration_ms) do
1328 |     Logger.info("Starting heavy job for #{duration_ms}ms")
1329 |     Process.sleep(duration_ms)
1330 |     Logger.info("Heavy job completed")
1331 |     "Completed heavy work"
1332 |   end
1333 | end
1334 | ```
1335 | 
1336 | Now you can queue jobs from the console:
1337 | 
1338 | ```elixir
1339 | iex -S mix
1340 | 
1341 | # Queue a simple job
1342 | JobProcessor.JobQueue.enqueue(JobProcessor.TestJob, :hello, ["World"])
1343 | 
1344 | # Queue a failing job (to test retry logic)
1345 | JobProcessor.JobQueue.enqueue(JobProcessor.TestJob, :failing_job, [])
1346 | 
1347 | # Queue multiple jobs to see parallel processing
1348 | for i <- 1..10 do
1349 |   JobProcessor.JobQueue.enqueue(JobProcessor.TestJob, :hello, ["Person #{i}"])
1350 | end
1351 | ```
1352 | 
1353 | Watch the logs to see jobs being processed, failures being retried, and the natural load balancing across multiple consumers.
1354 | 
1355 | We've transformed our simple counter example into a production-ready job processing system. The core GenStage concepts remained the same, but now we're processing real work with persistence, error handling, and retry logic.
1356 | 
1357 | # Bringing It All Together: From Tutorial to Production
1358 | 
1359 | Our tutorial system works well, but production systems need additional sophistication. Here's where you'd take this next:
1360 | 
1361 | ## Multiple Job Types with Dedicated Queues
1362 | 
1363 | Real applications have different types of work with different characteristics:
1364 | 
1365 | ```elixir
1366 | # High-priority user-facing jobs
1367 | JobQueue.enqueue(:email_queue, Mailer, :send_welcome_email, [user_id])
1368 | 
1369 | # Background data processing  
1370 | JobQueue.enqueue(:analytics_queue, Analytics, :process_events, [batch_id])
1371 | 
1372 | # Heavy computational work
1373 | JobQueue.enqueue(:ml_queue, ModelTrainer, :train_model, [dataset_id])
1374 | ```
1375 | 
1376 | Each queue gets its own producer, consumer pool, and configuration:
1377 | 
1378 | ```elixir
1379 | defmodule JobProcessor.QueueSupervisor do
1380 |   use Supervisor
1381 | 
1382 |   def start_link(init_arg) do
1383 |     Supervisor.start_link(__MODULE__, init_arg, name: __MODULE__)
1384 |   end
1385 | 
1386 |   def init(_init_arg) do
1387 |     children = [
1388 |       # Email queue - fast, lightweight
1389 |       queue_spec(:email_queue, max_consumers: 5, max_demand: 1),
1390 |       
1391 |       # Analytics queue - batch processing
1392 |       queue_spec(:analytics_queue, max_consumers: 3, max_demand: 100),
1393 |       
1394 |       # ML queue - heavy computation
1395 |       queue_spec(:ml_queue, max_consumers: 1, max_demand: 1)
1396 |     ]
1397 | 
1398 |     Supervisor.init(children, strategy: :one_for_one)
1399 |   end
1400 | 
1401 |   defp queue_spec(queue_name, opts) do
1402 |     %{
1403 |       id: :"#{queue_name}_supervisor",
1404 |       start: {JobProcessor.QueueManager, :start_link, [queue_name, opts]},
1405 |       type: :supervisor
1406 |     }
1407 |   end
1408 | end
1409 | ```
1410 | 
1411 | ## Dynamic Consumer Scaling
1412 | 
1413 | Scale consumers based on queue depth and system load:
1414 | 
1415 | ```elixir
1416 | defmodule JobProcessor.AutoScaler do
1417 |   use GenServer
1418 |   
1419 |   def init(queue_name) do
1420 |     schedule_check()
1421 |     {:ok, %{queue: queue_name, consumers: [], target_consumers: 2}}
1422 |   end
1423 | 
1424 |   def handle_info(:check_scaling, state) do
1425 |     queue_depth = JobQueue.queue_depth(state.queue)
1426 |     current_consumers = length(state.consumers)
1427 |     
1428 |     target = calculate_target_consumers(queue_depth, current_consumers)
1429 |     
1430 |     new_state = 
1431 |       cond do
1432 |         target > current_consumers -> scale_up(state, target - current_consumers)
1433 |         target < current_consumers -> scale_down(state, current_consumers - target)
1434 |         true -> state
1435 |       end
1436 |     
1437 |     schedule_check()
1438 |     {:noreply, new_state}
1439 |   end
1440 |   
1441 |   defp calculate_target_consumers(queue_depth, current) do
1442 |     cond do
1443 |       queue_depth > 1000 -> min(current + 2, 10)
1444 |       queue_depth > 100 -> min(current + 1, 10)
1445 |       queue_depth < 10 -> max(current - 1, 1)
1446 |       true -> current
1447 |     end
1448 |   end
1449 | end
1450 | ```
1451 | 
1452 | ## Worker Registries and Health Monitoring
1453 | 
1454 | Track worker health and performance:
1455 | 
1456 | ```elixir
1457 | defmodule JobProcessor.WorkerRegistry do
1458 |   use GenServer
1459 |   
1460 |   def start_link(_) do
1461 |     GenServer.start_link(__MODULE__, %{}, name: __MODULE__)
1462 |   end
1463 |   
1464 |   def register_worker(queue, pid, metadata \\ %{}) do
1465 |     GenServer.cast(__MODULE__, {:register, queue, pid, metadata})
1466 |   end
1467 |   
1468 |   def get_workers(queue) do
1469 |     GenServer.call(__MODULE__, {:get_workers, queue})
1470 |   end
1471 |   
1472 |   def get_worker_stats do
1473 |     GenServer.call(__MODULE__, :get_stats)
1474 |   end
1475 | 
1476 |   def handle_cast({:register, queue, pid, metadata}, state) do
1477 |     Process.monitor(pid)
1478 |     
1479 |     worker_info = %{
1480 |       pid: pid,
1481 |       queue: queue,
1482 |       started_at: DateTime.utc_now(),
1483 |       jobs_processed: 0,
1484 |       last_job_at: nil,
1485 |       metadata: metadata
1486 |     }
1487 |     
1488 |     new_workers = Map.put(state.workers || %{}, pid, worker_info)
1489 |     {:noreply, %{state | workers: new_workers}}
1490 |   end
1491 | 
1492 |   def handle_info({:DOWN, _ref, :process, pid, reason}, state) do
1493 |     Logger.warn("Worker #{inspect(pid)} died: #{inspect(reason)}")
1494 |     new_workers = Map.delete(state.workers, pid)
1495 |     {:noreply, %{state | workers: new_workers}}
1496 |   end
1497 | end
1498 | ```
1499 | 
1500 | ## Advanced Error Handling
1501 | 
1502 | Circuit breakers for failing job types:
1503 | 
1504 | ```elixir
1505 | defmodule JobProcessor.CircuitBreaker do
1506 |   use GenServer
1507 |   
1508 |   def should_process_job?(job_type) do
1509 |     GenServer.call(__MODULE__, {:should_process, job_type})
1510 |   end
1511 |   
1512 |   def record_success(job_type) do
1513 |     GenServer.cast(__MODULE__, {:success, job_type})
1514 |   end
1515 |   
1516 |   def record_failure(job_type, error) do
1517 |     GenServer.cast(__MODULE__, {:failure, job_type, error})
1518 |   end
1519 | 
1520 |   def handle_call({:should_process, job_type}, _from, state) do
1521 |     circuit_state = Map.get(state.circuits, job_type, :closed)
1522 |     
1523 |     case circuit_state do
1524 |       :closed -> {:reply, true, state}
1525 |       :open -> 
1526 |         if circuit_should_retry?(state, job_type) do
1527 |           {:reply, true, transition_to_half_open(state, job_type)}
1528 |         else
1529 |           {:reply, false, state}
1530 |         end
1531 |       :half_open -> {:reply, true, state}
1532 |     end
1533 |   end
1534 | end
1535 | ```
1536 | 
1537 | ## Dead Letter Queues
1538 | 
1539 | Handle permanently failed jobs:
1540 | 
1541 | ```elixir
1542 | defmodule JobProcessor.DeadLetterQueue do
1543 |   def handle_permanent_failure(job, final_error) do
1544 |     dead_job = %{
1545 |       original_job: job,
1546 |       failed_at: DateTime.utc_now(),
1547 |       final_error: final_error,
1548 |       attempt_history: job.attempt_history || [],
1549 |       forensics: collect_forensics(job)
1550 |     }
1551 |     
1552 |     Repo.insert(%DeadJob{data: dead_job})
1553 |     JobProcessor.Notifications.send_dead_letter_alert(dead_job)
1554 |   end
1555 |   
1556 |   defp collect_forensics(job) do
1557 |     %{
1558 |       system_load: :erlang.statistics(:scheduler_utilization),
1559 |       memory_usage: :erlang.memory(),
1560 |       queue_depths: JobQueue.all_queue_depths(),
1561 |       recent_errors: JobProcessor.ErrorTracker.recent_errors(job.module)
1562 |     }
1563 |   end
1564 | end
1565 | ```
1566 | 
1567 | ## Observability
1568 | 
1569 | Comprehensive monitoring with telemetry:
1570 | 
1571 | ```elixir
1572 | defmodule JobProcessor.Telemetry do
1573 |   def setup do
1574 |     events = [
1575 |       [:job_processor, :job, :start],
1576 |       [:job_processor, :job, :stop], 
1577 |       [:job_processor, :job, :exception],
1578 |       [:job_processor, :queue, :depth]
1579 |     ]
1580 |     
1581 |     :telemetry.attach_many("job-processor-metrics", events, &handle_event/4, nil)
1582 |   end
1583 |   
1584 |   def handle_event([:job_processor, :job, :stop], measurements, metadata, _config) do
1585 |     JobProcessor.Metrics.record_job_duration(metadata.queue, measurements.duration)
1586 |     JobProcessor.Metrics.increment_jobs_completed(metadata.queue)
1587 |     JobProcessor.Metrics.record_job_success(metadata.module, metadata.function)
1588 |   end
1589 | end
1590 | ```
1591 | 
1592 | GenStage's demand-driven architecture naturally handles backpressure, load balancing, and fault isolation. These production patterns build on that foundation, giving you the tools to run job processing at scale. The same principles that made our tutorial system work - processes, supervision, and message passing - scale to enterprise deployments.
1593 | 


--------------------------------------------------------------------------------