├── 2023
    ├── 11-12-time-series-databases.md
    └── 10-29-serialization.md
├── 2024
    ├── 04-21-memory-allocation.md
    ├── 06-02-wasmi.md
    ├── 06-30-sorting-implementations.md
    ├── 06-16-foundationdb.md
    ├── 04-28-aosa-volume1.md
    ├── 01-21-media-codecs.md
    ├── 08-11-solidjs.md
    ├── 10-13-tapl-pt3-4.md
    ├── 06-23-memory-models.md
    ├── 09-08-moonray.md
    ├── 10-20-tapl-pt5-6.md
    ├── 09-01-garbage-collection.md
    ├── 10-06-tapl-pt1-2.md
    ├── 05-19-ceph.md
    ├── 06-28-kernel-instrumentation.md
    ├── 06-09-disaggregated-databases.md
    ├── 08-04-intel-sgx.md
    ├── 01-14-zstandard.md
    ├── 11-10-gpu-sharing.md
    ├── 12-08-simdjson.md
    ├── 05-12-crush-rados.md
    ├── 09-29-amazon-s3.md
    ├── 02-11-linux-executables.md
    ├── 03-03-wireguard.md
    └── 10-27-sql-50years.md
├── 2025
    ├── 12-hardware-sys-security.md
    ├── 01-crdts-and-distributed-replication.md
    ├── 06-29-new-cpython-jit.md
    ├── 08-ssa-book.md
    ├── 03-animations-and-interactive-graphics.md
    ├── 04-v8-javascript-engine.md
    ├── 11-large-data-formats.md
    ├── 02-gpu-kernel-programming.md
    ├── 06-modern-vision-systems.md
    └── 07-performance-case-studies.md
└── README.md


/README.md:
--------------------------------------------------------------------------------
1 | # nysrg-notes
2 | 
3 | Eric's personal notes from [NYSRG](https://notes.ekzhang.com/events/nysrg).
4 | 


--------------------------------------------------------------------------------
/2024/04-21-memory-allocation.md:
--------------------------------------------------------------------------------
 1 | - Memory allocation — [[April 21st, 2024]]
 2 |     - External vs internal fragmentation
 3 |     - jemalloc: size classes are small {tiny, quantum-spaced (n*16 B), sub-page (1 KiB), large (4 KiB), huge (2 MiB)} and handled differently.
 4 |     - Each 2 MiB chunk is either an __arena__ or a "huge" region.
 5 |     - Arenas are shared between threads, using 4x the number of threads to avoid performance issues due to mutex contention during OS preemption.
 6 |     - There is a "current" run for each size class. Size of a run affects external fragmentation.
 7 |     - Hysteresis based on the __fullness quartile__ of each run to determine deletion.
 8 |     - __Cache lines__ are the minimum chunk of data that can be concurrently written to by each processor. If two processors try to access the same cache line (usually 64 B, 128 B on M1 processors) concurrently, then they fight over ownership.
 9 |     - How does thread-local storage work? It's essential for storing arena IDs efficiently. Why do some architectures not support it?
10 |     - How does the allocator use sbrk / mmap? They both return a pointer to a huge memory region, and the allocator divides that up into smaller regions. The allocator then builds on top of that, returning a pointer from that region back to the user.
11 |     - How is jemalloc different today, 14 years later? "Allocator design and implementation has strong potential as the subject of career-long obsession, and indeed there are people who focus on this subject."
12 | 


--------------------------------------------------------------------------------
/2024/06-02-wasmi.md:
--------------------------------------------------------------------------------
 1 | - Wasmi — [[June 2nd, 2024]]
 2 |     - 12 people
 3 |     - Main ideas from the original Wasmi article: There are two main categories of Wasm workloads, translation and compute. New execution engine balances translation and validation and is good for translation-intensive workloads. Might be 5-20x slower for heavy compute though.
 4 |     - __"You have to pick the runtime based on what type of code you want to run."__
 5 |     - There is an experimental WASM C-API for running WebAssembly and instantiating all of its external functions within a C calling convention, rather than a browser.
 6 |     - Wasmi relies on a new register-based IR, which is more performant than stack machines.
 7 |     - Difference between Wasmi — there's no actual "JMP" into the translated program.
 8 |     - Core is the "engine" while the rest of the package is various features like linear memory and tables and module loading, as well as external functions.
 9 |     - Relatively easy to read this code. Written by 2 people.
10 |     - Engine takes a module/ as input and sends it into translator -> bytecode -> executor.
11 |     - Translator produces a __virtual__ value stack or trace while the wasm source code is being visited. Each instruction modifies the virtual value and control stacks. Constants are folded, and other operations are converted into register indices.
12 |     - Register allocation is handled through several regions: inputs, locals, dynamics, and preservation. There is a defragmentation step at the end that makes all registers for a function contiguous.
13 | 


--------------------------------------------------------------------------------
/2024/06-30-sorting-implementations.md:
--------------------------------------------------------------------------------
 1 | - Sorting implementations — [[June 30th, 2024]]
 2 |     - ipnsort and driftsort
 3 |     - Prior-work on pattern-defeating quicksort (pdqsort).
 4 |     - Lessons: You can get pretty good performance with architecture-independent code, even if maximizing CPU features would give you maximum performance. Benchmark all the possible cases you can think about, holistically evaluate regressions & improvements. Allow theoretical analyses to guide practical performance tuning and weighing tradeoffs.
 5 |     - For ipnsort, the benchmarks come from very similar distributions compared to driftsort. Methodology is primarily based on relative speed graphs by size range.
 6 |     - There are 11 people here now, plus Nathan and his two friends just stopping by.
 7 |     - Focusing on wide out-of-order execution rather than SIMD / vectorization may offer more portable binaries; it's a valid alternative method of cranking out performance.
 8 |     - A lot of the authors' contributions are novel performance analyses. This is hard!
 9 |     - `cmov`: Branchless swap / move instruction, used by the "smallsort" sorting networks. Converts control dependencies into data dependencies, depends on the use case.
10 |     - Hardware parameters affect the constants in this implementations.
11 |     - VLIW was mentioned by a group member as an alternative hardware architecture.
12 |     - Branchless Lomuto partitions describe yet another paper with quite entertaining analysis. Conditional moves make it possible to generate more optimized code for Lomuto, vs Hoare. Cyclic permutations is another trick to avoid the copy-overhead of swaps.
13 |     - Branchless Lomuto partitioning: https://orlp.net/blog/branchless-lomuto-partitioning/
14 | 


--------------------------------------------------------------------------------
/2023/11-12-time-series-databases.md:
--------------------------------------------------------------------------------
 1 | - Time Series Databases — [[November 12th, 2023]]
 2 |     - Scenario: So you want to store a time series
 3 |         - slow-folder/ (disk drives)
 4 |             - ++ 2023-01-01.txt.gz
 5 |         - fast-folder/ (fast SSD)
 6 |             - -- 2023-01-01.txt.gz
 7 |             - 2023-01-02.txt.gz
 8 |             - ..
 9 |             - 2023-01-31.txt.gz
10 |             - 2023-02-01.txt.gz
11 |         - indices/
12 |             - January/
13 |                 - rolled-up-2023-01.txt.gz — __indexed by some column__
14 |             - February/
15 |     - Queries: Filter by Tractor_Brand = Toyota
16 |         - Speed for a __specific day__ versus columns over __long date ranges__
17 |     - Finance
18 |         - January-ETFs.csv
19 |         - January-ETFs-fine-grained-trades.csv
20 |         - January-entire-market.csv
21 |     - What data are you storing?
22 |         - Enum values (integer values)
23 |         - Floating-point / numeric data
24 |         - Logs? Probably.
25 |             - Indices might look different
26 |             - Storage format?
27 |         - Images and videos, perhaps
28 |             - Links / IDs to external storage
29 |             - String "vi-123456-on-january-23.mov"
30 |     - **Defined in terms of the __access pattern__ over the data.**
31 |         - OLAP
32 |         - Looking at data in the past, not necessarily by recently
33 |         - Lots of data
34 |         - Don't need 1ms latency, maybe 100ms, 1s, 5s, 10s, …?
35 |     - Prometheus:
36 |         - System for scraping, storing, and backing up metrics
37 |         - Make queries via a web UI, and install special query "rules" that trigger alerts (email, text message)
38 |         - Alternatives: Splunk(?)
39 | 


--------------------------------------------------------------------------------
/2024/06-16-foundationdb.md:
--------------------------------------------------------------------------------
 1 | - FoundationDB — [[June 16th, 2024]]
 2 |     - Control plane and data plane. Notable feature is that there is __exactly one sequencer__ (HLC service) across the entire cluster, and this is used to get read / commit sequence numbers for all transactions, regardless of their origin.
 3 |     - Paxos control plane recruits three singletons: Sequencer, DataDistributor, and Ratekeeper.
 4 |     - Transactions through MVCC + OCC, keeping track of conflicts between read & write sets.
 5 |     - Reconfigures the system through heartbeats to the control plane, ensuring that it can recover from failures in a few seconds of up to f nodes out of f+1.
 6 |     - Transaction -> Log -> Storage. A commit is rejected if there are overlapping changes in the sequence number range [read time, commit time]. To avoid racing commits, the resolvers send the changes to log servers, which then ensure that no commits are applied while earlier commits are still outstanding.
 7 |     - Very strong simulation testing makes them confident in the reliability / correctness.
 8 |     - Serializable snapshot isolation is guaranteed by checking read sets against write sets. Basically, the read set is compared against all writes concurrent with the transaction. This guarantees strict serializable execution. This is done using a __lastCommit__ skip-list over the entire keyspace, and checking if lastCommit(...) for any transaction interval comes after the read.
 9 |     - Basically, FoundationDB decouples the control plane from the data plane, in database form. The resolver and log servers are split up into separate services.
10 |     - Real innovation in FDB could be seen in its robust testing infrastructure. See blog post from James Baker on Project Loom, and the Antithesis spinoff company that focuses on robust testing, who is organizing a conference in NYC soon.
11 | 


--------------------------------------------------------------------------------
/2024/04-28-aosa-volume1.md:
--------------------------------------------------------------------------------
 1 | - AOSA, Volume 1 — [[April 28th, 2024]]
 2 |     - Themes are listed below.
 3 |     - Audacity: Plugin systems, performance impedance mismatch between threads / buffers, GUI frameworks, decoupling application logic from systems, data structures.
 4 |         - GUI frameworks today are quite different, likely due to the influence of React.
 5 |     - Bash: Standardization and backward-compatibility, behavior edge cases, parser considerations, weaker types allowing for a more dynamic language (static and dynamic tension), code being affected by historical changes (Unicode), lack of community dev involvement.
 6 |         - Related projects: How does it compare to Powershell? More structured, object-oriented. Perhaps stricter and more cumbersome.
 7 |         - Lisp is similarly dynamic, with self-modifying code. But a lot less popular, likely because of the difference in domain. Interactive exploration tends towards dynamic languages.
 8 |         - "It's up to the author to just write a readable script."
 9 |     - CMake: Unifying two build systems, cross-platform identity versus behavior, portabillity, embedded scripting language for build systems, scope and success/failure of various integrated components (analysis, tests, plugins).
10 |         - Example from Alex: Uses CMake at work, has a common macro system to develop their own DSL for compiling C++ libraries and executables, specifying dependencies.
11 |     - HDFS: Focusing on networking considerations with topology / racks, fault-tolerant replication, control (NameNode) and data separation, daemons to balance utilization and fix corruption, tuning many time/storage/ratio/replication parameters by human operators, "fat client."
12 |         - Someone at group: Used HDFS via a deployment on Kubernetes, and queried the NameNode for file locations so they could colocate jobs on pods that had that data.
13 |     - LLVM: Code reuse for compiler projects (extensibility / cleanness), "peephole" optimization, fully-specified intermediate representations, infrastructure as libraries instead of an app, tests as a first-class language representation, modularity allows replacement and contribution.
14 |         - Mentions MLIR, should we read about it? Look at GPU / FPGA support.
15 | 


--------------------------------------------------------------------------------
/2024/01-21-media-codecs.md:
--------------------------------------------------------------------------------
 1 | - Media codecs — [[January 21st, 2024]]
 2 |     - Audio
 3 |         - Sequence of very fast "samples"
 4 |         - What is a sample? Measuring the __physical movement__ of the speaker over time!
 5 |         - Bit rate = amount of data per second
 6 |         - VBR = strategy, change bit rate automatically to match quality, so silence has less bits, noise has many bits
 7 |             - Relevant for __lossy__ codecs, ~ compression level
 8 |         - Why do we need multiple channels?
 9 |             - 1 = mono (one channel)
10 |             - 2 = stereo (two ears, headphones) + center?
11 |             - more than 2 = speaker systems? or locations in space?
12 |             - standard is 5.1
13 |             - Dolby Atmos — speaker system can simulate an abstract directional space
14 |             - can also have dubs / translations in channels
15 |         - Latency number? — compression buffer / chunk size based on time / bitrate
16 |         - (32-bit PCM standard in Ableton)
17 |         - Each sample is a integer / floating-point
18 |         - Formats are important! __gapless looping__
19 |         - For most media work (video editing, game dev) — most tools will standardize all audio imports.
20 |     - "Falsehoods about distributed systems that programmers believe"
21 |         - You might meet these in your life
22 |             - Time only goes forward
23 |             - There are 365 days in a year
24 |         - You will not meet this in your life
25 |             - Media is nasty? — let's sacrifice ourselves to learning about this
26 |             - What is video
27 |                 - Images through time
28 |                     - What is color
29 |             - What is audio
30 |                 - Muxed / demuxed
31 |                 - PCM
32 |                 - Bitrate
33 |                 - Sample rate
34 |             - ~~What is infinity~~
35 |             - Encoding, transcoding
36 |                 - Performance in implementation / energy-efficiency
37 |     - Compare different formats
38 |         - AV1 — "hot" format
39 |         - VP8/VP9 — also open-source, possibly related to AV1
40 |         - H.265 vs H.264 — both by the MPEG industry group
41 |     - Licensing and compatibility issues?
42 |         - What software / hardware do you need in order to support a codec
43 |         - Who do you need to pay money to
44 | 


--------------------------------------------------------------------------------
/2024/08-11-solidjs.md:
--------------------------------------------------------------------------------
 1 | - Solid.js — [[August 11th, 2024]]
 2 |     - Tutorial notes
 3 |         - onCleanup can be called within any scope, and rendering the component is a top-level singular effect. <- interesting, what does this mean for nesting effects and having calls to effects in loops? what is onMount __really__ doing?
 4 |         - Event delegation is interesting, what's going on there? Can you emit custom events? Why use delegation versus just native events? (onMouseMove vs on:mousemove)
 5 |         - There is a button to see the compiled output in the tutorial. What's going on there? There's an _$effect() function that looks relevant, and _$delegateEvents([]).
 6 |         - Syntactic sugar: Style is nice and consistent, going to native CSS variables etc. classList for dynamic or conditional styling, refs, "use:" directives as a compiler transformation.
 7 |         - Reactivity breaks with object destructuring, use splitProps. **Hmm… why isn't this an issue when accessing a property of an object?**
 8 |         - Something funky is going on in the [props/children tutorial](https://www.solidjs.com/tutorial/props_children), take a closer look.
 9 |             - > The reactive variable 'props.children' should be used within JSX, a tracked scope (like createEffect), or inside an event handler function, or else changes will be ignored.
10 |               eslint(solid/reactivity)
11 |         - "Nested signals" and "proxies" seem like the magic sauce. "Stores" combine these all into a unified syntax sugar for updating state and making signals. Compare to Immer.js with the produce() store, local mutability.
12 |         - Context = dependency injection.
13 |     - Source code: about 6000 lines total inside the packages/solid folder, excluding tests
14 |         - h, html: jsx runtime fragments for DOM elements
15 |         - src: core library (3500 lines) — this is also called "solid-js"
16 |             - reactive (1700 lines)
17 |                 - Use of the [Proxy](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Proxy) builtin object is evident here.
18 |                 - Includes patch of S.js, signals library, along with a timing-tuned scheduler
19 |                 - Owner = computations[], cleanups, contexts and name
20 |                 - Computation (Init => Next) = effect function, ComputationState (enum 0,1,2 — stale/pending), value, suspense, etc., signal state (sources) each with observers
21 |             - render
22 |                 - client rendering core
23 |             - server
24 |                 - ssr core
25 |         - store: store library for mutables / deeply nested reactivity (2500 lines)
26 |         - universal: small library to create a custom renderer
27 |             - Actually just a re-export of dom-expressions right now
28 |         - web: connects the reactivity engine to the DOM, plus SSR support (250 lines)
29 |     - dom-expressions -> "rxcore" babel-mapped to solid-js
30 |     - babel-plugin-jsx-dom-expressions
31 | 


--------------------------------------------------------------------------------
/2024/10-13-tapl-pt3-4.md:
--------------------------------------------------------------------------------
 1 | - TAPL, parts 3+4 — [[October 13th, 2024]]
 2 |     - There are 9 people in the room, and we're seated around a table. Starting the reading process. Someone points out that STLC + references (Landin's knot) is enough to be Turing complete, since you get loops by assigning a reference to a function continuation — cool!
 3 |     - Chapter 15
 4 |         - First, STLC with subtyping and records: $$\lambda_{<:}$$
 5 |         - Subtyping is codifying the __principle of safe substitution__. $$S <: T$$ means that S can be used everywhere that T is. This adds the rule of __subsumption__.
 6 |         - Width subtyping for records: more fields means a __smaller__ type. (Depth subtyping as well, which is covariance, but note that this one is only sound for immutable records.) Together, these are kind of like "ignoring unknown fields" in a nested JSON struct.
 7 |         - Function types have contravariant argument type, covariant return type.
 8 |         - Coercion semantics marked with double-brackets. Translate a language with subtypes into a language without subtypes.
 9 |     - Chapter 16
10 |         - __Algorithmic__ subtyping rules are syntax directed. This is an important part of metatheory of subtyping, since it allows us to efficiently check subtypes, rather than having a more fuzzy set of derivations that is not as obvious how to apply. It's like an `is_subtype()` function.
11 |         - Interestingly enough: subtyping only needs to be done at function application time (or ascription), as otherwise we could just delay it. So TA (algorithmic typing) only has the subtype condition in the application rule, rather than a general x : T -> x : U rule.
12 |         - Joins are union / supertypes, meets are intersection / subtypes.
13 |         - In other words, the "minimal type" of two branches is the join, least upper-bound.
14 |         - Algorithmic subtyping needs to add a couple extra rules for $$\bot$$.
15 |     - Chapter 20
16 |         - `NatList = µX. <nil:Unit, cons:{Nat,X}>;` where $$\mu$$ is an explicit type recursion operator.
17 |         - `Hungry = µA. Nat→A;` takes an integer argument, then returns itself.
18 |         - `Stream = µA. Unit→{Nat,A};` is an iterator type, it returns itself but also a natural number.
19 |         - Construct these types using the fixpoint combinator (Y combinator), named `fix` above. Note that the combinator itself isn't well-typed before, but with recursive types we can write down the type of the `x x` term as `(λx:(µA.A→T). f (x x))`.
20 |         - Thus, recursive types break the __strong normalization__ property.
21 |         - This is really powerful! Untyped lambda calculus can be embedded in a static language with recursive types. (Related to denotational semantics.) `D = µX.X→X`
22 |         - Iso-recursive versus equi-recursive logic — iso-recursive makes it easier to write a typechecker since there's explicit isomorphism, but equi-recursive languages need to do complex unification to work out recursive types.
23 |         - Ry is here and killing it with his type theory knowledge! “Kajung” is here and she invited him, as a coworker from Jane Street. He works on the OCaml compiler, just joined after graduating in June.
24 | 


--------------------------------------------------------------------------------
/2025/12-hardware-sys-security.md:
--------------------------------------------------------------------------------
 1 | - Hardware systems security (Dec 2025)
 2 |     - Compiler explorer
 3 |         - Pretty standard architecture with a frontend, Node.js servers, and a compilation queue processor that runs things in nsjail.
 4 |         - Compile workers have 4 TB of compilers installed on them, mounted on EFS.
 5 |         - Also, compilers require a lot of tiny runtime files, this is slow for EFS. To get around this, they build squashfs (compressed) images for each of those files in EFS, and then mount them on the workers as well, as loopback block device mounts — this prevents Linux from repeatedly doing metadata validity checks in NFS.
 6 |             - Pretty standard hack for reducing NFS latency: group your files into bigger chunks. And then set up read-only caching (à la JuiceFS or Modal).
 7 |         - Looks like they get about 3 req/s for compile, which is quite a fair bit! Not too much load though, so the support for a bajillion compilers is the main technical feat. Like serverless technology where you need to support a bunch of sparse endpoints.
 8 |         - Auto-scaling and caching, everything keeps costs down to $3000/mo, nice.
 9 |         - Seems super organic, nice to see a project grow and come about like this.
10 |     - Figma's blog post on seccomp
11 |         - Practical considerations on eng cost: "To create the seccomp allowlist, you need to either know all possible syscalls that the program can make, or more typically, empirically construct this list by running the program with a tool like `strace` on a representative corpus of inputs to exercise all possible codepaths."
12 |         - seccomp allowlists may be brittle and need updates from people outside security org.
13 |         - Can't dereference pointers in seccomp filters, so filtering openat() is impossible for instance as it takes a directory file descriptor.
14 |         - They used nsjail and trialed pure-seccomp for their "RenderServer" (editor session backend) but ran into lots of edge cases in a production system.
15 |         - Worth noting that seccomp itself is much faster, more lightweight than nsjail, easier to test.
16 |         - Trick: Open all your files beforehand, then enter the sandbox.
17 |     - Optimizing seccomp in gVisor
18 |         - Small gains in seccomp-bpf by checking for non-cacheable syscalls first (Linux kernel emulates cacheable ones and stores them in a static map), and creating a binary tree instead of linear scan of jumps for instruction filtering.
19 |         - They can also reduce code size with instruction-level optimizer. cBPF is limited to 4096 instructions, so this will allow them to add more rules in the future.
20 |         - Remember that futex() is the most common syscall by far, followed by nanosleep and sendmmsg (for this distributed system I/O benchmark).
21 |     - rust-vmm/seccompiler
22 |         - Some more details on seccomp, installed via `prctl()` or `seccomp()`.
23 |         - I searched this up a bit, once you enter seccomp mode on a process, you can't disable or relax it again. You can only install more filters with logical AND (stricter).
24 |         - `seccomp()` is the modern API (Linux 3.17+), prctl is legacy cruft.
25 |         - Didn't work in Orbstack, but i got it working in a quick [Codespace with this code](https://gist.github.com/ekzhang/12a0456a5e196375e76b06c7446191f9).
26 | 


--------------------------------------------------------------------------------
/2024/06-23-memory-models.md:
--------------------------------------------------------------------------------
 1 | - Memory models — [[June 23rd, 2024]]
 2 |     - Looks like memory models can be quite confusing on the hardware level, so we rely on shortcuts to reason about their correctness and sequential consistency as programmers. Intel's TSO model reads similar to the __casual consistency__ model of Jepsen, which is a cool connection. __Store buffering__ is a relevant word to remember here.
 3 |     - ARM/POWER has a very relaxed memory model, but the key thing to remember is that writes to a single memory location are totally ordered. Reads might be reordered though, and processors don't necessarily agree about the order of writes to different locations.
 4 |     - The answer is to say hardware is __weakly ordered__ respect to a consistency model. This gives us a blueprint for how to write programs that behave as if sequentially consistent.
 5 |         - Adve and Hill (DRF-SC) was a turning point; it gave a blueprint for hardware to appear consistent, at least for the needs of programmers working with it (or compiler engineers who are generating instructions).
 6 |     - Programming language memory models are an entirely different interface. The language engineer offers a memory model to the programmer, then they implement it in hardware according to the provisions of that architecture.
 7 |         - Java (1996) had issues: coherence was too strict, non-synchronizing atomics too loose.
 8 |         - Java (2004) synchronized around creating threads, locking mutexes, writing to volatile variables. It also defines semantics for data races. This doesn't fully describe things though, and it's not even clear that mutexes guarantee mutual exclusion on variable updates.
 9 |     - “Inspired by the apparent success of Java's new memory model, many of the same people set out to define a similar memory model for C++, eventually adopted in C++11.”
10 |         - C++ is very complicated, with three types of atomics (seqcst, acq/rel coherence, relaxed).
11 |         - It is more complicated than Java, and apparently less helpful than Java?
12 |         - Data races are undefined behavior. DRF-SC or Catch Fire.
13 |         - Russ Cox keeps using the phrase "we should aim higher" and I feel like that comes from a place of frustration with existing languages. Re-executing loads is part of that.
14 |         - Acquire-release means that there is a happens-before edge, i.e., coherence. These are free on x86-64, but on ARM64 the best way to implement them is as sequential consistency anyway, so the abstraction is a bit leaky.
15 |     - Historical note is that today, modern hardware is very good at supporting sequentially-consistent synchronizing atomics, since they adapted to programming languages. Previously, it would require expensive barriers. (ldar/stlr in ARM)
16 |     - JavaScript's memory model for SharedArrayBuffer is equally tricky. Only sequential consistency is implemented, but there are still incompatibilities with UB on ARM.
17 |     - Acquire / Release semantics can also be defined by LoadStore / StoreLoad / StoreStore fences. This is complicated. But Russ Cox omits this detail. The point is that C++ has synchronizing atomics.
18 |     - “I conclude that there are two ways of constructing a software design: One way is to make it so simple that there are __obviously__ no deficiencies and the other way is to make it so complicated that there are no __obvious__ deficiencies.” –Tony Hoare
19 |     - Go's model has sequentially consistent atomics, and happens-before edges on runtime data structures. It also specifies disallowed compiler optimizations. That's all!
20 | 


--------------------------------------------------------------------------------
/2024/09-08-moonray.md:
--------------------------------------------------------------------------------
 1 | - MoonRay — [[September 8th, 2024]]
 2 |     - AOV system is interesting. Light path expressions primarily.
 3 |         - "Material AOV" provides different kinds of syntax for debugging materials.
 4 |     - Use of different JIT compilation through LLVM to implement their ISPC framework. SPMD computation on an Arras cluster.
 5 |         - They take a JSON describing the material and compile it down into shaders!
 6 |         - Kind of cool to have an out-of-the-box optimization.
 7 |         - ISPC / SPMD is close to a CUDA thread actually, running multiple of them within the same core. Only works for this compiler. It "looks like" gang scheduling, but in reality it's perhaps converting all of the instructions into AVX?
 8 |         - Exotic execution model, CMU slides: https://www.cs.cmu.edu/afs/cs/academic/class/15418-s18/www/lectures/03_progmodels.pdf
 9 |         - https://ispc.github.io/ispc_for_xe.html
10 |         - This is also why the BVH data structure is important? Intel Embree vs Nvidia RTX
11 |     - Vectorized path tracing, i.e., uses all SIMD lanes.
12 |     - Relies on several queues for radiance (dedupe samples), rays, and shading. Each queue is processed by a handler when it reaches its maximum size, for efficiency. Shading queue converts AoS -> SoA and then evaluates every shader & texture with SIMD.
13 |         - Custom image loader with OpenImageIO (OIIO)
14 |         - Integrator spawns rays with Monte Carlo which are incoherent, get placed in a separate queue for sorting.
15 |         - Occlusion rays to lights are also generated and handled asynchronously.
16 |     - "Vectorized Production Path Tracing" (HPG '17) has more context on this.
17 |         - Embree for ray intersections, two uni-directional path tracing implementations based on depth-search (scalar) and wavefront breadth-first (SIMD / ISPC).
18 |         - __"do the potential performance benefits gained outweigh the extra work required to harness the vector hardware?"__
19 |         - ![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fekzhang%2FffK7c2zelb.png?alt=media&token=2c1fa5de-838f-49db-8619-79435a63a3b4)
20 |         - Integrator / secondary rays spawner implements multiple importance sampling, Russian roulette, path splitting.
21 |         - "Another key tenet of DOD is __where there is one, there is usually more than one__, or more plainly, we should work in batches where possible."
22 |         - Intel TBB is how tasks are spawned.
23 |         - Sorting by [light-set index, UDIM tile, mip level, uv coordinates] in a 32-bit integer.
24 |             - Up to 128 light sets.
25 |         - Then every shader node is vectorized, since they are JIT compiled into an ISPC program and executed on the processor on batches. Texture sampling is done with OIIO loading / caching, and they do point sampling between two adjacent MIP levels for "instant" lookup.
26 |         - Thread-local primarily to avoid contention. The shade queue is not thread-local, for memory reasons. Queues kind of imply BFS because they are limited in size.
27 |     - Vectorized path tracing is about decomposing the system into smaller parts based on their data / memory access patterns. Like a little factory with different machines sending messages.
28 |         - Ray queue is only concerned about global geometry, nothing else.
29 |         - Shading queue cares about materials, light, sampling … but only at one single point.
30 |         - Radiance queue is just environmental illumination.
31 |         - Occlusion queue only works on scene lights and importance sampling.
32 | 


--------------------------------------------------------------------------------
/2024/10-20-tapl-pt5-6.md:
--------------------------------------------------------------------------------
 1 | - TAPL, parts 5+6 — [[October 20th, 2024]]
 2 |     - Say you have a type variable and you define substitutions with type variables. In some ways, this is like applying __abstractions__ (in the variable sense) to the metatheory of types. Then there are two modes of thinking about this.
 3 |         - Your types must be valid for all T->U: this is __parametric polymorphism__.
 4 |         - Your type must be valid for at least one T->U, figure out which one: this is __type inference__, otherwise known as __type reconstruction__ in this book.
 5 |     - __Constraint typing__ and __unification__ are introduced in this chapter, as per Hindley-Milner.
 6 |     - "let-polymorphism" is the practice of allowing each call to a function variable actually use different types, it's a form of polymorphism since the type inference constraints are solved every time you call the function. An argument can't be polymorphic within its body, though.
 7 |     - Okay, we talked about the let-polymorphism found in Ocaml, via Hindley-Milner unification. Now let's finally get to System F. The jewel of parametric polymorphism.
 8 |         - We introduce type abstraction and type application.
 9 |         - Parametric polymorphism has to do with universal quantification, rather than existential. Each type abstraction turns into a quantifier clause.
10 |         - It also gives us recursive types like the omega combinator via self-application.
11 |         - No type inference yet, so you need to explicitly write down the type parameters every time you want to apply a polymorphic function.
12 |         - Well-typed System F terms are normalizing. (Why?)
13 |         - Note that the omega combinator cannot be typed in System F, since it's normalizing. Why? Is there a more direct proof of this, or at least an intuition?
14 |         - However, type reconstruction in System F (even of applications) is undecidable.
15 |     - Impredicative types include themselves, like $$\lambda X. X \to X$$.
16 |     - You can't type the fixed-point combinator in System F. I tried, it just doesn't work. Neither can you type the omega combinator. This means that general recursion is not possible. But you can add a fixed-point combinator __by fiat__, which creates a typed system with recursion.
17 |     - Existential types: operationally is a pair of a type S and a term of type $$[X \mapsto S] T$$.
18 |         - Introduction and elimination forms: packing a type with a term (e.g., a struct), and then unpacking it by using it in a form that cancels out the type.
19 |         - Analogous to importing a module and using its contents, like a C FFI call.
20 |         - "… mechanisms for modularity and abstraction are almost completely orthogonal to the statefulness or statelessness of the abstractions being defined."
21 |         - Existential types can model a form of "abstract" data types, in the OOP privacy sense.
22 |     - Another way of thinking about existential types: you can pack interfaces into a heterogeneous list, where the interfaces themselves can have type variables in their signatures — as long as they're self-contained. That's pretty mind-blowing, but it makes sense.
23 |     - __Kinding__ is a copy of the simply-typed lambda calculus, one level up in type-land. Type constructors have kind * => *. Terms have types, which have kinds. You can keep going more levels up in __pure type theory__, but for programmers, three levels are plenty! (yay, lucky us, no $$\infty$$-categories in this book…)
24 |         - System $$F_\omega$$ is just like System F, but with kinds on each type abstraction. This gives us higher-kinded types like (* => *) => *, which are somewhat esoteric.
25 |         - An example of a higher-kinded type is Monad, since each instance operates on type constructors of kind (* => *).
26 | 


--------------------------------------------------------------------------------
/2024/09-01-garbage-collection.md:
--------------------------------------------------------------------------------
 1 | - Garbage collection — [[September 1st, 2024]]
 2 |     - Garbage-First (G1): Concurrent, thread-local allocation buffers, with a soft real-time requirement and completeness. Uses better metrics to determine which regions to collect first based on ratio of garbage to allocated memory.
 3 |         - Heap layout is equal-sized __heap regions__, each contiguous range of virtual memory.
 4 |         - Allocating in a heap region involves incrementing a boundary. Mutator threads create TLABs within an allocation region for small objects.
 5 |         - Empty regions are arranged in a linked list, finding one is O(1).
 6 |         - Large objects ("humongous") take up >3/4 of one region and are allocated separately.
 7 |         - Regions have a __remembered set__ of all locations outside of that region that might contain pointers into the region. Like a reverse pointer list. Mutator threads inform the heap via a __card table__ (512:1 bitset).
 8 |         - What is the global set of filled RS buffers?
 9 |         - __Evacuation__ phase: the GC moves live data from a less filled region into a different region and updates all inter-region pointers via its remembered set.
10 |             - Recall there is no free() operation, so this is the unit of actual GC work.
11 |             - Work-stealing for threads to copy data into their own GCLAB (thread-local) and then write a forwarding pointer to the old location.
12 |         - Generational mode: Regions marked as "young" when given to a mutator thread, scanned first during next cycle, which allows us to avoid some work in tracking changes.
13 |         - A form of snapshot-at-the-beginning (SATB) __concurrent marking__ that happens while the application is running, unlike evacuation which requires you to stop the world. This helps ensure completeness, since all regions will eventually be marked over time. Bitmaps are gradually constructed in "remark" phases after initial marking.
14 |     - Shenandoah (2016)
15 |         - Similar to G1GC, but it is newer and optimized for modern workloads.
16 |         - Initially not generational.
17 |         - "Shenandoah’s key advance over G1 is to do more of its garbage collection cycle work concurrently with the application threads. G1 can evacuate its heap regions, that is, move objects, only when the application is paused, while Shenandoah can relocate objects concurrently with the application. To achieve the concurrent relocation, it uses what’s known as a __Brooks pointer__. This pointer is an additional field that each object in the Shenandoah heap has and which points back to the object itself."
18 |         - "Shenandoah does this because when it moves an object, it also needs to fix up all the objects in the heap that have references to that object. When Shenandoah moves an object to a new location, it leaves the old Brooks pointer in place, forwarding references to the new location of the object. When an object is referenced, the application follows the forwarding pointer to the new location. Eventually the old object with the forwarding pointer needs to be cleaned up, but by decoupling the cleanup operation from the step of moving the object itself, Shenandoah can more easily accomplish the concurrent relocation of objects."
19 |         - Quote from G1GC is interesting since it mentions this technique, too. But they decided not to implement it.
20 |             - "To enable this fine-grained interruptibility, an extra header word is dedicated as a forwarding pointer, and the mutator must execute a read barrier that always follows these pointers (in the style of Brooks [9]; Henriksson [20] describes another hard real-time collector using this technique). Garbage-First avoids these space and time costs…"
21 |         - Shenandoah has a separate step that updates refs after evacuation, so concurrent evacuation can happen while the user's code is running.
22 | 


--------------------------------------------------------------------------------
/2024/10-06-tapl-pt1-2.md:
--------------------------------------------------------------------------------
 1 | - TAPL, parts 1+2 — [[October 6th, 2024]]
 2 |     - 11 people here to read a textbook on type theory together.
 3 |     - Intro and math background
 4 |         - Type systems help prevent certain classes of unwanted program behavior. They're how we reason mathematically about computation. Typically automatic, with some annotations.
 5 |             - If it's strong enough, encoding a specification, we get a proof checker.
 6 |             - Type systems are also part of "module languages" — the API of a library.
 7 |             - "a safe language is one that protects its own abstractions"
 8 |         - Question: What does "constructive" mean?
 9 |         - Ok, like 5-6 more people came.
10 |     - Part I: Untyped System
11 |         - Metatheory is the formal study of theory. "Metatheory of subtyping" = what subtyping looks like across type systems.
12 |         - Grammar can summarize a language syntax. Inference rules can also be used to write it.
13 |             - Each term can be "deduced" by the inference rules corresponding to it.
14 |             - This means you can induct on terms since they have a total ordering in size. It's like recursively processing an AST.
15 |         - Operational and denotational semantics: the steps, versus the "domain" analogue.
16 |         - Instance of an evaluation rule, taking steps toward getting values to normal form. Operational semantics allow us to reason about things. We won't talk about denotational semantics in this book at all.
17 |             - Small-step (->) and big-step (double down arrow) operational semantics.
18 |             - You can prove equivalence with structural induction.
19 |         - Untyped lambda calculus
20 |             - One of several core calculi, also pi-calculus (message passing concurrency) and object calculus (OOP).
21 |             - Concrete syntax (characters) to abstract syntax tree simplifies processing. Lexer, parser.
22 |             - Variables vs metavariables are based on the context.
23 |             - Evaluation is by finding a __reducible expression__ "redex" and doing beta-reduction.
24 |             - Evaluation strategy matters for lambda calculus, but it isn't relevant for type systems. So we'll just use call-by-value in this book, since it's familiar to people.
25 |     - Part 2: Simple Types
26 |         - Types based on static information. Interesting that they say "the evaluation gets stuck" as a precise technical property, haha.
27 |         - The typing relationship is __conservative__, some valid expressions don't typecheck.
28 |         - "Safety = Progress + Preservation"
29 |             - Is not stuck, and the result of the next derivation step will have the same type.
30 |         - Typing derivations form a tree, just like evaluation, except it's more conservative.
31 |         - The set of __simple types__ has base types and functions.
32 |         - Typing context $$\Gamma$$ is used to keep track of the types of variables.
33 |         - ![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fekzhang%2F_l9Ajm3xv9.png?alt=media&token=1f63aa1a-9ebd-49d7-9e90-df4e09026c0f)
34 |         - T-Abs is an __introduction rule__, and T-App is an __elimination rule__. Together, they are a redex.
35 |         - This terminology is related to the __Curry-Howard correspondence__. Propositions as types. This comes from propositional logic, introducing and removing variables.
36 |         - "In effect, programs are converted back to an untyped form before they are evaluated. This style of semantics can be formalized using an __erasure__ function mapping simply typed terms into the corresponding untyped terms."
37 |         - Two styles
38 |             - Curry-style defines a semantics, then removes some undesired behaviors.
39 |             - Church-style only gives semantics to well-typed terms. We are typing __derivations__, not terms, strictly speaking.
40 | 


--------------------------------------------------------------------------------
/2024/05-19-ceph.md:
--------------------------------------------------------------------------------
 1 | - Ceph — [[May 19th, 2024]]
 2 |     - “Clients typically interact with a metadata server (MDS) to perform metadata operations (open, rename), while communicating directly with OSDs to perform file I/O (reads and writes), significantly improving overall scalability.”
 3 |     - “Our target workload may include such extreme cases as tens or hundreds of thousands of hosts concurrently reading from or writing to the same file or creating files in the same directory. Such scenarios, common in scientific applications running on supercomputing clusters, are increasingly indicative of tomorrow’s general purpose workloads.”
 4 |     - Why is separating data from metadata more scalable? It's not really justified in the paper. Ori questions, what if we just stored the metadata in the same RADOS object store?
 5 |     - __Dynamic Subtree Partitioning__ is an interesting idea, I had thought about doing something similar for Blobnet at [[Modal]] for instance. I am surprised that all of these ideas are almost 20 years old and used in one of the most popular distributed file systems in the world.
 6 |     - Sec 3. Some of the capabilities seem fundamentally at odds with each other: cache revocations when multiple clients open a file, for instance. It sounds like Ceph is trying to do __too much__ while also being POSIX-compliant, and that adds complexity. Being a "tool for everyone" is hard.
 7 |         - “Most notably, these include an __O_LAZY__ flag for open that allows applications to explicitly relax the usual coherency requirements for a shared-write file.”
 8 |         - Slightly relaxed consistency through silently caching the readdirplus operation.
 9 |         - However, stopping multiple writers whenever anyone `stat()` calls the file is kind of a dealbreaker for performance. I guess that's the price you pay, unfortunately.
10 |         - lazyio_propagate and lazyio_synchronize
11 |         - I mean, yeah, POSIX was designed for disks that are attached to the machine. It's different when you're running a distributed storage service! The requirements vary greatly.
12 |     - Sec 4. “Although the MDS cluster aims to satisfy most requests from its in-memory cache, metadata updates must be committed to disk for safety. A set of large, bounded, lazily flushed journals allows each MDS to quickly stream its updated metadata to the OSD cluster in an efficient and distributed manner.”
13 |         - If an MDS fails, other nodes can scan from the journal to recover, which is in RADOS.
14 |         - Files with multiple hard-links are a special case for the "anchor table" —lol.
15 |         - Dynamic subtree partitioning offers hierarchical locality, while still automatically splitting metadata between different MDS servers.
16 |         - Selectively replicate popular directories and load balance between them.
17 |         - Details are omitted about the actual metadata partitioning. It's a completely different paper (Ori: "secret fourth paper in the trilogy") from the author's PhD. Very complicated.
18 |     - Sec 5. CRUSH and RADOS are pretty succinctly described. Failure recovery and metadata updates are the main complications here.
19 |         - Storing objects on individual nodes was done with EBOFS, then Btrfs and XFS, and now they have eventually moved to Bluestore (direct disk drivers) and RocksDB.
20 |     - “Finally, Ceph’s metadata management architecture addresses one of the most vexing problems in highly scalable storage—how to efficiently provide a single uniform directory hierarchy obeying POSIX semantics with performance that scales with the number of metadata servers.”
21 |     - Ori: Mentions that if he did it again today, would probably use something like S3? Just because it's cheaper. And FoundationDB, perhaps, since that is well-tested.
22 |     - Cool anecdotes about many companies still using fast on-prem storage, network-attached storage (NAS), NFS, etc., so all of this is still relevant today!
23 | 


--------------------------------------------------------------------------------
/2024/06-28-kernel-instrumentation.md:
--------------------------------------------------------------------------------
 1 | - Kernel instrumentation — [[July 28th, 2024]]
 2 |     - DTrace is pretty awesome, the idea is that we should have zero-overhead tracing of the kernel even with thousands of trace points. Maximize both flexibility and performance.
 3 |     - Function entrypoint tracing: register the tracepoint by having an unconditional jump to a nonexistent dtrace symbol, then modify the kernel linker to replace this jump with a no-op and register the tracepoint inside a special kernel table. When the tracepoint is enabled, edit kernel code to replace the nop with a jump to a special trampoline.
 4 |     - Syscall tracing just modifies the kernel system call handler table. Same for interrupts, probably, just modify the interrupt descriptor table for your processor architecture.
 5 |     - "D" programs are similar to awk, like pattern-matching blocks with basic conditionals.
 6 |         - Clause-local and thread-local variables are prefixed with this-> and self->
 7 |         - Seems like a very simple and concise expression language, quite different from eBPF in this way — look at how easy it is to call printf()
 8 |         - The quantize() function even produces an ASCII art histogram in the terminal
 9 |     - PID provider depends on OS, but it doesn't require process restarts, instead it produces a trap (signal -> transfers control to kernel mode) on the process
10 |     - "Speculative" tracing allows you to record data but discard it later if it is not interesting. Really, it just means you can delete from the associative data structure, not just add to it within a tracepoint handler.
11 |     - Case study on using DTrace to debug a complicated bug in X Server code on Solaris.
12 |     - What other tracers do people use? How does DTrace compare?
13 |         - Compare to Frida tracer on Android, which is useful because Java Virtual Machine has predefined hooks for doing things. It also works for Node.js, but it's limited.
14 |             - Expression language is JavaScript.
15 |         - "My favorite tracer is printf()"
16 |         - How do you instrument JavaScript in the browser? Chrome has a performance tab with every process, and the source of every line of code.
17 |         - Intel hardware-specific PT, which you can use with perf. Hardware-specific.
18 |             - How you might use it: if something is slow, break and dump the contents of the ring buffer to see what happened in the past few milliseconds.
19 |         - Valgrind / Callgrind — record all user-space function calls.
20 |     - Fun bullet point in the "What is eBPF?" article: __constant blinding__ to prevent JIT spraying attacks. What an interesting mitigation. Also, a bad pun. I'm constantly blinded.
21 |     - One person used it for research, documentation is pretty bad, kind of messy. Tried to reimplement TCP in an XDP program, but it didn't work because they had too much state inside the eBPF maps, didn't handle concurrency super well.
22 |     - Had a 30-minute break where we just chatted with group members for a while, and everyone seemed happy to talk. Now we're starting to read the PREVAIL blog post.
23 |         - Programming language background (PLDI/POPL slides): different zones for abstract interpretation, like integers, parity, octagons, and polyhedra. Stronger domains can prove assertions for a broader class of programs, but are slower.
24 |         - The "join" operation combines two models at different branches.
25 |         - Abstract domains must be __relational__ to track relationships between variables. They also need to track memory (register spilling) and avoid path enumeration.
26 |         - Regions are tagged as private (stack) or shared (packet, kernel).
27 |         - Variables are tagged as scalars, or pointers into various regions.
28 |     - Results of PREVAIL are not as fast as the Linux verifier, but they can still verify a fair subset of programs using just the zone domain. It's a nice way to compare different approaches to the verification problem.
29 |     - Solana's verifier is also eBPF? How does their verifier work?
30 | 


--------------------------------------------------------------------------------
/2024/06-09-disaggregated-databases.md:
--------------------------------------------------------------------------------
 1 | - Disaggregated Databases — [[June 9th, 2024]]
 2 |     - Storage disaggregation is popular in practice, and memory disaggregation is a theoretical technology that hasn't seen widespread adoption yet.
 3 |     - Depends on fundamental numbers in the latency of memory hierarchies.
 4 |         - Cache (SRAM) — $100/GB, 1 ns
 5 |         - DRAM — $5/GB, 100 ns
 6 |         - SSD — $0.12/GB, 40 us
 7 |         - HDD — $0.05/GB, 10 ms
 8 |         - Tape drives — $0.005/GB, 1800000 ms
 9 |     - SSD to DRAM has a big jump. There have been some research directions in making main memory non-volatile like attaching batteries, or Optane, but those were discontinued.
10 |     - "Enabling memory disaggregation is the advancement of fast networking technologies. In particular, recent generations of RDMA achieve sub-microsecond latency and hundreds of Gbps throughput. While this is still lower than the performance of buses, the boundary between local and remote machines is becoming blurred."
11 |     - CXL (Compute Express Link) tries to bridge the gap between local memory and RDMA. It's a replacement for PCIe. CXL.memory is a module for expanding main memory.
12 |     - Buses are fast because the components are on the same board.
13 |     - Is there a single point of failure in this pool? Unclear. But memory disaggregation seems like still a very experimental technology. CXL makes storage buses faster, but it's unclear if they can get to the speed of actual main memory.
14 |     - Differences between OLTP and OLAP databases, background mentioned differently. Paper seems more excited about expanding memory pools for OLAP workloads.
15 |     - Aurora paper: 2/3 quorums are adequate. "However, the failure of AZ C, due to a fire, roof failure, flood, etc, will break quorum for any of the replicas that concurrently have failures in AZ A or AZ B."
16 |     - Engineered for reliability with six-way replication: 2 copies of data in 3 AZs.
17 |     - Segments are 10 GB, and volumes can grow up to 64 TB. "A 10GB segment can be repaired in 10 seconds on a 10Gbps network link. We would need to see two such failures in the same 10 second window plus a failure of an AZ not containing either of these two independent failures to lose quorum."
18 |     - Storage system materializes databases pages by applying a log of write transactions. These are replicated to all 6 databases. The paper claims traditional replication in a system like MySQL would be "untenable" due to write amplification, but I didn't catch why. Maybe it's just an optimization, they do show numbers from an experiment that showi t's better.
19 |     - In any case, Aurora is layered in that the storage system itself reapplies transactions. So the storage is separated from the compute / query engine! Crash recovery is simplified. There would be lower overhead compared to distributed SQL, as long as your writes fit on a single node.
20 |     - Snowflake paper (NSDI '20): two key ideas in systems design
21 |         - "custom-designed storage system for management and exchange of ephemeral/intermediate data that is exchanged between compute nodes during query execution (e.g., tables exchanged during joins)"
22 |         - "Snowflake uses its ephemeral storage system not only for intermediate data, but also as a write-through “cache” for persistent data."
23 |     - So basically Snowflake reduces the network load caused by compute-storage disaggregation using their custom ephemeral storage engine & cache layer.
24 |     - Snowflake's customers issue a _lot_ of read-write queries, over half of all queries actually.
25 |     - Skewed distributions, very large variance in intermediate data sizes, still high cache hit rates though thanks to data warehouse properties.
26 |     - Consistent hashing for ephemeral cache between virtual warehouse nodes. Each of the VWs talk to each other, so they kind of share a global set of ephemeral memory. This is built on top of the larger, persistent storage but is much lower-latency. Generally the ephemeral storage has a hierarchy of {in-memory, SSD, S3} although spilling to object storage is not ideal.
27 | 


--------------------------------------------------------------------------------
/2024/08-04-intel-sgx.md:
--------------------------------------------------------------------------------
 1 | - Intel SGX — [[August 4th, 2024]]
 2 |     - Provide integrity and confidentiality guarantees to sensitive computation on a computer, when all of the privileged software is compromised. Including the kernel and hypervisor!
 3 |     - Q: Does it secure both the code and the data, or just the data?
 4 |     - Remote computation service verifies an __attestation key__ against an __endorsement certificate__.
 5 |         - The attestation key is created by Intel and baked onto the chip. So basically, the hardware cryptographically signs the code running in its enclave.
 6 |     - SGX __enclave__ = "secure container" only containing the private data + code.
 7 |     - This paper covers SGX 1, but SGX 2 is apparently similar.
 8 |     - SGX PRM = processor-reserved memory. The CPU doesn't allow anyone outside of enclaves to access that memory.
 9 |         - EPC = Enclave Page Cache, part of PRM
10 |         - OS (untrusted) assigns EPC pages to different enclaves. CPU ensures that each page belongs to exactly one enclave.
11 |         - Initial enclave state is copied from host memory, so it's not private. Initialization creates a __measurement hash__, and users can verify that they are communicating with an enclave that has a specific measurement hash.
12 |         - How are interrupts and traps handled? AEX = asynchronous enclave exit.
13 |     - Criticism due to security concerns, holes in the program, and Intel's hidden licensing behavior due to "launch control" that forces application developers to get permission from Intel before using the feature.
14 |     - Does the PRM live on a special processor region, or on main memory? We don't know yet. Seems like main memory after reading.
15 |     - Section 2: Thorough crash course in computer processors, operating systems, and virtualization.
16 |         - A motherboard is sockets connected by buses. CPUs connected via QPI and with NUMA DDR DRAM, then on the side there's a PCH ("southbridge") with management engine, flash, UEFI, USB, SATA, etc., as well as NIC hardware.
17 |         - CPU package has major components as well, broken down into parts. Fetch/decode into microcode and with various execution units.
18 |         - Caches implemented in hardware. Coherence protocol and ring interconnect diagrams.
19 |     - Q: How does certificate / attestation keys work? What is the chain of trust?
20 |         - Intel runs a root CA and intermediate CAs. Each processor's attestation key is unique, and the manufacturer shares public endorsement certificates.
21 |     - Q: Can you jmp into the EPC, so you can run code as encrypted data?
22 |     - Q: How does the processor provide access to secure memory or secure randomness? Is the actual PRM storage encrypted?
23 |     - Q: Why is defending against cache timing attacks so difficult? How does this change in a post-Spectre world?
24 |     - Q: If any one person gets the attestation key for one computer, does it compromise all other computers via MITM? We feel like this is too obvious to not be addressed.
25 |         - Yes, this is certainly possible with ion beam microscopy on 14 nm chips (chip attack).
26 |         - Defend against it by reducing the value of a single compromise. Tamper-resistant hardware. It sounds like you can verify the manufacturer's certificate metadata. __This__ is why you need to trust Intel — to only sign chips that are bug-free and hard to physically attack.
27 |     - Q: How do you communicate securely with a running enclave?
28 |         - Seems like the process running the enclave can communicate with it, and from there you can use a TLS channel or something. It's unclear exactly how this communication works, but maybe you send data through channels in the EENTER / ERESUME instructions.
29 |     - Related work
30 |         - IBM 4765 secure coprocessor lives inside a Faraday cage enclosure and has software attestation.
31 |         - ARM TrustZone involves separate address translation units. IP cores.
32 |         - TPM = Auxiliary tamper-resistant chip with no CPU modifications. One enclave measures and attests to the measurement of an OS kernel. Verified Boot vs Trusted Boot features of Intel's Boot Guard in UEFI firmware. Prevents people from just reflashing the firmware.
33 |     - Unlike ordinary crypto, which anyone can simulate and run "strong" algorithms, hardware security requires some additional assumptions like having devices that store certain keys, which are __hard to read__ from outside the chip itself (thanks to advanced EE / lithography).
34 | 


--------------------------------------------------------------------------------
/2024/01-14-zstandard.md:
--------------------------------------------------------------------------------
 1 | - Zstandard — [[January 14th, 2024]]
 2 |     - From Yann Collet at Meta, a compression algorithm!
 3 |     - Goals
 4 |         - Big things in small boxes
 5 |         - Done fast
 6 |     - How
 7 |         - Lempel-Ziv (1977)
 8 |             - Remove common repeated patterns
 9 |             - More of an algorithm __family__ than a specific algorithm
10 |             - Keep a sliding window of some size in bytes, then repeatedly move it forward while extracting the longest common substring as a pointer
11 |             - Naively O(NM^2) where M is window size, but can be done in O(N) linear time with string algorithms (suffix arrays)
12 |             - Practical efficiency depends on pointer encoding / specific wire format
13 |             - How would the window sizes work?
14 |                 - Window size: zlib uses 32KB by default
15 |                 - Limited by RAM
16 |                 - Q: Index from the current position, or from the start of the search buffer?
17 |             - Q: Can you initialize it with a dictionary of common words?
18 |             - LZW (Lempel-Ziv-Welch): alternative variant using a dictionary
19 |             - Q: __How do you evaluate__ a compression algorithm?
20 |                 - Hutter Prize
21 |                 - Various text corpuses, all of Dropbox, all of GitHub?? idk
22 |                 - Size and speed (both compression speed & decompression speed)
23 |             - Error correction (Gzip = DEFLATE + CRC-32)
24 |             - String transformations applied before compression
25 |                 - Burrows-Wheeler (linear-time string transformation)
26 |                     - Moves similar parts of the text near each other
27 |                     - Used in bzip2
28 |                     - https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
29 |                 - Other permutations of the data
30 |                 - Run-length encoding (RLE)
31 |         - Huffman coding (1951)
32 |             - Frequencies are different
33 |             - Prefix-free encoding
34 |             - **Assign less bits to the most common sequences of characters in the string!**
35 |                 - Assign __more bits__ to the things that are less frequent
36 |                 - Entropy
37 |             - Information theory
38 |             - Huffman coding is suboptimal
39 |                 - Arithmetic coding is more efficient
40 |                 - They lose up to a single token in the alphabet due to rounding
41 |             - Where is our frequency distribution from?
42 |                 - Best is the document, but sometimes you can have a global dictionary
43 |         - Finite State Entropy (Collet, 2013)
44 |             - Entropy is the minimum number of __bits of information__ to express data.
45 |             - Claude Shannon: You can't do better than the [entropy](https://en.wikipedia.org/wiki/Shannon%27s_source_coding_theorem).
46 |             - Suppose we have a random sequence of iid characters drawn a discrete distribution. We want to encode this stream of random iid data. The entropy of this distribution is the lower bound on expected number of bits per token in the stream.
47 |             - 
48 |         - Fast = lots of C / SIMD / architecture magic, also multithreaded(?)
49 |     - Compression formats
50 |         - Gzip
51 |         - Zip
52 |         - MP4/MOV/??? "media container"
53 |     - Compression algorithms
54 |         - Lossless (~4x?)
55 |             - Data, text — you save "The Bible" a word can't be missing
56 |             - Huffman
57 |             - Brotli
58 |             - Zlib?
59 |             - Snappy
60 |             - DEFLATE
61 |         - Lossy
62 |             - JPEG (20x smaller, 50x)
63 |             - H.264/H.265
64 |     - Where do we use compression?
65 |         - Streaming images and movies
66 |         - Reduce payload on websockets (+ HTTP, gRPC)
67 |             - Almost all HTTP requests / responses are compressed
68 |             - "permessage-deflate" extension
69 |             - `Content-Encoding: zstd`
70 |         - Backups
71 |             - Long-term archival storage
72 |             - Tape drives
73 |         - Databases
74 |             - Parquet / Snappy
75 |             - DuckDB also uses Snappy
76 |         - Can you operate directly on compressed data?
77 |             - Yes — For example, wavelet trees provide a search index but also compression (__succinct data structures__)
78 |             - Decompress-on-demand (sectors, byte offsets)
79 |         - Trading compute for bandwidth / storage, balance systems
80 |         - Different parameters result in people designing different systems / architectures.
81 | 


--------------------------------------------------------------------------------
/2024/11-10-gpu-sharing.md:
--------------------------------------------------------------------------------
 1 | - GPU sharing — [[November 10th, 2024]]
 2 |     - Hosted by Rene Ravanan. There are 17 people today.
 3 |     - Start by refreshing on hardware fundamentals: what is a GPU, and what is a CPU? The graphics units have a lot of small cores. Thread blocks and grids. Each block is run by a streaming multiprocessor and can read from shared memory (L1 cache).
 4 |     - CUDA is the industry standard. Why not GLSL, WGSL, Vulkan, …? Unclear, seems like Nvidia has thoroughly optimized their stack.
 5 |     - Even if you oversubscribe a GPU, an SM only runs one application at a time. So multiple applications on a GPU (e.g. Chrome, plus a game) would submit kernels for evaluation every so often, but they can just use however many SMs they need.
 6 |     - __Timeslicing__ is enabled on default on Nvidia GPUs. Allows for context switching every couple milliseconds — this is slow since register files are large (lots of cores).
 7 |     - CUDA streams allow you to submit jobs to the kernel asynchronously. Compare to the WebGPU API for instance, in `dispatchWorkgroups()`.
 8 |         - MPS (multi-process sharing) allows different apps to overlap kernel / memcpy ops on a GPU. Helps increase utilization. By default, kernels from only one application can run on a GPU concurrently. Other applications wait. MPS improves concurrency. This allows multiple applications to __submit kernels__ at the same time, not just time slicing.
 9 |         - This is a setting on your GPU, by default applications get exclusive access to the entire device. So it can use all of its memory, SMs, etc.
10 |     - MIG = isolated mini-GPUs, resource contention doesn't affect applications. But this is fixed partitioning, you don't get dynamic sharing.
11 |     - Isn't it weird that MPS needs to exist? If you had two different streams in an application, they would already do this. Why would different applications synchronize as the default behavior, by sending things onto the global "default stream"?
12 |     - The graphics driver groups the requests from applications, and typically they can figure out how to then allocate device time then. "One application" — why don't they share SMs?
13 |     - GPUs have memory protection.
14 |     - **Modern alternatives to MPS:** TGS = temporal sharing, REEF = spatial sharing. Overall theme is that they divide applications by priority. If you have a production task that doesn't use the GPU fully, maybe you can try filling in the gaps with other workloads.
15 |     - TGS (NSDI '23)
16 |         - "It ensures that __production jobs__ are not greatly affected by __opportunistic jobs__ on shared GPUs."
17 |         - Integrated with Docker and Kubernetes, meant for container usage.
18 |         - Opportunistic jobs can run on spare resources. Lower priority than production jobs. They can only use leftover resources at any given time.
19 |         - Transparency means that they intercept system calls from containers to GPUs, then isolates the production job from contention compared to the opportunistic one.
20 |         - Memory oversubscription: "MPS fails when the total GPU memory required by containers exceeds the GPU memory size." TGS handles memory swapping underneath and prioritizes the memory of production jobs.
21 |         - As good as AntMan, but transparent.
22 |         - Adaptive rate control: throttle dequeuing opportunistic kernels based on the __arrival rate__ of production kernels(!) Like TCP congestion control. This might be ok for a homogeneous deep learning training job, but I'm not sure about it being a general solution?
23 |         - Benchmarked on training two deep learning models at the same time. You get to squeeze in extra compute for free, while the main model isn't at 100% capacity!
24 |     - REEF (OSDI '22)
25 |         - Since kernels are idempotent, just preempt them and retry later.
26 |         - Using "offline profiling in advance" it can pad a workload with pending best-effort kernels to achieve optimal utilization.
27 |         - Not transparent. Hooks into a fork of TVM, the ML compiler framework. Code transformer adds a __preemption flag__ to lazily evict the kernel (kind of like a kill switch), and a set of __proxy kernels__ that are compiled for different register counts.
28 |         - Register file size on each CU is 256 KB. State-saving is expensive!
29 |         - Maybe you should have an OS scheduler instead? "SM stealing."
30 |         - Can concurrently execute the real-time task with opportunistic tasks, on the same SM! So that's why they need the register allocation system. It's not like TGS, where they run in the gaps between kernel runs.
31 |         - Assumes that you don't oversubscribe GPU memory!
32 | 


--------------------------------------------------------------------------------
/2024/12-08-simdjson.md:
--------------------------------------------------------------------------------
 1 | - simdjson — [[December 8th, 2024]]
 2 |     - Specific contributions of the paper center around using SIMD to do tricky tasks. Validating UTF-8, detecting quoted strings with escape sequences; ranges of code point values. vpcmpeqb
 3 |     - Mison architecture, compared against. simdjson also does validation.
 4 |     - Instead of a common top-down recursive descent parser, going one character at a time, they use a multi-stage SIMD architecture.
 5 |         - Stage 1: Validate UTF-8 and find starting location of every primitive. Like tokenization. Also figure out where strings are escaped.
 6 |         - Stage 2: Convert into a linear __tape__ of primitive values and arrays/objects. Parse bools, string escape sequences -> normalized UTF-8, and floating-point literals.
 7 |         - Returns a validation result (error code) and tape.
 8 |     - The parts of stage 1
 9 |         - Identifying quoted strings depends on escape sequences. An even number of backslashes cancels itself out.
10 |         - simdjson works with bitvectors of locations of characters.
11 |         - To find positions __in between quotes__ (i.e., strings), we take the unescaped quotes and do a prefix sum over all of the bits. This is a pclmulqdq (carryless multiplication) with -1.
12 |         - UTF-8 validation can also be SIMD. Look at the top nibbles.
13 |         - "Error variable" is a 32-bit uint. If there are any errors, we bitwise or it with the variable.
14 |         - Question: How to handle dependencies between chunks efficiently? The last character of the previous chunk affects the next chunk. Have to read the code to find out.
15 |         - Figure shows example of branchless code to detect odd-length sequences of backslashes. It uses 15 instructions in total.
16 |         - Outside of UTF-8 strings, all JSON characters must be ASCII. But there might be many strings. So the authors just validate the whole chunk of text as UTF-8 all at once, to have a more consistent performance profile. Uses vpshufb to detect continuations.
17 |     - The parts of stage 2 (tape)
18 |         - To handle nested arrays and objects, they use a stack / "goto" state machine.
19 |         - When there are "at least 8 digits" in the fractional part (i.e., a lot), they switch to a vectorized number parser. Otherwise they just do character-by-character.
20 |     - `vpshufb`: vectorized table lookup, and it lets you do two table lookups on the low4 bits of every character, used for testing character classes! Ori thinks this is a big part of the magic. You need to set up a 16-byte table first. Composes well with bitshifts.
21 |     - Do they do an ablation study? How much is from SIMD, versus vector comparison, etc?
22 |     - How does this generalize to other architectures? Are the instructions the same, or analogous for other architectures. Ordering the instructions matters too, pipelining / reducing data dependencies is inherent here.
23 |     - Question: Does this library support streaming? (Ori) Everything they're doing seems like it wouldn't work unless the JSON was in RAM.
24 |     - Right now stage 1 is done before stage 2. They could interleave the processes in the future. Also, stage 1 could use multi-threading, since it only requires local knowledge and doesn't build a tape. In both cases it requires in-memory data though.
25 |     - __simdjson is slower than memory bandwidth__ of the hardware, so it seems like we still have speed even with large files bigger than the L2 cache size.
26 |     - Note: They only test up to 80 MB. So "gigabytes per second" may be a bit unintuitive.
27 |     - Code reading
28 |         - "inl" header files are inline
29 |         - Main high-level passes are in the src folder, but most of the code is implemented in the include folder in headers. "Structural indexes" is an array of uint32_t.
30 |         - Data-driven architecture. Buffers are passed around between different objects.
31 |         - To manage complexity, different step sizes are placed into a const generic. The functions behave as if you stepped with a different constant value, but using the bigger step size might be preferable for speed. (Longer strides.)
32 |         - Some comments in the generic folder mention specific architectures. They use template programming to enable specific layout optimizations. Arm64 ISA has a leading versus trailing zero operation, so you may reverse the bits.
33 |         - Example of a function different between microarchitectures: [parse_eight_digit_unrolled](https://github.com/search?q=repo%3Asimdjson%2Fsimdjson+%22uint32_t+parse_eight_digits_unrolled%28%22&type=code). Especially cool that the generic one is also used in ARM64, uses SWAR.
34 |         - Note about C++ programming, the comment about reinterpret_cast being unsafe is because it's technically undefined behavior to have two type aliases pointing to the same memory location. C++20 introduces bit_cast for this.
35 | 


--------------------------------------------------------------------------------
/2023/10-29-serialization.md:
--------------------------------------------------------------------------------
 1 | - Serialization — [[October 29th, 2023]]
 2 |     - What is serialization? (Setting)
 3 |         - **Take object state (complex) and convert it into a sequence of bytes**
 4 |             - Why bytes? Why not bits? (not sure?)
 5 |         - Microservices: distributed systems are more common, require more communication
 6 |             - Traditionally: Sharing memory, data structures (pointers)
 7 |     - Goals
 8 |         - Save space (compact data serialization)
 9 |             - "Space" — serialization over the wire, bandwidth
10 |             - Disk storage
11 |         - Speed
12 |             - Fast — high throughput (~GB/s)
13 |             - Less CPU usage (compression requires a lot of CPU)
14 |         - Simplicity
15 |             - Very complex formats re difficult for programmers to use
16 |             - `{"name": "Eric"}`
17 |             - `<xml:person id="3"><name>Eric</name></xml:person>`
18 |             - `<xml:person id="3"><name>Eric</name><name>Eric2</name></xml:person>`
19 |         - Flexibility
20 |             - Represent all the data types — how rich
21 |                 - List[K]
22 |                 - Map[K, V]
23 |                 - HashList[K] — unclear? performance optimization?
24 |                 - Mutex — ???? probably not
25 |                 - TCP socket — definitely not, unless…
26 |                 - `volatile`
27 |             - (Should we represent boolean?)
28 |             - (Should we represent u32?)
29 |             - (Should we represent u128?)
30 |             - (Should we represent u2048?)
31 |             - (Should we represent GF256?)
32 |     - Schemaful vs non-schemaful
33 |         - **Schemaful:**
34 |             - Versions of the format may change, don't break protocol (relevant for over-the-wire)
35 |             - Better type safety
36 |             - Efficiency
37 |             - Protobuf, Thrift, Kiwi, Cap'n Proto, Arrow
38 |         - **Non-schemaful:**
39 |             - JSON, XML, MessagePack, CBOR, YAML, TOML
40 |             - Quicker to work with, don't need to write a schema
41 |     - What does a serialization language / library data format look like?
42 |         - `POST /new-user {"fullName": "Eric"}`
43 |         - `POST /new-user [1: "Eric"]`
44 |     - Encode Protobuf over the wire:
45 |         - Marker
46 |             - String/Bytes = 1 [1 byte] + Length [varint]
47 |             - Int32 = 2 [1 byte]
48 |             - Bool = 3 [1 byte]
49 |             - ...
50 |         - Variable length int
51 |             - 0xxxxxxx — literally 0..127
52 |             - 1xxxxxxx 0xxxxxxx — next 2^14 values
53 |             - 1xxxxxxx 1xxxxxxx 0xxxxxxxx — next 2^21 values
54 |         - ```javascript
55 |           message SearchRequest {
56 |             string query = 1;
57 |             int32 page_number = 2;
58 |             int32 results_per_page = 3;
59 |             // can't add new fields, can't remove fields
60 |             int32 results_per_page_new_version = 4;
61 |           }
62 |           ```
63 |             - Go in order. No gaps in the field numbers allowed.
64 |             - query: `01 04 45 72 69 64`
65 |             - page_number: `02 00 00 00 07`
66 |                 - variant? `02 07` — if most of the numbers are small, use varint because it's cheaper
67 |                 - but it might be 5 bytes if all of them are big, so that's like ~25% worse, and might be slower, use more CPU
68 |             - results_per_page: `02 00 00 00 C8`
69 |             - `[01] 04 45 72 69 64 [02] 00 00 00 07 [02] 00 00 00 C8` — 12 B!
70 |             - We don't have to encode the type!
71 |                 - 9 B — without type descriptors
72 |                 - Deserializer is hardcoded for this schema. Yes.
73 |     - **Protobuf**
74 |         - {Tag, Value} pairs
75 |             - Tag = field number (the rest of the bits) + "wire type" (3 bits)
76 |             - Wire Types = Varint, Fixed32, Fixed64, Byte sequence
77 |         - Endianness
78 |             - `0x01020304` -> `[0x04, 0x03, 0x02, 0x01]` — little endian
79 |             - `0x01020304` -> `[0x01, 0x02, 0x03, 0x04]` — big endian
80 |         - Network Byte Order = Big endian
81 |         - Compression — LZMA — detect duplicate sequences?
82 |         - Bit-level? 3 bit tags instead of 8 bits
83 |     - `<M1><M2><M3>`
84 |     - `[*M1][*M2][*M3]...[M1][M2][M3]`
85 |     - `repeated string x = 1;`
86 |     - **Cap'n Proto**
87 |         - __Infinitely faster__
88 |         - The deserialization is zero-cost, in the sense that there is no copying or dynamic allocation
89 |         - If you have a buffer of bytes, you can access any part of that buffer without parsing the entire message and loading into memory
90 |         - Embedded or real-time systems
91 |         - Inter-process communication (sharing memory is free!)
92 |         - Struct pointers are less memory efficient but necessary to encode this complexity
93 |         - Cap'n Proto hijacks the language's in-memory representation of its data, since there's no separate serialization step
94 |         - CPU versus network ratio
95 | 


--------------------------------------------------------------------------------
/2025/01-crdts-and-distributed-replication.md:
--------------------------------------------------------------------------------
 1 | - CRDTs and distributed replication (Jan 2025)
 2 |     - [Zero](https://zero.rocicorp.dev/docs/writing-data) is a cool new offering from Replicache, and they seem to have attracted some attention. But their branding seems to have a lot of big talk, not as much concrete argument for why it improves developer flow.
 3 |         - The basic advantage is that you get automatic __optimistic updates__.
 4 |         - Yes, it is cool technology. My understand is that it's a differential dataflow query engine, replicated on both the client and server, with streaming updates. So you can not just get point data like Firebase, but also replicate query results in a differential manner. (Why do you want to do this? I think it just allows you to program __exact__ optimistic updates to complex queries.)
 5 |         - Zero only makes sense if you want a scalable read-write workflow with complex distributed queries. And optimistic UI updates. (Even then, if the optimistic revalidation is easy to implement by hand with SWR / useOptimistic, that seems better?)
 6 |         - You pay a lot of technical cost: separately deploy and scale a change replication database in addition to the main database, learning a new materialized query language (not SQL).
 7 |         - Unclear how to measure performance. In a large table, a lack of proper index could result in massive change sets. In ordinary SQL, you would just EXPLAIN ANALYZE it. But for this dataflow system, it's much harder to debug what's going wrong.
 8 |         - Electric SQL is a cache / similar idea to One.
 9 |         - I've been reading this for a while, but I think it's really not compelling. When we partially replicate the database into the browser, there are so many sacrifices. It feels like creating an inefficient system then trying to muscle your way into tractability. Just simplify your query model to allow for optimistic updates (Firebase), or write the optimistic update yourself — don't fight nature.
10 |     - [One](https://onestack.dev/) is another hyped up devtool. (Idk if it's good though.) Why are they rewriting Next.js and using Zero? Maybe plugging in Expo helps.
11 |         - In principle I'm not sure if having shared codebases for web and native apps really makes sense. These should have different paradigms of interaction.
12 |         - Flutter tried it, they aren't doing so well.
13 |     - [pglite](https://electric-sql.com/product/pglite) has utility outside of the streaming replication, since they put a database in the browser. Could be useful as a simple interpreter for SQL.
14 |         - Maybe build a Postgres query explainer / education tool like regexper.com with it?
15 |         - Or you could use it in browser notebooks like Observable. In that case DuckDB or Polars might be better though; they're built for OLAP.
16 |     - CR-SQLite and OrbitDB are comparable replication algorithms for databases. But OrbitDB is cryptographic and decentralized, while CR-SQLite is traditional. (Redis Enterprise does the same thing as CR-SQLite, but it's closed-source. No good docs.)
17 |         - How does CR-SQLite maintain a CRDT for database tables? I guess you do LWW by row ID. But what about schema updates + transactions? Hmm…
18 |         - Also curious how they support rich text, will need to read more about it.
19 |     - Note on CRDT algorithms: RGA is isomorphic to a __causal tree__. (This makes sense!)
20 |     - How to resolve conflicts, for example in git merge? For source code, like having separate agents modifying the same codebase. Some innovations like [diifftastic](https://github.com/Wilfred/difftastic), syntax-aware diff might be able to apply changes better.
21 |         - Or you could… ask the AI how to resolve the merge conflict? :D
22 |         - I guess there's unlimited options there. Perhaps it could even speed up code editing at scale with these agentic flows. (But it's very expensive atm.)
23 |     - There are competing ideas / no consensus about OT vs CRDT for collaborative editing. OT has been simple and understandable for along time, especially for client-server applications. (Which is the vast majority of software today.) But the people at Zed like CRDTs, and eg-walker is recent research making them much simpler / faster.
24 |         - I guess CRDTs enable true p2p collaborative editing, could be nice. But in most practical scenarios you do have some "primary" that can handle the OT state.
25 |         - For instance, to make a collaborative Jupyter notebook, you probably want one person to actually be running the code / or at least to own the remote kernel connection, and also to save the file on their disk. This peer is privileged and can run the OT.
26 |         - A code editing session also probably has one "host," they can order the OT.
27 |         - Full p2p connections require O(n^2) edges too. You could use Chord but that adds complexity again.
28 |         - Idk maybe I'll explore CRDTs as a core data structure, but I have not super high expectations. Can try seeing what comes out of it. Can it plug into state management frameworks like Zustand / the idea of a reactive Solid.js store? That might be convincing.
29 | 


--------------------------------------------------------------------------------
/2024/05-12-crush-rados.md:
--------------------------------------------------------------------------------
 1 | - CRUSH and RADOS — [[May 12th, 2024]]
 2 |     - CRUSH is a hierarchical data placement function. You just need a schematic, weighted tree of buckets in order to determine placement.
 3 |         - Summary: CRUSH is hierarchical RUSH, with additional bucket types & overload flags.
 4 |         - The requester can choose its hierarchy, as well as the placement spec for how replicas should be laid out. Reorganization happens when new nodes are added or removed, and you can also mark a node as overloaded to offland data stochastically.
 5 |         - Compare to distributed hashing algorithms:
 6 |             - Cassandra has a single level of ring-based hashing. But to rebalance, it just divides the most heavily weighted node's work in half.
 7 |             - Chord is P2P and also has no centralized service, but it's more of a distributed communication algorithm computing the membership of a ring. CRUSH needs to know the membership ahead of time, so it's not meant for instances to join and leave often.
 8 |             - CRUSH is interesting because it's focused on hierarchical data placement.
 9 |             - Also, it guarantees equal-probability data balance, whereas ring hashing would have different probabilities even for adjacent nodes in the ring. For ring hashing, the variance is proportional to the number of __nodes__, but CRUSH's variance is proportional to the number of __objects__.
10 |         - For large amounts of data, it's approximated by a geometric distribution, sigma ~ Sqrt(mu).
11 |         - Four different types of buckets in the hierarchy with placement algorithms:
12 |             - Uniform: h(x) % p
13 |             - List: repeated hashing, h(x, i) / H_MAX < (bucket_weights[0]) / sum(bucket_weights) — mathematically optimal reorganization efficiency (RUSH_P)
14 |             - Tree: like a list, but you only need to update probabilities for stuff gradually going up to the root, so it's O(log n) time to add/remove (RUSH_T)
15 |             - Straw: draw a straw for every bucket, pick the bucket with the largest straw
16 |         - Remember that Ceph has a top-level hierarchy of nodes, so the bucket type used by each individual node doesn't matter as much, or it's really just a small constant factor. There's +2-4x data reshuffling factor penalty from the multi-level hierarchy anyway.
17 |             - I guess bucket type doesn't __really matter__ at all, but it's something to consider for an academic paper when you consider and evaluate all the options.
18 |             - This analysis justifies having a map / hierarchy anyway.
19 |         - I love how this paper balances practical and theoretical aspects. They do a rigid theoretical analysis, but then you always need to ask: how does this translate to actual operational system improvement? Is it relevant, or does it help actually balance the problem?
20 |         - Cool that it models a physical infrastructure via general math. But even the questions we ask are bent towards the domain problem being solved. It's a real systems problem!
21 |     - RADOS is the object storage layer built on top of CRUSH. The idea is to distribute the control plane onto the storage nodes, without a controller.
22 |         - Summary: RADOS is an object store. Files are grouped into placement groups, which are then assigned to several replicated ODS with CRUSH. A monitor cluster replicated with Paxos is tasked with failure detection and propagating the cluster map.
23 |         - Keys are sharded into 2^k placement groups, which are then assigned via CRUsH. Reminds me of the Redis Cluster sharding algorithm.
24 |         - Primary-copy, chain, and hybrid "splay" replication for reader / writer consistency.
25 |         - Shuffle the number of placement groups proportional to the number of nodes. Maybe you'll want around 100 placement groups per node, since the variance is inversely proportional to the number of placement groups. For instance, if you want like +-10% load on average, then you'll need 100n (asymptotically, times some constant).
26 |         - Each OSD is in {up, down} or {out, in} state. The storage maps are updated by the controllers via Paxos epochs, and then they gossip between the machines lazily in O(log n) time. Messages are tagged with the sender's version.
27 |         - Data migration is also distributed, and it's driven through peering by the first OSD in the placement group, which becomes a __replica__ or a __stray__.
28 |         - This means that rebalancing only requires that every node checks ~O(100) PGs.
29 |         - Nitty gritty details about replication and its relationship with cluster rebalancing.
30 |         - Note: No consideration for parallel reads at this level, or breaking files up into chunks to expand past the throughput of a single machine.
31 |         - Unanswered question about garbage collection of junk if the write fails in the middle due to a cluster reorganization that makes an OSD no longer the primary.
32 |         - Failure recovery is handled by the new primary of the PG. It recovers from the old replicas.
33 |         - Note: RADOS writes support PUT operations as well as appends, this detail is left out. In general, it depends on whichever mutations you support.
34 |         - A lot of further work is dedicated to snapshotting, GFS-like queues, better load balancing and ECC redundancy, and so on. Fine-tuning the general object store and exploring different avenues for customization.
35 | 


--------------------------------------------------------------------------------
/2025/06-29-new-cpython-jit.md:
--------------------------------------------------------------------------------
 1 | - Copy-and-patch JIT — [[June 29th, 2025]]
 2 |     - https://fredrikbk.com/publications/copy-and-patch.pdf
 3 |     - "different points on the startup delay-execution performance Pareto frontier."
 4 |         - The paper actually draws out this frontier, averaged over Wasm benchmarks and TPC-H on databases! It's pretty interesting to compare LLVM to Wasm, V8, and this approach.
 5 |     - The copy-and-patch paper makes a distinction between "full compilers" as in a database query language compiler (all of which currently rely on LLVM), and "bytecode assemblers" that work on a linearized bytecode. They make a full compiler, on a C-like language that can be generated by embedded functions in C++, turned into an AST, and then compiled.
 6 |     - "At a high level, copy-and-patch works by having a pre-built library of composable and parametrizable binary code snippets that we call binary stencils."
 7 |         - Wasm = 1666 stencils (35 kB), high-level C compiler = 98,831 stencils (17.5 MB).
 8 |         - So an average stencil is in the range of 10-500 bytes.
 9 |     - They take stencils and then compose them into a full representation by patching them, including some linearization passes. Each stencil is generated ahead-of-time, and then actually combining them has very little performance impact.
10 |     - Can be extended with other optimizations (unrolling, common subexpressions, SIMD, …).
11 |     - MetaVar = their system for producing binary stencils, from clean C++ code and LLVM.
12 |         - Multiple stencil variants for each AST node. For example, addition with a constant / literal, versus addition with a register or stack value.
13 |         - Differs from template JITs in that there are more stencils per node 
14 |         - Stencils are not generated by hand, there are too many. Each is a binary code fragment with holes for literals, jump addresses, and stack offsets. They represent either an AST node, or a commonly-used AST subtree / bytecode sequence ("supernode").
15 |         - They use continuation-passing style / tail recursion, no returns. Uses Clang TCO to turn the calls into jumps and GHC calling convention for full caller-saved registers.
16 |             - Note: GHC calling convention is because Haskell needs lots of continuations. Its IR is based on System F.
17 |         - Repurposes calling convention into register allocation: each function argument is implicitly a register passed between stencils. Other shared data is passed in memory via stack offset.
18 |         - Each stencil can have "pass-through" arguments, using widest data type.
19 |             - So you have a stencil for equality "==" of ints with two registers, with stack argument and constant, with passthrough and any of the above, etc.
20 |             - (Isn't this still an exponential explosion? Maybe # passthrough is configurable.)
21 |         - Stencil library takes Stencil configuration -> Stencil binary (with patch records).
22 |     - Code generation
23 |         - Apparently this is extremely fast, maybe even faster than AST construction. They do post-order traversals, first to plan register usage, then to select stencil configurations and construct a compact CPS call graph.
24 |         - Once we have the call graph, it does a topological ordering and pastes in the stencils while applying patches to the correct addresses.
25 |         - "In our benchmarked implementations, we only use registers to preserve temporary values produced while evaluating an expression." — equivalent to a simple Sethi-Ullman register allocation algorithm, where they just spill over registers that exceed the watermark.
26 |         - So actually constructing stencils falls back to Clang and requires some parsing of object files, but once the stencils are there, codegen is a simple scan / pattern match.
27 |         - Decided mem2reg (hot local variables -> registers) is not worth it. Registers are only used for expression intermediates, not local variables.
28 |     - Stencil generator (MetaVar) internals
29 |         - Paper writes a bunch of C++ template metaprogramming, there's constants like numMaxPassthroughs and `OperandType` generic that control the number of variants.
30 |         - Holes are represented as linker relocations in the object file.
31 |         - LLVM Object File API is used to extract stencils, convert them to compact binary format.
32 |     - Now reading PEP 744, it gives some historical background. Since Python 3.11, they've been collecting micro-ops and speculative interpretation with some type information on bytecode. But this is turned off right now due to overheads. Compilers can reduce those overheads in instruction decoding, memory traffic, etc.
33 |     - __Despite their reputation, JIT compilers are not magic “go faster” machines.__
34 |     - CPython already uses a C-based DSL to generate most of the interpreter (since 3.12), so maintainers don't have to handwrite boilerplate for each instruction. Then this DSL can also be used to update the copy-and-patch template JIT, reducing maintenance cost.
35 |         - Current implementation has about 900-lines of build-time Python, and 500 lines of runtime C. That's a very small amount of code for a JIT system!
36 |     - Interesting to see how this all works. Lots of discussion about not making Python itself more complex, while still gaining performance improvements and reusing infrastructure. Adds LLVM to the build dependencies though. Target "5% better" on at least one platform.
37 |     - "Improving JIT code quality" ticket https://github.com/python/cpython/issues/115802
38 | 


--------------------------------------------------------------------------------
/2025/08-ssa-book.md:
--------------------------------------------------------------------------------
 1 | - SSA Book (Aug 2025)
 2 |     - Intros: SSA means every name gets assigned once, and ɸ function runs in parallel at the head of basic blocks as a __pseudo-assignment__ for merging variables.
 3 |     - Intuitively, SSA took over the world because it simplifies dataflow analyses. Instead of marking properties of each variable at every code location, you only need to deduce facts of variables once, at the location where they are assigned. (sparse dataflow)
 4 |         - Dataflow information __propagates directly__ from definitions to uses. (__def-use links__)
 5 |         - Results of dataflow analyses are __more succinct__.
 6 |     - Def-use and use-def chains, link definitions to usages. SSA reduces x*y links to x+y links.
 7 |     - __Minimality property__ - for SSA construction, minimize the number of phi() functions inserted. The two-phase construction algorithm has this property.
 8 |         - Let's work within nodes of a CFG here.
 9 |         - __Join node__, for two nodes a and b, is node n such that if there exist two nonempty paths from a->n and b->n, such that n is the only node along both of those paths.
10 |             - In other words, join nodes are places where two code paths meet.
11 |             - __Join set:__ J(S) = {join node of a and b | a, b in S}
12 |         - Intuitively, you place ɸ-functions at the join sets of all definitions of a variable v, and this guarantees minimality. Classical techniques also include the entry point in the join set, which inserts more ɸ's ("there are good reasons for this" – see Strict SSA).
13 |     - __Strict SSA form__ - every variable is defined before it is used.
14 |         - Under SSA, this means that uses are dominated by their definition. From the entry point, you must reach a definition of v before any use of v.
15 |         - "a dom b" ~ dominance, and "a sdom b" ~ strict dominance (a != b).
16 |         - You can get strict SSA by inserting (v = undefined) at the entry point for every variable. That's why join sets sometimes include the entry node.
17 |         - With strict SSA, you get the immediate dominator (idom), which is the unique node that dominates each CFG node, and can construct a __dominator tree__. Then the live range of each variable is its subtree in the dominator tree, which can be queried efficiently.
18 |         - You might want this to compute interference / non-overlapping variables. Also, the intersection ranges form a __chordal graph__, which can be colored in linear time, that's great for optimal register allocation.
19 |         - Note that the dominance property can be broken by const propagation, needs repair.
20 |     - __Pruned SSA form__ - remove ɸ-functions that never get used, or are dead, which happens if a variable gets assigned in two peer places and not used afterward.
21 |     - ɸ-webs are partitions of local variables that are related to each other by a ɸ-function. This is the granularity of non-SSA register allocation. When pruned, each ɸ-web has exactly one variable live at any given time.
22 |         - __C-SSA (conventional SSA)__ - each ɸ-web is interference free.
23 |         - __T-SSA__ - can have interference between variables in ɸ-web, makes it harder to "destruct" the SSA because you can't just replace each web with a reassignable variable name.
24 |         - To convert from T-SSA -> C-SSA, you insert copy operations to remove interference. This provides a "more accurate" view of resource usage / how many variables are actually live at any given time.
25 |         - "most current compilers choose not to maintain the conventional property"
26 |     - In LLVM, they generated pruned SSA form. `phi` operation in LLVM is a ɸ-node, which is generated from alloca instructions during the mem2reg pass ([Godbolt example](https://godbolt.org/z/dfnPW93EP)). It turns locals into LLVM registers (%) in SSA form, which can then have machine register allocation.
27 |     - Chapter 3: Standard Construction and Destruction Algorithms
28 |         - Note: [mem2reg](https://llvm.org/docs/Passes.html#mem2reg-promote-memory-to-register) uses this: "just the standard SSA construction algorithm".
29 |         - There are some alternative algorithms that are more efficient, but they're also more complex. The "standard" ones remain popular.
30 |         - Dominance frontier DF(n) is the set of nodes v where n doesn't dominate v, but n dominates a predecessor of v — this is the "1 past the boundary" of what is dominated by n.
31 |             - If you iterate the dominance frontier, DF+(n), you can compute DF+(Defs(v)) and insert phi-nodes there. This is equivalent to join set, since DF+(Defs(S)) = J(S || {r}).
32 |             - Standard algorithm involves a DFS over DF edges, iteratively adding ɸs.
33 |             - DF is straightforward to compute from the dominator tree with join-edges (edges from CFG not in dominator subtree tree), a.k.a., the DJ graph. For every join edge a->b, just add it to the dominance frontier while walking up the tree from a. ![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fekzhang%2FUqGDcu-Jo-.png?alt=media&token=ba7722f1-37fe-4fbc-b2e1-b27961ec47a2)
34 |         - For the construction, second phase is renaming, keep track of "reaching definition."
35 |         - There are alternative linear-time construction algorithms (Sreedhar and Gao, 1995), described in Chapter 4.
36 |         - ɸ-webs are discovered with union-find or DFS, and then the "critical edge splitting" destruction algorithm splits up edges from nodes with multiple successors, to nodes with multiple predecessors. Also you need to linearize "parallel" ɸ assignments.
37 | 


--------------------------------------------------------------------------------
/2024/09-29-amazon-s3.md:
--------------------------------------------------------------------------------
 1 | - Amazon S3 — [[September 29th, 2024]]
 2 |     - "S3 is effectively a living, breathing organism. Everything, from developers writing code running next to the hard disks at the bottom of the software stack, to technicians installing new racks of storage capacity in our data centers, to customers tuning applications for performance, everything is one single, continuously evolving system. S3’s customers aren’t buying software, they are buying a service and they expect the experience of using that service to be continuously, predictably fantastic."
 3 |     - I feel that, some systems are built to be continuously monitored rather than engineered for zero-touch correctness from the start. It's the interplay between human operators and computer programs that lets us produce fantastically efficient software.
 4 |     - “… managing and balancing I/O demand across a really large set of hard drives. In S3, we refer to that problem as __heat management__.”
 5 |     - “So, with aggregation flattening the overall demand distribution, we need to take this relatively smooth demand rate and translate it into a __similarly smooth level of demand__ across all of our disks, balancing the heat of each workload”
 6 |         - Note to self: If you have a few heavy-hitting customers though, like at [[Modal]], they could indeed affect aggregate load. So this is important to design for as well.
 7 |     - Intentionally place different objects on different sets of drives so that they cannot interfere with each other, and so they are a small % of each disk. No hotspots.
 8 |     - On scale and aggregate workloads: “building at this scale means that any one of those individual workloads is able to burst to a level of performance that just wouldn’t be practical to build if they were building without this scale.“
 9 |         - Benefits of scale: **Distribute the hardware across a lot of users.**
10 |     - Using techniques like formal verification to allow engineers to __move faster and be more confident__. This kind of argument makes sense for very large, complex systems.
11 |         - Traditions to raise quality, hold each other accountable as a team.
12 |     - Services are not just the software. They have owners. Pushing on something and driving it to completion, setting a bar for quality. But also offering a service for others—which is important since at scale, nothing you do operates in a vacuum.
13 |         - in dialogue w/ Kunda's tech culture.
14 |         - American culture: professionals = "independent thinkers;" have __self__
15 |     - “But ultimately, my most successful research projects were never mine. They were my students and I was lucky to be involved.”
16 |     - 1 IOPS per 2 TB of data stored, why this ratio? Probably because __load is balanced out__.
17 |         - Note that random HDD reads are 100,000x slower than sequential, and random SSD reads are 10,000x slower than sequential.
18 |         - Sequential reads are just as fast in either case though, approaching RAM speeds.
19 |     - S3 has no built-in caching! It has storage tiers though.
20 |     - Taking a break now, what a nice reading. Seems like people have mostly formed groups and are chatting during the break now, except a few people on the sides who are just reading or working. I get it though, socializing can be kind of tiring sometimes.
21 |     - "Lightweight" formal verification that's easy to keep updated by non-experts.
22 |         - Reference models for each part of the system (LSM -> Hash Map), 1% of total code.
23 |         - Property-based testing & model checking between each component and its reference model, which allows you to determine correct, consistent behavior.
24 |     - ShardStore maps shard IDs (metadata) to shards of customer object data.
25 |         - LSM tree with shard data stored out-of-tree. Takes shard ID -> list of __chunks__, where each is stored within an __extent__.
26 |         - Writes to an extent should be sequential. Separate reset operation returns to start.
27 |         - Chunk store abstraction is PUT(data)->locator and GET(locator)->data, like blobnet.
28 |     - “It was only when we discussed long-term maintenance implications with the team that we realized writing the models themselves in Rust was a much better choice, and even later when we realized the reference models could serve double duty as mocks for unit testing.”
29 |         - Formal methods rely on some domain knowledge, not just pure fuzzing.
30 |         - Other domain knowledge: types of I/O failures, executable reference model.
31 |         - "It doesn't seem very formal at all." Or is it?
32 |         - Key idea: Modeling out their system simply, check implementation against it, make it a mock in the same language to validate correctness. Compare it in unit tests so it stays in sync, even across frequent changes.
33 |         - Key idea 2: ShardStore architecture. There's a garbage collector! Almost looks kind of like an LSM-based file system, with multiple levels / tiers.
34 |             - "Soft updates" = dependency graph, comes from past work.
35 |             - `fn append(&self, ..., dep: Dependency) -> Dependency` is the _only_ way to write to disk! Used in FFS, compare with journal-based file systems.
36 |             - This avoids corruption on crash.
37 |             - Maybe it would help Amazon roll out new hardware.
38 |         - Key idea 3: Handing things off to non-formal methods experts.
39 |             - Currently 18% of the lines, means people are editing it. Good enough, maybe they hit the model mostly correct from the start.
40 |             - Currently hundreds of petabytes, ~0.1% of all customer data. (In 2021. Which subset?)
41 |             - The shuttle model checker simulates threads, chooses ordering.
42 | 


--------------------------------------------------------------------------------
/2024/02-11-linux-executables.md:
--------------------------------------------------------------------------------
 1 | - Linux executables — [[February 11th, 2024]]
 2 |     - What is an executable?
 3 |         - A file with code in it
 4 |         - Binary 01010101
 5 |         - It has a _start() method / entrypoint
 6 |         - "You can double-click on it" (execute it)
 7 |         - Architecture-specific file that you can execute
 8 |             - "The OS knows how to run it"
 9 |             - The kernel has code to parse and load various parts of the executable
10 |         - Header, Static data (also arch-specific), machine code (architecture-specific)
11 |         - Is a Python file an executable?
12 |             - "executable permission" — files have this
13 |         - `glibc`
14 |     - > An executable is a file containing binary code with a defined entry point, such as a _main() method, designed for a specific architecture, which an operating system can directly execute, often by double-clicking, and includes components like headers, static data, and machine code; its executability may also depend on permissions, and it can involve libraries like glibc.
15 |     - ELF Header
16 |         - REL = Relocatable file (snippet / "not yet linked" — not quite an executable yet, like a `.o` file)
17 |         - EXEC = Executable
18 |         - DYN = Shared library (or bin `/bin/true`)
19 |         - CORE = Core dump (fake, not an executable)
20 |     - **Q: Why use the ELF file format for core dumps?**
21 |         - "Why not?" Reusability of CLI tools.
22 |     - **Q: Why are shared libraries like .so files still in the ELF format?**
23 |         - They aren't executable, they're just in the executable file format. (Not confusing)
24 |         - It has "executable energy"
25 |         - Also relevant: You can have executables that are actually shared libraries in disguise.
26 |     - Entry point is a 64-bit number
27 |         - To be explained, but these are not actually direct byte offsets? In some cases?
28 |     - Metadata / dynamic sizes for what's to come.
29 |     - Code reading tips
30 |         - Ignore all of the types and generics
31 |         - Nom = parser combinator crate, build a binary structure out of various smaller parts.
32 |         - Parser combinators help you validate data in a structured way, with context
33 |         - The code is just a translation of the ELF header diagram into an executable program. (You could write it in Python, too.)
34 |     - Part 2: Memory mapping and mprotect regions
35 |         - We finally talk about the end of the header.
36 |         - [File header] + [Program header 1] + [Program header 2] + …
37 |         - Program headers define memory mapping regions for code and data.
38 |         - Why do we need to map things into regions? Security: some parts should be readable and writable, you can't just JMP into a random place.
39 |         - The region check is in the OS, but the processor itself also has safety checks (set in rings).
40 |         - Cooperation between OS and hardware. Page table bits.
41 |         - Program headers = memory mapping regions + protection bits on each region.
42 |         - 48-bit addresses
43 |         - Loop through all program headers and find which region corresponds to the `entry_point`.
44 |         - **Q: Why is calling a function from a shared library more expensive? ("DSO how-to")**
45 |             - Follow-up for later
46 |         - **Q: Who determines memory ranges?**
47 |             - Old executables are at a fixed address.
48 |             - New executables look like a shared library, which are "position-independent" — you can offset them +N bytes.
49 |             - `-pie`
50 |             - OS will just add some random number (ld.so interprets the binary!).
51 |         - Paper rec about security: "The geometry of innocent flesh on bone"
52 |     - Part 3: Position-independent code
53 |         - Trivia: "Running executables without exec()"
54 |         - Retracing path:
55 |             - Modified program to accept executables with a single code segment, and jump to them. This works!
56 |             - Try to iterate over all of the memory ranges, call mmap(MAP_FIXED | MAP_ANON) on all of them, and copy the data into those mappings. This doesn't work!
57 |             - Something PIE
58 |             - It turns out you actually need to use /lib/ld.so as an interpreter for PIEs!
59 |             - Link the program with that ld interpreter, and then run it again, and it works.
60 |         - Dynamic linking is literally just mmapping the .so file into memory.
61 |         - The rest of the ELF file is also mmapped with various protection bits.
62 |         - 64-bit address spaces take offsets from %rip (`lea` instruction).
63 |         - This is what happens when you use `LD_PRELOAD` to overwrite some API in a shared library when running a binary. `ld` uses this variable to figure out what to map into virtual memory.
64 |         - **Q: What if libc.so.6 gets bigger?**
65 |             - It's fine because ld is responsible for dynamically determining addresses.
66 |             - Each library is its own ELF object.
67 |         - **Q: What happens if you call a linked function? How does your code determine the correct address to jump to at runtime?**
68 |             - Dynamic linker has a list of all the addresses and rewrites all of them.
69 |             - PLT / GOT — lookup tables for executables for symbols
70 |         - **Q: Why would you want to use LD_PRELOAD?**
71 |             - Plugging in a different version of malloc (tcmalloc).
72 |             - Debugging functions by patching them into an existing binary, print out their arguments.
73 |             - "This is not normal"
74 |             - __"You wouldn't do that in prod, would you?"__
75 |             - Caveat: You need the right ABI.
76 |     - Part 4: ELF relocations
77 | 


--------------------------------------------------------------------------------
/2024/03-03-wireguard.md:
--------------------------------------------------------------------------------
 1 | - WireGuard — [[March 3rd, 2024]]
 2 |     - Abel is talking about WireGuard. It's a VPN. :D
 3 |     - Background is that it's more a __security__ thing, from the background of the author Jason Donenfeld, or so on. But as a system session, we can assume mathematical & cryptography background, but focuses on the actual Linux kernel engineering and data structures.
 4 |     - Ultimately, design decisions behind using radix trees and offering an interface that anyone can use, without thinking about it.
 5 |     - Paper is written toward an audience using IPsec / OpenVPN. "These solutions suck."
 6 |     - Configuration is annoying, needs you to pick a whole cipher suite (__cipher agility__ was important at the time). WireGuard is an opinionated stance on VPNs, less config.
 7 |     - Why use a VPN? You want to connect things securely, and not over the public Internet. Sometimes you can also abuse it to control the intermediate hops through a network. (e.g., using a CDN to get better latency to your __League of Legends__ game server.)
 8 |     - PoP — point of presence.
 9 |     - OSI layer model is mentioned here, especially layer 2 (link layer) and layer 3 (IP).
10 |     - PQ = "post-quantum" cryptography, could affect the security of protocols like Diffie-Hellman, which rely on the hardness of factorization / discrete logarithms.
11 |     - Section 1
12 |         - From the intro — WireGuard tries to be simple, so it emulates a device `wg0` in the Linux kernel. You configure this with your key pair and the public keys of all peers, and it will "just work" after that point, as any IP packets sent to wg0 will be forwarded to the appropriate virtual network peer. The actual communication happens through UDP on a port though.
13 |         - Why is kernel-space WireGuard more efficient? Well, there's no memory copying between packets in kernel memory and user memory. But maybe you can use kernel-bypass tricks, like DPDK.
14 |     - Section 2
15 |         - "Cryptokey routing" — i.e., your routing table is a list of public keys.
16 |         - Internal IP ranges correspond with public keys (hosts).
17 |         - The practice of making internet endpoints optional for each host enables roaming, in the typical setup of having a "server" with a fixed IP address and a "client" on the other end.
18 |     - Section 3
19 |         - The allowed source virtual IPs for peers in WireGuard are bidirectional. It's the allowed IPs for validating incoming packets, but it's also how we sent outbound packets.
20 |         - You need to have a match in the table for a packet to be sent, including Internet endpoint.
21 |         - Question: What if you have A <-> B via WireGuard peers, but then also B <-> C via WireGuard peers? Can A send a message to C, through B? Yes, this will happen if B has C's ip address in its cryptokey routing table. That's just how Linux works. Linux forwards packets it receives to the appropriate network interface (if iptables, iproute2 are used correctly).
22 |         - It does this through __virtual IP routing__. But it can also do it through the public Internet on B, if the destination address from the packet on A is in a public range. For example, this configuration forwards all Internet traffic through one peer: ![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fekzhang%2F5yknsmS27i.png?alt=media&token=b3037878-ab2e-4357-a40e-3e99303513b9)
23 |     - Section 4
24 |         - The "Basic Usage" section has the first real, concrete configuration we've seen so far using the `ip` command in Linux.
25 |         - It creates a wg0 interface, where with IP block `10.192.112.3/24`. This means the virtual network has 254 hosts, and the current one is `10.192.112.3`.
26 |         - Possible disadvantages of dropping invalid packets: it can cause a timer stall since no response is sent, so we get no feedback on the host that initiated the connection. This could waste time, in addition to not getting any error message for debugging.
27 |         - Limitations: The initial handshake of WireGuard is quite slow unfortunately. Implementing the Internet over WireGuard would be slower than TCP/IP.
28 |     - Section 5.4 - 5.7
29 |         - Three phases: handshake, cookie (optional, not web browser cookies), and transport.
30 |         - Ephemeral versus static key pairs for encryption. The ephemeral key pair is regenerated every few minutes, giving us forward secrecy.
31 |         - AEAD = authenticated encryption with associated data, i.e., ChaCha20.
32 |         - MAC = message authentication code, i.e., Poly1305. There are two of them for a reason.
33 |     - Section 5.1 - 5.3
34 |         - Protocol has a 1-RTT handshake, which does the ECDH key exchange with Curve25519.
35 |         - Timestamps are used to avoid replay attacks on the initial handshake. You are not allowed to replay the timestamp for a given host. It must be increasing.
36 |         - The timestamp also is used later in a small reorder buffer for transport data.
37 |         - To avoid being broken by quantum computing in the future, there is a bonus option to strengthen the handshake with a pre-shared symmetric secret. You could then append another protocol to this, and the details are left for future work.
38 |         - "Cookies" are a mechanism to avoid DoS. The problem is that ECDH is comparatively slower to the overhead of sending a network packet, like over a 10 Gbit connection.
39 |         - The key idea is to __offload compute__ using the cookie.
40 |         - There are two MACs. The first MAC is always sent. The second MAC is sent after getting a cookie reply, and it allows the requester to try its handshake again. This MAC proves the source IP address, and then the initiator does some expensive subcomputation.
41 |         - There are many problems caused by the cookie. No silence (responding to unauthenticated message), cannot be sent in cleartext, and cookie flood. These are addressed.
42 |     - WireGuard is open-source, only 4000 lines of code, and you can read it!
43 | 


--------------------------------------------------------------------------------
/2024/10-27-sql-50years.md:
--------------------------------------------------------------------------------
 1 | - 50 years of SQL — [[October 27th, 2024]]
 2 |     - To each of the papers, the previous papers are kind of like a "test of time" award recipient, a seminal paper from a previous technological era.
 3 |     - Also, computers got 32x faster every decade. So that's kind of amazing. Cost per gigaflop went from $15,000,000,000 to $0.30, yet the SQL language work remains the same, and its intended use case is just as salient now than ever.
 4 |     - SEQUEL: …language or menu selection (3,4) seem to be the most viable alternatives. However, there is also a large class of users who, while they are not computer specialists, would be willing to learn to interact with a computer in a reasonably high-level, non-procedural query language. Examples of such users are accountants, engineers, architects, and urban planners. It is for this class of users that SEQUEL is intended.
 5 |     - Compare to the predicate calculus, SQL tries to be more readable. Needs: Easy-to-maintain, linear notation format.
 6 |         - This is interesting, reminds me of how computing interface ideas are more timeless than you might think. There's not so much complexity that can be supported in an interface. Humans are simple, and even kids can use an iPhone.
 7 |         - SQL is a tool, and it connects the user with the compute. The nature of people hasn't changed, and neither has the fundamental nature of computation.
 8 |         - Programming languages have evolved over decades to be this masterpiece that allows us to reason about and express computational ideas.
 9 |         - Composition operator makes sense, no lowercase letters. Invented the word "join."
10 |     - Now there are 14 people. We're getting on to the second paper, a Critique of SQL.
11 |         - Very bolded text, lol: "The criticisms that follow should not be construed as criticisms of the original designers and implementers of the SQL language. The paper is intended solely as a critique of the SQL language as such, and nothing more."
12 |         - An old culture of criticizing the idea, rather than the person behind the idea.
13 |         - Lol this should be required reading for anyone who doesn't like SQL, they're super matter-of-fact about the lack of constructors / data model clarity / assignments. ![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fekzhang%2FCxgidsVs-V.png?alt=media&token=84ba4a3c-5cfd-45fd-8867-69d39a1d12bf)
14 |         - But SQL lives on, and these other languages do not. So I mean, what went wrong? Or what did SQL do right that this analysis cannot capture?
15 |         - They correctly observe that users might have to transform their query to be "less natural" when they add a level of nesting. It's not a very composable language like others. Maybe `NYC UNION SFO` should be valid, just as `SELECT * FROM NYC UNION SELECT * FROM SFO` — but in practice, the latter is "good enough."
16 |         - PRQL-lang.org would have a field day with this, orthogonal language features.
17 |         - Good points about COUNT and aggregates being outside, though being inside allows the groupby-count workflow to be simpler, which is perhaps more common! I ran into this awkwardness when designing https://percival.ink as well.
18 |         - "SQL does not exploit the exception-handling capabilities of the host" LOL this predates the idea of client-server databases.
19 |         - Apparently SQL has no foreign key support? I guess database implementations added that. It that could be a boon though, since it made OLAP flows more flexible.
20 |         - It's fascinating to see how much criticism is just "I don't like SELECT * FROM" or something since it's error-prone. At the end of the day, like any programming language, SQL is a compromise: between declarative and imperative, interactive and application-embedded, transactional and analytic, simple and complex, English-like and composable.
21 |     - Critique of ANSI SQL Isolation Levels
22 |         - First "modern" database conference paper of the bunch. Invented snapshot isolation, characterized the levels in terms of forbidden phenomena. Dirty reads, non-repeatable reads, and phantoms.
23 |         - Important is the distinction between RR and SI. In repeatable read, you can have phantom reads (new sets appear), and in snapshot isolation you can have write skew.
24 |         - SI can have write skew because it only prevents write-write conflicts. It doesn't do any range checking or prohibit read-write conflicts (at least, vanilla SI doesn't). But if you do a `GET FOR UPDATE` then I suppose you get a bit stronger isolation.
25 |         - The paper introduced SI and said, hey, this doesn't fit in your 4-level framework!
26 |         - Repeatable read can have phantoms because it doesn't lock the entire range set of the table being selected. It only locks individual rows. Tricky 2PC here.
27 |         - Predicate locking is needed for ANSI serializable mode (i.e., range locks + locks on the result set of a filter).
28 |         - 2PC locking, versus snapshot isolation + OCC/locking, are the main approaches to a DB.
29 |     - C-Store
30 |         - First prototype of a column-oriented database in 2005.
31 |         - Tuple mover moves data from the WS (writeable store) to the RS (read-optimized store). This gives you fast OLTP insertions in addition to OLAP.
32 |         - Overlapping projections of tables, unlike tables, secondary indices (i.e., non-clustered indices), and projections (materialized views).
33 |         - Much faster, and thanks to compression, also smaller in disk usage. Saving space via efficient data encodings with strings and other numeric types.
34 |         - Tradeoff between read performance and write performance via these phases.
35 |         - Also, the same column could be sorted on multiple attributes in separate projections, that's kind of fun to think about. Floating sort-orders.
36 |     - Shark (Spark SQL)
37 |         - Join strategies implemented on top of Spark RDDs. Enables interactive performance unlike past systems because the data is held in RAM before future computation work.
38 |         - Adaptive join optimizer: shuffle join vs map join.
39 |         - "Partial DAG Execution" does a few runs, then re-optimizes the query based on a profile.
40 |         - I guess that makes sense, joins and aggregates are kind of the key part of a MapReduce algorithm that SQL needs to translate to. The table itself is an RDD partitioned over multiple machines, and then joins operate between those partitions.
41 |         - In-memory columnar storage and compression helps reduce I/O and CPU usage.
42 | 


--------------------------------------------------------------------------------
/2025/03-animations-and-interactive-graphics.md:
--------------------------------------------------------------------------------
 1 | - Animation and interactive graphics (Mar 2025)
 2 |     - Starting off strong by compiling a list of a bunch of websites, pretty neat.
 3 |     - Animated Vega-Lite (VIS '23)
 4 |         - Goal is to unify interaction and animation. Remember that these two are related: both of them slice up the data based on some parameter, either time or some other click.
 5 |         - Time as an __encoding__ versus as an __event stream__.
 6 |         - You can drag a slider, and that would be equivalent to the clock moving forward in an animation. Kind of like the playback timeline in a video. But interaction can be for things other than time in Vega, like focusing on a particular series.
 7 |         - Zoom/pan as a "view transformation."
 8 |         - Vega-Lite has two encodings "x" and "y" for spatial position, as well as other encodings like "hue" and "size" — this paper introduces one for "time" as a field of data. Infer a default for tweening between points based on class of data, so you get animations!
 9 |         - Interesting that the default is 500ms per domain value. Information density?
10 |         - "rescale" by time: interesting property, is this parameter in the rest of Vega-Lite as well? Feels like it could be applied to other slices / interactions with data.
11 |         - Specific examples motivate the design of the grammar. I think this makes sense as a guiding principle. Over time, when designing creative tools / languages, you can look at what people have created in the past, and then discover how to express that again.
12 |         - Sidenote: Lol the paper refers to itself as "low viscosity" instead "frictionless"!
13 |     - Lottie animated graphics format https://lottie.github.io/lottie-spec/latest/single-page/
14 |         - Recall that the purpose is to be exported from Adobe After Effects as a vector graphic, then embedded into software via the runtime library.
15 |         - Compact JSON format. Gradients are a flat array of numbers, color & opacity stops.
16 |         - Some properties can be animated. An animated property consists of an array of __keyframes__. Each keyframe is a timestamp, as well as two cubic bezier handle points for the interval before and after that keyframe. They form a cubic curve from from [0,0] to [1,1].
17 |         - Just skimming. Besides keyframes and animations (spec unclear to me right now), the rest of the format is just: SVG shapes, colors, stroke/fill/mask/line caps, and transforms. Simple!
18 |         - Animation = name, layers[], framerate, framenumbers, width/height, assets (reference, data URIs), markers (named time ranges), and slots (constants).
19 |             - Layers can have children, so they follow the movement, kind of like an SVG group. It forms a transform hierarchy. Also they can have an ip..op range of frames in which they're visible (optimization for sequencing?).
20 |             - Types of layers: solid bg, image (by asset ID), precomposition (component with other layers that can be transformed or referenced again), null (group), or shapes.
21 |         - Overall a pretty simple format. Basically like SVG, but with precomposition, enter/exit times, and every property (scalars, position, color, gradients) is allowed to be a list of keyframes that are played back in animation.
22 |     - Post-processing shaders https://blog.maximeheckel.com/posts/post-processing-as-a-creative-medium/
23 |         - Obviously dithering is possible, but I like this blog post's trick of rounding (u,v) in a shader as their efficient means for pixelating an image. Why downscale it yourself, when you can have the graphics pipeline do it properly for you with implcit mipmaps?
24 |         - Lots of ideas with the pixelation, in the rough category of "shapes" or "colors" per pixel.
25 |         - Seems fun to design around this for web graphics! They're __very easy__ postprocessing shaders to write, just a few lines of code, and very interactive.
26 |         - Sidenote: The on-demand interactive code editors embedded into the website are pretty damn clean. They also rerender immediately. It's understated but very good frontend work to build this well.
27 |         - Speaking of advanced image effects, you can do a lot with CSS! These image filters for simulating the light on cards are kind of awesome. https://poke-holo.simey.me/
28 |     - More 3D work to collect from people
29 |         - https://darkroom.engineering/
30 |         - https://www.joshuas.world/
31 |         - https://qrshow.nyc/
32 |     - Reminded of 3D work
33 |         - Google gravity, the original inspiration for a lot of this.
34 |         - Case study, Lusion's connectors. https://codesandbox.io/p/sandbox/xy8c8z
35 |             - The source of this is actually quite simple. No keyframes, just a physics engine with impulse -x applied on each frame, and a kinematic sphere collider.
36 |             - Three different materials. Pull in the glTF for the connectors.
37 |             - Some tech here.
38 |                 - N8AO = SSAO library https://github.com/N8python/n8ao
39 |                 - "Lightformer" = dynamic construction of HDRI environment lights with React, doesn't cost extra ray intersections like real lights do(?)
40 |             - Overall only like ~100 lines of code, but you need to be comfortable knowing where all the controls are. Like setting near/far/fov in Blender, versus props on the scene.
41 |                 - Can use a design engineering workflow. Start with a main component, gradually add more stuff. But mix it in with creative execution, tweaking stuff.
42 |             - Remember that the library editor is one way to construct these graphics, but it requires writing code. There's a bunch of merit to exploring in actual creative tools / software like After Effects, Blender, or Rive. They're purpose-built UIs instead of PL.
43 |             - But I guess code helps capture custom logic in a way that isn't so easy. Tends toward anything more __procedural__, … up to generative art, or video games. Even postprocessing can be done better with node graph editors.
44 |     - How does N8AO work (SSAO for three.js)
45 |         - Two shaders: EffectShader and PoissonBlur, both WebGL.
46 |         - There's some glue code to bind all the uniforms and inputs/outputs together. It seems to use textures and involves view / projection matrices, not sure exactly how though.
47 |         - Ah okay, it takes the depth, normals, and albedo. Then calculates occlusion based on the depth buffer results.
48 |         - TBN matrix = local coordinate system around the normal vector (tangent + bitangent). It uses blue noise jittering around each pixel to grab a few other samples, then weights them and reduces variance.
49 |         - SAMPLES = 16, takes random samples along a hemisphere and then projects them back to world position and loads a point from the depth map.
50 |         - Second part is a **Poisson blur**. Raw SSAO values are low sample count and noisy / jittered with random blue noise. So then it applies a Poisson disc blur (Poisson = random samples to avoid aliasing), weighted by a distance kernel.
51 |     - March 30. Might spend this meeting thinking about maps.
52 |         - Questions about tiles: What's the difference between actual terrain information and layers on top of that? How many layers are included in a tile server, and what's the standard? What free tile servers or datasets are available?
53 |         - Question (practical): What's the best way to draw / customize a map on a website and to dynamically style it? How about post-processing the image or adding isolines? Surely it's not too hard of a systems problem.
54 |         - Frontend: MapLibre GL vs Mapbox GL (both Mapbox API) vs Leaflet (less powerful).
55 |             - Seems like Leaflet is probably slower and not optimized with WebGL, probably also not a very modern appearance.
56 |             - MapLibre offers real-time data visualization, dynamic styling, GPU acceleration.
57 |             - Interesting that [deck.gl](https://deck.gl/) exists as a library too. This looks like a pretty neat API for data visualization on maps. It's like the map itself acts as a canvas, and `react-map-gl` plus the deck.gl / loaders.gl ecosystem display data on top.
58 |             - "styles.json" loaded from https://basemaps.cartocdn.com/gl/dark-matter-nolabels-gl-style/style.json — see https://carto.com/basemaps
59 |             - CSV file in hex bin demo is just a list of lat,lng pairs. Then they're converted to [lat,lng] pairs by JavaScript before being passed in. `gpuAggregation` is interesting.
60 |             - `transitions: { elevationScale: 3000 }` for the fade in animation.
61 |             - `mapStyle` is set to a style.json URL, but where's the tile server URL when you're using the `<Map>` component from Maplibre? I'll check the Network tab to see what's going on here, maybe there's a hardcoded default.
62 |                 - Oh I just realized that the style file actually has the source as a URL in that file. `"sources": { "carto": { "type": "vector", "url": ".../tiles.json" } }`
63 |                 - https://tiles.basemaps.cartocdn.com/vector/carto.streets/v1/tiles.json
64 |                 - https://tiles-b.basemaps.cartocdn.com/vectortiles/carto.streets/v1/5/16/9.mvt
65 |             - MVT = Mapbox Vector Tiles, a Protobuf of path winding number-based shapes.
66 |         - Okay, getting more into the format here. The rendering library is code that takes ["style" documents](https://maplibre.org/maplibre-style-spec/) and turns them into interactive, rendered maps.
67 |             - https://en.wikipedia.org/wiki/World_Geodetic_System
68 |             - There are like at most 22-ish levels of zoom in maps (high resolution). This matches the Earth's diameter, which is about 12 million meters or 2^23.
69 |             - Z=0 is the world in 1 tile. Z=14 is about 1.5 km per tile, or a neighborhood level.
70 |             - At Z=22, you can see individual cars or benches. Each pixel is 2 cm.
71 |             - "Web Mercator" (EPSG:3857) is coordinate system for point, linestring, or polygon. Each named layer in the tile has features, and each feature has key-value properties.
72 |             - [tippecanoe](https://github.com/mapbox/tippecanoe) turns GeoJSON/Geobuf/CSV features into tiles.
73 |                 - > The goal of Tippecanoe is to enable making a scale-independent view of your data, so that at any level from the entire world to a single building, you can see the density and texture of the data rather than a simplification from dropping supposedly unimportant features or clustering or aggregating them.
74 |             - The style files take a source, take a "source-layer" string within that source, and filter it by one or more conditions (`property == value`). Then you can add fill/background, which are colors graded by the zoom level.
75 |             - You might also only show these layers at different minzoom / maxzoom levels.
76 |     - Demos now!
77 |         - People making small scenes in three.js!
78 |         - mc-bench has this thing to render Minecraft scenes with commands in AI models.
79 |         - https://design.google/library/exploring-color-google-maps
80 | 


--------------------------------------------------------------------------------
/2025/04-v8-javascript-engine.md:
--------------------------------------------------------------------------------
  1 | - The V8 JavaScript engine (Apr 2025)
  2 |     - V8 bytecode is a "register machine with an accumulator register." So this is a bit different from Wasm's stack machine, for instance. There are a fixed set of numbered registers.
  3 |     - What is the accumulator register? Looks like an implicit input-output, like `%rax` in x86.
  4 |     - "Concise" bytecode, 25-50% the size of machine code. Maybe this was a design goal, but how is it possible for the bytecode to be that much smaller?
  5 |         - (Oh, well we're loading named properties from a lookup table instead of GEP's.)
  6 |         - Originally Ignition did not exist, and V8 had a faster baseline compiler before its full JIT. This changed with their high-performance interpreter: it generates fast, optimized handlers for each opcode.
  7 |         - "Macro-assembly" reuses some of Turbofan's architecture-independent machine code to generate the interpreter handlers. This lets them have a really fast interpreter (faster than C++, avoids trampolines) but also avoids handwriting assembly for each platform.
  8 |     - Historically V8 had "full-codegen" and Crankshaft (2010-2016 era). This got replaced with just Ignition and Turbofan around 2017, so Ignition generates bytecode and Turbofan optimizes it after enough passes to trigger the JIT.
  9 |         - Ignition bytecode is linear, but Turbofan optimizes it via a control flow graph.
 10 |         - Animation of building the **"sea of nodes" graph** by walking through the operations: https://docs.google.com/presentation/d/1chhN90uB8yPaIhx_h2M3lPyxPgdPmkADqSNAoXYQiVE/edit?usp=sharing
 11 |     - What V8 bytecode looks like (note: as of 2016, when it was ~100 kLoC):
 12 |         - [Example of modern V8 bytecode in for..of loop](https://github.com/v8/v8/blob/ef7991fbda9e4f7d26c3b5b556bea99932e8476d/test/unittests/interpreter/bytecode_expectations/ForOfLoop.golden)
 13 |         - Glossary
 14 |             - "Ld" = load, "St" = store, "A" = accumulator, "R" = register
 15 |             - SMI = Smallint. or a small integer value between -2^31 and 2^31-1, which is efficient to pass around and call functions with.
 16 |             - Strict/Sloppy refers to JS "strict mode" (2009) which was introduced to make the language less janky and easier to optimize. Modern websites [use strict mode](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Strict_mode).
 17 |         - Star/Ldar: accumulator, store into or load from register. (Ld/St : Accumulator : Register)
 18 |         - LdaZero, LdaSmi8 (SMI = Smallint), LdaNull, … = load into accumulator
 19 |             - LdrUndefined = load into register, micro-optimization for `undefined` since it's common
 20 |         - Binary operators: Sub/Add/Mul/…: arithmetic operations on accumulator and a register.
 21 |         - New (objects), CreateClosure (functions)
 22 |         - LdaGlobal, LdrGlobal, StaGlobal[Strict/Sloppy]
 23 |         - Unary operators: Inc, Dec, … DeleteProperty[Strict/Sloppy] act on the accumulator
 24 |         - Call, TailCall, CallRuntime, …
 25 |         - Conditionals and jumps
 26 |             - Test[Equal, NotEqual, LessThan, …]
 27 |             - Jump, Jump[Constant, IfTrue, IfFalse, IfUndefined, IfNotHole, …]
 28 |         - Context operations
 29 |             - PushContext, PopContext, [Lda/Ldr/Sta]ContextSlot
 30 |             - What exactly is a context? Is it used for switching between call stacks in generators perhaps, so you can have efficient async code?
 31 |             - SuspendGenerator, ResumeGenerator also exist
 32 |         - Other control flow
 33 |             - Return: return the accumulator.
 34 |             - Throw/ReThrow (hm, how are the exceptions caught?)
 35 |             - ForIn[Prepare,Next,Done,Step] opcodes baked in to support `for ... in` loops (seems unnecessary in modern JS)
 36 |         - Load/store of object properties
 37 |             - Separate opcodes for "named" properties or "keyed" accesses by int
 38 |     - ["Hidden classes" in V8](https://v8.dev/docs/hidden-classes), "shapes" in SM
 39 |         - Important optimization in JITs. Allows the compiler to infer the shape of a repeated object like `{ a: number, b: string }` and then store that more efficiently.
 40 |         - Powers the prototype chain, turns dynamic dispatch into more static dispatch. This is what lets you dynamically create objects in JavaScript and be lax about declaring object shapes ahead of time.
 41 |         - The JIT is super important here, otherwise you do repetitive work for these kinds of objects. Think of the common pattern of returning arrays of CRUD objects for instance, and filtering or ordering them in a table.
 42 |         - Got linked [Hermes](https://github.com/facebook/hermes), a project to AoT-compile React Native apps into bytecode since Apple doesn't allow you to run JIT software on iOS for security reasons.
 43 |         - This is tied closely to __deoptimization__, as you track when conditions change and you need to return to bytecode. Optimized code has guardrails or assumptions, and if you notice one of those is breached then you gotta return to the interpreter.
 44 |     - Security in V8 (see [sandbox](https://v8.dev/blog/sandbox) blog post)
 45 |         - Memory safety is a big issue, in particular, subtle logic issues in memory accesses can turn into vulnerabilities. Can't be fixed by a memory-safe language. Of course that could fix the compiler __itself__, but not the JIT-generated code that's immediately evaluated.
 46 |         - Referred to as "2nd-order logic bugs" in the post.
 47 |         - __"The basic idea behind the sandbox is to isolate V8's (heap) memory such that any memory corruption there cannot "spread" to other parts of the process' memory."__
 48 |         - Ideally you could implement this in hardware, with a mode-switch feature to make it impossible for the CPU to access out-of-sandbox-memory. But no hardware implements this, so they have something software-based instead.
 49 |         - How it works: all heap pointers are actually 40-bit offsets (1 TiB) into a reserved range of virtual address space for the sandbox. Not stack, just heap, and you have to compile an extra shift+add on every global memory load. (+1% perf overhead)
 50 |         - This is a good security boundary because it contains the impact of memory bugs in JIT-generated code, while also not penalizing performance much.
 51 |         - Think about serverless isolates: having strong memory safety is paramount.
 52 |     - Looking at a [V8 type confusion RCE](https://github.blog/security/vulnerability-research/the-chromium-super-inline-cache-type-confusion/)
 53 |         - Super inline cache (SuperIC) feature, history of being exploitable.
 54 |         - "Inline cache" is related to shapes. The Ignition bytecode collects profiling data for speculative optimizations on every run. Feedback is used to generate optimized machine code, so you don't need a dict property lookup.
 55 |         - Each V8 JS object has a "map" as its first property. Stores type + property offsets. If two objects have the same map (shape), their memory layouts are identical.
 56 |         - Each bytecode is run through simple IGNITION_HANDLER in `interpreter-generator.cc`.
 57 |         - Inline cache is IC. LoadIC_BytecodeHandler versus LoadIC_Miss, fallback if there's not enough feedback or the `map` is previously unknown.
 58 |         - It gets more complex from here. The tl;dr is that IC is great for optimizing property accesses, but if you mess up the validity of the optimized handler when using it on a LoadIC cache hit, you're in for a bunch of memory vulnerabilities.
 59 |         - Vulnerability background: Behavior of `super` and the prototype chain. In V8, there are a few different objects have fundamentally different implementations: compare `JS_OBJECT_TYPE` and `JS_TYPED_ARRAY_TYPE`. You can set the prototype inheritor to be a totally wacky object, which doesn't have the same properties accessible via `super`.
 60 |         - So the "super IC" is used for inline caching super property accesses. But it's checking the wrong `lookup_start_object_map`, for the object and not its superclass, boom!
 61 |         - To exploit the vulnerability, gets some JS objects, develop arbitrary memory writes in Blink, and then overwrite a compiled WebAssembly.Instance object code.
 62 |             - There is a `wasm-memory-protection-keys` flag, but can just overwrite in memory.
 63 |         - "Being able to discover and exploit these bugs requires a great wealth and depth of knowledge in both Blink and V8, and these bugs give us a glimpse of both the resources and expertise that bad actors possess and perhaps an area where more research is needed."
 64 |     - [CodeStubAssembler](https://v8.dev/blog/csa) (CSA / Torque)
 65 |         - Used to generate pre-compiled code implementations for JavaScript builtins, to speed up startup so that TurboFan doesn't have so much work for each operation.
 66 |         - For example, `string.length` is a builtin function, written in assembly.
 67 |         - CSA is like a cross-platform, universal assembly language that paints over the difference between architectures. So it uses the optimal instruction on each architecture, and it's a pretty lightweight abstraction.
 68 |         - CSA involved hand-writing assembly though and was too complicated. Torque is a DSL to make this less buggy / tricky to write.
 69 |         - See `src/builtins` directory, with .tq files.
 70 |         - For example, here is the implementation of [Array.from()](https://github.com/v8/v8/blob/master/src/builtins/array-from.tq), 170 lines of code. Kind of like a straight port of the JavaScript TC39 specification for its behavior.
 71 |     - How does the V8 microtask queue work and how is it different from Node's libuv?
 72 |         - Looks like microtasks are just a push-only queue of stuff to run after the JavaScript interpreter finishes what it's currently doing.
 73 |         - Microtasks are all run until completion, before the next task is run (e.g., an event handler or callback). Examples of microtasks are the body of a `new Promise()`, the callback of a MutationObserver, or stuff provided to enqueueMicrotask() manually.
 74 |         - This way, microtasks are built into V8, rather than being a feature of the surrounding runtime. Like the Blink render engine, or the Node.js event loop based on libuv.
 75 |         - Can be used for consistent execution order between Promise/non-Promise branches.
 76 |     - How does CSA actually generate machine code from macros — what are the macros?
 77 |         - `src/builtins/builtins-util-gen.h` defines TF_BUILTIN macro, used ~300 times
 78 |         - `src/codegen/define-code-stub-assembler-macros.inc` has `CSA_` macros
 79 |         - It looks like the implementations of TF_BUILTIN() do emit code via functions on the AssemblerBase class like Goto(), GotoIf(), and Branch().
 80 |         - I guess the idea is that CSA needs to generate code for builtins, which is generic over different input types (Smi vs Float for instance) that is only available at runtime. So that's why it can't just be a C++ function. But it's not as dynamic as actual JS code, so it's more efficient to write it in this lower-level language.
 81 |     - What heuristic does TurboFan use to determine when to optimize code?
 82 |         - This is defined in the flags. 8x for Sparkplug (feedback register allocation), 400x for Maglev or 1000x on Android, and 3000x for Turbofan.
 83 |         - I guess it means that constant-factor overhead for Turbofan can be something like 1000x the cost of actually running the code. That gives the optimizer a bit of wiggle-room to deduce stuff about the code and run control-flow analysis.
 84 |         - ```c++
 85 |           // Tiering: Turbofan.
 86 |           DEFINE_INT(invocation_count_for_turbofan, 3000,
 87 |                      "invocation count required for optimizing with TurboFan")
 88 |           DEFINE_INT(invocation_count_for_osr, 500, "invocation count required for OSR")
 89 |           DEFINE_INT(osr_to_tierup, 1,
 90 |                      "number to decrease the invocation budget by when we follow OSR")
 91 |           DEFINE_INT(minimum_invocations_after_ic_update, 500,
 92 |                      "How long to minimally wait after IC update before tier up")
 93 |           DEFINE_INT(minimum_invocations_before_optimization, 2,
 94 |                      "Minimum number of invocations we need before non-OSR optimization")
 95 |           ```
 96 |     - Compiling V8 and making a small change
 97 |         - https://github.com/pranayga/expl0ring_V8/blob/master/experiments/builtins/math_is42/commit_log.md
 98 |         - I'll try to add `Math.collatz()`.
 99 |         - Nevermind, this guide is 8 years out-of-date, and the MathBuiltinsAssembler doesn't exist anymore. Also all of the Math builtins are in Torque as well.
100 |         - Difference between builtins and builtins-gen, which are generated by CodeAssembler.
101 |         - Seems that the codebase has gotten a good deal more complex in 8 years.
102 |         - We found the implementation of `GlobalIsNaN` though, that one is still in CSA. Good example! It's a bit verbose, but the way they generate the code is by basically having a function to emit each assembly instruction, so the C++ mirrors the assembly.
103 |     - Pointer compression
104 |         - https://v8.dev/blog/oilpan-library
105 |         - https://v8.dev/blog/pointer-compression
106 | 


--------------------------------------------------------------------------------
/2025/11-large-data-formats.md:
--------------------------------------------------------------------------------
 1 | - Large data formats (Nov 2025)
 2 |     - Parquet
 3 |         - Magic number, followed by a list of "row groups." Each row group (configurable, 128 MB default, can be up to 1 GB for high read throughput) stores a set of columns (equal length), each column is broken up into pages (default 1 MB).
 4 |         - Metadata is written at the end of the file (version, schema, type) to allow for single-pass writing. This means that reads need to backward scan from EOF.
 5 |         - Data types: Int32, Int64, Bool, Float32, Float64, Bytes, Fixed len byte array.
 6 |         - Pages have a compression code (zstd, snappy, lz4), and encoding (plain, RLE, delta, dictionary, or byte stream split). They also have column metadata for definition levels (optional fields) and repetition levels.
 7 |         - Delta encoding: https://arxiv.org/pdf/1209.2137v5
 8 |         - Each page can be encrypted and/or checksummed, will omit the details for now, allows for better failure recovery.
 9 |         - How to concatenate files, or add rows? Hard.
10 |         - How to read a column? Go through every row group, read the column one page at a time. Supports "predicate pushdown" — fancy word that means a query optimizer can turn a full table scan into a column scan.
11 |         - How to read a few rows? Find the row group, find indices for each column, read the page that the row's data is in for each column.
12 |         - https://docs.rs/parquet/latest/parquet/column/index.html
13 |     - Arrow
14 |         - Basically just the only in-memory format you need, columnar storage, also concatenate variadic fields like strings.
15 |         - Reduces copying since every library agrees to use Arrow's format.
16 |         - Makes it easier to create libraries and databases, since you can use Arrow's optimized library of primitives (e.g., SIMD vector operations).
17 |     - Hardware trends - object storage (S3/GCS)
18 |         - Object storage is great, infinite throughput / capacity and very reliable to manage.
19 |         - Other systems like WarpStream (Kafka on object storage) have this same shape, much easier to manage and better throughput, but has a bit of latency penalty.
20 |         - This is what people mean when they call stuff a "data lake" — and then "lakehouse" is like structured querying and libraries on top of the lake.
21 |     - Lance / Vortex
22 |         - https://lancedb.com/blog/lance-v2/
23 |         - https://docs.vortex.dev/concepts/arrays
24 |             - https://docs.rs/vortex/latest/vortex/vtable/trait.VTable.html
25 |             - "Layout" is described by the user, can involve either local or remote storage, various forms of chunking. So it's kind of like, building your own data format with the customization that you get.
26 |         - Regarding claims like "10x faster scans, 5x faster writes" — don't think too much about it, probably just some marketing material. Could get this benchmark with various kinds of optimizations / reducing overheads, modern compression.
27 |         - The file formats themselves are not __substantially__ better, maybe just targeting specific cases (random writes, wide rows like vectors and media).
28 |         - Perhaps it's not too inspiring. I guess Parquet is "good enough" in some ways.
29 |         - Actual benefits: Faster at random access, GPGPU compute, and better query engine / DB parts like predicate pushdown and fine-tuning storage to the data characteristics.
30 |     - Other data formats
31 |         - Smallpond: example tiny framework (Spark alt) that companies might build on top of NFS and data tools https://deepseek-ai.github.io/smallpond/api.html#high-level-api
32 |         - Lots of functional programming ideas in data processing, especially in low-level APIs. Whether that's Spark or something else, you don't need to rely on optimizations like "predicate pushdown" if the programmer just handles it themselves!
33 |             - In general, one way to think about data queries is as structured transformations, like building a logical plan within a database.
34 |             - It hides the complexity of distributed scheduling, fault tolerance & I/O.
35 |         - Why GPUs? Unclear what the importance is currently, but GPGPU is something people are thinking about in these formats, probably since networking is getting fast.
36 |     - Tiktoken / BPE
37 |         - Very nice and reviewable/correct implementation of BPE, including an educational one.
38 |         - There are simple, basic regexes for initial pre-tokenization that converts text into words based on heuristics (see [pat_str](https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py) in the registry).
39 |         - Special tokens are handled separately and initially by another regex-based substring matcher at start time, so they don't get clobbered by any other matching rules.
40 |         - Once that's done, the Rust code precomputes "ranks" for each pair of adjacent indices i..i+2 and iterates them to find the lowest rank in the word. It merges that rank, while recomputing ranks for merging i-1 and i (folding i+1 into i). This is faster than recomputing ranks for all indices on each iteration.
41 |         - Pretty straightforward, it is O(n^2) but that's ok since the pre-tokenization already makes word-based chunks pretty compact. Parallelism is by non-GIL Python threads.
42 |         - You can draw parallels with dictionary compression algorithms like Brotli or Zstd. Tradeoff between vocab size and efficient representation.
43 |     - Harmony renderer
44 |         - Turns structured data (storage, main format) into text that can be passed into LLMs. Tokenizes media / allocates space for it. Inserts special tokens between text, likely some lossy encoding involved here.
45 |         - Needs to escape special tokens so that users can't inject them haha.
46 |         - Tool calling and JSON constraints are part of the renderer as well, I guess this defines the special tokens + response format used by the model, and then the model is fine-tuned to respect or follow those system instructions.
47 |     - cuDF
48 |         - Nvidia's interested in helping data scientists run their Pandas/Polars code faster, unedited. Just plug in the Nvidia GPU and see things accelerate.
49 |         - So cuDF is an implementation of that, it may not be the fastest way to run operations (versus JAX/PyTorch), but it will have good drop-in support for most things and fits into the data science workflow. RAPIDS is the general project including scikit-learn compatibility.
50 |         - Note that the operations may fall back to CPU for unsupported wrappers. So it seems like the data buffers still live on CPU, and you're going to be limited by CPU<->GPU link bandwidth (NVLink 50 GB/s, PCIe 32 GB/s).
51 |             - Still a ton, and it's going to accelerate even element-wise ops!
52 |             - Unless you have a ton of cores and set up NUMA properly. Remember that most things are compute-bound on CPU, that's why SIMD vector ops exist.
53 |         - For neural networks though, obviously cuDF isn't the right tool. It's probably not going to take advantage of tensor cores mostly (except where cuBLAS might be used as a subroutine for various ops).
54 |         - A dataframe library backed by CUDA cores, standard parallel programming model.
55 |     - Why is Polars so fast?
56 |         - Hmm, if you read the Polars code, everything looks to be implemented in quite ergonomic Rust. It's a complex library with good docs and code structure. I'm actually surprised how much they use closures and standard Rust / Rayon iterators (+ par_sort()) to back operations that aren't handled by Arrow.
57 |         - Series (untyped) / ChunkedArray (typed) is the core data type, which wraps polars-arrow types, and eventually you get all the way down to an Arrow array (via FFI calls).
58 |         - Basically, the speed comes from: good code and relying on Rust ecosystem, automatic multithreading (polars is threads=1), Arrow memory format by default, and lazy query plans to enable streaming + predicate pushdown + projection pushdown.
59 |             - Kernels are implemented in polars-compute as traits for various dtypes.
60 |             - Layering: Operations go to polars-plan (logical planning), which are then optimized (OptimizationRule) and converted to polars-expr (physical plans), which are finally executed by kernels.
61 |                 - Logical planning consists of `DslPlan` -> arena-allocated `IR`, which also contains `Expr` -> arena-allocated `AExpr`. Optimization rules transform the arena-allocated variants of both so you don't have a bunch of allocation and memory fragmentation.
62 |                 - Eventually, the arena-allocated IR plan gets sent to `create_physical_plan()` and recursively converts logical nodes into physical executors.
63 |                 - Logical expressions are lowered in parallel to physical (polars-expr).
64 |             - Lazy API has a `describe()` function to inspect the query plan.
65 |             - Polars can have more flexible advanced APIs since it doesn't need to "invent" SQL syntax to support them, for instance, rolling groupby.
66 |         - They have blog posts about parallelism, fast vectorized data formats / storage (e.g., ArrowStr series) and hashing to avoid communication overheads in various compute kernels. Just done very well.
67 |         - It's just a good, well-written library! Not much else to say here. The magic is that they implemented foundational building blocks very well and put them together.
68 |     - DuckDB storage format
69 |         - Oh wow, it stores data in row groups like Parquet. The default is 122,880 rows. I wonder if these are honestly more like column chunks though…
70 |         - Hmm, parallelism is on the row-group level. This makes sense, you don't want to use a new thread unless your dataset is sufficiently large. I guess thread communication and startup overhead is around the same order of magnitude as 100k scalar ops.
71 |         - [Internals](https://duckdb.org/docs/stable/internals/overview) docs are very concise and a good overview of all the passes.
72 |     - Vector databases
73 |         - Introductions from Pinecone employee
74 |             - At a high-level, grab a bunch of vectors (embedding model outputs, 768/1536-dimensional vectors) and then do nearest neighbors or cosine distance search on them.
75 |             - Custom indices that allow you to do this. It started as just a single file, indexing, and then a hosted compute model. Launched in 2019.
76 |             - But eventually this got very expensive, moved to serverless over storage rather than pods, put a lot of effort making that fast & cost-effective / more scalable.
77 |             - Explosion in usage after 2023, when "RAG" took off, some traffic from recommender engines and other AI work.
78 |             - What are the system requirements?
79 |                 - There are few different kinds of customers, like AI companies or RAG apps.
80 |                 - Long tail of distributions of workloads. Either having not much data and lots of traffic, or having lots of data and not much traffic. Say, 95% of Notion users (partitioned indices) don't actually have data being accessed, lots of partitions, serverless.
81 |                 - But __recommender systems__ are a much more static dataset, like an online storefront (Amazon), with much higher query traffic. Semantic search is also different.
82 |                 - Loading and unloading very partitioned indices is important for user-level vecor search, and that's why multi-tenancy and scaling to zero is nice.
83 |         - Pinecone archiecture https://www.pinecone.io/learn/slab-architecture/
84 |             - L0 is memtable / just brute-force search, L1 is 10k-1m vectors (FJLT projections / compression), L2 is 1m-100m vectors (IVF clustering), and then there's something else for L3. With more vectors, it becomes safer to look at less of them.
85 |             - Architecture decouples indexing (research, geometry) from systems (files and distribution), slabs and object storage concerns!
86 |             - Search in one slab is totally decoupled from that in other slabs.
87 |             - Everything is just Parquet files in each slab, as a raw data format. Then the indices are built up on top of it.
88 |             - With IVF, you can do 30-40M vectors. After that you need replication. Next step would be to add "graph indexing" (HNSW-like), which is scalable to larger datasets.
89 |             - https://www.cs.princeton.edu/~chazelle/pubs/FJLT-sicomp09.pdf — this is just a JL random projection but sparse, 1/sqrt(d) of the entries in the (k x d) projection matrix is actually nonzero. But you add a FWHT (Hadamard matrix) to mix things up.
90 |             - The IVF in pgvector is just a k-means algorithm, makes sense. You have some heuristic for k, initialize the centers with another heuristic, and then (in the case of Pinecone) keep things immutable in slabs until compaction.
91 |             - https://alex-jacobs.com/posts/the-case-against-pgvector/
92 |         - Vector indexing thought experiment
93 |             - You can always do low-distortion embeddings, this is mathematically guaranteed to be fine (within epsilon~1%, etc.). It will reduce the constant factor (FLOPs time), just like what SIMD will do. Clustering is more sketchy though ("ignoring" some clusters), and that's when you need graph algorithms like HNSW (still randomized).
94 |             - Quick explainer https://www.pinecone.io/learn/series/faiss/hnsw/
95 |             - Hmm, yeah this makes sense. Also the slabs / compaction system are optimized for backing by object storage, since they are immutable. (Just as LSM trees are optimized for coherent disk access.)
96 |             - The vector indices themselves need to be tuned, and they have certain requirements / constraints. Everything else is a systems and engineering task over time.
97 |             - Filtering needs to be done in stages as well. If you think about vector indices in a Postgres / pgvector integration for example, you EXPLAIN the query and see whether the vector index is used or not, then the filter needs to be done before / after.
98 |             - It was brought up by Greg that in some ways, HNSW is kind of like iterated IVF, each level is its own kind of cluster index setup, and you stochastically build it.
99 | 


--------------------------------------------------------------------------------
/2025/02-gpu-kernel-programming.md:
--------------------------------------------------------------------------------
  1 | - GPU kernel programming (Feb 2025)
  2 |     - CUDA programming cheatsheet (PMPP)
  3 |         - cudaMalloc, cudaMemcpy = device global memory
  4 |             - global & constant memory > block shared memory > thread registers & local memory
  5 |             - `__device__ float DeviceFunc` = device internal
  6 |             - `__global__ void KernelFunc` = host calls device
  7 |             - `__host__ float HostFunc` = ordinary function, you can use this along with device to get isomorphic libraries on both host + device
  8 |         - threadIdx.x, threadIdx.y = index of thread inside its block (Thread 17,26) — blockDim
  9 |         - blockIdx.x, blockIdx.y = index of block inside the grid (dispatch of kernel) — gridDim
 10 |         - `__syncthreads()` only synchronizes within a block, execution between different blocks is order-independent and cannot have data dependencies, CUDA can execute individual blocks in any order it chooses at runtime.
 11 |         - Warp = 32. Unit of thread scheduling in SMs, kind of like a cache line. Your blockDim should be a multiple of 32.
 12 |             - Avoid thread divergence within a warp. SIMT: every thread proceeds in lockstep.
 13 |             - Use favorable data access patterns. DRAM is not exactly random-access. Within a warp, access sequential data (row-major vs column-major = GMEM coalescing).
 14 |         - Sizing: You would have between 128-1024 threads per block, given block and thread per-SM limits. In practice, most people start with 128-256 threads per block.
 15 |             - blockDim should be on the order of a single SM. In T4-A10-A100-H100 GPUs, you get like maximum of 16 blocks per SM. But blockDim also can't be too big, since then it won't fit on a single SM and can't be scheduled at all.
 16 |             - ![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fekzhang%2FYAugZxPZ_V.png?alt=media&token=263b5ba3-d40e-448b-a2dc-235b8d8cb6c7)
 17 |         - How to use shared memory? (64-228 KB)
 18 |             - Tiling for matrix multiplication: Every thread loads one element of the shared memory array. Then syncthreads(), do your computation on all, and syncthreads() one more time.
 19 |             - Also, prefetch tiles into registers during computation to do data load during independent compute. (Register size = shared memory size.)
 20 |     - Applications to scientific computing. Think in parallel, try to avoid data dependencies, in geometric applications you can schedule compute smarter.
 21 |     - Notes from siboehm — kernel optimization and tuning
 22 |         - The rough idea seems to be to start with a blueprint for this kernel (load stuff into SMEM, in the proper order with vectorization, then multiply). Then tune all the numeric parameters to suit a particular workload on hardware.
 23 |             - For example, global memory coalescing (2) is just, changing iteration order of output from [x][y] to [y][x]. This translates an incoherent load on A into a coherent load on B.
 24 |             - Then, smem cache-blocking (3) shares the accessed memory across 32 threads.
 25 |         - Arithmetic intensity (AI) computed as FLOPs / GMEM access.
 26 |             - But although smem cache-blocking increases AI by 16x (~1 -> 15), and thus fitting under the GMEM bandwidth bound, it's still not close to reaching this limit. (And CUDA already does pretty good L1 caching automatically.)
 27 |             - There's many other parts to this calculation. Not just improving arithmetic intensity.
 28 |         - Occupancy = ratio of active/resident warps per SM, to total number of warps per SM. I think this is ~equivalent to threads since warp = thread*32. For example, there a 64 resident warps per SM on the A100, so you'd want to saturate that.
 29 |             - In theory then, 2 blocks is fine right, for this kernel? But the blog post says it isn't fine. I don't totally understand this — I think it might just be an illustrative example, but as they mention, SMEM isn't the bottleneck in this specific case. It's registers & threads.
 30 |             - Register count and SMEM capacity are the main limits to occupancy. You basically can't exceed the total L1 cache available to that SM.
 31 |             - Q: How to count resource demands (registers / thread)? Profiling tools?
 32 |         - Reading the profiler output
 33 |             - See mix of executed instructions by count: LDS, FFMA, IADD3.
 34 |             - See percentage of cycles in each warp state: Documented in [Nsight guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-reference).
 35 |             - So you see, if you have a lot of memory stalls in your kernel, and you also look at the PTX and see more latency in LD than FMA… it's time to start blocktiling.
 36 |             - This kind of optimization is just another way of increasing intensity. But the performance / profiling gives us higher confidence, and we can directly improve the kernel by reducing the memory stalling — less guesswork!
 37 |         - Basically, kernels 3->4 and 4->5 all improve increase the number of work done per thread by 8x, then consequently decreases the number of GMEM accesses per element by 2x. The reason why is tricky.
 38 |             - Kernel 3->4 uses BM=BN=64 and BK=8. This means each block computes a 64x64 tile of output, but only uses 2x8x64 = 1024 input floats in shared memory. This would usually be too many threads! But since each thread computes an 8x1 submatrix of outputs (warp = 8x32), block size = 512. (Basically we add an inner loop `dotIdx` to each thread).
 39 |             - Kernel 4->5 increases AI more. Each thread computes an 8x8 tile (TM=TN=8). Also, BM=BN=128, wow! And BK=8 as before. So we end up with block size of 16x16=256, but still less GMEM accesses.
 40 |             - At this point, SMEM access also can be optimized within block-tiling — this is where adding register caches helps. You can populate the relevant TN+TM entries into __registers__ before each iteration of BK. It's threadtiling!
 41 |         - Kernel 9 is autotuning of BM, BN, BK, TM, and TN. Straightforward.
 42 |         - Warptiling is yet another level that we can add on top of thread / blocktiling. I don't completely understand, but it seems to be just another loop based on having each __warp__ compute a given number of output elements in its dimension, same idea.
 43 |         - To summarize: Every hardware level divides up the compute a bit, with its own caching and scheduling primitives. The caches and vectorization mean that we need to change our iteration order to suit the hardware. Tile compute smartly to increase occupancy, reduce GMEM / SMEM accesses. Along the way, use profilers & tune parameters.
 44 |     - JAX: Pallas kernel language notes
 45 |         - `import jax.experimental.pallas as pl`
 46 |         - This is an embedded language in Python based on the concept of a "Ref". A bit higher-level, kind of like Triton in not needing to worry about the details of warptiling / blocktiling / threadtiling. Tries to maintain optimal control.
 47 |         - All the kernels so far use `[...]` to get the whole block.
 48 |         - You might want to write kernels differently on various devices. SMEM/GMEM loading. TPUs are like very wide SIMD registers, while GPUs are warp schedulers.
 49 |         - "Ref" = mutable buffers in SMEM. JAX has you chunk up the kernel into blocks.
 50 |         - `pl.BlockSpec` is the magic sauce that does the tiling and caching. Basically tells you how to carve up accessing each input for each output. In matmul, this is pretty simple, but you could implement blocktiling on it. Declarative instead of for-loops.
 51 |         - Fusing a tanh activation function onto a kernel is trivial.
 52 |         - Interesting: on TPUs, each operation has a cost. Addition, subtraction, multiplication are fast. But division/exp/tanh/pow are slower, and sin/cos are the slowest.
 53 |     - Tinygrad codebase notes
 54 |         - Small codebase, tries to be "everything" in a tiny package ~10K lines (it's dense).
 55 |         - Getting quite good Metal performance with Tinygrad, even when `BEAM=0` is disabled and just manual optimizations are turned on.
 56 |         - Interesting part is probably the backend, `get_kernel(Render, UOp) -> Kernel`. Applies optimizations in order and then does a Renderer pass. Kernel.opts is the "Renderer" that has all accelerator and dialect-specific stuff, like CUDA/Metal/C/PTX and so on.
 57 |             - required_optimizations() — seems minor, skip for now, one UPCAST.
 58 |             - apply_tensor_cores() — if device has TC, run optimizations that transform operations into tensor core arithmetic (__how?__)
 59 |             - hand_coded_optimizations() — if not TC, then there's about 100 lines of optimization passes, very short (__how does this work, and why is it so simple?__)
 60 |             - beam_search() — If beam search enabled (not by default), run beam search. This is slow but the results are cached after first evaluation. Runs the kernel many times.
 61 |         - On my M1 Pro, matmul is like 20x faster after hand_coded_optimizations() and another 2x faster after that with apply_tensor_cores(). So it's clear that the optimization magic is happening somewhere in both of these two functions.
 62 |             - (Also, the fact that it's device-independent is very impressive!)
 63 |         - UOp is the tree of executed operations, kind of like a trace. It's constructed lazily when you chain ops, and then `Tensor.realize()` actually executes the whole thing.
 64 |         - OptOps is TC, UPCAST, UNROLL, LOCAL, GROUP, GROUPTOP, NOLOCALS, PADTO, SWAP.
 65 |             - They can also be applied to a specific axis number or operation argument.
 66 |             - About the kernel itself: each tensor is multidimensional, and they have a coloring system to determine which parts of the kernel ("inner loops") correspond to each axis.
 67 |             - ```python
 68 |                 # there's eight chunks of the shape
 69 |                 # blue   -- global dims
 70 |                 # cyan   -- local dims (warp ones first)
 71 |                 #  *** self.first_reduce
 72 |                 # green  -- reduce-local dims
 73 |                 # red    -- reduce loops
 74 |                 #  *** self.upcasted
 75 |                 # purple -- reduce upcasted
 76 |                 # yellow -- normal upcasted dimensions
 77 |               ```
 78 |         - I'm guessing that each of the Opt classes does some recoloring of the axes in each operation, and then the colors are used to generalized an optimized kernel in codegen.
 79 |         - Okay, on further inspection of the tc and handrolled opts (exclude beam search):
 80 |             - Handrolled: Matvec multiply (upcast reduce), group and fuse reduction axes, image data types (stride=4), heuristics to also upcast non-reduce axes, and then local grouping.
 81 |             - **UPCAST** is mentioned in many places, what does this do? (yellow)
 82 |             - The other important ones are GROUP[TOP], LOCAL, UNROLL.
 83 |             - TC/PADTO are only used in tensor core opts, that's a separate thing. I can take a look at that later. Using a special instruction for matrix multiplication.
 84 |             - Note: Almost all operations can be represented in terms of the same shape algebra, this is why the code is so short!
 85 |         - At the last step of compiling a program, Renderer.render() then takes the linearized list of UOp (previously a tree) and lowers it into machine code for the backend.
 86 |             - A couple backends, most inherit from CStyleLanguage (Clang, OpenCL, Intel, Metal, CUDA, AMD, WGSL).
 87 |             - Some are special cases, like LLVM IR and PTX.
 88 |             - In all cases the code is __very short and dependency-free__. Just format strings.
 89 |         - Mysterious "swizzle" parameter in TensorCore. Looks like an odd permutation. It might have to do with how you arrange the first 10 axis numbers. Empirically determined?
 90 |     - More notes on Tinygrad (with Chenyu presenting!)
 91 |         - Biggest file is tensor.py, which basically all the high-level APIs on a tensor (matches Torch / JAX). Wrapper over an array of memory or something.
 92 |         - Different runtimes, `python -m tinygrad.device` to show supported devices.
 93 |             - Quick notes: inside `runtime/`, ops_gpu is OpenCL, disk is something to do with random-access memory, hip is AMD HIP, amd is AMD RDNA 3, Intel is OpenCL + XMX tensor cores, ops_nv is Nvidia without CUDA (PTX?), qcom is Qualcomm
 94 |             - DTypes supports ints, floats (no float8 yet), images.
 95 |         - tinygrad.nn: optimizers, mnist/cifar datasets, and model state (safetensor, tar, ggml)
 96 |         - tinygrad.gradient: has a `PatternMatcher` for each of the ~20 ops that maps them to the gradient of the op, very compact list of ops, not even a specialized convolution!
 97 |             - Philosophy of tinygrad is to simplify things down to as few ops as possible, and these are encoded in UOp. "If you get new hardware, you're not going to write a conv gate on it."
 98 |         - Tensor -> lazydata -> ScheduleItem -> ExecItem -> codegen -> source code -> compiled binary -> runtime
 99 |         - tinygrad.engine: everything before codegen, there's lots of stuff here!
100 |             - schedule.py: tensor graph, **grouping (kernel fusion)** and linearizer for topological sorting, as well as the memory planning
101 |             - realize.py: ScheduleItem -> ExecItem (1 per kernel), can be pickled and sent to nodes
102 |             - search.py: BEAM search and MCTS
103 |             - jit.py: replay execution with new data? similar to ExecItem? — it can be used to speed up a repeated computation, can store a whole CUDA graph = multiple kernels
104 |             - multi.py: NCCL and ring allreduce system
105 |             - memory.py: "very straightforward" memory planning system
106 |         - tinygrad.codegen: kernels, optimization actions
107 |             - Kernel.to_program() -> ProgramSpec
108 |             - Sticking to searches on real devices for now, rather than a "smart" simulator or something else. This helps you avoid problems.
109 |             - linearize.py: order into blocks, any topological ordering of blocks is a valid program
110 |             - lowerer.py: rewrites or permutes buffer indexes into tensor / array indexes
111 |             - rewriter.py: general library for simplification rules on UOp trees; const folding, folding load/store, etc.; and every compile / lowering step has its own rewrite rules
112 |             - transcendental.py: implement sin, exp2, log2, pow if device doesn't support it natively
113 |             - __(currently doesn't support some common tricks like double-buffering, local memory copies)__
114 |         - tinygrad.shape: representing multi-dimensional arrays
115 |             - view.py: lazy zero-copy views / permute on arrays
116 |                 - __trivia: determining the index to access in the buffer after a reshape() requires int division/modulo, which is slow on gpus! this is why modern gpus have dedicated integer units__
117 |             - shapetracker.py: shape-tracking algebra, stack of multiple views, tracking properties throughout the views (maybe locality is better? use OptOps.SWAP)
118 |                 - __Operation to merge two views / ShapeTracker.simplify() can be applied. But the algebra is surprising, you can have 3 views that are not piecewise merge-able but all 3 can be merged. Apparently it's NP-hard lol.__
119 |         - tinygrad.renderer: take code blocks from codegen, make them into real source code
120 |             - Fun fact: for WGSL, browsers limit the size of your buffer, so you need to chunk the weights when running Llama.
121 |             - Can learn and try some GPU programming with WebGPU!
122 |         - tinygrad.runtime: copy memory between host/device, allocate RAM, & dispatch kernels
123 |             - Some backends have RDMA.
124 |             - These all just call into different ways to allocate a buffer, run a program ~100 LoC.
125 |             - ops_cloud.py: send GPU backend operations over USB, HTTP (1 req per ExecItem)
126 |             - ops_amd.py: they believe in AMD hardware
127 |             - support/hcq.py: [Hardware Command Queue](https://docs.tinygrad.org/developer/hcq/)
128 |             - graph/: Maybe you run llama, and it's like 300 kernels (ExecItem), or you train a big network and each step is 50,000 kernels — limited by queue dispatch time (each ~ns). This is CUDA Graph, you replay the queue of multiple kernels / save the dispatch for later.
129 |         - tinygrad.viz: when `VIZ=1`, display a webapp of all the UOp graphs / rewrites
130 |             - Convention for global buffers: input = 1 and output = 0
131 |         - Environment variables
132 |             - DEBUG=1..7 — show kernel dispatch, codegen, timing and shapes, gets more specific
133 |             - TRACEMETA=3 — show line number of the code
134 |             - BEAM=2..4
135 |             - NOOPT=1
136 |         - What is swizzle? It's apparently in the GPU docs, you need to swizzle your input into a specific register order, before calling into a tensor core instruction.
137 |         - Currently looking for gains on better memory planning & kernel fusion.
138 |         - OptOps
139 |             - GROUP = […][…][…][…], an i++ loop for each thread
140 |             - GROUPTOP = strided groups, like the "k" axis in blocktiling
141 |             - UPCAST = float4 type, very fast
142 |             - UNROLL = loop unrolling (similar to UPCAST in principle)
143 |             - LOCAL = ???
144 | 


--------------------------------------------------------------------------------
/2025/06-modern-vision-systems.md:
--------------------------------------------------------------------------------
  1 | - Modern Vision Systems (Jun 2025)
  2 |     - Recap on generative models / [diffusion theory from scratch](https://iclr-blogposts.github.io/2024/blog/diffusion-theory-from-scratch/) — there are two dual ways of thinking about the DDPM work, one in terms of modeling score via Langevin dynamics / statistical mechanics, and one in terms of reversing a Gaussian diffusion process.
  3 |         - To estimate the score scalably, previously people used all kinds of complex objective functions. Key insight that made it work in DDPM was diffusing the sample data into Gaussians, and then only modeling the score function on the __parts that matter__ (i.e., the paths from the sample data to eventually reach their destination), not the whole space.
  4 |         - Last bit of the equation is the __schedule__ (beta): this basically tracks how much of the information you lose as you diffuse to Gaussians on each step / add noise (forward). It then affects the reverse equation / denoising process.
  5 |         - People use cosine schedules because they look good and empirically get better detail, but any schedule is theoretically grounded.
  6 |         - DDPM uses Fokker-Planck equation and Anderson's reversal to explain the math, versus the Langevin dynamics explanation (reverse-first) that builds up the theory chronologically.
  7 |         - Note that the score function depends on time t, this is an input to the learned model (U-Net for Stable Diffusion) via a sinusoidal timestamp embedding.
  8 |     - Practical diffusion models / advancements.
  9 |         - Some more steps are needed to go from the common math to actually usable results. This is why we had an explosion of papers after DDPM.
 10 |         - As for the actual neural network architecture:
 11 |             - Residual block: x -> conv1 -> (+time embedding) -> activation -> conv2 -> (+residual). 
 12 |             - Attention layers: self-attention in middle block, and cross-attention with text or image conditioning in encoder and decoder stages.
 13 |             - Encoder blocks use 2x strided convolutions, decoder blocks use 2x upscaling.
 14 |             - `attention_resolutions` (say: 4, 8, 16) — at how much downsampling do you introduce attention blocks. These come between residual blocks, and they include both multi-head self-attention (like transformer encoder) and cross-attention. OK, easy enough.
 15 |         - Convolution is used by SD 1.5 / SDXL, other models like [Imagen](https://papers.nips.cc/paper_files/paper/2022/file/ec795aeadae0b7d230fa35cbaf04c041-Paper-Conference.pdf) vary the U-Net architecture or make it more efficient.
 16 |         - Also [this paper](https://arxiv.org/pdf/2102.09672) "Improved DDPMs" (2021) looks like it was a pretty big practical advancement in efficiency of these models. It uses cosine schedule in forward pass and __learns__ the reverse variances via interpolation trick, model decides how confident it is in adjusting the variances.
 17 |         - Another [big paper](https://arxiv.org/abs/2010.02502) DDIMs (2020) that this built on, an efficiency improvement to see the code for. It's just a different "sampler" — and various models will either predict epsilon, x_0, or v (velocity). The sampler is kind of just a numerical integrator for the reverse SDE, basically how you actually go from scores to samples via the reverse Ornstein-Uhlenbeck (damped Langevin) process.
 18 |         - Most popular sampler today seem to be DPM++ or variants. Basically, all samplers converge to the same under infinite time steps, kind of like how RK4 and Euler do the same thing. But practical models today use 4-30 steps, and then the algorithm choice matters!
 19 |         - See [How Imagen Actually Works](https://www.assemblyai.com/blog/how-imagen-actually-works) for another blog post take on this. There's some more steps as you piece everything together: the U-Net diagram, STM/MTL super-resolution models, and classifier-free guidance (interpolating between guided=1 and unguided=0).
 20 |         - This is why the models are __systems__, they take the foundational math and create complex things out of it as they design model architectures.
 21 |         - DPM++ solvers are from [this paper](https://arxiv.org/pdf/2211.01095), including DPM++2M.
 22 |     - Alright, back to image backbones. Will read a bit about these.
 23 |         - EfficientViT calls itself an "image foundation model." Is this a compelling description of what it does, or is it more of a backbone? I guess it depends on how it can be used.
 24 |         - See this [recent comparison of Vision Transformers](https://arxiv.org/pdf/2308.09372) from February 2025. It breaks down the improvements into three categories: 1) faster attention or hybrid convolutional-attention algorithms / O(n^2) self-attention bottleneck, 2) removing / merging some of the token sequence, and 3) changing up the final MLP block.
 25 |     - Apple's Core ML
 26 |         - Apple's Core ML is surprisingly open-source and has a lot of details. What is the `.mlmodel` format, and how does inference actually work? Sounds like it's a lot faster in iOS 15+.
 27 |         - Their examples are older vision models (ResNet50, MobileNetV3-Small), sounds like this hasn't been updated too recently.
 28 |         - Utilities for quantizing model weights, other compression methods. The flow is pretty standard, you build out your model in PyTorch/TensorFlow or other frameworks, then trace and export it into a format compatible with Apple.
 29 |         - I guess that [OpenELM](https://apple.github.io/coremltools/docs-guides/source/convert-openelm.html) (2024) is a newer model from Apple, for Core ML deployment.
 30 |         - Unclear how the actual inference engine works, on-device. It's definitely a familiar mobile deployment story, and the company is in charge of the whole runtime engine. But I think, maybe the open-source ecosystem should be responsible for this.
 31 |     - Grounding DINO
 32 |         - Object detection model (DETR = detection transformer), built on [GLIP](https://github.com/microsoft/GLIP), which extends CLIP to object-level descriptions within an image. That's why it's grounded.
 33 |         - Previous object detection models might only let you detect predefined classes, but that doesn't generalize well. Rather than asking the model to do that, you can use text-image embeddings and then ask for captioning / detecting novel classes of objects.
 34 |         - Often plugged into robotics pipelines now.
 35 |         - Some people are using it for masked image generation with Stable Diffusion. Results seem somewhat questionable, but I get it.
 36 |     - SAM 2
 37 |         - Meta model for image segmentation from human descriptions.
 38 |         - What's the model architecture?
 39 |     - Vision model foundations
 40 |         - Let's drill into ViT models and then alternatives that explore their computational efficiency and rationale behind the model architecture.
 41 |         - MobileVit (2021) https://arxiv.org/abs/2110.02178
 42 |         - EfficientFormer (2022 / 2023) https://github.com/snap-research/EfficientFormer
 43 |         - Question: What happened after 2023? Do people not study this direction anymore? (It sounds like they're still the best models out there, but maybe progress stalled. Or the ChatGPT launch led people to focus on other kinds of research.)
 44 |         - [CLIP-UP](https://machinelearning.apple.com/research/clip-up) (2025) — A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling
 45 |         - [MetaFormer](https://arxiv.org/pdf/2111.11418) (2022)
 46 |             - This is a criticism / analysis of the core idea that makes vision transformers work. Also mentions / compares to architectures like ViT, ConvNeXt. Got here by starting on the FastViT paper and learning about the architecture.
 47 |             - It's not necessarily the self-attention heads. What matters is a "token mixer" module, having something that combines information between spatial tokens.
 48 |             - The idea of a transformer is, image -> patch embedding -> blocks, where each block is [norm, <mix>, residual, norm, MLP-ratio4, residual] like a transformer.
 49 |             - Except the <mix> doesn't need to be self-attention, it could even just be a high-pass filter with 0 parameters. Or as in the FNet paper, they replace attention with a Fourier transform that also does not have any parameters!
 50 |             - PoolFormer gets reasonable results, despite lacking self-attention. Patch stride 4, then /8, /16 and /32.
 51 |                 - ![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fekzhang%2FmX9vnQT6vC.png?alt=media&token=dc77e7a1-87ec-4f2b-b262-dfaf0e954fe9)
 52 |             - An interesting relationship here with ConvNeXt, which is a very popular transformer that has __depthwise convolution__ blocks and 1x1 convolutions. The "pool" operation is like a fixed-weight depthwise convolution, and the 1x1 convs are like per-token MLP blocks.
 53 |             - So, ConvNeXt has an almost identical block design compared to MetaFormer, I guess this is where the key idea comes from originally.
 54 |                 - MetaFormer just takes it to the logical conclusion by eliminating attention.
 55 |             - Something to think about with all of these models — what's the __training recipe__? I often don't think about this too much since I'm just doing inference, but it's important to think about where the datasets are coming from, and what we can reproduce practically. Otherwise, it's just a bit out of reach for people outside of the big labs.
 56 |         - FastViT (2023) https://github.com/apple/ml-fastvit
 57 |             - Newer paper than EfficientFormer, ConvNeXt, Swin-T, CMT.
 58 |             - Interesting that they also still compare to EfficientNet-B4, and ResNet-50. Guess those traditional convolutional neural networks are still interesting to people.
 59 |             - __"novel token mixing operator, RepMixer, a building block of FastViT, that uses structural reparameterization to lower the memory access cost by removing skip-connections in the network"__
 60 |             - Nice, it's the top model on the [Core ML Models page](https://developer.apple.com/machine-learning/models/) on Apple right now. The smaller variant T8 is only 6.5 MB, headless, and runs in less than a millisecond!
 61 |             - They also replace all convolutions with depthwise + pointwise counterparts. These convolutions are typically used in stem/patch embedding layers.
 62 |                 - Note: patch embedding = ViT, stem embedding = ConvNeXt and FastViT.
 63 |                 - Patch embeddings are rigid divisions of the image, while stem embeddings are just a bit more overlapped, perhaps stronger local features.
 64 |             - RepMixer(X) = DepthwiseConv(BN(X)) + X. At inference time, with batch size 1, this can just be reparameterized to a single DepthwiseConv!
 65 |             - Lol this is funny, they compared it to PoolFormer and found a 43% latency reduction due to the lack of skip connections! Less memory accesses.
 66 |                 - ![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fekzhang%2FG2glEvhPzy.png?alt=media&token=0ddb7d9c-1114-4c42-9e58-235242a6aa14)
 67 |             - For semantic segmentation, they use the [Semantic FPN](https://arxiv.org/abs/1901.02446) decoder (2019).
 68 |             - Project ideas: take your webcam, run some segmentation on it, apply cool filters like reflective light effects (Vercel-ish) or physics.
 69 |     - Rectified flow
 70 |         - I guess I'm returning to some of the diffusion model stuff, since it's the most fun and mathy part. Also I like geometry, so seeing these ODE / SDE models is cool!
 71 |         - History of models:
 72 |             - SD 1.1-1.4 = original by CompVis with the same model architecture and different training checkpoints (LAION), then SD 1.5 = Runway, SDXL / SD 3.x = Stability AI.
 73 |             - After Stability AI collapsed, the researchers moved on to Black Forest Labs. They are releasing FLUX.1 with different variants and sizes. Each of these models is pretty flexible (conditioning, output size, …), and Schnell is the smallest. It's what powers the image generation on the Modal.com site, which is very fast.
 74 |             - Looks like state-of-the-art has moved on from diffusion (DDPM / DDIM / …) into rectified flow transformers (SD 3.5, FLUX.1). Biggest models are expensive and not public, but they're also not __that__ much better than the open-source available ones.
 75 |         - https://rectifiedflow.github.io/ — rectified flow tutorial, series of good posts
 76 |         - Going to take some notes from reading / skimming the [flow book](https://www.cs.utexas.edu/~lqiang/PDF/flow_book.pdf) though. This one is about 100 pages long, and I think the complete exposition would help.
 77 |         - Ch 1. Rectified Flow
 78 |             - You write an ODE that takes $$Z_0 \sim \pi_0$$ and moves it along $$\dot Z_t = v_t(Z_t), t \in [0, 1]$$.
 79 |             - Note that the velocity curve is time-dependent. Can solve this using Euler's method or other solvers. Many flows but aim to have __straight transports__ for computational efficiency, since you accumulate less error in each step.
 80 |             - The "rectified flow" idea is to take an arbitrary coupling of $$\pi_0$$ (noise) and $$\pi_1$$ (model distribution), then just train the ODE model $$v_t$$ by interpolated paths from each pair in the coupling. __Hand-wavy part:__ The parts where the paths "cross over" naturally average out during training, and we're left with a rectified model.
 81 |             - Rectify() can be used on any time-differential stochastic process. It produces a more optimal transport, since there's less cost / Wasserstein metric? Also during inference you get a "straighter" path.
 82 |             - You can do it with any interpolation process, $$(\alpha_t, \beta_t)$$ where alpha: 0->1 and beta: 1->0. This encodes time scaling. When alpha+beta=1, you get linear interpolation, and alpha^2+beta^2=1 gives you DDPM/DDIM (spherical).
 83 |             - I think I can see how the loss term matches up between flow matching and DDPM-style score matching! In either case — you train by averaging over random couplings of data to noise, and random timesteps, taking L2 error. So they're kind of the same quantity, just interpreted differently though score and flow.
 84 |             - Things to vary: loss function, time-weighting by $$\eta_t$$. You can also improve the sampler by adding a bit of steady-state Langevin dynamics on the distribution of $$Z_t$$, this provides some negative feedback but can reduce detail if too much noise is added. You need the score $$\nabla \log \rho_t(Z_t)$$ to do this, but that's obtainable in closed form from the flow thanks to [Karras et al., 2022](https://arxiv.org/pdf/2206.00364).
 85 |             - Historically it developed as diffusion, then diffusion as ODEs (DDIM), then just learning flow directly — with possibly some diffusion in the sampler.
 86 |         - Ch 2. Marginals and Errors
 87 |             - There's some math here to prove that Rectify() preserves marginals for all t. If you have a time-differential stochastic process $$\{X_t\}$$, and $$\{Z_t\} = \mathrm{Rectify}(\{X_t\})$$, then we want $$X_t \sim Z_t$$ to match for all t.
 88 |             - Okay they did a couple manipulations with derivatives + Adam's law which amounted to basically, ending up at some integral equations of a test function over the distributions $$\pi_t$$, based on $$v_t^X(X_t) = \mathbb E[\dot X_t \mid X_t]$$ the normalizing flow. Then they cited a random theorem from analysis to finish it off, a uniqueness result for flow solutions via measure theory.
 89 |             - Another intuition to think of it: it's kind of like a continuity equation. If you have two flows in different directions, you can kind of average them out, and the same "quantity" of probability mass flows in aggregate.
 90 |             - Next, they study the KL divergence, Bergman, and semantic loss of these marginal distributions with ODE vs SDE. I didn't look at this too closely.
 91 |         - Ch 3. Interpolations and Equivalence
 92 |             - If you pass the $$X_t$$ process through a time-variant diffeomorphism $$\phi_t(x)$$, and/or a time schedule $$\tau_t$$, this commutes with rectification. (ok seems reasonable, the averaging is a local property)
 93 |             - In particular, this means all the different affine transformations are equivalent, with different alphas, betas, and time schedules. (…really? wat)
 94 |             - They also prove that straight and spherical are the same, up to a reparameterization. Their explanation is the identity $$\dot \alpha_t' \beta_t' - \alpha_t' \dot\beta_t' = \pi/2 = \text{constant}$$. (Note that the book has a typo.) Seems believable enough, cool — this means spherical vs linear doesn't matter in training if true.
 95 |             - Euler method can be viewed as piecewise linear approximation of the flow. But if you're doing a "curved interpolation scheme", you want __natural Euler samplers__ that is invariant/equivariant. Includes DDIM, DDPM, EDM, DMP solvers.
 96 |             - Natural Euler = instead of line segment, take the same interpolated curve at each point, but extended tangent to the velocity. This is same as Euler for linear interpolation, but for spherical (affine), you use some trig identities and get the same formula as DDIM. So DDIM is derived as a special case of natural Euler.
 97 |             - [Their blog post has a great visual of this, applied to DDIM.](https://rectifiedflow.github.io/blog/2024/DDIM-and-natural-euler/)
 98 |             - It's mentioned that you can derandomize __stochastic__ smooth interpolations by just Rao-Blackwellizing away the randomness, lol. But this is definitely a case where the theory doesn't quite hold in practice.
 99 |         - Ch 4. Identities
100 |             - This is where they derive Tweedie's formula and score functions for straight interpolation to a normal distribution. Nice! This is what lets us add additional noise with Langevin dynamics during sampling, for flexibility. ("Karras", I think)
101 |             - Some more identities with covariance, curvature, L2 error bounds. Fine.
102 |         - Ch 5. Stochastic Solvers
103 |             - Alright, here is where they get back to the main course. ODE can have a stochastic solver by re-injecting a bit of Langevin dynamics, tuned as needed. ![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fekzhang%2FlktyvtxGKv.png?alt=media&token=98fa32ab-a066-4fde-877b-857f366bda37)
104 |             - Tweedie's formula in Ch 4 is how we compute the score in the second term.
105 |             - Why bother with Langevin dynamics? It's a __self-correcting mechanism__ or something, reduces outliers. Maybe a little counterintuitive, but __larger diffusion yields more concentrated samples__. Why?
106 |             - We recap Euler–Maruyama / Brownian motion and Fokker–Planck Equation, which drifts particles away from high density (i.e., by negative score function).
107 |             - "It is known that Langevin dynamics can be viewed as the gradient flow of the KL divergence … under the 2-Wasserstein metric … a gradient flow in the space of distributions" wtf
108 |                 - My attempt to translate to English: given any distribution, if you apply LD to it, you get closer to the target distribution by KL divergence. In fact, you get __maximally closer__ per unit of Wasserstein metric ("earth moved").
109 |                 - So the intuition is that LD corrects distributional drift. Great!
110 |             - Something about different noise schedules here, including derivations for the original DDPM paper.
111 |         - Ch 6. Reward Tilting
112 |             - Here "tilting" means that we weight the final distribution, in a Bayesian sense where we multiply it by a reward function $$r(x)$$ and then renormalize. I think an example is a censor that weights improper images with $$r(x)=0$$.
113 |                 - My first instinct is to just do importance sampling. But this is slow / not guaranteed. You'd much rather get a new flow, so you don't waste time.
114 |                 - "The question is how … preferably without retraining the model."
115 |             - So they tilt it, and now we're trying to recover properties of the tilted process by doing some mathematical tricks.
116 |             - Oh I see. They end up with "Gaussian tilting" results that tilt the model in terms of L2 distance away from a target point $$x^*$$, with linear/spherical schedules. Basically updating the velocity field to point more in that direction, but mathematically sound.
117 |             - Algorithm also can be used with negative factor to steer away from a point.
118 |             - Seems of questionable utility given existing text-conditioning.
119 |         - Reflections
120 |             - The math is nice, although they spend a lot of time just thinking about this abstract idea of "rectification" in theory, which may or may not actually align with what's going on with rectification in practice.
121 |             - The "neural network as a learned function" is doing __a lot__ of heavy lifting in this description. The dynamics of training the neural network probably are equally as important as the flow itself, since it's due to the inductive biases / continuity that any rectification happens at all, with floating-point numbers being inexact.
122 |             - Also the 2D example they have of the o=>o flow with crossing paths on a square is iconic. It might be the fundamental picture of what's going on.
123 |             - They have a bunch of algebraic formulas. It must take quite a bit of practice to convert between the different time schedules, diffusion ODE/SDE terminology, and interpolations in your head. At least this book is relatively consistent in its notation, unlike papers.
124 |     - What is actually in SD3?
125 |         - [Scaling Rectified Flow Transformers for High-Resolution Image Synthesis](https://stabilityai-public-packages.s3.us-west-2.amazonaws.com/Stable+Diffusion+3+Paper.pdf)
126 |         - This paper starts with rectified flows ("not yet decisively established as standard practice") and then tries 61 different combinations of loss, schedule, on a few datasets.
127 |         - Also parameterize based on model depth, run a bajillion scaling experiments.
128 |         - The model architecture itself is a pretty straightforward extension of [Diffusion Transformers (DiT)](https://arxiv.org/pdf/2212.09748), where the latent image is split into patches. Except with two modalities for images and text, which are sent in parallel with different weights (but combined for self-attention).
129 |           ![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fekzhang%2F2N8XN3KkBU.png?alt=media&token=1fde0c20-7e1e-4c1b-80b8-b3f7567abc0a)
130 |         - I guess one interesting thing is how they use CLIP-G/14, CLIP-L/14, and T5 XXL together — the former two are used in the adaptive layer norm weights (adaLN), which isn't like the cross-attention in prior U-Nets. But the latter is principally used as text tokens & passed through the DiT blocks directly. I wonder why this worked for them.
131 |         - Alright, this seems pretty straightforward. One thing to remember here is that "straight line" refers to the forward / generative process for rectified flows, as opposed to DDPM. The sampling process is still always going to involve multiple ODE/SDE iterations.
132 |     - Stepping through FLUX.1 Schnell code
133 |         - Going to take some notes on this process here. Getting started, I downloaded the two big weights files, ae.safetensors and flux1-schnell.safetensors. These have some metadata and then a few hundred big array buffers in bfloat16 format.
134 |         - First issue is loading them using the safetensors library. Unfortunately, this doesn't support bfloat16 for JAX. It only has that for torch. So that's an ecosystem annoyance.
135 |         - When I run `t = f.get_tensor("decoder.conv_in.bias")` on the flax mode, it gives me float32 dtype. The shape is right at least, and the weights look reasonable.
136 |         - I guess I can load them in numpy mode, then run jnp.array(…, dtype=jnp.bfloat16).
137 |         - BTW, I think I won't have enough RAM on my laptop for this whole thing. Will start by developing the autoencoder though, and then maybe we can move to a Modal notebook (or function) on H100 for the rest of the implementation.
138 |         - Flow might end up being, rapidly push changes to a Git repo, then `pip install` from that Git repo inside my notebook? Kind of jank TBH.
139 |         - [Penzai](https://penzai.readthedocs.io/en/stable/notebooks/how_to_think_in_penzai.html) is cool with its pretty-printing and visualization tools for research, too tricky to get used to right now though, I just want to get something running.
140 |         - Ok hm, I got a bit further and think it's probably just going to be more productive for me to first copy/paste big chunks of code into a notebook and explore it by changing things up or profiling it, rather than reimplementing everything from scratch at first.
141 |         - I'm midway through, got the VAE latents, also figured out the difference FLUX.1 dev/schnell and other models are actually just the same architecture, 12B parameters each.
142 |             - Difference is training process, schnell takes fewer diffusion steps, maybe trained with a bias toward straight-line ODE flows?
143 |         - Rope = rotational position embeddings, they come in pairs. Applied proportionally to each channel, so channel i and position j has rot(i / theta^(dim/max_dim)) embedding. This makes sense — frequency is exponential with dimension number.
144 |         - Alright, almost there. Copying was definitely the move, can skip over parts I'm not so interested in __at the moment__ like pretrained encoder submodules. Now I'm running into an issue where the model weights that I loaded are all Float32 instead of Bfloat16, which seems like a mistake when loading the safetensors files.
145 |             - Ah, I'm dumb, I just didn't write `.to(torch.bfloat16)` during initialization.
146 |             - Ok it works now. Glad I have LSP support and AI in this notebook environment, the delay isn't so bad honestly and it's very helpful when coding at 1 AM.
147 |         - Need to restart kernel, oops. My messing around has used up all the GPU memory, and I think IPython is keeping around references to the variables that I printed out in cells, which prevents reclaiming the model memory. D:
148 |         - There goes probably another 5-10 minutes, will come back. Back down to 33 GB / 80 GB.
149 |         - Great, we did it! I understand a bit better where the surfaces for creative control here are. A lot of the workflows fit well into torch primitives, with some loose wrappers. It strikes me as a good balance between flexibility and structure — most modules in this codebase do what you expect, but they can also be modified from the outside, by autocast or no_grad or device casting and so on. Not too much model edits needed.
150 |         - https://gist.github.com/ekzhang/663b6bccd1bfa1848c02a45afa171cef
151 | 


--------------------------------------------------------------------------------
/2025/07-performance-case-studies.md:
--------------------------------------------------------------------------------
  1 | - Case studies on software performance (Jul 2025)
  2 |     - uv
  3 |         - Looks like we were able to generate a reasonable flame graph, targeting some good build options. Optimizations turned on, but also debug symbols.
  4 |         - ```shell
  5 |           #!/bin/bash
  6 |           
  7 |           # cd into directory of the script
  8 |           cd "$(dirname "$0")/"
  9 |           
 10 |           UV=$(realpath ../target/profiling/uv)
 11 |           
 12 |           sudo rm -rf .venv
 13 |           sudo rm -rf cargo-flamegraph.trace
 14 |           
 15 |           $UV venv
 16 |           $UV python install 3.13.3
 17 |           
 18 |           echo "uv managed python list:"
 19 |           ls -Gh ~/.local/share/uv/python
 20 |           
 21 |           sudo flamegraph --root -- \
 22 |           $UV --managed-python --color always pip install --no-cache torch
 23 |           ```
 24 |         - https://gist.github.com/ekzhang/01c72db3c76387918c4d95db7d6df0d3
 25 |         - Debug flamegraph has a lot more rectangles / nodes, probably because many function calls get inlined out when compiling in release mode. In particular, you can see the whole call stack for uv_client / uv_distribution and zlib while in debug mode, but most of the calls get turned into zero-cost abstractions when optimizations are enabled.
 26 |         - Creating files actually seems to be a nontrivial number of samples in release mode. I wonder why, that's kind of interesting. Probably an artifact of the event loop removing call stacks — actually file creation originates from cached requests & populating site-packages?
 27 |         - Reading some code now, I'm starting to get a sense for the shape of `uv pip install`.
 28 |             - There's a lot of code here, maybe some 6k lines per uv-client, uv-distribution crates, and then another 20k in uv/src/commands (the CLI) and so on.
 29 |             - A lot of the code seems to be unit tests though (probably at least 50%), for data structures with a whole bunch of manual serialization impls (why not use Serde? I guess custom logic for Arch, Os, Python version, etc.) and request constructors. It's a very organized way of implementing a complex protocol.
 30 |             - uv-python: data structures for environment preference, interpreter version, filenames, "missing version" errors, platform, etc.
 31 |             - uv-client / CachedClient: This seems to be in the hot path of `uv pip install`, and probably other requests. It wraps a registry client, and there's some implementation of the HTTP caching headers framework to store downloaded blobs on disk.
 32 |                 - BurntSushi wrote a bunch of this code, you can tell from the extremely detailed comments haha. Really thought through every design choice in data layout.
 33 |             - Interesting that they're using the async-compression crate, guess that's just fast enough. It makes sense. Networking is probably going to be your bottleneck for most developer or CI setups; this isn't a reverse-proxy or network intensive service.
 34 |             - uv-distribution / DistributionDatabase: Also on the hot path. Converts a Dist (source or binary distribution) object, which references registry/git/path/url/…, into a wheel or wheel metadata. Does this via a `client` (see above) and a `builder`.
 35 |                 - Charlie Marsh is also big on comments. They stand out a lot. Documenting historical and present behavior, finding hashes, following links, validating metadata and so on. I guess this is what it takes to bring order to the chaos of Python packaging.
 36 |                 - This package does seem very complex though. There's a lot of branching factor in every step of the code, as it decides which path to take based on cache availability, archiving, and recency.
 37 |                 - (I guess this is good since it handles all the possible cases like editable vs git vs remote vs registry. Curious how it evolved though. Is there a spec?)
 38 |             - uv-resolver: This is a beast, some 20k lines of code split up between package resolution, lockfile format (7.5k), [PubGrub algorithm](https://nex3.medium.com/pubgrub-2fb6470504f) (2.5k), the core resolver itself (4.5k), and the top-level modules which look to be some kind of configurable, high-level graph algorithm. (wtf is UniversalMarker?)
 39 |                 - PubGrub is cool. A concrete algorithm based on ideas from constraint solving systems, ASP / SMT techniques, symbolic stuff.
 40 |                 - Now I know why uv doesn't do that stupid poetry / pip thing where it goes through every O(n^2) versions of aiobotocore + aioboto3 checking for compatibility with a particular version of botocore, lol.
 41 |         - Ran an [old fork of tokei](https://github.com/nviennot/tokei) on the repository, looks like there is 126k lines of Rust and 165k lines of Rust tests, including cfg(test) blocks. Lots of code!
 42 |             - 143k lines of tests are in the `crates/uv/tests` folder, lol. I guess that's where a lot of the CLI-level tests live, which makes sense. It's fast enough as-is, and those tests are separate from the main build, so they don't slow down development.
 43 |             - The core uv-client, uv-distribution, uv-installer code is only about 4k, 5k, 2k lines though, including all of the parsing and struct boilerplate.
 44 |             - I guess that's pretty normal for implementing a packaging standard, with caches!
 45 |             - (But something like ty is likely much more complex — so many PL / compiler data structures tailored for static analysis.)
 46 |         - I guess, strategically, if I was focused on uv performance (assuming uv was somehow slow, which it's not atm but maybe was early on), I'd look at a couple axes:
 47 |             - Making sure stuff that can be cached gets cached. The bulk of the code in uv already focuses on this, afaict, lots of branching and casework to "do it right."
 48 |             - Reducing I/O overhead. Using the best zero-copy syscalls, making a really good HTTP client with retries, building a solid on-disk data structure.
 49 |             - Improving the resolution algorithm to behave predictably, install less duplicates, run faster for large projects with many dependencies. Also, prefetching or optimizing registry invocations during resolution to avoid blocking serial requests.
 50 |         - Non-focus would be on parsing, fs, or compression performance, since it seems like that's not the bottleneck anyway. Rust is plenty fast enough with -opt-level=3.
 51 |         - More profiling methods, just for fun: dtrace. Going to check out file I/O.
 52 |             - Hmm, dtruss isn't working unless I disable System Integrity Protection.
 53 |               ```shell
 54 |               STAMP=$(date +%Y%m%d-%H%M%S)
 55 |               TRACE="uv-fileio-${STAMP}.log"
 56 |               sudo dtruss -f -n uv \
 57 |                 -t open -t open_nocancel -t openat -t openat_nocancel \
 58 |                 -t read -t read_nocancel -t write -t write_nocancel \
 59 |                 -- "$UV" --managed-python --color always pip install --no-cache torch \
 60 |                 2> "$TRACE"
 61 |               ```
 62 |             - Not going to do that right now as it would take too long. Won't learn dtrace / dtruss then unless you're doing macOS-specific work and need to profile it there, with SIP disabled. Most cloud stuff is Linux anyway, just bpftrace.
 63 |                 - Gleb was able to get dtrace working, but he disabled SIP. Visualized the flamegraph from it after figuring out commands.
 64 |                 - It was more detailed than the sampling profiler, but showed the same basic idea.
 65 |         - Is anyone profiling with Instruments on macOS?
 66 |             - Seems like a bunch of people are using this. Also [speedscope](https://github.com/jlfwong/speedscope#usage) as a trace visualizer. Although honestly it doesn't seem to be that much more featureful than flamegraph SVGs, which have the advantage of being able to be "dropped" anywhere.
 67 |             - Honestly I kind of wish there was a desktop app for these. Browsers are fine I guess, but nothing beats the convenience of double-clicking on a file.
 68 |             - Another interesting link: https://github.com/s7nfo/Cirron
 69 |     - Nano-vLLM
 70 |         - Simpler implementation of vLLM, was able to get better performance (tok/s) on specific configurations of running on certain hardware and so on.
 71 |         - LLM inference engines use paged attention, dynamic batching, prefill + generation steps.
 72 |         - This project is pretty small, ~1k lines of code. Simple PyTorch.
 73 |         - Got it running on Nvidia, although flash-attn has some janky installation requirements that make it incompatible with uv pip install. Also opened an mlx-llm .gputrace within Xcode Instruments.
 74 |         - Imports take 4.3 seconds warm (mostly flash-attn / torch), creating LLM for 0.6b parameters takes 3.05 seconds. The generate() then takes 9.8 seconds.
 75 |         - ```plain text
 76 |           calling llm.generate() with prompts: ['<|im_start|>user\nintroduce yourself<|im_end|>\n<|im_start|>assistant\n', '<|im_start|>user\nlist all prime numbers within 100<|im_end|>\n<|im_start|>assistant\n'] 7.751103448000322
 77 |           Generating: 100%|████████████████████████████████████████████████████| 2/2 [00:13<00:00,  6.61s/it, Prefill=23tok/s, Decode=25tok/s]
 78 |           returned from llm.generate() 29.576559246000215
 79 |           -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
 80 |                                                              Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
 81 |           -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
 82 |                                                    llm.generate()        24.72%        3.269s       100.00%       13.227s       13.227s       0.000us         0.00%     906.361ms     906.361ms             1  
 83 |                             _compile.compile_inner (dynamo_timed)         8.64%        1.143s        15.91%        2.104s     210.385ms       0.000us         0.00%      21.056us       2.106us            10  
 84 |                                                      aten::linear         0.75%      98.789ms        10.94%        1.446s      50.000us       0.000us         0.00%     756.007ms      26.134us         28928  
 85 |                                                      aten::matmul         0.62%      81.869ms         8.54%        1.129s      39.032us       0.000us         0.00%     756.007ms      26.134us         28928  
 86 |                                                          aten::mm         5.45%     720.837ms         7.92%        1.047s      36.202us     753.055ms        76.38%     756.007ms      26.134us         28928  
 87 |                                          TorchDynamo Cache Lookup         7.82%        1.034s         7.82%        1.034s      23.897us       0.000us         0.00%       0.000us       0.000us         43264  
 88 |                                        Torch-Compiled Region: 0/3         4.95%     655.215ms         6.52%     862.084ms     113.194us       0.000us         0.00%      17.075ms       2.242us          7616  
 89 |                                        Torch-Compiled Region: 2/1         4.27%     564.879ms         6.08%     803.742ms     105.533us       0.000us         0.00%      15.067ms       1.978us          7616  
 90 |                                                    cuLaunchKernel         5.39%     712.399ms         5.40%     714.339ms       9.369us       3.016ms         0.31%       3.018ms       0.040us         76244  
 91 |                                        Torch-Compiled Region: 0/5         3.42%     452.837ms         4.76%     629.890ms      93.734us       0.000us         0.00%      12.569ms       1.870us          6720  
 92 |           -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
 93 |           Self CPU time total: 13.227s
 94 |           Self CUDA time total: 985.904ms
 95 |           ```
 96 |         - It took about 10 seconds to generate, but notably the CUDA execution time was only 900 ms, so there's a huge discrepancy. I think you need more prompts / batch size in order to make full use of the GPU. Probably it's low utilization at the moment.
 97 |         - Also, recording the profile with `with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:` was fine, it only had a minor impact (<20% for sure) on the performance, but it took like a minute to actually collect the profile at the end and print out the `prof.key_averages()` line. Wonder what work it's deferring to the end of profile collection.
 98 |         - Switching gears to look at the codebase, there's BlockManager which hashes token IDs and produces blocks of allocated memory (Paged attention), and ModelRunner, which moves torch.tensor() calls and dynamically tracks them in cache.
 99 |         - prepare_prefill(), prepare_decode(), etc., with pin_memory=True for faster copies.
100 |         - Overall a pretty simple LLM engine algorithm! What this profile reveals is that almost everything is matmul, and we could improve utilization / FLOPs in this example.
101 |         - ```python
102 |           import time
103 |           start_time = time.monotonic()
104 |           print("starting imports")
105 |           
106 |           import os
107 |           from nanovllm import LLM, SamplingParams
108 |           from transformers import AutoTokenizer
109 |           
110 |           # ERIC
111 |           from torch.profiler import profile, ProfilerActivity, record_function
112 |           
113 |           
114 |           
115 |           def main():
116 |               print("starting example.py", time.monotonic() - start_time)
117 |               path = os.path.expanduser("~/huggingface/Qwen3-0.6B/")
118 |               tokenizer = AutoTokenizer.from_pretrained(path)
119 |               print("created autotokenizer instance", tokenizer, time.monotonic() - start_time)
120 |               llm = LLM(path, enforce_eager=True, tensor_parallel_size=1)
121 |               print("created LLM instance", llm, time.monotonic() - start_time)
122 |           
123 |               sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
124 |               prompts = [
125 |                   "introduce yourself",
126 |                   "list all prime numbers within 100",
127 |               ]
128 |               prompts = [
129 |                   tokenizer.apply_chat_template(
130 |                       [{"role": "user", "content": prompt}],
131 |                       tokenize=False,
132 |                       add_generation_prompt=True,
133 |                       enable_thinking=True
134 |                   )
135 |                   for prompt in prompts
136 |               ]
137 |               print("calling llm.generate() with prompts:", prompts, time.monotonic() - start_time)
138 |               with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
139 |                   with record_function("llm.generate()"):
140 |                       outputs = llm.generate(prompts, sampling_params)
141 |               print("returned from llm.generate()", time.monotonic() - start_time)
142 |               print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
143 |           
144 |               for prompt, output in zip(prompts, outputs):
145 |                   print("\n")
146 |                   print(f"Prompt: {prompt!r}")
147 |                   print(f"Completion: {output['text']!r}")
148 |           
149 |           
150 |           if __name__ == "__main__":
151 |               main()
152 |           ```
153 |         - Probably better to use the graphical viewer later on. Just remember that there's a lot of profiling data, so you have to wait a while for the file to save / be created.
154 |     - resvg
155 |         - This is a minimal, reproducible and self-contained SVG renderer in Rust. It's about 2500 lines of code, and it's developed alongside the "usvg" parser, which is about 14000 lines of code. Compiles pretty quickly!
156 |         - Unlike mainstream SVG implementations like in Chrome and so on, this one doesn't use system text rendering, it relies on rustybuzz / ttf-parser. So it's reproducible.
157 |         - There's a codegen'd file, parser/svgtree/names.rs, seems to be a Lexer.
158 |         - Alright, seems to be working. Here's how long it takes to render a [Paris map](https://upload.wikimedia.org/wikipedia/commons/7/74/Paris_department_land_cover_location_map.svg) in release vs debug mode, looks like a 20x difference in total actually.
159 |         - ```plain text
160 |           ubuntu@ip-10-1-4-234:~/resvg$ time ./eric-benchmark.sh 
161 |           Warning (in usvg::text:129): No match for '"DejaVu Sans Condensed"' font-family.
162 |           Warning (in usvg::text:129): No match for '"DejaVu Sans Condensed"' font-family.
163 |           Warning (in usvg::text:129): No match for '"DejaVu Sans Condensed"' font-family.
164 |           
165 |           real    0m0.408s
166 |           user    0m0.375s
167 |           sys     0m0.033s
168 |           ubuntu@ip-10-1-4-234:~/resvg$ time ./eric-benchmark.sh 
169 |           Warning (in usvg::text:129): No match for '"DejaVu Sans Condensed"' font-family.
170 |           Warning (in usvg::text:129): No match for '"DejaVu Sans Condensed"' font-family.
171 |           Warning (in usvg::text:129): No match for '"DejaVu Sans Condensed"' font-family.
172 |           
173 |           real    0m8.005s
174 |           user    0m7.962s
175 |           sys     0m0.042s
176 |           ```
177 |         - Going to edit the SVG to remove "DejaVu Sans Condensed" since we don't have that font on my Ubuntu system. Seems like it's working with just "DejaVu Sans".
178 |         - Alright, going to start profiling! I set up a `bench` build with debuginfo and no stripping.
179 |         - Having `--width 10000`, I can get it to take longer, like 1.8 seconds. Going to standardize on `--width 2000` for now, which is taking about 460 ms right now.
180 |             - Interesting that the time taken scales sublinearly with image width, even as the image dimensions are increasing quadratically. I guess most of the compute is probably spent parsing and simplifying the SVG in that case, not rasterizing.
181 |             - Which makes sense — this means SVGs can be zoomed performantly, since you construct the in-memory data structure and save it for future rendering.
182 |         - Here's a flamegraph for the workload. Seems like about 60% of the time is spent on rasterization in tiny-skia, and the remaining 40% is parsing.
183 |         - https://gist.github.com/ekzhang/3a5e671e6eafb3fbda9c63551d046976
184 |         - Note that running flamegraph takes a while… seems like over 100 MB/s of data generated by the samples. So that's kind of interesting. I wonder how that breaks down.
185 |         - If you increase the resolution to width=10000, then it looks like the parsing stage disappears from the flamegraph (only 7% of samples now? — seems proportional) and now it's all tiny_skia crate spans, as well as some PNG conversion time.
186 |             - Of the parsing time, about 70% of it is in `usvg::parser::converter`. This recursively goes through all the SVG elements. You can actually see the structure of the SVG through the flamegraph's stack trace distribution, showing how many nested levels of groups there are.
187 |             - Converter is actually only a single file, not too large, less than 1000 lines of code. The main function seems to be convert_element(). It takes SvgNode from usvg's svgtree data structure, then for instance, convert_group() calculates bounding boxes and produces a Group struct, from usvg's tree module.
188 |             - Interesting that resvg is taking usvg svgtree and producing usvg tree. I would've thought only the input data structure belongs to usvg.
189 |             - Most parsing time is taken by parsing paths (and numbers inside the paths) to be converted, and likewise path bbox's, from a geometry data structure in tiny-skia's PathStroker. This is distinct from tiny-skia's drawing library with [Pixmap](https://docs.rs/tiny-skia/latest/tiny_skia/struct.Pixmap.html).
190 |             - Also fun that this exercises only "fill paths" and "hairline strokes" — "anti_hairline" means an anti-aliased line with 1px width no matter what your zoom level is.
191 |         - Now that I understand the library (pretty straightforward, just parses and renders the SVG file format), let's try some other profiling strategies for fun. Curious if I can count page faults, since I see a significant number of them in the flame graph.
192 |         - Going to use bpftrace for this. Looks like I can use `handle_mm_fault`, cool!
193 |         - Nice, I see like 23k-27k page faults with width=2000. And it increases maybe to 240k with width=10000. This makes sense, although not super sure why the memory usage keeps going up as we construct a pixel buffer.
194 |             - `perf stat` is probably better for this for getting a summary report and navigating it with paths / debug info. bpftrace might be better at network operations and so on, even if it can technically do all the same things with enough effort.
195 |             - Can see some other metrics like branches, makes sense.
196 |             - This library isn't particularly optimized, and neither is its dependencies. But it's naturally fast enough, being in Rust and so on, and rustybuzz. Lots of room for performance optimization with buffers if you really needed it (see Forma, for instance).
197 |     - typescript-go
198 |         - Quick intro, this is a 2025 project from Microsoft, compare with esbuild (2021) which was seminal and started the movement of JS language tooling to Go.
199 |         - Recall the [esbuild FAQ](https://esbuild.github.io/faq/#why-is-esbuild-fast) on "Why is esbuild fast?" and other details. Some takeaways.
200 |             - Go and native code, designed for parallelism, efficient memory usage since compilers are mostly O(n) and thus memory access between AST passes become a concern.
201 |             - Won't include expensive features like TypeScript type-checking support. Which makes sense, that's why typescript-go is a fully separate project and rewrite, to be used in tandem with something like esbuild. Typechecking has become a separate part of the toolchain and workflow for frontend developers, at this point.
202 |         - Also see [esbuild architecture.md](https://github.com/evanw/esbuild/blob/main/docs/architecture.md) for some high-level picture of the "scan phase" and "compile phase" — basically, local and global interactions + parallel chunking. Will be interesting to compare with typescript-go, since that may have more global interaction.
203 |             - How do you parallelize when typechecking over a whole codebase, with a complex type system? Maybe a "query"-based structure, like Rust's Salsa?
204 |         - Ok, onto typescript-go (TS 7 / Native TS). Will first try to build it.
205 |             - The tooling seems a bit jank, they have TypeScript upstream as a submodule, and the task runner `hereby` has a thousand lines of glue code for developer tasks. Hmm, well let's not judge before we actually try building it.
206 |             - Ah, I found why Rust is needed as a build dependency. [libsyncrpc](https://github.com/microsoft/libsyncrpc) is a small library that's used by the typescript-go JS API to communicate with its Go subprocess.
207 |                 - Strange that this tiny shim is what's written in Rust. Why not write it in Go as well, or just in JavaScript? Maybe it's performance-sensitive, and Rust supports napi better than Go? Anyway, we'll find out if it's actually important for perf soon.
208 |             - Going to time `npx hereby build` now. Only a couple Go dependencies, nice.
209 |                 - real    2m41.685s
210 |                   user    5m46.611s
211 |             - Not bad, honestly! Go compiler is fast and parallel. By far the largest folder in the codebase is `internal/` (240K LoC), which I assume has the implementation of the actual bulk of the compiler. Looks to be a 10-ish person project from commit history.
212 |             - First time running built/local/tsgo, and the CLI looks very similar to tsc, down to every line of help text. I guess that makes sense, it's as much as possible a direct port.
213 |         - What's a good benchmark for tsgo? Maybe I'll run it against date-fns first, that seems pretty straightforward and a simple codebase.
214 |             - ubuntu@ip-10-1-4-234:~/date-fns$ time ../typescript-go/built/local/tsgo 
215 |               
216 |               **real    0m1.332s**
217 |               user    0m11.057s
218 |               sys     0m1.433s
219 |               
220 |               ubuntu@ip-10-1-4-234:~/date-fns$ time npx tsc
221 |               
222 |               **real    0m5.613s**
223 |               user    0m11.138s
224 |               sys     0m0.484s
225 |         - Alright, so that's pretty interesting. There's around the same total CPU time spent in both versions of TypeScript, but the native one is much more parallel. I can't reproduce quite as good numbers as we saw in the [blog post](https://devblogs.microsoft.com/typescript/typescript-native-port/) (6.5s vs 0.7s), at least on Ubuntu.
226 |         - Going to just start with the `flamegraph` command from cargo-flamegraph as before, since it's convenient. But I don't know if built/local/tsgo has debuginfo or symbols.
227 |             - Oh that produced a very awful flame graph, just Go runtime spans. Probably it doesn't follow the forked worker threads. Let's try using go pprof instead.
228 |             - This is in the command somehow, see internal/pprof/pprof.go. Use `--pprofDir pprof`.
229 |                 - CPU profile: /home/ubuntu/date-fns/pprof/2669572-cpuprofile.pb.gz
230 |                 - Memory profile: /home/ubuntu/date-fns/pprof/2669572-memprofile.pb.gz
231 |             - Nice, we can open these in [speedscope.app](https://www.speedscope.app/).
232 |             - You can also open the pprof UI, it's my first time using that tool. Hmm, interesting http server for visualizing flame graph and graph. I find the graph a bit hard to use, flame graph seems to give a better overview, but maybe for some less-hierarchical codebases the former graph layout might be nice.
233 |                 - pprof UI also shows the source+binary disassembly and so on, I'm less interested. Lots of internal runtime spans, probably would be better for C/C++.
234 |             - Memory profile – notably, spends tons of memory on package.json, over half of a gigabyte. That's pretty insane. I guess the file is like 7000+ lines of locales and other metadata. ![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fekzhang%2F8RUPobZclR.png?alt=media&token=41eb3f8f-5345-4dd7-aa24-7827a79d0a64)
235 |             - CPU profile – likewise has 55% of time spent on package.json parsing, okay this is probably why my benchmark doesn't match theirs. The interesting parts are parseSourceFile(), which uses internal/parser.go *Parser object, and checkSourceFile(), which uses internal/checker.go *Checker object. ![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fekzhang%2FIrabGu3lKl.png?alt=media&token=1b7b43ab-1fad-4013-b700-87d34b519945)
236 | 


--------------------------------------------------------------------------------