$ curl {url.toString()}
35 |Segmentation fault (core dumped)
36 |37 |
$ # Looks like you wound up accessing a memory region — errr, page — that doesn‘t exist.
38 |$ # Want to learn more about page faults? Start from the beginning!
39 |├── vercel.json
├── tsconfig.json
├── public
├── stats
│ └── js
│ │ └── script.js
├── banner.png
├── robots.txt
├── favicon-on-dark.png
├── images
│ ├── the-end.png
│ ├── cpu-pleading-face.png
│ ├── elf-data-section.png
│ ├── elf-file-structure.png
│ ├── init-process-tree.png
│ ├── binprm-buf-changelog.png
│ ├── fetch-execute-cycle.png
│ ├── instruction-pointer.png
│ ├── writing-this-article.png
│ ├── gnu-linux-elf-drawing.jpg
│ ├── hardware-interrupt-meme.png
│ ├── interrupt-vector-table.png
│ ├── elf-program-header-types.png
│ ├── kernel-mode-vs-user-mode.png
│ ├── linux-shebang-truncation.png
│ ├── static-vs-dynamic-linking.png
│ ├── keyboard-hardware-interrupt.png
│ ├── multilevel-paging-explainer.png
│ ├── page-table-entry-permissions.png
│ ├── virtual-memory-mmu-example.png
│ ├── 4kib-paging-address-breakdown.png
│ ├── higher-half-kernel-memory-map.png
│ ├── linux-scheduler-target-latency.png
│ ├── process-virtual-memory-mapping.png
│ ├── elf-section-header-table-diagram.png
│ ├── linux-program-execution-process.png
│ ├── syscall-architecture-differences.png
│ ├── assembly-to-machine-code-translation.png
│ └── demand-paging-with-page-faults-comic.png
├── favicon-on-light.png
├── editions
│ └── printable.pdf
├── github-images
│ ├── banner-dark.png
│ └── banner-light.png
├── orpheus-flag.svg
└── squiggles
│ └── bottom.svg
├── postcss.config.cjs
├── src
├── env.d.ts
├── metadata.ts
├── content
│ ├── config.ts
│ └── chapters
│ │ ├── 0-intro.mdx
│ │ ├── 7-epilogue.mdx
│ │ ├── 2-slice-dat-time.mdx
│ │ ├── 4-becoming-an-elf-lord.mdx
│ │ ├── 1-the-basics.mdx
│ │ ├── 6-lets-talk-about-forks-and-cows.mdx
│ │ ├── 5-the-translator-in-your-computer.mdx
│ │ └── 3-how-to-run-a-program.mdx
├── components
│ ├── EditButton.astro
│ ├── DowngradeHeadings.astro
│ ├── TOCList.astro
│ ├── ScrollPadding.astro
│ ├── OldNav.astro
│ ├── ColoredTitle.astro
│ ├── CodeBlock.astro
│ ├── ExternalNav.astro
│ └── SEO.astro
├── styles
│ ├── 404.css
│ ├── home.css
│ ├── chapter.css
│ ├── one-pager.css
│ └── global.css
└── pages
│ ├── 404.astro
│ ├── editions
│ └── one-pager.astro
│ ├── index.astro
│ └── [...slug].astro
├── .gitignore
├── package.json
├── astro.config.mjs
├── LICENSE
├── .github
└── workflows
│ └── deploy.yml
├── README.md
└── scripts
├── pdfgen.js
└── run-pdf-ci.mjs
/vercel.json:
--------------------------------------------------------------------------------
1 | {
2 | "trailingSlash": false
3 | }
4 |
--------------------------------------------------------------------------------
/tsconfig.json:
--------------------------------------------------------------------------------
1 | {
2 | "extends": "astro/tsconfigs/base"
3 | }
4 |
--------------------------------------------------------------------------------
/public/stats/js/script.js:
--------------------------------------------------------------------------------
1 | // This is a Vercel rewrite in production, see vercel.json.
--------------------------------------------------------------------------------
/postcss.config.cjs:
--------------------------------------------------------------------------------
1 | module.exports = {
2 | plugins: [ require('postcss-nesting') ]
3 | }
--------------------------------------------------------------------------------
/public/banner.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Revisto/putting-the-you-in-cpu/main/public/banner.png
--------------------------------------------------------------------------------
/public/robots.txt:
--------------------------------------------------------------------------------
1 | User-agent: *
2 | Allow: /
3 |
4 | Sitemap: https://cpu.land/sitemap-index.xml
5 |
--------------------------------------------------------------------------------
/src/env.d.ts:
--------------------------------------------------------------------------------
1 | ///
$ curl {url.toString()}
35 |$ # Looks like you wound up accessing a memory region — errr, page — that doesn‘t exist.
38 |$ # Want to learn more about page faults? Start from the beginning!
39 |
12 |
13 | من دیگه طاقت نیوردم و هر چقدر که میتونستم شروع به خوندن کردم. اگه قصد رفتن به دانشگاه رو ندارید قرار نیست که منابع جامع زیادی برای درک طرز کار سیستمها پیدا کنید; بنابراین من مجبور بودم طیف وسیعی از منابع متفاوتی رو که گاهی حتی اطلاعات متناقض به همراه داشتن رو بررسی کنم. حالا پس از چند هفته تحقیق و تقریبا ۴۰ صفحه یادداشت، فکر میکنم درک خیلی بهتری از نحوه کارکرد کامپیوترها از زمان روشن شدن تا اجرای برنامهها دارم. من واقعا حاضر بودم سر یه مقاله که بهم چیزایی که این مدت یاد گرفتم رو یاد بده آدم بکشم. و الان دارم همون چیزی رو مینویسم که حسرتش رو داشتم که کاش یکی برای من نوشته بود.
14 |
15 | و میدونی چی میگن... وقتی واقعاً یه چیزی رو فهمیدی که بتونی برای یکی دیگه توضیحش بدی.
16 |
17 | > وقت نداری؟ حس میکنی این چیزا رو بلدی؟
18 | >
19 | > فصل ۳ رو بخون، قول میدم که قراره چیزای جدیدی یاد بگیری، مگه اینکه یکی مثل لینوس توروالدز باشی!
20 |
--------------------------------------------------------------------------------
/.github/workflows/deploy.yml:
--------------------------------------------------------------------------------
1 | name: Deploy to GitHub Pages
2 |
3 | on:
4 | push:
5 | branches:
6 | - main
7 | workflow_dispatch:
8 |
9 | concurrency:
10 | group: "pages"
11 | cancel-in-progress: false
12 |
13 | jobs:
14 | build:
15 | runs-on: ubuntu-latest
16 | permissions:
17 | contents: write
18 | steps:
19 | - name: Checkout repository
20 | uses: actions/checkout@v4
21 |
22 | - name: Set up Node.js
23 | uses: actions/setup-node@v4
24 | with:
25 | node-version: '18'
26 | cache: 'npm'
27 |
28 | - name: Install dependencies
29 | run: npm install
30 |
31 | - name: Build project
32 | run: npm run build
33 |
34 | - name: Generate PDF
35 | run: npm run generate-pdf:ci
36 |
37 | - name: Copy PDF to dist directory
38 | run: cp public/editions/printable.pdf dist/editions/printable.pdf
39 |
40 | - name: Commit and push PDF
41 | run: |
42 | git config --global user.name 'github-actions[bot]'
43 | git config --global user.email 'github-actions[bot]@users.noreply.github.com'
44 | git add public/editions/printable.pdf
45 | # Check if there are changes to commit
46 | if ! git diff --staged --quiet; then
47 | git commit -m "docs: update printable.pdf [skip ci]"
48 | git push
49 | else
50 | echo "No changes to printable.pdf to commit."
51 | fi
52 |
53 | - name: Upload artifact
54 | uses: actions/upload-pages-artifact@v3
55 | with:
56 | path: ./dist
57 |
58 | deploy:
59 | needs: build
60 | runs-on: ubuntu-latest
61 | permissions:
62 | pages: write
63 | id-token: write
64 | environment:
65 | name: github-pages
66 | url: ${{ steps.deployment.outputs.page_url }}
67 | steps:
68 | - name: Deploy to GitHub Pages
69 | id: deployment
70 | uses: actions/deploy-pages@v4
71 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
4 | A technical explainer of how your computer runs programs, from start to finish.
8 | 9 |
16 |
17 | I cracked and started figuring as much out as possible. There aren't many comprehensive systems resources if you aren't going to college, so I had to sift through tons of different sources of varying quality and sometimes conflicting information. A couple weeks of research and almost 40 pages of notes later, I think I have a much better idea of how computers work from startup to program execution. I would've killed for one solid article explaining what I learned, so I'm writing the article that I wished I had.
18 |
19 | And you know what they say... you only truly understand something if you can explain it to someone else.
20 |
21 | > In a hurry? Feel like you know this stuff already?
22 | >
23 | > [Read chapter 3](https://cpu.land/how-to-run-a-program) and I guarantee you will learn something new. Unless you're like, Linus Torvalds himself.
24 |
25 | Continue to Chapter 1: The "Basics" »
(cpu.land)
کنجکاوید که دقیقاً چه اتفاقی میافته وقتی برنامهای رو روی کامپیوترتون اجرا میکنید؟ این مقاله رو بخونید تا یاد بگیرید چطور چندپردازشی کار میکنه، فراخوانیهای سیستم واقعاً چی هستن، کامپیوترها چطور حافظه رو با وقفههای سختافزاری مدیریت میکنن، و لینوکس چطور فایلهای اجرایی رو بارگذاری میکنه.
51 |52 | نوشته 53 | لکسی متیک 54 | و 55 | هک کلاب 56 | · 57 | ژانویه ۲۰۲۳ 58 |
59 |
16 |
17 | ... but wait, there's more!
18 |
19 | ## Bonus: Translating C Concepts
20 |
21 | If you've done some low-level programming yourself, you probably know what the stack and the heap are and you've probably used `malloc`. You might not have thought a lot about how they're implemented!
22 |
23 | First of all, a thread's stack is a fixed amount of memory that's mapped to somewhere high up in virtual memory. On most (although [not all](https://stackoverflow.com/a/664779)) architectures, the stack pointer starts at the top of the stack memory and moves downward as it increments. Physical memory is not allocated up-front for the entire mapped stack space; instead, demand paging is used to lazily allocate memory as frames of the stack are reached.
24 |
25 | It might be surprising to hear that heap allocation functions like `malloc` are not system calls. Instead, heap memory management is provided by the libc implementation! `malloc`, `free`, et al. are complex procedures, and the libc keeps track of memory mapping details itself. Under the hood, the userland heap allocator uses syscalls including `mmap` (which can map more than just files) and `sbrk`.
26 |
27 | ## Bonus: Tidbits
28 |
29 | I couldn't find anywhere coherent to put these, but found them amusing, so here you go.
30 |
31 | > *Most Linux users probably have a sufficiently interesting life that they spend little time imagining how page tables are represented in the kernel.*
32 | >
33 | > *[Jonathan Corbet, LWN](https://lwn.net/Articles/106177/)*
34 |
35 | An alternate visualization of hardware interrupts:
36 |
37 |
38 |
39 | A note that some system calls use a technique called vDSOs instead of jumping into kernel space. I didn't have time to talk about this, but it's quite interesting and I recommend [reading](https://en.wikipedia.org/wiki/VDSO) [into](https://man7.org/linux/man-pages/man7/vdso.7.html) [it](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-3.html).
40 |
41 | And finally, addressing the Unix allegations: I do feel bad that a lot of the execution-specific stuff is very Unix-specific. If you're a macOS or Linux user this is fine, but it won't bring you too much closer to how Windows executes programs or handles system calls, although the CPU architecture stuff is all the same. In the future I would love to write an article that covers the Windows world.
42 |
43 | ## Acknowledgements
44 |
45 | I talked to GPT-3.5 and GPT-4 a decent amount while writing this article. While they lied to me a lot and most of the information was useless, they were sometimes very helpful for working through problems. LLM assistance can be net positive if you're aware of their limitations and are extremely skeptical of everything they say. That said, they're terrible at writing. Don't let them write for you.
46 |
47 | More importantly, thank you to all the humans who proofread me, encouraged me, and helped me brainstorm — especially Ani, B, Ben, Caleb, Kara, polypixeldev, Pradyun, Spencer, Nicky (who drew the wonderful elf in [chapter 4](/becoming-an-elf-lord)), and my lovely parents.
48 |
49 | If you are a teenager and you like computers and you are not already in the [Hack Club Slack](https://hackclub.com/slack), you should join right now. I would not have written this article if I didn't have a community of awesome people to share my thoughts and progress with. If you are not a teenager, you should [give us money](https://hackclub.com/philanthropy/) so we can keep doing cool things.
50 |
51 | All of the mediocre art in this article was drawn in [Figma](https://figma.com/). I used [Obsidian](https://obsidian.md/) for editing, and sometimes [Vale](https://vale.sh/) for linting. The Markdown source for this article is [available on GitHub](https://github.com/hackclub/putting-the-you-in-cpu/) and open to future nitpicks, and all art is published on a [Figma community page](https://www.figma.com/community/file/1260699047973407903).
52 |
53 |
54 |
--------------------------------------------------------------------------------
/src/pages/[...slug].astro:
--------------------------------------------------------------------------------
1 | ---
2 | import '../styles/global.css'
3 | import '../styles/chapter.css'
4 |
5 | import { CollectionEntry, getCollection, getEntryBySlug } from 'astro:content'
6 | import TOCList from '../components/TOCList.astro'
7 | import SEO from '../components/SEO.astro'
8 | import ExternalNav from '../components/ExternalNav.astro'
9 | import ColoredTitle from '../components/ColoredTitle.astro'
10 | import EditButton from '../components/EditButton.astro'
11 | import ScrollPadding from '../components/ScrollPadding.astro'
12 | import OldNav from '../components/OldNav.astro'
13 |
14 | const baseUrl = import.meta.env.BASE_URL && import.meta.env.BASE_URL !== '/' ? `${import.meta.env.BASE_URL}/` : '/'
15 |
16 | export interface Params {
17 | slug: CollectionEntry<'chapters'>['slug']
18 | }
19 |
20 | const chapter = await getEntryBySlug('chapters', Astro.params.slug)
21 | const { Content, headings } = await chapter.render()
22 |
23 | const allChapters = await getCollection('chapters')
24 | const prevChapter = allChapters
25 | .find(otherChapter => otherChapter.data.chapter === chapter.data.chapter - 1)
26 | const nextChapter = allChapters
27 | .find(otherChapter => otherChapter.data.chapter === chapter.data.chapter + 1)
28 |
29 | export async function getStaticPaths() {
30 | const chapters = await getCollection('chapters')
31 | return chapters
32 | .filter(chapter => chapter.data.chapter !== 0) // Skip the intro
33 | .map(chapter => ({ params: { slug: chapter.slug } }))
34 | }
35 | ---
36 |
37 |
38 |
39 |
40 |
41 |
20 |
21 | OS schedulers use *timer chips* like [PITs](https://en.wikipedia.org/wiki/Programmable_interval_timer) to trigger hardware interrupts for multitasking:
22 |
23 | 1. Before jumping to program code, the OS sets the timer chip to trigger an interrupt after some period of time.
24 | 2. The OS switches to user mode and jumps to the next instruction of the program.
25 | 3. When the timer elapses, it triggers a hardware interrupt to switch to kernel mode and jump to OS code.
26 | 4. The OS can now save where the program left off, load a different program, and repeat the process.
27 |
28 | This is called *preemptive multitasking*; the interruption of a process is called [*preemption*](https://en.wikipedia.org/wiki/Preemption_(computing)). If you’re, say, reading this article on a browser and listening to music on the same machine, your very own computer is probably following this exact cycle thousands of times a second.
29 |
30 | ## Timeslice Calculation
31 |
32 | A *timeslice* is the duration an OS scheduler allows a process to run before preempting it. The simplest way to pick timeslices is to give every process the same timeslice, perhaps in the 10 ms range, and cycle through tasks in order. This is called *fixed timeslice round-robin* scheduling.
33 |
34 | > **Aside: fun jargon facts!**
35 | >
36 | > Did you know that timeslices are often called "quantums?" Now you do, and you can impress all your tech friends. I think I deserve heaps of praise for not saying quantum in every other sentence in this article.
37 | >
38 | > Speaking of timeslice jargon, Linux kernel devs use the [jiffy](https://github.com/torvalds/linux/blob/22b8cc3e78f5448b4c5df00303817a9137cd663f/include/linux/jiffies.h) time unit to count fixed frequency timer ticks. Among other things, jiffies are used for measuring the lengths of timeslices. Linux's jiffy frequency is typically 1000 Hz but can be configured when compiling the kernel.
39 |
40 | A slight improvement to fixed timeslice scheduling is to pick a *target latency* — the ideal longest time for a process to respond. The target latency is the time it takes for a process to resume execution after being preempted, assuming a reasonable number of processes. *This is pretty hard to visualize! Don't worry, a diagram is coming soon.*
41 |
42 | Timeslices are calculated by dividing the target latency by the total number of tasks; this is better than fixed timeslice scheduling because it eliminates wasteful task switching with fewer processes. With a target latency of 15 ms and 10 processes, each process would get 15/10 or 1.5 ms to run. With only 3 processes, each process gets a longer 5 ms timeslice while still hitting the target latency.
43 |
44 | Process switching is computationally expensive because it requires saving the entire state of the current program and restoring a different one. Past a certain point, too small a timeslice can result in performance problems with processes switching too rapidly. It's common to give the timeslice duration a lower bound (*minimum granularity*). This does mean that the target latency is exceeded when there are enough processes for the minimum granularity to take effect.
45 |
46 | At the time of writing this article, Linux's scheduler uses a target latency of 6 ms and a minimum granularity of 0.75 ms.
47 |
48 |
49 |
50 | Round-robin scheduling with this basic timeslice calculation is close to what most computers do nowadays. It's still a bit naive; most operating systems tend to have more complex schedulers which take process priorities and deadlines into account. Since 2007, Linux has used a scheduler called [Completely Fair Scheduler](https://docs.kernel.org/scheduler/sched-design-CFS.html). CFS does a bunch of very fancy computer science things to prioritize tasks and divvy up CPU time.
51 |
52 | Every time the OS preempts a process it needs to load the new program's saved execution context, including its memory environment. This is accomplished by telling the CPU to use a different *page table*, the mapping from "virtual" to physical addresses. This is also the system that prevents programs from accessing each other's memory; we'll go down this rabbit hole in chapters [5](/the-translator-in-your-computer) and [6](/lets-talk-about-forks-and-cows) of this article.
53 |
54 | ## Note #1: Kernel Preemptability
55 |
56 | So far, we've been only talking about the preemption and scheduling of userland processes. Kernel code might make programs feel laggy if it took too long handling a syscall or executing driver code.
57 |
58 | Modern kernels, including Linux, are [preemptive kernels](https://en.wikipedia.org/wiki/Kernel_preemption). This means they're programmed in a way that allows kernel code itself to be interrupted and scheduled just like userland processes.
59 |
60 | This isn't very important to know about unless you're writing a kernel or something, but basically every article I've read has mentioned it so I thought I would too! Extra knowledge is rarely a bad thing.
61 |
62 | ## Note #2: A History Lesson
63 |
64 | Ancient operating systems, including classic Mac OS and versions of Windows long before NT, used a predecessor to preemptive multitasking. Rather than the OS deciding when to preempt programs, the programs themselves would choose to yield to the OS. They would trigger a software interrupt to say, "hey, you can let another program run now." These explicit yields were the only way for the OS to regain control and switch to the next scheduled process.
65 |
66 | This is called [*cooperative multitasking*](https://en.wikipedia.org/wiki/Cooperative_multitasking). It has a couple major flaws: malicious or just poorly designed programs can easily freeze the entire operating system, and it's nigh impossible to ensure temporal consistency for realtime/time-sensitive tasks. For these reasons, the tech world switched to preemptive multitasking a long time ago and never looked back.
67 |
--------------------------------------------------------------------------------
/src/content/chapters/4-becoming-an-elf-lord.mdx:
--------------------------------------------------------------------------------
1 | ---
2 | chapter: 4
3 | title: Becoming an Elf-Lord
4 | shortname: ELF
5 | slug: becoming-an-elf-lord
6 | updatedAt: 2023-07-17T17:16:18.079Z
7 | ---
8 |
9 | import CodeBlock from '../../components/CodeBlock.astro'
10 |
11 | We pretty thoroughly understand `execve` now. At the end of most paths, the kernel will reach a final program containing machine code for it to launch. Typically, a setup process is required before actually jumping to the code — for example, different parts of the program have to be loaded into the right places in memory. Each program needs different amounts of memory for different things, so we have standard file formats that specify how to set up a program for execution. While Linux supports many such formats, the most common format by far is *ELF* (executable and linkable format).
12 |
13 |
14 |
15 |
18 | (Thank you to Nicky Case for the adorable drawing.) 19 |
20 |
47 |
48 | ### ELF Header
49 |
50 | Every ELF file has an [ELF header](https://refspecs.linuxfoundation.org/elf/gabi4+/ch4.eheader.html). It has the very important job of conveying basic information about the binary such as:
51 |
52 | - What processor it's designed to run on. ELF files can contain machine code for different processor types, like ARM and x86.
53 | - Whether the binary is meant to be run on its own as an executable, or whether it's meant to be loaded by other programs as a "dynamically linked library." We'll go into details about what dynamic linking is soon.
54 | - The entry point of the executable. Later sections specify exactly where to load data contained in the ELF file into memory. The entry point is a memory address pointing to where the first machine code instruction is in memory after the entire process has been loaded.
55 |
56 | The ELF header is always at the start of the file. It specifies the locations of the program header table and section header, which can be anywhere within the file. Those tables, in turn, point to data stored elsewhere in the file.
57 |
58 | ### Program Header Table
59 |
60 | The [program header table](https://refspecs.linuxbase.org/elf/gabi4+/ch5.pheader.html) is a series of entries containing specific details for how to load and execute the binary at runtime. Each entry has a type field that says what detail it's specifying — for example, `PT_LOAD` means it contains data that should be loaded into memory, but `PT_NOTE` means the segment contains informational text that shouldn't necessarily be loaded anywhere.
61 |
62 |
63 |
64 | Each entry specifies information about where its data is in the file and, sometimes, how to load the data into memory:
65 |
66 | - It points to the position of its data within the ELF file.
67 | - It can specify what virtual memory address the data should be loaded into memory at. This is typically left blank if the segment isn't meant to be loaded into memory.
68 | - Two fields specify the length of the data: one for the length of the data in the file, and one for the length of the memory region to be created. If the memory region length is longer than the length in the file, the extra memory will be filled with zeroes. This is beneficial for programs that might want a static segment of memory to use at runtime; these empty segments of memory are typically called [BSS](https://en.wikipedia.org/wiki/.bss) segments.
69 | - Finally, a flags field specifies what operations should be permitted if it's loaded into memory: `PF_R` makes it readable, `PF_W` makes it writable, and `PF_X` means it's code that should be allowed to execute on the CPU.
70 |
71 | ### Section Header Table
72 |
73 | The [section header table](https://refspecs.linuxbase.org/elf/gabi4+/ch4.sheader.html) is a series of entries containing information about *sections*. This section information is like a map, charting the data inside the ELF file. It makes it easy for [programs like debuggers](https://www.sourceware.org/gdb/) to understand the intended uses of different portions of the data.
74 |
75 |
76 |
77 | For example, the program header table can specify a large swath of data to be loaded into memory together. That single `PT_LOAD` block might contain both code and global variables! There's no reason those have to be specified separately to *run* the program; the CPU just starts at the entry point and steps forward, accessing data when and where the program requests it. However, software like a debugger for *analyzing* the program needs to know exactly where each area starts and ends, otherwise it might try to decode some text that says "hello" as code (and since that isn't valid code, explode). This information is stored in the section header table.
78 |
79 | While it's usually included, the section header table is actually optional. ELF files can run perfectly well with the section header table completely removed, and developers who want to hide what their code does will sometimes intentionally strip or mangle the section header table from their ELF binaries to [make them harder to decode](https://binaryresearch.github.io/2019/09/17/Analyzing-ELF-Binaries-with-Malformed-Headers-Part-1-Emulating-Tiny-Programs.html).
80 |
81 | Each section has a name, a type, and some flags that specify how it's intended to be used and decoded. Standard names usually start with a dot by convention. The most common sections are:
82 |
83 | - `.text`: machine code to be loaded into memory and executed on the CPU. `SHT_PROGBITS` type with the `SHF_EXECINSTR` flag to mark it as executable, and the `SHF_ALLOC` flag which means it's loaded into memory for execution. (Don't get confused by the name, it's still just binary machine code! I always found it somewhat strange that it's called `.text` despite not being readable "text.")
84 | - `.data`: initialized data hardcoded in the executable to be loaded into memory. For example, a global variable containing some text might be in this section. If you write low-level code, this is the section where statics go. This also has the type `SHT_PROGBITS`, which just means the section contains "information for the program." Its flags are `SHF_ALLOC` and `SHF_WRITE` to mark it as writable memory.
85 | - `.bss`: I mentioned earlier that it's common to have some allocated memory that starts out zeroed. It would be a waste to include a bunch of empty bytes in the ELF file, so a special segment type called BSS is used. It's helpful to know about BSS segments during debugging, so there's also a section header table entry that specifies the length of the memory to be allocated. It's of type `SHT_NOBITS`, and is flagged `SHF_ALLOC` and `SHF_WRITE`.
86 | - `.rodata`: this is like `.data` except it's read-only. In a very basic C program that runs `printf("Hello, world!")`, the string "Hello world!" would be in a `.rodata` section, while the actual printing code would be in a `.text` section.
87 | - `.shstrtab`: this is a fun implementation detail! The names of sections themselves (like `.text` and `.shstrtab`) aren't included directly in the section header table. Instead, each entry contains an offset to a location in the ELF file that contains its name. This way, each entry in the section header table can be the same size, making them easier to parse — an offset to the name is a fixed-size number, whereas including the name in the table would use a variable-size string. All of this name data is stored in its own section called `.shstrtab`, of type `SHT_STRTAB`.
88 |
89 | ### Data
90 |
91 | The program and section header table entries all point to blocks of data within the ELF file, whether to load them into memory, to specify where program code is, or just to name sections. All of these different pieces of data are contained in the data section of the ELF file.
92 |
93 |
94 |
95 | ## A Brief Explanation of Linking
96 |
97 | Back to the `binfmt_elf` code: the kernel cares about two types of entries in the program header table.
98 |
99 | `PT_LOAD` segments specify where all the program data, like the `.text` and `.data` sections, need to be loaded into memory. The kernel reads these entries from the ELF file to load the data into memory so the program can be executed by the CPU.
100 |
101 | The other type of program header table entry that the kernel cares about is `PT_INTERP`, which specifies a "dynamic linking runtime."
102 |
103 | Before we talk about what dynamic linking is, let's talk about "linking" in general. Programmers tend to build their programs on top of libraries of reusable code — for example, libc, which we talked about earlier. When turning your source code into an executable binary, a program called a linker resolves all these references by finding the library code and copying it into the binary. This process is called *static linking*, which means external code is included directly in the file that's distributed.
104 |
105 | However, some libraries are super common. You'll find libc is used by basically every program under the sun, since it's the canonical interface for interacting with the OS through syscalls. It would be a terrible use of space to include a separate copy of libc in every single program on your computer. Also, it might be nice if bugs in libraries could be fixed in one place rather than having to wait for each program that uses the library to be updated. Dynamic linking is the solution to these problems.
106 |
107 | If a statically linked program needs a function `foo` from a library called `bar`, the program would include a copy of the entirety of `foo`. However, if it's dynamically linked it would only include a reference saying "I need `foo` from library `bar`." When the program is run, `bar` is hopefully installed on the computer and the `foo` function's machine code can be loaded into memory on-demand. If the computer's installation of the `bar` library is updated, the new code will be loaded the next time the program runs without needing any change in the program itself.
108 |
109 |
110 |
111 | ## Dynamic Linking in the Wild
112 |
113 | On Linux, dynamically linkable libraries like `bar` are typically packaged into files with the .so (Shared Object) extension. These .so files are ELF files just like programs — you may recall that the ELF header includes a field to specify whether the file is an executable or a library. In addition, shared objects have a `.dynsym` section in the section header table which contains information on what symbols are exported from the file and can be dynamically linked to.
114 |
115 | On Windows, libraries like `bar` are packaged into .dll (**d**ynamic **l**ink **l**ibrary) files. macOS uses the .dylib (**dy**namically linked **lib**rary) extension. Just like macOS apps and Windows .exe files, these are formatted slightly differently from ELF files but are the same concept and technique.
116 |
117 | An interesting distinction between the two types of linking is that with static linking, only the portions of the library that are used are included in the executable and thus loaded into memory. With dynamic linking, the *entire library* is loaded into memory. This might initially sound less efficient, but it actually allows modern operating systems to save *more* space by loading a library into memory once and then sharing that code between processes. Only code can be shared as the library needs different state for different programs, but the savings can still be on the order of tens to hundreds of megabytes of RAM.
118 |
119 | ## Execution
120 |
121 | Let's hop on back to the kernel running ELF files: if the binary it's executing is dynamically linked, the OS can't just jump to the binary's code right away because there would be missing code — remember, dynamically linked programs only have references to the library functions they need!
122 |
123 | To run the binary, the OS needs to figure out what libraries are needed, load them, replace all the named pointers with actual jump instructions, and *then* start the actual program code. This is very complex code that interacts deeply with the ELF format, so it's usually a standalone program rather than part of the kernel. ELF files specify the path to the program they want to use (typically something like `/lib64/ld-linux-x86-64.so.2`) in a `PT_INTERP` entry in the program header table.
124 |
125 | After reading the ELF header and scanning through the program header table, the kernel can set up the memory structure for the new program. It starts by loading all `PT_LOAD` segments into memory, populating the program's static data, BSS space, and machine code. If the program is dynamically linked, the kernel will have to execute the [ELF interpreter](https://unix.stackexchange.com/questions/400621/what-is-lib64-ld-linux-x86-64-so-2-and-why-can-it-be-used-to-execute-file) (`PT_INTERP`), so it also loads the interpreter's data, BSS, and code into memory.
126 |
127 | Now the kernel needs to set the instruction pointer for the CPU to restore when returning to userland. If the executable is dynamically linked, the kernel sets the instruction pointer to the start of the ELF interpreter's code in memory. Otherwise, the kernel sets it to the start of the executable.
128 |
129 | The kernel is almost ready to return from the syscall (remember, we're still in `execve`). It pushes the `argc`, `argv`, and environment variables to the stack for the program to read when it begins.
130 |
131 | The registers are now cleared. Before handling a syscall, the kernel stores the current value of registers to the stack to be restored when switching back to user space. Before returning to user space, the kernel zeroes this part of the stack.
132 |
133 | Finally, the syscall is over and the kernel returns to userland. It restores the registers, which are now zeroed, and jumps to the stored instruction pointer. That instruction pointer is now the starting point of the new program (or the ELF interpreter) and the current process has been replaced!
134 |
--------------------------------------------------------------------------------
/src/content/chapters/1-the-basics.mdx:
--------------------------------------------------------------------------------
1 | ---
2 | chapter: 1
3 | title: مقدمات
4 | shortname: مقدمات
5 | slug: the-basics
6 | updatedAt: 2023-07-19T18:57:54.630Z
7 | ---
8 |
9 | یه چیزی که موقع نوشتن این کتاب منو بارها و بارها شگفتزده کرد این بود که کامپیوترها چقدر سادهان. هنوزم برام سخته که ذهنمو درگیر نکنم و انتظار پیچیدگی و انتزاع بیشتری نداشته باشم. اگه نکتهای وجود داره که قبل از ادامه دادن باید توی ذهنت حک کنی اینه که هر چیزی که ساده به نظر میرسه، واقعا به همون سادگیه. سادگی خیلی قشنگه و بعضی وقتها هم خیلی...خیلی رو اعصابه.
10 |
11 | بیاین از پایهایترین بخش کارکرد کامپیوتر توی عمیقترین قسمتش شروع کنیم.
12 |
13 | ## کامپیوترها چطوری طراحی شدن؟
14 |
15 | *واحد پردازش مرکزی* (CPU) توی کامپیوتر مسئول انجام همهی محاسباته. همون رئیس بزرگه، جادوگر بزرگ. از لحظهای که کامپیوتر رو روشن میکنی شروع میکنه به کار کردن و دستورات رو پشت سر همدیگه یکییکی اجرا میکنه.
16 |
17 | اولین CPUیی که به تولید انبوه رسید، مدل Intel 4004 بود ک اواخر دهه ۶۰ توسط فیزیکدان و مهندس ایتالیایی به نام فدریکو فاگین طراحی شد. برخلاف سیستمهای ۶۴ بیتی که ما امروز استفاده میکنیم، این پردازنده معماری ۴ بیتی داشت و از پیچیدگی کمتری نسبت به پردازندههای مدرن امروزی برخوردار بود اما بخش زیادی از سادگی طراحی اون هنوز توی پردازندههای امروزی ما دیده میشه.
18 |
19 | «دستوراتی» که CPU اجرا میکنه، فقط دادههای باینری هستن: یک یا دو بایت به مشخص کردن اینکه چه دستوری باید اجرا بشه (opcode) اختصاص داره و بعدش هم هر دادهای که برای اجرای اون دستور لازمه. چیزی که بهش میگیم کد ماشین، فقط یه سری دستور باینریه که پشت سر هم قرار گرفتن. اسمبلی هم یه زبان کمکیه که باعث میشه خوندن و نوشتن کد ماشین برای ما آدما راحتتر از کار کردن با بیتهای خام باشه. وقتی کد اسمبلی مینویسیم هم در نهایت همیشه به همون کد باینری تبدیل میشه که CPU بلده بخونه.
20 |
21 |
22 |
23 | > نکته: دستورها همیشه بهصورت یک به یک توی کد ماشین تبدیل نمیشن، مثل مثالی که بالا دیدی. مثلاً دستور اسمبلی `add eax, 512` به این کد تبدیل میشه: `05 00 02 00 00`.
24 | >
25 | > بایت اول (`05`) یه opcode خاصه که مشخصاً به دستور اضافه کردن یه عدد ۳۲ بیتی به رجیستر EAX اشاره داره. بقیه بایتها عدد ۵۱۲ (یا `0x200`) هستن که به صورت little-endian ذخیره شدن.
26 | >
27 | > سایت Defuse Security یه ابزار مفید ساخته برای تبدیل کد اسمبلی و کد ماشین بهمدیگه که میتونید ازش استفاده کنید.
28 |
29 | RAM is your computer's main memory bank, a large multi-purpose space which stores all the data used by programs running on your computer. That includes the program code itself as well as the code at the core of the operating system. The CPU always reads machine code directly from RAM, and code can't be run if it isn't loaded into RAM.
30 |
31 | The CPU stores an *instruction pointer* which points to the location in RAM where it's going to fetch the next instruction. After executing each instruction, the CPU moves the pointer and repeats. This is the *fetch-execute cycle*.
32 |
33 |
34 |
35 | After executing an instruction, the pointer moves forward to immediately after the instruction in RAM so that it now points to the next instruction. That's why code runs! The instruction pointer just keeps chugging forward, executing machine code in the order in which it has been stored in memory. Some instructions can tell the instruction pointer to jump somewhere else instead, or jump different places depending on a certain condition; this makes reusable code and conditional logic possible.
36 |
37 | This instruction pointer is stored in a [*register*](https://en.wikipedia.org/wiki/Processor_register). Registers are small storage buckets that are extremely fast for the CPU to read and write to. Each CPU architecture has a fixed set of registers, used for everything from storing temporary values during computations to configuring the processor.
38 |
39 | Some registers are directly accessible from machine code, like `ebx` in the earlier diagram.
40 |
41 | Other registers are only used internally by the CPU, but can often be updated or read using specialized instructions. One example is the instruction pointer, which can't be read directly but can be updated with, for example, a jump instruction.
42 |
43 | ## Processors Are Naive
44 |
45 | Let's go back to the original question: what happens when you run an executable program on your computer? First, a bunch of magic happens to get ready to run it — we’ll work through all of this later — but at the end of the process there’s machine code in a file somewhere. The operating system loads this into RAM and instructs the CPU to jump the instruction pointer to that position in RAM. The CPU continues running its fetch-execute cycle as usual, so the program begins executing!
46 |
47 | (This was one of those psyching-myself-out moments for me — seriously, this is how the program you are using to read this article is running! Your CPU is fetching your browser's instructions from RAM in sequence and directly executing them, and they're rendering this article.)
48 |
49 |
50 |
51 | It turns out CPUs have a super basic worldview; they only see the current instruction pointer and a bit of internal state. Processes are entirely operating system abstractions, not something CPUs natively understand or keep track of.
52 |
53 | *\*waves hands\* processes are abstractions made up by ~~os devs~~ big byte to sell more computers*
54 |
55 | For me, this raises more questions than it answers:
56 |
57 | 1. If the CPU doesn’t know about multiprocessing and just executes instructions sequentially, why doesn’t it get stuck inside whatever program it’s running? How can multiple programs run at once?
58 | 2. If programs run directly on the CPU, and the CPU can directly access RAM, why can't code access memory from other processes, or, god forbid, the kernel?
59 | 3. Speaking of which, what's the mechanism that prevents every process from running any instruction and doing anything to your computer? AND WHAT'S A DAMN SYSCALL?
60 |
61 | The question about memory deserves its own section and is covered in [chapter 5](/the-translator-in-your-computer) — the TL;DR is that most memory accesses actually go through a layer of misdirection that remaps the entire address space. For now, we're going to pretend that programs can access all RAM directly and computers can only run one process at once. We'll explain away both of these assumptions in time.
62 |
63 | It's time to leap through our first rabbit hole into a land filled with syscalls and security rings.
64 |
65 | > **Aside: what is a kernel, btw?**
66 | >
67 | > Your computer's operating system, like macOS, Windows, or Linux, is the collection of software that runs on your computer and makes all the basic stuff work. "Basic stuff" is a really general term, and so is "operating system" — depending on who you ask, it can include such things as the apps, fonts, and icons that come with your computer by default.
68 | >
69 | > The kernel, however, is the core of the operating system. When you boot up your computer, the instruction pointer starts at a program somewhere. That program is the kernel. The kernel has near-full access to your computer's memory, peripherals, and other resources, and is in charge of running software installed on your computer (known as userland programs). We'll learn about how the kernel has this access — and how userland programs don't — over the course of this article.
70 | >
71 | > Linux is just a kernel and needs plenty of userland software like shells and display servers to be usable. The kernel in macOS is called [XNU](https://en.wikipedia.org/wiki/XNU) and is Unix-like, and the modern Windows kernel is called the [NT Kernel](https://en.wikipedia.org/wiki/Architecture_of_Windows_NT).
72 |
73 | ## Two Rings to Rule Them All
74 |
75 | The *mode* (sometimes called privilege level or ring) a processor is in controls what it's allowed to do. Modern architectures have at least two options: kernel/supervisor mode and user mode. While an architecture might support more than two modes, only kernel mode and user mode are commonly used these days.
76 |
77 | In kernel mode, anything goes: the CPU is allowed to execute any supported instruction and access any memory. In user mode, only a subset of instructions is allowed, I/O and memory access is limited, and many CPU settings are locked. Generally, the kernel and drivers run in kernel mode while applications run in user mode.
78 |
79 | Processors start in kernel mode. Before executing a program, the kernel initiates the switch to user mode.
80 |
81 |
82 |
83 | An example of how processor modes manifest in a real architecture: on x86-64, the current privilege level (CPL) can be read from a register called `cs` (code segment). Specifically, the CPL is contained in the two [least significant bits](https://en.wikipedia.org/wiki/Bit_numbering) of the `cs` register. Those two bits can store x86-64's four possible rings: ring 0 is kernel mode and ring 3 is user mode. Rings 1 and 2 are designed for running drivers but are only used by a handful of older niche operating systems. If the CPL bits are `11`, for example, the CPU is running in ring 3: user mode.
84 |
85 | ## What Even is a Syscall?
86 |
87 | Programs run in user mode because they can't be trusted with full access to the computer. User mode does its job, preventing access to most of the computer — but programs need to be able to access I/O, allocate memory, and interact with the operating system *somehow*! To do so, software running in user mode has to ask the operating system kernel for help. The OS can then implement its own security protections to prevent programs from doing anything malicious.
88 |
89 | If you've ever written code that interacts with the OS, you'll probably recognize functions like `open`, `read`, `fork`, and `exit`. Below a couple of layers of abstraction, these functions all use *system calls* to ask the OS for help. A system call is a special procedure that lets a program start a transition from user space to kernel space, jumping from the program's code into OS code.
90 |
91 | User space to kernel space control transfers are accomplished using a processor feature called [*software interrupts*](https://en.wikipedia.org/wiki/Interrupt#Software_interrupts):
92 |
93 | 1. During the boot process, the operating system stores a table called an [*interrupt vector table*](https://en.wikipedia.org/wiki/Interrupt_vector_table) (IVT; x86-64 calls this the [interrupt descriptor table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table)) in RAM and registers it with the CPU. The IVT maps interrupt numbers to handler code pointers.
94 |
95 |
96 |
97 | 2. Then, userland programs can use an instruction like [INT](https://www.felixcloutier.com/x86/intn:into:int3:int1) which tells the processor to look up the given interrupt number in the IVT, switch to kernel mode, and then jump the instruction pointer to the memory address stored in the IVT.
98 |
99 | When this kernel code finishes, it uses an instruction like [IRET](https://www.felixcloutier.com/x86/iret:iretd:iretq) to tell the CPU to switch back to user mode and return the instruction pointer to where it was when the interrupt was triggered.
100 |
101 | (If you were curious, the interrupt ID used for system calls on Linux is `0x80`. You can read a list of Linux system calls on [Michael Kerrisk's online manpage directory](https://man7.org/linux/man-pages/man2/syscalls.2.html).)
102 |
103 | ### Wrapper APIs: Abstracting Away Interrupts
104 |
105 | Here's what we know so far about system calls:
106 |
107 | - User mode programs can't access I/O or memory directly. They have to ask the OS for help interacting with the outside world.
108 | - Programs can delegate control to the OS with special machine code instructions like INT and IRET.
109 | - Programs can't directly switch privilege levels; software interrupts are safe because the processor has been preconfigured *by the OS* with where in the OS code to jump to. The interrupt vector table can only be configured from kernel mode.
110 |
111 | Programs need to pass data to the operating system when triggering a syscall; the OS needs to know which specific system call to execute alongside any data the syscall itself needs, for example, what filename to open. The mechanism for passing this data varies by operating system and architecture, but it's usually done by placing data in certain registers or on the stack before triggering the interrupt.
112 |
113 | The variance in how system calls are called across devices means it would be wildly impractical for programmers to implement system calls themselves for every program. This would also mean operating systems couldn't change their interrupt handling for fear of breaking every program that was written to use the old system. Finally, we typically don't write programs in raw assembly anymore — programmers can't be expected to drop down to assembly any time they want to read a file or allocate memory.
114 |
115 |
116 |
117 | So, operating systems provide an abstraction layer on top of these interrupts. Reusable higher-level library functions that wrap the necessary assembly instructions are provided by [libc](https://www.gnu.org/software/libc/) on Unix-like systems and part of a library called [ntdll.dll](https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/libraries-and-headers) on Windows. Calls to these library functions themselves don't cause switches to kernel mode, they're just standard function calls. Inside the libraries, assembly code does actually transfer control to the kernel, and is a lot more platform-dependent than the wrapping library subroutine.
118 |
119 | When you call `exit(1)` from C running on a Unix-like system, that function is internally running machine code to trigger an interrupt, after placing the system call's opcode and arguments in the right registers/stack/whatever. Computers are so cool!
120 |
121 | ## The Need for Speed / Let's Get CISC-y
122 |
123 | Many [CISC](https://en.wikipedia.org/wiki/Complex_instruction_set_computer) architectures like x86-64 contain instructions designed for system calls, created due to the prevalence of the system call paradigm.
124 |
125 | Intel and AMD managed not to coordinate very well on x86-64; it actually has *two* sets of optimized system call instructions. [SYSCALL](https://www.felixcloutier.com/x86/syscall.html) and [SYSENTER](https://www.felixcloutier.com/x86/sysenter) are optimized alternatives to instructions like `INT 0x80`. Their corresponding return instructions, [SYSRET](https://www.felixcloutier.com/x86/sysret.html) and [SYSEXIT](https://www.felixcloutier.com/x86/sysexit), are designed to transition quickly back to user space and resume program code.
126 |
127 | (AMD and Intel processors have slightly different compatibility with these instructions. `SYSCALL` is generally the best option for 64-bit programs, while `SYSENTER` has better support with 32-bit programs.)
128 |
129 | Representative of the style, [RISC](https://en.wikipedia.org/wiki/Reduced_instruction_set_computer) architectures tend not to have such special instructions. AArch64, the RISC architecture Apple Silicon is based on, uses only [one interrupt instruction](https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/SVC--Supervisor-Call-) for syscalls and software interrupts alike. I think Mac users are doing fine :)
130 |
131 | ---
132 |
133 | Whew, that was a lot! Let's do a brief recap:
134 |
135 | - Processors execute instructions in an infinite fetch-execute loop and don't have any concept of operating systems or programs. The processor's mode, usually stored in a register, determines what instructions may be executed. Operating system code runs in kernel mode and switches to user mode to run programs.
136 | - To run a binary, the operating system switches to user mode and points the processor to the code's entry point in RAM. Because they only have the privileges of user mode, programs that want to interact with the world need to jump to OS code for help. System calls are a standardized way for programs to switch from user mode to kernel mode and into OS code.
137 | - Programs typically use these syscalls by calling shared library functions. These wrap machine code for either software interrupts or architecture-specific syscall instructions that transfer control to the OS kernel and switch rings. The kernel does its business and switches back to user mode and returns to the program code.
138 |
139 | Let’s figure out how to answer my first question from earlier:
140 |
141 | > If the CPU doesn't keep track of more than one process and just executes instruction after instruction, why doesn't it get stuck inside whatever program it's running? How can multiple programs run at once?
142 |
143 | The answer to this, my dear friend, is also the answer to why Coldplay is so popular... clocks! (Well, technically timers. I just wanted to shoehorn that joke in.)
144 |
--------------------------------------------------------------------------------
/src/content/chapters/6-lets-talk-about-forks-and-cows.mdx:
--------------------------------------------------------------------------------
1 | ---
2 | chapter: 6
3 | title: Let's Talk About Forks and Cows
4 | shortname: Fork-Exec
5 | slug: lets-talk-about-forks-and-cows
6 | updatedAt: 2023-07-17T17:16:18.079Z
7 | ---
8 |
9 | import CodeBlock from '../../components/CodeBlock.astro'
10 |
11 | The final question: how did we get here? Where do the first processes come from?
12 |
13 | This article is almost done. We're on the final stretch. About to hit a home run. Moving on to greener pastures. And various other terrible idioms that mean you are a single *Length of Chapter 6* away from touching grass or whatever you do with your time when you aren't reading 15,000 word articles about CPU architecture.
14 |
15 | If `execve` starts a new program by replacing the current process, how do you start a new program separately, in a new process? This is a pretty important ability if you want to do multiple things on your computer; when you double-click an app to start it, the app opens separately while the program you were previously on continues running.
16 |
17 | The answer is another system call: `fork`, the system call fundamental to all multiprocessing. `fork` is quite simple, actually — it clones the current process and its memory, leaving the saved instruction pointer exactly where it is, and then allows both processes to proceed as usual. Without intervention, the programs continue to run independently from each other and all computation is doubled.
18 |
19 | The newly running process is referred to as the "child," with the process originally calling `fork` the "parent." Processes can call `fork` multiple times, thus having multiple children. Each child is numbered with a *process ID* (PID), starting with 1.
20 |
21 | Cluelessly doubling the same code is pretty useless, so `fork` returns a different value on the parent vs the child. On the parent, it returns the PID of the new child process, while on the child it returns 0. This makes it possible to do different work on the new process so that forking is actually helpful.
22 |
23 |
101 |
102 | Killing the init process kills all of its children and all of their children, shutting down your OS environment.
103 |
104 | ## Back to the Kernel
105 |
106 | We had a lot of fun looking at Linux kernel code [back in chapter 3](/how-to-run-a-program), so we're gonna do some more of that! This time we'll start with a look at how the kernel starts the init process.
107 |
108 | Your computer boots up in a sequence like the following:
109 |
110 | 1. The motherboard is bundled with a tiny piece of software that searches your connected disks for a program called a *bootloader*. It picks a bootloader, loads its machine code into RAM, and executes it.
111 |
112 | Keep in mind that we are not yet in the world of a running OS. Until the OS kernel starts an init process, multiprocessing and syscalls don’t really exist. In the pre-init context, "executing" a program means directly jumping to its machine code in RAM without expectation of return.
113 | 2. The bootloader is responsible for finding a kernel, loading it into RAM, and executing it. Some bootloaders, like [GRUB](https://www.gnu.org/software/grub/), are configurable and/or let you select between multiple operating systems. BootX and Windows Boot Manager are the built-in bootloaders of macOS and Windows, respectively.
114 | 3. The kernel is now running and begins a large routine of initialization tasks including setting up interrupt handlers, loading drivers, and creating the initial memory mapping. Finally, the kernel switches the privilege level to user mode and starts the init program.
115 | 4. We're finally in userland in an operating system! The init program begins running init scripts, starting services, and executing programs like the shell/UI.
116 |
117 | ### Initializing Linux
118 |
119 | On Linux, the bulk of step 3 (kernel initialization) occurs in the `start_kernel` function in [init/main.c](https://github.com/torvalds/linux/blob/22b8cc3e78f5448b4c5df00303817a9137cd663f/init/main.c). This function is over 200 lines of calls to various other init functions, so I won't include [the whole thing](https://github.com/torvalds/linux/blob/22b8cc3e78f5448b4c5df00303817a9137cd663f/init/main.c#L880-L1091) in this article, but I do recommend scanning through it! At the end of `start_kernel` a function named `arch_call_rest_init` is called:
120 |
121 |
24 |
25 | When the computer first boots up, memory accesses go directly to physical RAM. Immediately after startup, the OS creates the translation dictionary and tells the CPU to start using the MMU.
26 |
27 | This dictionary is actually called a *page table*, and this system of translating every memory access is called *paging*. Entries in the page table are called *pages* and each one represents how a certain chunk of virtual memory maps to RAM. These chunks are always a fixed size, and each processor architecture has a different page size. x86-64 has a default 4 KiB page size, meaning each page specifies the mapping for a block of memory 4,096 bytes long.
28 |
29 | In other words, with 4 KiB pages the bottom 12 bits of an address will always be the same before and after MMU translation — 12, because that's the amount of bits needed to index the 4,096-byte page you get post-translation.
30 |
31 | x86-64 also allows operating systems to enable larger 2 MiB or 4 GiB pages, which can improve address translation speed but increase memory fragmentation and waste. The larger the page size, the smaller the portion of the address that's translated by the MMU.
32 |
33 |
34 |
35 | The page table itself just resides in RAM. While it can contain millions of entries, each entry's size is only on the order of a couple bytes, so the page table doesn't take up too much space.
36 |
37 | To enable paging at boot, the kernel first constructs the page table in RAM. Then, it stores the physical address of the start of the page table in a register called the page table base register (PTBR). Finally, the kernel enables paging to translate all memory accesses with the MMU. On x86-64, the top 20 bits of control register 3 (CR3) function as the PTBR. Bit 31 of CR0, designated PG for Paging, is set to 1 to enable paging.
38 |
39 | The magic of the paging system is that the page table can be edited while the computer is running. This is how each process can have its own isolated memory space — when the OS switches context from one process to another, an important task is remapping the virtual memory space to a different area in physical memory. Let's say you have two processes: process A can have its code and data (likely loaded from an ELF file!) at `0x0000000000400000`, and process B can access its code and data from the very same address. Those two processes can even be instances of the same program, because they aren't actually fighting over that address range! The data for process A is somewhere far from process B in physical memory, and is mapped to `0x0000000000400000` by the kernel when switching to the process.
40 |
41 |
42 |
43 | > **Aside: cursed ELF fact**
44 | >
45 | > In certain situations, `binfmt_elf` has to map the first page of memory to zeroes. Some programs written for UNIX System V Release 4.0 (SVr4), an OS from 1988 that was the first to support ELF, rely on null pointers being readable. And somehow, some programs still rely on that behavior.
46 | >
47 | > It seems like the Linux kernel dev implementing this was [a little disgruntled](https://github.com/torvalds/linux/blob/22b8cc3e78f5448b4c5df00303817a9137cd663f/fs/binfmt_elf.c#L1322-L1329):
48 | >
49 | > *"Why this, you ask??? Well SVr4 maps page 0 as read-only, and some applications 'depend' upon this behavior. Since we do not have the power to recompile these, we emulate the SVr4 behavior. Sigh."*
50 | >
51 | > Sigh.
52 |
53 | ## Security with Paging
54 |
55 | The process isolation enabled by memory paging improves code ergonomics (processes don't need to be aware of other processes to use memory), but it also creates a level of security: processes cannot access memory from other processes. This half answers one of the original questions from the start of this article:
56 |
57 | > If programs run directly on the CPU, and the CPU can directly access RAM, why can't code access memory from other processes, or, god forbid, the kernel?
58 |
59 | *Remember that? It feels like so long ago...*
60 |
61 | What about that kernel memory, though? First things first: the kernel obviously needs to store plenty of data of its own to keep track of all the processes running and even the page table itself. Every time a hardware interrupt, software interrupt, or system call is triggered and the CPU enters kernel mode, the kernel code needs to access that memory somehow.
62 |
63 | Linux's solution is to always allocate the top half of the virtual memory space to the kernel, so Linux is called a [*higher half kernel*](https://wiki.osdev.org/Higher_Half_Kernel). Windows employs a [similar](https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/overview-of-windows-memory-space) technique, while macOS is... [slightly](https://www.researchgate.net/figure/Overview-of-the-Mac-OS-X-virtual-memory-system-which-resides-inside-the-Mach-portion-of_fig1_264086271) [more](https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/AboutMemory.html) [complicated](https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/vm/vm.html) and caused my brain to ooze out of my ears reading about it. \~(++)\~
64 |
65 |
66 |
67 | It would be terrible for security if userland processes could read or write kernel memory though, so paging enables a second layer of security: each page must specify permission flags. One flag determines whether the region is writable or only readable. Another flag tells the CPU that only kernel mode is allowed to access the region's memory. This latter flag is used to protect the entire higher half kernel space — the entire kernel memory space is actually available in the virtual memory mapping for user space programs, they just don't have the permissions to access it.
68 |
69 |
70 |
71 | The page table itself is actually contained within the kernel memory space! When the timer chip triggers a hardware interrupt for process switching, the CPU switches the privilege level to kernel mode and jumps to Linux kernel code. Being in kernel mode (Intel ring 0) allows the CPU to access the kernel-protected memory region. The kernel can then write to the page table (residing somewhere in that upper half of memory) to remap the lower half of virtual memory for the new process. When the kernel switches to the new process and the CPU enters user mode, it can no longer access any of the kernel memory.
72 |
73 | Just about every memory access goes through the MMU. Interrupt descriptor table handler pointers? Those address the kernel's virtual memory space as well.
74 |
75 | ## Hierarchical Paging and Other Optimizations
76 |
77 | 64-bit systems have memory addresses that are 64 bits long, meaning the 64-bit virtual memory space is a whopping 16 [exbibytes](https://en.wiktionary.org/wiki/exbibyte) in size. That is incredibly large, far larger than any computer that exists today or will exist any time soon. As far as I can tell, the most RAM in any computer ever was in the [Blue Waters supercomputer](https://en.wikipedia.org/wiki/Blue_Waters), with over 1.5 petabytes of RAM. That's still less than 0.01% of 16 EiB.
78 |
79 | If an entry in the page table was required for every 4 KiB section of virtual memory space, you would need 4,503,599,627,370,496 page table entries. With 8-byte-long page table entries, you would need 32 pebibytes of RAM just to store the page table alone. You may notice that's still larger than the world record for the most RAM in a computer.
80 |
81 | > **Aside: why the weird units?**
82 | >
83 | > I know it's uncommon and really ugly, but I find it important to clearly differentiate between binary byte size units (powers of 2) and metric ones (powers of 10). A kilobyte, kB, is an SI unit that means 1,000 bytes. A kibibyte, KiB, is an IEC-recommended unit that means 1,024 bytes. In terms of CPUs and memory addresses, byte counts are usually powers of two because computers are binary systems. Using KB (or worse, kB) to mean 1,024 would be more ambiguous.
84 |
85 | Since it would be impossible (or at least incredibly impractical) to have sequential page table entries for the entire possible virtual memory space, CPU architectures implement *hierarchical paging*. In hierarchical paging systems, there are multiple levels of page tables of increasingly small granularity. The top level entries cover large blocks of memory and point to page tables of smaller blocks, creating a tree structure. The individual entries for blocks of 4 KiB or whatever the page size is are the leaves of the tree.
86 |
87 | x86-64 historically uses 4-level hierarchical paging. In this system, each page table entry is found by offsetting the start of the containing table by a portion of the address. This portion starts with the most significant bits, which work as a prefix so the entry covers all addresses starting with those bits. The entry points to the start of the next level of table containing the subtrees for that block of memory, which are again indexed with the next collection of bits.
88 |
89 | The designers of x86-64's 4-level paging also chose to ignore the top 16 bits of all virtual pointers to save page table space. 48 bits gets you a 128 TiB virtual address space, which was deemed to be large enough. (The full 64 bits would get you 16 EiB, which is kind of a lot.)
90 |
91 | Since the first 16 bits are skipped, the "most significant bits" for indexing the first level of the page table actually start at bit 47 rather than 63. This also means the higher half kernel diagram from earlier in this chapter was technically inaccurate; the kernel space start address should've been depicted as the midpoint of an address space smaller than 64 bits.
92 |
93 |
94 |
95 | Hierarchical paging solves the space problem because at any level of the tree, the pointer to the next entry can be null (`0x0`). This allows entire subtrees of the page table to be elided, meaning unmapped areas of the virtual memory space don't take up any space in RAM. Lookups at unmapped memory addresses can fail quickly because the CPU can error as soon as it sees an empty entry higher up in the tree. Page table entries also have a presence flag that can be used to mark them as unusable even if the address appears valid.
96 |
97 | Another benefit of hierarchical paging is the ability to efficiently switch out large sections of the virtual memory space. A large swath of virtual memory might be mapped to one area of physical memory for one process, and a different area for another process. The kernel can store both mappings in memory and simply update the pointers at the top level of the tree when switching processes. If the entire memory space mapping was stored as a flat array of entries, the kernel would have to update a lot of entries, which would be slow and still require independently keeping track of the memory mappings for each process.
98 |
99 | I said x86-64 "historically" uses 4-level paging because recent processors implement [5-level paging](https://en.wikipedia.org/wiki/Intel_5-level_paging). 5-level paging adds another level of indirection as well as 9 more addressing bits to expand the address space to 128 PiB with 57-bit addresses. 5-level paging is supported by operating systems including Linux [since 2017](https://lwn.net/Articles/717293/) as well as recent Windows 10 and 11 server versions.
100 |
101 | > **Aside: physical address space limits**
102 | >
103 | > Just as operating systems don't use all 64 bits for virtual addresses, processors don't use entire 64-bit physical addresses. When 4-level paging was the standard, x86-64 CPUs didn't use more than 46 bits, meaning the physical address space was limited to only 64 TiB. With 5-level paging, support has been extended to 52 bits, supporting a 4 PiB physical address space.
104 | >
105 | > On the OS level, it's advantageous for the virtual address space to be larger than the physical address space. As Linus Torvalds [said](https://www.realworldtech.com/forum/?threadid=76912&curpostid=76973), "[i]t needs to be bigger, by a factor of _at least_ two, and that's quite frankly pushing it, and you're much better off having a factor of ten or more. Anybody who doesn't get that is a moron. End of discussion."
106 |
107 | ## Swapping and Demand Paging
108 |
109 | A memory access might fail for a couple reasons: the address might be out of range, it might not be mapped by the page table, or it might have an entry that's marked as not present. In any of these cases, the MMU will trigger a hardware interrupt called a *page fault* to let the kernel handle the problem.
110 |
111 | In some cases, the read was truly invalid or prohibited. In these cases, the kernel will probably terminate the program with a [segmentation fault](https://en.wikipedia.org/wiki/Segmentation_fault) error.
112 |
113 |
128 |
129 | For one, this allows syscalls like [mmap](https://man7.org/linux/man-pages/man2/mmap.2.html) that lazily map entire files from disk to virtual memory to exist. If you're familiar with LLaMa.cpp, a runtime for a leaked Facebook language model, Justine Tunney recently significantly optimized it by [making all the loading logic use mmap](https://justine.lol/mmap/). (If you haven't heard of her before, [check her stuff out](https://justine.lol/)! Cosmopolitan Libc and APE are really cool and might be interesting if you've been enjoying this article.)
130 |
131 | > *Apparently there's a [lot](https://rentry.org/Jarted) [of](https://news.ycombinator.com/item?id=35413289) [drama](https://news.ycombinator.com/item?id=35458004) about Justine's involvement in this change. Just pointing this out so I don't get screamed at by random internet users. I must confess that I haven't read most of the drama, and everything I said about Justine's stuff being cool is still very true.*
132 |
133 | When you execute a program and its libraries, the kernel doesn't actually load anything into memory. It only creates an mmap of the file — when the CPU tries to execute the code, the page immediately faults and the kernel replaces the page with a real block of memory.
134 |
135 | Demand paging also enables the technique that you've probably seen under the name "swapping" or "paging." Operating systems can free up physical memory by writing memory pages to disk and then removing them from physical memory but keeping them in virtual memory with the present flag set to 0. If that virtual memory is read, the OS can then restore the memory from disk to RAM and set the present flag back to 1. The OS may have to swap a different section of RAM to make space for the memory being loaded from disk. Disk reads and writes are slow, so operating systems try to make swapping happen as little as possible with [efficient page replacement algorithms](https://en.wikipedia.org/wiki/Page_replacement_algorithm).
136 |
137 | An interesting hack is to use page table physical memory pointers to store the locations of files within physical storage. Since the MMU will page fault as soon as it sees a negative present flag, it doesn't matter that they are invalid memory addresses. This isn't practical in all cases, but it's amusing to think about.
138 |
--------------------------------------------------------------------------------
/src/content/chapters/3-how-to-run-a-program.mdx:
--------------------------------------------------------------------------------
1 | ---
2 | chapter: 3
3 | title: How to Run a Program
4 | shortname: Exec
5 | slug: how-to-run-a-program
6 | updatedAt: 2023-07-24T15:57:08.044Z
7 | ---
8 |
9 | import CodeBlock from '../../components/CodeBlock.astro'
10 |
11 | So far, we've covered how CPUs execute machine code loaded from executables, what ring-based security is, and how syscalls work. In this section, we'll dive deep into the Linux kernel to figure out how programs are loaded and run in the first place.
12 |
13 | We're specifically going to look at Linux on x86-64. Why?
14 |
15 | - Linux is a fully featured production OS for desktop, mobile, and server use cases. Linux is open source, so it's super easy to research just by reading its source code. I will be directly referencing some kernel code in this article!
16 | - x86-64 is the architecture that most modern desktop computers use, and the target architecture of a lot of code. The subset of behavior I mention that is x86-64-specific will generalize well.
17 |
18 | Most of what we learn will generalize well to other operating systems and architectures, even if they differ in various specific ways.
19 |
20 | ## Basic Behavior of Exec Syscalls
21 |
22 |
23 |
24 | Let's start with a very important system call: `execve`. It loads a program and, if successful, replaces the current process with that program. A couple other syscalls (`execlp`, `execvpe`, etc.) exist, but they all layer on top of `execve` in various fashions.
25 |
26 | > **Aside: `execveat`**
27 | >
28 | > `execve` is *actually* built on top of `execveat`, a more general syscall that runs a program with some configuration options. For simplicity, we'll mostly talk about `execve`; the only difference is that it provides some defaults to `execveat`.
29 | >
30 | > Curious what `ve` stands for? The `v` means one parameter is the vector (list) of arguments (`argv`), and the `e` means another parameter is the vector of environment variables (`envp`). Various other exec syscalls have different suffixes to designate different call signatures. The `at` in `execveat` is just "at", because it specifies the location to run `execve` at.
31 |
32 | The call signature of `execve` is:
33 |
34 | if (fd == AT_FDCWD) \{ /\* special codepath \*/ \}.
115 |
116 | ### Step 1: Setup
117 |
118 | We've now reached `do_execveat_common`, the core function handling program execution. We're going to take a brief step back from staring at code to get a bigger picture view of what this function does.
119 |
120 | The first major job of `do_execveat_common` is setting up a struct called `linux_binprm`. I won't include a copy of [the whole struct definition](https://github.com/torvalds/linux/blob/22b8cc3e78f5448b4c5df00303817a9137cd663f/include/linux/binfmts.h#L15-L65), but there are several important fields to go over:
121 |
122 | - Data structures like `mm_struct` and `vm_area_struct` are defined to prepare virtual memory management for the new program.
123 | - `argc` and `envc` are calculated and stored to be passed to the program.
124 | - `filename` and `interp` store the filename of the program and its interpreter, respectively. These start out equal to each other, but can change in some cases: one such case is when running interpreted scripts with a [shebang](https://en.wikipedia.org/wiki/Shebang_(Unix)). When executing a Python program, for example, `filename` points to the source file but `interp` is the path to the Python interpreter.
125 | - `buf` is an array filled with the first 256 bytes of the file to be executed. It's used to detect the format of the file and load script shebangs.
126 |
127 | (TIL: binprm stands for **bin**ary **pr**og**r**a**m**.)
128 |
129 | Let's take a closer look at this buffer `buf`:
130 |
131 |
199 |
200 | COMPUTERS ARE SO COOL!
201 |
202 | Since shebangs are handled by the kernel, and pull from `buf` instead of loading the whole file, they're *always* truncated to the length of `buf`. Apparently, 4 years ago, someone got annoyed by the kernel truncating their >128-character paths, and their solution was to double the truncation point by doubling the buffer size! Today, on your very own Linux machine, if you have a shebang line more than 256 characters long, everything past 256 characters will be *completely lost*.
203 |
204 |
205 |
206 | Imagine having a bug because of this. Imagine trying to figure out the root cause of what's breaking your code. Imagine how it would feel, discovering that the problem is deep within the Linux kernel. Woe to the next IT person at a massive enterprise who discovers that part of a path has mysteriously gone missing.
207 |
208 | **The second wonky thing:** remember how it's only *convention* for `argv[0]` to be the program name, how the caller can pass any `argv` they want to an exec syscall and it will pass through unmoderated?
209 |
210 | It just so happens that `binfmt_script` is one of those places that *assumes* `argv[0]` is the program name. It always removes `argv[0]`, and then adds the following to the start of `argv`:
211 |
212 | - Path to the interpreter
213 | - Arguments to the interpreter
214 | - Filename of the script
215 |
216 | 217 | **Example: Argument Modification** 218 | 219 | Let's look at a sample `execve` call: 220 | 221 |244 | 245 | After updating `argv`, the handler finishes preparing the file for execution by setting `linux_binprm.interp` to the interpreter path (in this case, the Node binary). Finally, it returns 0 to indicate success preparing the program for execution. 246 | 247 | ### Format Highlight: Miscellaneous Interpreters 248 | 249 | Another interesting handler is `binfmt_misc`. It opens up the ability to add some limited formats through userland configuration, by mounting a special file system at `/proc/sys/fs/binfmt_misc/`. Programs can perform [specially formatted](https://docs.kernel.org/admin-guide/binfmt-misc.html) writes to files in this directory to add their own handlers. Each configuration entry specifies: 250 | 251 | - How to detect their file format. This can specify either a magic number at a certain offset or a file extension to look for. 252 | - The path to an interpreter executable. There's no way to specify interpreter arguments, so a wrapper script is needed if those are desired. 253 | - Some configuration flags, including one specifying how `binfmt_misc` updates `argv`. 254 | 255 | This `binfmt_misc` system is often used by Java installations, configured to detect class files by their `0xCAFEBABE` magic bytes and JAR files by their extension. On my particular system, a handler is configured that detects Python bytecode by its .pyc extension and passes it to the appropriate handler. 256 | 257 | This is a pretty cool way to let program installers add support for their own formats without needing to write highly privileged kernel code. 258 | 259 | ## In the End (Not the Linkin Park Song) 260 | 261 | An exec syscall will always end up in one of two paths: 262 | 263 | - It will eventually reach an executable binary format that it understands, perhaps after several layers of script interpreters, and run that code. At this point, the old code has been replaced. 264 | - ... or it will exhaust all its options and return an error code to the calling program, tail between its legs. 265 | 266 | If you've ever used a Unix-like system, you might've noticed that shell scripts run from a terminal still execute if they don't have a shebang line or `.sh` extension. You can test this out right now if you have a non-Windows terminal handy: 267 | 268 |222 | ```c 223 | // Arguments: filename, argv, envp 224 | execve("./script", [ "A", "B", "C" ], []); 225 | ``` 226 | 227 | 228 | This hypothetical `script` file has the following shebang as its first line: 229 | 230 |231 | ```js 232 | #!/usr/bin/node --experimental-module 233 | ``` 234 | 235 | 236 | The modified `argv` finally passed to the Node interpreter will be: 237 | 238 |239 | ```c 240 | [ "/usr/bin/node", "--experimental-module", "./script", "B", "C" ] 241 | ``` 242 | 243 |