├── A-Workflow
└── workflow.md
├── B-Sockets-ServerTesting
├── example_html
│ ├── images
│ │ ├── crew.jpg
│ │ └── ship.jpg
│ └── index.html
├── img
│ ├── client-server.png
│ └── listening-vs-connecting.png
├── overview-sockets-http.pdf
└── sockets-servertesting.md
├── C-Linux-Kernel-Dev
└── linux-kernel-dev.md
├── D-Fridge
└── waitqueue.pdf
├── E-Freezer
├── allclass.png
├── final_simple_runqueue.png
├── freezer.md
├── freezer.png
├── freezer_sched_class.md
├── naive.png
└── task_struct.png
├── F-Farfetchd
├── Page_table.png
├── index.md
├── x86_address_structure.png
├── x86_address_structure.svg
├── x86_paging_64bit.png
└── x86_paging_64bit.svg
├── G-Pantry
├── page-cache-overview.pdf
├── pantryfs-fill-super-diagram.svg
└── pantryfs-spec.pdf
└── README.md
/A-Workflow/workflow.md:
--------------------------------------------------------------------------------
1 | VM/Kernel Workflow
2 | ==================
3 | Before we get into the heavier assignments in this course, you should invest
4 | sometime into setting up a workflow and develop good coding habits.
5 |
6 | Why Use a VM?
7 | -------------
8 |
9 | - Hypervisor: Software responsible for managing VMs (e.g. creation, deletion,
10 | resource allocation). In this class, we use VMware.
11 | - Host machine: The hardware running the hypervisor (e.g. your personal laptop)
12 | - Virtual machine (VM): A computer system created by the hypervisor that shares
13 | resources with the host machine. In this class, we virtualize Debian 11 (a
14 | flavor of Linux). VMs adhere to sandbox properties:
15 | - Isolate the work you'll be doing from the rest of the system
16 | - Be able to take snapshots of your development environment so that if it does
17 | get corrupted, you can revert to a clean state
18 |
19 | Working on a VM can be preferable to working on your host machine (i.e.
20 | "bare-metal"). You can virtualize other operating systems to test your
21 | applications. You can work on potentially buggy/dangerous code without
22 | compromising your host machine.
23 |
24 |
25 | - Snapshot: A feature most hypervisors support. Capture the current running
26 | state of the VM, stores it, and then allows you to revert to that state
27 | sometime later. You should snapshot your VM before executing something that
28 | can compromise the system.
29 |
30 | In this class, we will often be "hacking" the Linux kernel and attempting to
31 | boot into those hacked kernels. We need to use VMs since not all of us have a
32 | Linux host machine. Even if we did, we wouldn't want to ruin our host machines
33 | with our potentially buggy kernels.
34 |
35 | VM Setup/Workflow
36 | -----------------
37 | We've already written a series of guides for working on your VM.
38 |
39 | First and foremost, you should have a Debian VM installed already (see [Debian
40 | VM Setup](https://cs4118.github.io/dev-guides/debian-vm-setup.html) for a guide
41 | on how to do this).
42 |
43 | ### Working on your VM
44 | You have more options than simply working on the VM's graphical interface, which
45 | often feels clunky.
46 |
47 | The most common workflow involves SSHing into your VM, which we've written a
48 | [guide](https://cs4118.github.io/dev-guides/vm-ssh.html#ssh-into-your-local-vm)
49 | for. This is a good option if you want to conserve processing power on your host
50 | machine and disable the VM's graphical interface.
51 |
52 | Alternatively, you can setup an IDE to SSH into your VM. One option is using
53 | Visual Studio Code, which we've written up some notes (below) on how to use.
54 | This is a nice alternative to command-line editors like vim/emacs if you're
55 | not familiar with them.
56 |
57 | ### Additional Tools
58 | If you do decide to work in a command-line environment, here are some tools
59 | we've used to streamline our workflow:
60 |
61 | - `bat`: A better version of `cat`
62 | [(installation)](https://github.com/sharkdp/bat#on-ubuntu-using-most-recent-deb-packages)
63 | - `grep`: Pattern-match files (learn how to effectively use regexs to improve
64 | search results). Even better, `ripgrep`.
65 | - Reverse-i search: Efficiently search through bash history instead of retyping
66 | long commands.
67 | - `tmux`: Terminal multiplexer. (e.g. open 2 panes for vim, 1 for `dmesg`, 1 for
68 | running commands).
69 |
70 | Kernel Workflow
71 | ---------------
72 |
73 | ### Editor enhancements
74 |
75 | Before getting into hw4/supermom, please read through one of our kernel developer
76 | workflow guides. These explain how to set up either `vim` or VSCode for kernel
77 | development.
78 |
79 | - [Vim workflow](https://cs4118.github.io/dev-guides/vim-workflow.html): Here,
80 | we offer a bunch of cool `vim` additions/tricks to help you develop more
81 | efficiently while working on the kernel assignments. Note that this guide is
82 | only relevant if you intend to work on the command-line using `vim`. One notable
83 | mention here is `cscope`. This is a kernel-source navigator that works directly
84 | in your terminal/vim session. This is far more powerful than using `grep` to
85 | look for a symbol definition/use in the source-tree.
86 | - [VSCode workflow](https://cs4118.github.io/dev-guides/vscode-workflow.html):
87 | In this guide, we explain how to configure autocomplete, formatting, and
88 | various other features for working on the kernel while using VSCode. VSCode
89 | is a very powerful editor, and being able to take advantage of its functionality
90 | will make your life much easier.
91 |
92 | ### Web Tools
93 |
94 | If you don't want to use `cscope`, there's an popular online kernel-source
95 | navigator: [bootlin](https://elixir.bootlin.com/linux/v5.10.158/source). Note
96 | that kernel version matters when you're navigating code – be sure you select the
97 | correct version.
98 |
99 | Like bootlin, you can look for symbols in the kernel-source and look for
100 | instances of where they are defined and used. However, bootlin doesn't index all
101 | symbols, so you might have better luck searching with `cscope`.
102 |
--------------------------------------------------------------------------------
/B-Sockets-ServerTesting/example_html/images/crew.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/B-Sockets-ServerTesting/example_html/images/crew.jpg
--------------------------------------------------------------------------------
/B-Sockets-ServerTesting/example_html/images/ship.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/B-Sockets-ServerTesting/example_html/images/ship.jpg
--------------------------------------------------------------------------------
/B-Sockets-ServerTesting/example_html/index.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | Star Trek: The Next Generation, the best TV show ever!
5 |
6 |
7 |
8 |
9 |
The ship:
10 |
11 |
12 |
13 |
And the crew:
14 |
15 |
16 |
17 |
18 |
19 |
--------------------------------------------------------------------------------
/B-Sockets-ServerTesting/img/client-server.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/B-Sockets-ServerTesting/img/client-server.png
--------------------------------------------------------------------------------
/B-Sockets-ServerTesting/img/listening-vs-connecting.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/B-Sockets-ServerTesting/img/listening-vs-connecting.png
--------------------------------------------------------------------------------
/B-Sockets-ServerTesting/overview-sockets-http.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/B-Sockets-ServerTesting/overview-sockets-http.pdf
--------------------------------------------------------------------------------
/B-Sockets-ServerTesting/sockets-servertesting.md:
--------------------------------------------------------------------------------
1 | # Sockets and Server Testing
2 |
3 | ## Sockets and HTTP
4 |
5 | ### What is a socket?
6 | A socket is a software construct used for many modes of communication between
7 | processes. The mode of communication that this recitation will focus on is
8 | network communication. In particular, stream sockets represent an endpoint for
9 | reliable, bidirectional connections such as TCP connections. This allows for two
10 | processes, on separate computers, to communicate over a TCP/IP network
11 | connection.
12 |
13 | Sockets have:
14 | - an IP address, to (typically) identify the computer that the socket endpoint
15 | belongs to
16 | - a port number, to identify which process running on the computer the socket
17 | endpoint belongs to
18 | - a protocol, such as TCP (reliable) or UDP (unreliable). Stream sockets use TCP
19 |
20 | An IP address and port number are both required in order for a computer to
21 | communicate with a specific process on a remote computer.
22 |
23 | ### The client-server model
24 | The two endpoints in a socket connection serve different roles. One end acts as
25 | a *server*:
26 | - It tells the operating system that it should receive incoming connections on a
27 | port number
28 | - It waits for incoming connections
29 | - When it receives a connection, it creates a *new socket* for each client,
30 | which will then be used to communicate with that client
31 |
32 | The other end is a *client*:
33 | - It "connects" to the server using the server’s IP address and the port number
34 |
35 | After a client connects to a server, there is bidirectional communication
36 | between the two processes, often with I/O system calls such as `read()` and
37 | `write()`, or their socket-specific variants `recv()` and `send()`.
38 |
39 | ### Sockets with netcat
40 | A simple way to demonstrate the bidirectional and network-based communcation of
41 | sockets is with `netcat`. `netcat` is a bare-bones program to send streams of
42 | binary data over the network.
43 |
44 | Imagine we have two computers that can communicate over the internet, with the
45 | IP addresses `clac.cs.columbia.edu` and `clorp.cs.nyu.edu`.
46 |
47 | Because of the client-server model, connecting two socket endpoints to each
48 | other is not a symmetrical process. One socket needs to act as the server, while
49 | the other needs to act as a client. You tell `netcat` to act as a server with
50 | the `-l` flag:
51 |
52 | ```console
53 | joy@clac.cs.columbia.edu:~$ nc -l 10000
54 | ```
55 |
56 | The `netcat` program on `clac.cs.columbia.edu` will create a socket and wait for
57 | connections on port 10000. To tell `netcat` to act as a client, you supply the
58 | IP address of the server and the port number of the socket listening on that
59 | server:
60 |
61 | ```console
62 | jeremy@clorp.cs.nyu.edu:~$ nc clac.cs.columbia.edu 10000
63 | ```
64 |
65 | Notice the differences between these two commands. The first command only
66 | requires a port number, and doesn't require the IP address of the other
67 | computer. The second command requires knowledge of both the IP address (what
68 | computer to connect to) and the port number (which process to connect to on that
69 | computer). This asymmetry is the client-server model.
70 |
71 | After the client connects to the server, the server `netcat` process creates a
72 | new socket for bidirectional communicaiton. After the two processes connect
73 | there is no functional difference between client and server. What you type on
74 | one end should be visible on the other -- a full duplex stream of data.
75 |
76 | ### Sockets API Summary
77 | 
78 |
79 | **`socket()`**
80 | - Called by both the client and the server
81 | - On the server-side, a listening socket is created; a connected socket will be
82 | created later by `accept()`
83 |
84 | **`bind()`**
85 | - Usually called only by the server
86 | - Binds the listening socket to a specific port that should be known to the
87 | client
88 |
89 | **`listen()`**
90 | - Called only by the server
91 | - Sets up the listening socket to accept connections
92 |
93 | **`accept()`**
94 | - Called only by the server
95 | - By default blocks until a connection request arrives
96 | - Creates and returns a new socket for each client
97 |
98 | **`connect()`**
99 | - Called only by the client
100 | - Requires an IP address and port number to connect to
101 | - Attempt to establish connection by reaching out to server
102 |
103 | **`send()` and `recv()`**
104 | - Called by both the client and server
105 | - Reads and writes to the other side
106 | - Message boundaries may not be preserved
107 | - nearly the same as `write()` and `read()`, but with socket-specific options
108 |
109 | A TCP client may use these functions as such:
110 | ```c
111 | int fd = socket(...);
112 | connect(fd, ... /* server address */);
113 |
114 | // Communicate with the server by send()ing from and recv()ing to fd.
115 |
116 | close(fd);
117 | ```
118 |
119 | And a TCP server:
120 |
121 | ```c
122 | int serv_fd = socket(...);
123 | bind(serv_fd, ... /* server address */);
124 | listen(serv_fd, ... /* max pending connections */);
125 |
126 | // use an infinite loop, to continue accepting new incoming clients
127 | for (;;) {
128 | int clnt_fd = accept(serv_fd, ...);
129 |
130 | // Communicate with the client by send()ing from and recv()ing to
131 | // clnt_fd, NOT serv_fd.
132 |
133 | close(clnt_fd);
134 | }
135 | ```
136 |
137 | ### Listening socket vs connected socket
138 | 
139 |
140 | To form a bidirectional channel between client and server, three sockets are used:
141 | - The server uses two sockets
142 | - The listening socket, to accept incoming connections from a client
143 | - The client socket, which is created when an incoming connection has been
144 | `accept()`ed.
145 | - The client uses one socket
146 | - The `connect()`ing socket, which reaches out to the server. Once the
147 | connection has been made, communication can be done between the server's client
148 | socket and the client's connecting socket.
149 |
150 | ### HTTP 1.0
151 | HTTP 1.0 is a protocol between a client, typically a web browser, and a server,
152 | typically a web server hosting files such as HTML. It is an outdated version of
153 | the HTTP protocol and simpler than newer versions.
154 |
155 | When visiting a website, a URL is specified in the following format:
156 |
157 | ```
158 | http://example.com:80/index.html
159 | ^^^^ ^^^^^^^^^^^ ^^^^^^^^^^^^^
160 | | | | |
161 | | | | URI = /index.html
162 | | | port number = 80
163 | | domain name = example.com
164 | protocol = HTTP
165 | ```
166 |
167 | Based on the information provided by the user in the URL, a web client will
168 | establish a socket connection with the IP address of the domain name. After
169 | establishing the connection, the two computers exchange text in the form of HTTP
170 | requests:
171 |
172 | - The client sends an **HTTP request** for a resource on the server
173 | - The server sends an **HTTP response**
174 |
175 | **HTTP request**
176 | - First line: method, request URI, version
177 | - Ex: "GET /index.html HTTP/1.0\r\n"
178 | - Followed by 0 or more headers
179 | - Ex: "Host: www.google.com\r\n"
180 | - Followed by an empty line
181 | - "\r\n"
182 |
183 | **HTTP response**
184 | - First line: response status
185 | - Success: "HTTP/1.0 200 OK\r\n"
186 | - Failure: "HTTP/1.0 404 Not Found\r\n"
187 | - Followed by 0 or more response headers
188 | - Followed by an empty line
189 | - "\r\n"
190 | - Followed by the content of the response
191 | - Ex: image file or HTML file
192 |
193 | We can see the contents of real HTTP requests using `netcat` by pretending to be
194 | either a web client or server. Our client and server won't actually work, since
195 | they simply recieve the incoming request but do nothing to process the request
196 | or reply.
197 |
198 | Let's first act as a web server. We tell `netcat` to open a server connection
199 | with `nc -l 10000`, and then in a web browser navigate to the URL with the
200 | domain name of this server. We can use the domain name `localhost` to specify
201 | the local computer rather than connecting to a remote computer over the
202 | internet. In Chrome, we'll navigate to the URL
203 | `http://localhost:10000/index.html`. `netcat` outputs this:
204 |
205 | ```console
206 | $ nc -l 10000
207 | GET /index.html HTTP/1.1 # GET == method; /index.html == request URI; HTTP/1.1 == version
208 | Host: localhost:10000 # header
209 | Connection: keep-alive # more headers...
210 | -removed for brevity-
211 | # blank newline to indicate end of headers/request
212 | ```
213 |
214 | To act as a client, we can type our HTTP request manually into netcat rather
215 | than doing it through the web browser. Here, we try to send an HTTP request to
216 | the domain name `example.com` on port `80` (the default for HTTP web servers)
217 | for the URI `/index.html`. Note that we specify the `-C` with `netcat` so that
218 | newlines are `\r\n` rather than `\n` -- a requirement of the HTTP protocol. This
219 | flag may vary depending on `netcat` version -- check `man nc`.
220 |
221 | ```console
222 | $ nc -C example.com 80
223 | GET /index.html HTTP/1.0 # GET == method; /index.html == request URI; HTTP/1.1 == version
224 | # blank line to specify end of request
225 | HTTP/1.0 200 OK # start of HTTP response. HTTP/1.0 == version; 200 OK == response status
226 | Accept-Ranges: bytes # header
227 | Content-Type: text/html # more headers...
228 | -removed for brevity-
229 | # blank newline to indicate end of headers and start of file contents
230 | # HTML contents
231 |
232 |
233 | Example Domain
234 | -removed for brevity-
235 | ```
236 |
237 | ## Testing your multi-server
238 |
239 | ### Siege
240 | Siege is a command-line tool that allows you to benchmark your webserver using
241 | load testing. Given a few parameters, Siege gives you information about the
242 | number of successful transactions to your website, percent availability, the
243 | latency/throughput of your server, and more.
244 |
245 | To install siege, run the following command:
246 |
247 | ```
248 | sudo apt install siege
249 | ```
250 |
251 | To use siege with your webserver in HW3, run your server and test with the
252 | following command:
253 |
254 | ```
255 | siege http://:/
256 | ```
257 |
258 | This will run for an infinite amount of time. When you
259 | Ctrl-C out of the command, a list of statistics will be
260 | outputted on your terminal.
261 |
262 | A better way to test with siege is using its options. The `-c` and `-r` options
263 | are particularly useful, as they allow you to specify the number of concurrent
264 | "users" and repetitions per user, respectively. For example, the following
265 | command will create 25 concurrent users that will each attempt to hit the server
266 | 50 times, resulting in 1250 hit attempts:
267 |
268 | ```
269 | siege -c 25 -r 50 http://:/
270 | ```
271 |
272 | There are many other options, specified in the siege man page. These include
273 | `-t`, which specifies how long each user should run (as opposed to how many
274 | times), and `-f`, which specifies a file path that contains a list of URLs to
275 | test.
276 |
277 | ### Additional guidance on testing/benchmarking
278 | When grading, we're going to test your implementation using a mix of manual
279 | connections (e.g. using tools like netcat) and stress testers like siege.
280 |
281 | You should use `netcat` to make sure that basic functionality works and that you
282 | can indeed have more than 1 connection to the server at any given time. `netcat`
283 | is nice because it allows you to establish a connection and then prompts you for
284 | the data to send. You should also use `netcat` to test that your cleanup logic
285 | is correct, as you can control exactly when connections start/terminate.
286 |
287 | Your server should be resilient to any client failure. `netcat` is a useful tool
288 | to test these kinds of failures, as you can simulate bad requests or
289 | disconnections at various points during the transaction. Your server should be
290 | able to gracefully handle these scenarios -- under no condition should your
291 | server crash because of a client failure.
292 |
293 | Once you've tested the basic functionality, use a stress tester to make sure
294 | that your server handles concurrent hoards of requests in a reasonable amount of
295 | time. Since we're all on VMs running on different host machines, we can't really
296 | say "X requests must finish in Y seconds". We're just looking to make sure that
297 | your server doesn't take years (e.g. because it is actually serializing
298 | requests).
299 |
300 | Our grading scripts make heavy use of siege url files. siege will basically make
301 | requests using the URLs specified in this file. Use this to make sure your
302 | server can concurrently handle all kinds of requests and correctly respond to
303 | all of them (2xx, 4xx, 5xx, multi-server never responds with 3xx).
304 |
305 | Regarding benchmarking, the assignment prompt occasionally instructs you to
306 | compare the performance of the implementation of one part with another. However,
307 | since you are testing multi-server in a virtual machine, the performance isn’t
308 | guaranteed to be significantly better. As such, don’t worry too much about the
309 | benchmarking instructions - it’s not a hard and fast requirement.
310 |
311 |
312 | ## Acknowledgements
313 | - Some examples were taken from John Hui's [Advanced
314 | Programming](https://cs3157.github.io/www/2022-9/) lecture notes. We recommend
315 | reading them on top of these recitation notes.
316 | - [Lecture 13 - TCP/IP Networking](https://cs3157.github.io/www/2022-9/lecture-notes/13-tcp-ip.pdf)
317 | - [Lecture 14 - HTTP](https://cs3157.github.io/www/2022-9/lecture-notes/14-http.pdf)
318 |
--------------------------------------------------------------------------------
/C-Linux-Kernel-Dev/linux-kernel-dev.md:
--------------------------------------------------------------------------------
1 | Linux Kernel Development
2 | ========================
3 | This document is meant to supplement our [compilation
4 | guide](https://cs4118.github.io/dev-guides/debian-kernel-compilation.html),
5 | which goes over how to install a new kernel on your machine.
6 |
7 | Kernel Module Makefile
8 | ----------------------
9 | More info in our [kernel module
10 | guide](https://cs4118.github.io/dev-guides/linux-modules.html), but for the sake
11 | of comparison, here it is again:
12 |
13 | ```make
14 | obj-m += hello.o
15 | all:
16 | make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
17 | clean:
18 | make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
19 | ```
20 |
21 | This Makefile is separate from the top-level kernel source Makefile we're about
22 | to work with. Here, we're adding to the `obj-m` "goal definition" (build object
23 | as module) and using a specialized Makefile located at `/lib/modules/$(shell
24 | uname -r)/build` to link our module into the kernel.
25 |
26 | Top-level Kernel Makefile
27 | -------------------------
28 |
29 | `linux/Makefile` is the only Makefile you'll be executing when developing with
30 | the kernel. At a high level, this Makefile is responsible for:
31 |
32 | - recursively calling Makefiles in subdirectories
33 | - generating kernel configurations
34 | - building and installing kernel images
35 |
36 | Here's an overview of the common operations you'll be doing with this Makefile.
37 |
38 | ### Preparing for development
39 |
40 | It's good practice to ensure you have a clean kernel source before beginning to
41 | development. Here's the cleaning options that this Makefile provides, as
42 | reported by `make help`:
43 |
44 | ```
45 | Cleaning targets:
46 | clean - Remove most generated files but keep the config and
47 | enough build support to build external modules
48 | mrproper - Remove all generated files + config + various backup files
49 | distclean. - mrproper + remove editor backup and patch files
50 | ```
51 |
52 | `make mrproper` is usually sufficient for our purposes. Be warned that you'll
53 | have to completely rebuild the kernel after running this!
54 |
55 | ### Configuration and compilation
56 |
57 | `linux/.config` is the kernel's configuration file. It lists a bunch of options
58 | that determine the properties of the kernel you're about to build. It also
59 | determines what code will be compiled and linked into the final kernel image.
60 | For example, if `CONFIG_SMP` is set, you're letting the kernel know that you
61 | have more than one processor so it can provide multi-processing functionality.
62 |
63 | There's a bunch of different ways to generate this configuration file. Here's
64 | the relevant options we have, as reported by `make help` in the top-level kernel
65 | directory:
66 |
67 | ```
68 | Configuration targets:
69 | config - Update current config utilising a line-oriented program
70 | menuconfig - Update current config utilising a menu based program
71 | oldconfig - Update current config utilising a provided .config as base
72 | localmodconfig - Update current config disabling modules not loaded
73 | defconfig - New config with default from ARCH supplied defconfig
74 | olddefconfig - Same as oldconfig but sets new symbols to their
75 | default value without prompting
76 | ```
77 |
78 | Summarizing from our [compilation
79 | guide](https://cs4118.github.io/dev-guides/debian-kernel-compilation.html), we
80 | set up our kernel config in a couple of steps:
81 |
82 | ```
83 | make olddefconfig # Use current kernel's .config + ARCH defaults
84 | make menuconfig # Manually edit some configs
85 | ```
86 |
87 | If you want to **significantly** reduce your build time, you can also set your
88 | config to skip unloaded modules during compilation:
89 |
90 | ```sh
91 | yes '' | make localmodconfig
92 | ```
93 |
94 | As you're developing in the kernel, you might add additional source files that
95 | need to be compiled and linked in. Open up the directory's Makefile and add your
96 | desired object file to the `obj-y` goal definition, which is for "built-in"
97 | functionality (as opposed to kernel modules). For example, here's the relevant
98 | portion of `linux/kernel/Makefile`:
99 |
100 | ```make
101 | obj-y = fork.o exec_domain.o panic.o \
102 | cpu.o exit.o softirq.o resource.o \
103 | sysctl.o sysctl_binary.o capability.o ptrace.o user.o \
104 | signal.o sys.o umh.o workqueue.o pid.o task_work.o \
105 | extable.o params.o \
106 | kthread.o sys_ni.o nsproxy.o \
107 | notifier.o ksysfs.o cred.o reboot.o \
108 | async.o range.o smpboot.o ucount.o
109 | ```
110 |
111 | If you were adding a source file to `linux/kernel`, you'd add the object file
112 | target here.
113 |
114 | Once you're ready to compile your kernel, you run the following in the top-level
115 | source directory:
116 |
117 | ```sh
118 | make -j $(nproc)
119 | ```
120 |
121 | This will compile your kernel across all available CPUs, as reported by `nproc`.
122 | Note that this top-level Makefile will recursively build source code in
123 | subdirectories for you!
124 |
125 | ### Kernel installation
126 |
127 | Like in our [compilation
128 | guide](https://cs4118.github.io/dev-guides/debian-kernel-compilation.html), you
129 | run the following command once the compilation finishes:
130 |
131 | ```sh
132 | sudo make modules_install && sudo make install
133 | ```
134 |
135 | The first time you install a kernel, you must build `modules_install`. All
136 | subsequent times, `install` is sufficient.
137 |
138 | Finally, reboot and select the kernel version you just installed via the GRUB
139 | menu!
140 |
141 | Kernel Code Style
142 | -----------------
143 |
144 | The kernel source comes with a linter written in Perl (located at
145 | `linux/scripts/checkpatch.pl`).
146 |
147 | checkpatch will let you know about any stylistic errors and warnings that it
148 | finds in the codebase. You should get into the habit of running checkpatch and
149 | fixing what it suggests before major checkpoints.
150 |
151 | If you want a general overview of kernel code style, here's
152 | [one](https://www.kernel.org/doc/html/v5.10/process/coding-style.html). You can
153 | also find this in `linux/Documentation/process/coding-style.rst`.
154 |
155 | Debugging Techniques
156 | --------------------
157 |
158 | - Take
159 | [snapshots](https://docs.vmware.com/en/VMware-Fusion/12/com.vmware.fusion.using.doc/GUID-4C90933D-A31F-4A56-B5CA-58D3AE6E93CF.html)
160 | of your VM before installing software that may corrupt it.
161 | - Use `printk/pr_info` to log messages to the kernel log buffer (viewable in
162 | userspace with by running `sudo dmesg`)
163 | - Set up a [serial port](http://cs4118.github.io/freezer/#tips) to redirect log
164 | buffer to your host machine (in case VM crashes before you can check it with
165 | `dmesg`).
166 |
--------------------------------------------------------------------------------
/D-Fridge/waitqueue.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/D-Fridge/waitqueue.pdf
--------------------------------------------------------------------------------
/E-Freezer/allclass.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/E-Freezer/allclass.png
--------------------------------------------------------------------------------
/E-Freezer/final_simple_runqueue.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/E-Freezer/final_simple_runqueue.png
--------------------------------------------------------------------------------
/E-Freezer/freezer.md:
--------------------------------------------------------------------------------
1 | # Implementing a scheduling class
2 |
3 | ## Introduction
4 |
5 | One of the reasons students struggle with this assignment boils down to a lack
6 | of understanding of how the scheduler works. In this guide, I hope to provide
7 | you with a clear understanding of how the Linux scheduler pieces fit together. I
8 | hope to paint a picture that you can use to implement the freezer scheduler.
9 |
10 | ## The `task_struct`
11 |
12 | Recall from HW1 that in Linux, every process is defined by its `struct
13 | task_struct`. When you have multiple tasks forked off a common parent, they are
14 | linked together in a doubly linked-list `struct list_head sibling` embedded
15 | within the `task_struct`. For example, if you had four processes running on your
16 | system, each forked off one parent, it would look something like this (the
17 | parent is not shown):
18 |
19 |
20 |
21 |
22 |
23 | However, at this stage, none of these processes are actually running on a CPU.
24 | In order to get them onto a CPU, I need to introduce you to the `struct rq`.
25 |
26 | ## The `struct_rq`
27 |
28 | The `struct rq` is a per-cpu run queue data structure. I like to think of it as
29 | the virtual CPU. It contains a lot of information (most of which goes way over
30 | my head), but it also includes the list of tasks that will (eventually) run on
31 | that CPU.
32 |
33 | A naive implementation would be to embed a `struct list_head runqueue_head` (for
34 | example) into the `struct rq`, and embed a `struct list_head node` into every
35 | `task_struct`.
36 |
37 |
38 |
39 | This is a BAD implementation.
40 |
41 |
42 | The main problem with this implementation is that it does not extend well. At
43 | this point, you know Linux has more than one scheduling class. Linux comes built
44 | with a deadline class, a real-time class, and the primary CFS. Having a
45 | `list_head` embedded directly into the `struct rq` for each scheduling class is
46 | not feasible.
47 |
48 | The solution is to create a new structure containing the `list_head` and any
49 | bookkeeping variables. Then, we can include just the wrapper structure in the
50 | `struct rq`. Linux includes these structures in `linux/kernel/sched/sched.h`.
51 |
52 | By convention, Linux scheduler-specific wrapper structures are named `struct
53 | _rq`. For example, the CFS class defines a `struct cfs_rq` which is
54 | then declared inside of `struct rq` as `struct cfs_rq cfs`.
55 |
56 | The following snippet is taken from `linux/kernel/sched/sched.h`:
57 |
58 | ```c
59 | struct cfs_rq {
60 | struct load_weight load;
61 | unsigned int nr_running;
62 | unsigned int h_nr_running;
63 | unsigned int idle_h_nr_running;
64 | /* code omitted */
65 | };
66 |
67 | struct rq {
68 | /* code omitted */
69 | struct cfs_rq cfs;
70 | struct rt_rq rt;
71 | struct dl_rq dl;
72 | /* code omitted */
73 | };
74 | ```
75 |
76 | ## The `freezer_rq`
77 |
78 | At this point, you've probably guessed that you will need to do the same thing
79 | for freezer. You are right. The `freezer_rq` should include the head of the
80 | freezer runqueue. Additionally, you may need to include some bookkeeping
81 | variables. Think of what you would actually need and don't add anything extra
82 | (it should be pretty simple).
83 |
84 | ## The `sched_freezer_entity`
85 |
86 | Now that you have the `struct rq` setup, you need to have some mechanism to join
87 | your `task_struct`s into the queue. Here, too, you can't just include a
88 | `list_head node` to add a task onto the scheduler-specific runqueue because
89 | you'll need additional bookkeeping. As you have probably guessed, we are going
90 | to wrap the `list_head` and all the bookkeeping variables into their own struct.
91 |
92 | In Linux, we name these structs `sched_{class}_entity` (one exception is that
93 | CFS names this `sched_entity`). For example, the real-time scheduling class
94 | calls it `sched_rt_entity`. We will name ours `struct sched_freezer_entity`.
95 | Again, make sure you only include what you need in this struct.
96 |
97 | With all this setup, here is what the final picture looks like:
98 |
99 |
100 |
101 |
102 |
103 | In the picture above, the two structs on the far left represent a system with
104 | two CPUs. I colored these blue and green to distinguish them from each other,
105 | and to show that different `task_structs` linked on one `siblings` linked-list
106 | can run on separate CPUs.
107 |
108 | ## The `sched_class`
109 |
110 | At this point, we have set up the data structures, but we are still not done. We
111 | now need to implement the freezer functionality to let the kernel use freezer as
112 | a scheduler.
113 |
114 | Think about this situation: Say we have a CFS task about to return from main().
115 | The OS needs to call CFS `dequeue_task()` to remove it from the CFS queue. How
116 | can we ensure that the OS will call the CFS implementation of `dequeue_task()`?
117 | The answer is `struct sched_class`, defined in `linux/kernel/sched/sched.h`.
118 | Here is what the structure looks like:
119 |
120 | ```c
121 | struct sched_class {
122 |
123 | #ifdef CONFIG_UCLAMP_TASK
124 | int uclamp_enabled;
125 | #endif
126 |
127 | void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
128 | void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
129 | void (*yield_task) (struct rq *rq);
130 | bool (*yield_to_task)(struct rq *rq, struct task_struct *p);
131 |
132 | void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
133 |
134 | struct task_struct *(*pick_next_task)(struct rq *rq);
135 |
136 | void (*put_prev_task)(struct rq *rq, struct task_struct *p);
137 | void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);
138 |
139 | #ifdef CONFIG_SMP
140 | int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
141 | int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
142 | void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
143 |
144 | void (*task_woken)(struct rq *this_rq, struct task_struct *task);
145 |
146 | void (*set_cpus_allowed)(struct task_struct *p,
147 | const struct cpumask *newmask);
148 |
149 | void (*rq_online)(struct rq *rq);
150 | void (*rq_offline)(struct rq *rq);
151 | #endif
152 |
153 | void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
154 | void (*task_fork)(struct task_struct *p);
155 | void (*task_dead)(struct task_struct *p);
156 |
157 | /*
158 | * The switched_from() call is allowed to drop rq->lock, therefore we
159 | * cannot assume the switched_from/switched_to pair is serliazed by
160 | * rq->lock. They are however serialized by p->pi_lock.
161 | */
162 | void (*switched_from)(struct rq *this_rq, struct task_struct *task);
163 | void (*switched_to) (struct rq *this_rq, struct task_struct *task);
164 | void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
165 | int oldprio);
166 |
167 | unsigned int (*get_rr_interval)(struct rq *rq,
168 | struct task_struct *task);
169 |
170 | void (*update_curr)(struct rq *rq);
171 |
172 | #define TASK_SET_GROUP 0
173 | #define TASK_MOVE_GROUP 1
174 |
175 | #ifdef CONFIG_FAIR_GROUP_SCHED
176 | void (*task_change_group)(struct task_struct *p, int type);
177 | #endif
178 | } __aligned(STRUCT_ALIGNMENT); /* STRUCT_ALIGN(), vmlinux.lds.h */
179 | ```
180 |
181 | As you can see, `struct sched_class` contains many function pointers. When we
182 | add a new scheduling class, we create an instance of `struct sched_class` and
183 | set the function pointers to point to our implementation of these functions. If
184 | we look in the file `linux/kernel/sched/fair.c`, we see how CFS does it:
185 |
186 | ```c
187 | const struct sched_class fair_sched_class
188 | __section("__fair_sched_class") = {
189 | .enqueue_task = enqueue_task_fair,
190 | .dequeue_task = dequeue_task_fair,
191 | .yield_task = yield_task_fair,
192 | .yield_to_task = yield_to_task_fair,
193 |
194 | .check_preempt_curr = check_preempt_wakeup,
195 |
196 | .pick_next_task = __pick_next_task_fair,
197 | .put_prev_task = put_prev_task_fair,
198 | .set_next_task = set_next_task_fair,
199 |
200 | #ifdef CONFIG_SMP
201 | .balance = balance_fair,
202 | .select_task_rq = select_task_rq_fair,
203 | .migrate_task_rq = migrate_task_rq_fair,
204 |
205 | .rq_online = rq_online_fair,
206 | .rq_offline = rq_offline_fair,
207 |
208 | .task_dead = task_dead_fair,
209 | .set_cpus_allowed = set_cpus_allowed_common,
210 | #endif
211 |
212 | .task_tick = task_tick_fair,
213 | .task_fork = task_fork_fair,
214 |
215 | .prio_changed = prio_changed_fair,
216 | .switched_from = switched_from_fair,
217 | .switched_to = switched_to_fair,
218 |
219 | .get_rr_interval = get_rr_interval_fair,
220 |
221 | .update_curr = update_curr_fair,
222 |
223 | #ifdef CONFIG_FAIR_GROUP_SCHED
224 | .task_change_group = task_change_group_fair,
225 | #endif
226 |
227 | #ifdef CONFIG_UCLAMP_TASK
228 | .uclamp_enabled = 1,
229 | #endif
230 | };
231 | ```
232 |
233 | Note that the dot notation is a C99 feature that allows you to set specific
234 | fields of the struct by name in an initializer. This notation is also called
235 | [designated
236 | initializers](http://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html). Also,
237 | not every function needs to be implemented. You will need to figure out what is
238 | and is not necessary. To see an example of a bare minimum scheduler, see the
239 | [idle_sched_class](https://elixir.bootlin.com/linux/v5.10.158/source/kernel/sched/idle.c#L487),
240 | which is the scheduling policy used when no other tasks are ready to be
241 | executed.
242 |
243 | As you can see, CFS initializes the `struct sched_class` function pointers to
244 | the CFS implementation. Two things of note here. First, the convention is to
245 | name the struct `_sched_class`, so CFS names it `fair_sched_class`.
246 | Second, we name a particular class's functions as
247 | `_`. For example, the CFS implementation of
248 | `enqueue_task` as `enqueue_task_fair`. Now, every time the kernel needs to call
249 | a function, it can simply call `p->sched_class->`. Here, `p` is of
250 | the type `task_struct *`, `sched_class` is a pointer within the `task_struct`
251 | pointing to an instance of `struct sched_class`, and the `` points
252 | to the specific implementaion of the the function to be called.
253 |
254 | One final thing: you may have noticed the `__section("__fair_sched_class")`
255 | macro in the declaration of`struct sched_class fair_sched_class`. When building
256 | the kernel, this allows the linker to align the `sched_class`'s contiguously in
257 | memory through the use of a linker script. A linker script describes how various
258 | sections in the input (source) files should be mapped into the output
259 | (binary/object) file, and to control the memory layout of the output file.
260 |
261 | We can see this in `linux/include/asm-generic/vmlinux.lds.h`:
262 |
263 | ```c
264 | /*
265 | * The order of the sched class addresses are important, as they are
266 | * used to determine the order of the priority of each sched class in
267 | * relation to each other.
268 | */
269 | #define SCHED_DATA \
270 | STRUCT_ALIGN(); \
271 | __begin_sched_classes = .; \
272 | *(__idle_sched_class) \
273 | *(__fair_sched_class) \
274 | *(__rt_sched_class) \
275 | *(__dl_sched_class) \
276 | *(__stop_sched_class) \
277 | __end_sched_classes = .;
278 | ```
279 |
280 | This effectively allows the kernel to treat the `sched_class` structs as part of
281 | an array of `sched_class`'s. The first class in the array is of lower priority
282 | than the second. In other words, `sched_class_dl` has a higher priority than
283 | `sched_class_rt`. Now, every time a new process needs to be scheduled, the
284 | kernel can simply go through the class array and check if there is a process of
285 | that class that needs to run. Let's take a look at this as implemented in
286 | `linux\kernel\sched\core.c`.
287 |
288 | ```c
289 | static inline struct task_struct *
290 | pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
291 | {
292 | const struct sched_class *class;
293 | struct task_struct *p;
294 |
295 | /* code omitted */
296 |
297 | for_each_class(class) {
298 | p = class->pick_next_task(rq);
299 | if (p)
300 | return p;
301 | }
302 |
303 | /* The idle class should always have a runnable task: */
304 | BUG();
305 | }
306 | ```
307 |
308 | This makes use of the `for_each_class()` macro, which takes advantage of the
309 | array structure of the `sched_class`'s. We can see this implementation in
310 | `linux/kernel/sched/sched.h`:
311 |
312 | ```c
313 | /* Defined in include/asm-generic/vmlinux.lds.h */
314 | extern struct sched_class __begin_sched_classes[];
315 | extern struct sched_class __end_sched_classes[];
316 |
317 | #define sched_class_highest (__end_sched_classes - 1)
318 | #define sched_class_lowest (__begin_sched_classes - 1)
319 |
320 | #define for_class_range(class, _from, _to) \
321 | for (class = (_from); class != (_to); class--)
322 |
323 | #define for_each_class(class) \
324 | for_class_range(class, sched_class_highest, sched_class_lowest)
325 | ```
326 |
327 | Essentially, when a process wants to relinquish its time on a CPU, `schedule()`
328 | gets called. Following the chain of calls in the kernel, `pick_next_task()`
329 | eventually gets called, and the OS will loop through each scheduling class by
330 | calling `for_each_class(class)`. Here, we call the `pick_next_task()` function
331 | of a particular instance of `struct sched_class`. If `pick_next_task()` returns
332 | `NULL`, the kernel will simply move on to the next class. If the kernel reaches
333 | the lowest priority class on the list (i.e. `idle_sched_class`) then there are
334 | no tasks to run and the CPU will go into idle mode.
335 |
--------------------------------------------------------------------------------
/E-Freezer/freezer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/E-Freezer/freezer.png
--------------------------------------------------------------------------------
/E-Freezer/freezer_sched_class.md:
--------------------------------------------------------------------------------
1 | # Linux Scheduler Series
2 |
3 | > This guide is adapted from a blog series written by Mitchell Gouzenko, a
4 | > former OS TA. The original posts can be found
5 | > [here](https://mgouzenko.github.io/jekyll/update/2016/11/04/the-linux-process-scheduler.html).
6 | >
7 | > The code snippets and links in this post correspond to Linux v5.10.158.
8 |
9 | ## Introduction
10 |
11 | I'm writing this series after TA-ing an operating systems class for two
12 | semesters. Each year, tears begin to flow by the time we get to the infamous
13 | Scheduler Assignment - where students are asked to implement a round-robin
14 | scheduler in the Linux kernel. The assignment is known to leave relatively
15 | competent programmers in shambles. I don't blame them; the seemingly simple task
16 | of writing a round robin scheduler is complicated by two confounding factors:
17 |
18 | * The Linux scheduler is cryptic as hell and on top of that, very poorly
19 | documented.
20 | * Bugs in scheduler code will often trigger a kernel panic, freezing the system
21 | without providing any logs or meaningful error messages.
22 |
23 | I hope to ease students' suffering by addressing the first bullet point. In this
24 | series, I will explain how the scheduler's infrastructure is set up, emphasizing
25 | how one may leverage its modularity to plug in their own scheduler.
26 |
27 | We'll begin by examining the basic role of the core scheduler and how the rest
28 | of the kernel interfaces with it. Then, we'll look at `sched_class`, the modular
29 | data structure that permits various scheduling algorithms to live and operate
30 | side-by-side in the kernel.
31 |
32 | ### A Top Down Approach
33 |
34 | Initially, I'll treat the scheduler as a black box. I will make gross
35 | over-simplifications but note very clearly when I do so. Little by little, we
36 | will delve into the scheduler's internals, unfolding the truth behind these
37 | simplifications. By the end of this series, you should be able to start tackling
38 | the problem of writing your own working scheduler.
39 |
40 | ### Disclaimer
41 |
42 | I'm not an expert kernel hacker. I'm just a student who has spent a modest
43 | number of hours reading, screaming at, and sobbing over kernel code. If I make a
44 | mistake, please point it out in the comments section, and I'll do my best to
45 | correct it.
46 |
47 | ## What is the Linux Scheduler?
48 |
49 | Linux is a multi-tasking system. At any instant, there are many processes active
50 | at once, but a single CPU can only perform work on behalf of one process at a
51 | time. At a high level, Linux context switches from process to process, letting
52 | the CPU perform work on behalf of each one in turn. This switching occurs
53 | quickly enough to create the illusion that all processes are running
54 | simultaneously. **The scheduler is in charge of coordinating all of this
55 | switching.**
56 |
57 | ### Where Does the Scheduler Fit?
58 |
59 | You can find most scheduler-related code under `kernel/sched`. Now, the
60 | scheduler has a distinct and non-trivial job. The rest of the kernel doesn't
61 | know or care how the scheduler performs its magic, as long as it can be called
62 | upon to schedule tasks. So, to hide the complexity of the scheduler, it is
63 | invoked with a simple and well-defined API. The scheduler - from the perspective
64 | of the rest of the kernel - has two main responsibilities:
65 |
66 | * **Responsibility I**: Provide an interface to halt the currently running
67 | process and switch to a new one. To do so, the scheduler must pick the next
68 | process to run, which is a nontrivial problem.
69 |
70 | * **Responsibility II**: Indicate to the rest of the OS when a new process
71 | should be run.
72 |
73 | ### Responsibility I: Switching to the Next Process
74 |
75 | To fulfill its first responsibility, the scheduler must somehow keep track of
76 | all the running processes.
77 |
78 | ### The Runqueue
79 |
80 | Here's the first over-simplification: you can think of the scheduler as a system
81 | that maintains a simple queue of processes in the form of a linked list. The
82 | process at the head of the queue is allowed to run for some "time slice" - say,
83 | 10 milliseconds. After this time slice expires, the process is moved to the back
84 | of the queue, and the next process gets to run on the CPU for the same time
85 | slice. When a running process is forcibly stopped and taken off the CPU in this
86 | way, we say that it has been **preempted**. The linked list of processes waiting
87 | to have a go on the CPU is called the runqueue. Each CPU has its own runqueue,
88 | and a given process may appear on only one CPU's runqueue at a time. Processes
89 | CAN migrate between various CPUs' runqueues, but we'll save this discussion for
90 | later.
91 |
92 |
93 |
94 | Figure 1: An over-simplification of the runqueue
95 |
96 |
97 | The scheduler is not really this simple; the runqueue is defined in the kernel
98 | as `struct rq`, and you can take a peek at its definition
99 | [here](https://elixir.bootlin.com/linux/v5.10.158/source/kernel/sched/sched.h#L897).
100 | Spoiler alert: it's not a linked list! To be fair, the explanation that I gave
101 | above more or less describes the very first Linux runqueue. But over the years,
102 | the scheduler evolved to incorporate multiple scheduling algorithms. These
103 | include:
104 |
105 | * Completely Fair Scheduler (CFS)
106 | * Real-Time Scheduler
107 | * Deadline Scheduler
108 |
109 | The modern-day runqueue is no longer a linked list but actually a collection of
110 | algorithm-specific runqueues corresponding to the list above. Indeed,
111 | `struct rq` has the following members:
112 |
113 | ```c
114 | struct cfs_rq cfs; // CFS scheduler runqueue
115 | struct rt_rq rt; // Real-time scheduler runqueue
116 | struct dl_rq dl; // Deadline scheduler runqueue
117 | ```
118 |
119 | For example, CFS, which is the default scheduler in modern Linux kernels, uses a
120 | red-black tree data structure to keep track of processes, with each process
121 | assigned a "virtual runtime" that determines its priority in the scheduling
122 | queue. The scheduler then selects the process with the lowest virtual runtime to
123 | run next, ensuring that each process gets a fair share of CPU time over the long
124 | term.
125 |
126 | Keep these details in the back of your mind so that you don't get bogged down.
127 | Remember: the goal here is to understand how the scheduler interoperates with
128 | the rest of the kernel. The main takeaway is that a process is allowed to run
129 | for some time, and when that time expires, it gets preempted so that the next
130 | process can run.
131 |
132 | ### Preemption vs Yielding
133 |
134 | Preemption is not always the reason a process is taken off the CPU. For example,
135 | a process might voluntarily go to sleep, waiting for an IO event or lock. To do
136 | this, the process puts itself on a "wait queue" and takes itself off the
137 | runqueue. In this case, the process has **yielded** the CPU. In summary:
138 |
139 | * "preemption" is when a process is forcibly kicked off the CPU.
140 |
141 | * "yielding" is when a process voluntarily gives up the CPU.
142 |
143 | In addition to an expired timeslice, there are several other reasons that
144 | preemption may occur. For example, when an interrupt occurs, the CPU may be
145 | preempted to handle the interrupt. Additionally, a real-time process may have a
146 | higher priority than some other process and may preempt lower-priority processes
147 | to ensure that it meets its deadline.
148 |
149 | ### `schedule()`
150 |
151 | With a conceptual understanding of the runqueue, we now have the background to
152 | understand how Responsibility I is carried out by the scheduler. The
153 | `schedule()` function is the crux of Responsibility I: it halts the currently
154 | running process and runs the next one on the CPU. This function is referred to
155 | by many texts as "the entrypoint into the scheduler". `schedule()` invokes
156 | `__schedule()` to do most of the real work. Here is the portion relevant to us:
157 |
158 | ```c
159 | static void __sched notrace __schedule(bool preempt)
160 | {
161 | struct task_struct *prev, *next;
162 | unsigned long *switch_count;
163 | struct rq *rq;
164 |
165 | /* CODE OMITTED */
166 |
167 | next = pick_next_task(rq, prev, &rf);
168 | clear_tsk_need_resched(prev);
169 | clear_preempt_need_resched();
170 |
171 | if (likely(prev != next)) {
172 | rq->nr_switches++;
173 | RCU_INIT_POINTER(rq->curr, next);
174 | ++*switch_count;
175 |
176 | psi_sched_switch(prev, next, !task_on_rq_queued(prev));
177 | trace_sched_switch(preempt, prev, next);
178 | rq = context_switch(rq, prev, next, &rf);
179 | }
180 |
181 | /* CODE OMITTED */
182 | }
183 | ```
184 |
185 | `pick_next_task()` looks at the runqueue `rq` and returns the `task_struct`
186 | associated with the process that should be run next. If we consider t=10 in
187 | Figure 1, `pick_next_task()` would return the `task_struct` for Process 2. Then,
188 | `context_switch()` switches the CPU's state to that of the returned
189 | `task_struct`. This fullfills Responsibility I.
190 |
191 | ## Responsibility II: When Should the Next Process Run?
192 |
193 | Great, so we've seen that `schedule()` is used to context switch to the next
194 | task. But when does this *actually* happen?
195 |
196 | As mentioned previously, a user-space program might voluntarily go to sleep
197 | waiting for an IO event or a lock. In this case, the kernel will call
198 | `schedule()` on behalf of the process that needs to sleep. But what if the
199 | user-space program never sleeps? Here's one such program:
200 |
201 | ```c
202 | int main()
203 | {
204 | while(1);
205 | }
206 | ```
207 |
208 | If `schedule()` were only called when a user-space program voluntarily sleeps,
209 | then programs like the one above would use up the processor indefinitely. Thus,
210 | we need a mechanism to preempt processes that have exhausted their time slice!
211 |
212 | This preemption is accomplished via the timer interrupt. The timer interrupt
213 | fires periodically, allowing control to jump to the timer interrupt handler in
214 | the kernel. This handler calls the function `update_process_times()`, shown
215 | below.
216 |
217 | ```c
218 | /*
219 | * Called from the timer interrupt handler to charge one tick to the current
220 | * process. user_tick is 1 if the tick is user time, 0 for system.
221 | */
222 | void update_process_times(int user_tick)
223 | {
224 | struct task_struct *p = current;
225 |
226 | PRANDOM_ADD_NOISE(jiffies, user_tick, p, 0);
227 |
228 | /* Note: this timer irq context must be accounted for as well. */
229 | account_process_tick(p, user_tick);
230 | run_local_timers();
231 | rcu_sched_clock_irq(user_tick);
232 | #ifdef CONFIG_IRQ_WORK
233 | if (in_irq())
234 | irq_work_tick();
235 | #endif
236 | scheduler_tick();
237 | if (IS_ENABLED(CONFIG_POSIX_TIMERS))
238 | run_posix_cpu_timers();
239 | }
240 | ```
241 |
242 | Notice how `update_process_times()` invokes `scheduler_tick()`. In
243 | `scheduler_tick()`, the scheduler checks to see if the running process's time
244 | has expired. If so, it sets a (over-simplification alert) per-CPU flag called
245 | `need_resched`. This indicates to the rest of the kernel that `schedule()`
246 | should be called. In our simplified example, `scheduler_tick()` would set this
247 | flag when the current process has been running for 10 milliseconds or more.
248 |
249 | But wait, why the heck can't `scheduler_tick()` just call `schedule()` by
250 | itself, from within the timer interrupt? After all, if the scheduler knows that
251 | a process's time has expired, shouldn't it just context switch right away?
252 |
253 | As it turns out, it is not always safe to call `schedule()`. In particular, if
254 | the currently running process is holding a spinlock in the kernel, it cannot be
255 | put to sleep from the interrupt handler. (Let me repeat that one more time
256 | because people always forget: **you cannot sleep with a spinlock.** Sleeping
257 | with a spinlock may cause the kernel to deadlock, and will bring you anguish
258 | for many hours when you can't figure out why your system has hung.)
259 |
260 | When the scheduler sets the `need_resched` flag, it's really saying, "please
261 | dearest kernel, invoke `schedule()` at your earliest convenience." The kernel
262 | keeps a count of how many spinlocks the currently running process has acquired.
263 | When that count goes down to 0, the kernel knows that it's okay to put the
264 | process to sleep. The kernel checks the `need_resched` flag in two main places:
265 |
266 | * when returning from an interrupt handler
267 |
268 | * when returning to user-space from a system call
269 |
270 | If `need_resched` is `True` and the spinlock count is 0, then the kernel calls
271 | `schedule()`. With our simple linked-list runqueue, this delayed invocation of
272 | `schedule()` implies that a process can possibly run for a bit longer than its
273 | timeslice. We're cool with that because it's always safe to call `schedule()`
274 | when the kernel is about to return to user-space. That's because user-space
275 | programs are allowed to sleep. So, by the time the kernel is about to return to
276 | user-space, it cannot be holding any spinlocks. This means that there won't be a
277 | large delay between when `need_resched` is set, and when `schedule()` gets
278 | called.
279 |
280 | ## Understanding `sched_class`
281 |
282 | In this section, I will analyze `struct sched_class` and talk briefly about what
283 | most of the functions do. I've reproduced `struct sched_class` below.
284 |
285 | ```c
286 | struct sched_class {
287 |
288 | #ifdef CONFIG_UCLAMP_TASK
289 | int uclamp_enabled;
290 | #endif
291 |
292 | void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
293 | void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
294 | void (*yield_task) (struct rq *rq);
295 | bool (*yield_to_task)(struct rq *rq, struct task_struct *p);
296 |
297 | void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
298 |
299 | struct task_struct *(*pick_next_task)(struct rq *rq);
300 |
301 | void (*put_prev_task)(struct rq *rq, struct task_struct *p);
302 | void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);
303 |
304 | #ifdef CONFIG_SMP
305 | int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
306 | int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
307 | void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
308 |
309 | void (*task_woken)(struct rq *this_rq, struct task_struct *task);
310 |
311 | void (*set_cpus_allowed)(struct task_struct *p,
312 | const struct cpumask *newmask);
313 |
314 | void (*rq_online)(struct rq *rq);
315 | void (*rq_offline)(struct rq *rq);
316 | #endif
317 |
318 | void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
319 | void (*task_fork)(struct task_struct *p);
320 | void (*task_dead)(struct task_struct *p);
321 |
322 | /*
323 | * The switched_from() call is allowed to drop rq->lock, therefore we
324 | * cannot assume the switched_from/switched_to pair is serliazed by
325 | * rq->lock. They are however serialized by p->pi_lock.
326 | */
327 | void (*switched_from)(struct rq *this_rq, struct task_struct *task);
328 | void (*switched_to) (struct rq *this_rq, struct task_struct *task);
329 | void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
330 | int oldprio);
331 |
332 | unsigned int (*get_rr_interval)(struct rq *rq,
333 | struct task_struct *task);
334 |
335 | void (*update_curr)(struct rq *rq);
336 |
337 | #define TASK_SET_GROUP 0
338 | #define TASK_MOVE_GROUP 1
339 |
340 | #ifdef CONFIG_FAIR_GROUP_SCHED
341 | void (*task_change_group)(struct task_struct *p, int type);
342 | #endif
343 | } __aligned(STRUCT_ALIGNMENT); /* STRUCT_ALIGN(), vmlinux.lds.h */
344 | ```
345 |
346 | ### `enqueue_task()` and `dequeue_task()`
347 |
348 | ```c
349 | /* Called to enqueue task_struct p on runqueue rq. */
350 | void enqueue_task(struct rq *rq, struct task_struct *p, int flags);
351 |
352 | /* Called to dequeue task_struct p from runqueue rq. */
353 | void dequeue_task(struct rq *rq, struct task_struct *p, int flags);
354 | ```
355 |
356 | `enqueue_task()` and `dequeue_task()` are used to put a task on the runqueue and
357 | remove a task from the runqueue, respectively.
358 |
359 | These functions are called for a variety of reasons:
360 |
361 | * When a child process is first forked, `enqueue_task()` is called to put it on
362 | a runqueue. When a process exits, `dequeue_task()` takes it off the runqueue.
363 |
364 | * When a process goes to sleep, `dequeue_task()` takes it off the runqueue. For
365 | example, this happens when the process needs to wait for a lock or IO event.
366 | When the IO event occurs, or the lock becomes available, the process wakes up.
367 | It must then be re-enqueued with `enqueue_task()`.
368 |
369 | * Process migration - if a process must be migrated from one CPU's runqueue to
370 | another, it's dequeued from its old runqueue and enqueued on a different one
371 | using this function.
372 |
373 | * When `set_cpus_allowed()` is called to change the task's processor affinity,
374 | it may need to be enqueued on a different CPU's runqueue.
375 |
376 | * When the priority of a process is boosted to avoid priority inversion. In this
377 | case, the task used to have a low-priority `sched_class`, but is being
378 | promoted to a `sched_class` with high priority. This action occurs in
379 | `rt_mutex_setprio()`.
380 |
381 | * From `__sched_setscheduler`. If a task's `sched_class` has changed, it's
382 | dequeued using its old `sched_class` and enqueued with the new one.
383 |
384 | Each of these functions are passed the task to be enqueued/dequeued, as well as
385 | the runqueue it should be added to/removed from. In addition, these functions
386 | are given a bit vector of flags that describe *why* enqueue or dequeue is being
387 | called. Here are the various flags, which are described in
388 | [sched.h](https://elixir.bootlin.com/linux/v5.10.158/source/kernel/sched/sched.h#L1743):
389 |
390 | ```c
391 | /*
392 | * {de,en}queue flags:
393 | *
394 | * DEQUEUE_SLEEP - task is no longer runnable
395 | * ENQUEUE_WAKEUP - task just became runnable
396 | *
397 | * SAVE/RESTORE - an otherwise spurious dequeue/enqueue, done to ensure tasks
398 | * are in a known state which allows modification. Such pairs
399 | * should preserve as much state as possible.
400 | *
401 | * MOVE - paired with SAVE/RESTORE, explicitly does not preserve the location
402 | * in the runqueue.
403 | *
404 | * ENQUEUE_HEAD - place at front of runqueue (tail if not specified)
405 | * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
406 | * ENQUEUE_MIGRATED - the task was migrated during wakeup
407 | *
408 | */
409 | ```
410 |
411 | The `flags` argument can be tested using the bitwise `&` operation. For example,
412 | if the task was just migrated from another CPU, `flags & ENQUEUE_MIGRATED`
413 | evaluates to 1.
414 |
415 | ### `pick_next_task()`
416 |
417 | ```c
418 | /* Pick the task that should be currently running. */
419 | struct task_struct *pick_next_task(struct rq *rq);
420 | ```
421 |
422 | `pick_next_task()` is called by the core scheduler to determine which of `rq`'s
423 | tasks should be running. The name is a bit misleading: This function is not
424 | supposed to return the task that should run *after* the currently running task;
425 | instead, it's supposed to return the `task_struct` that should be running now,
426 | **in this instant.**
427 |
428 | The kernel will context switch from the currently running task to the task
429 | returned by `pick_next_task()`.
430 |
431 | ### `put_prev_task()`
432 |
433 | ```c
434 | /* Called right before p is going to be taken off the CPU. */
435 | void put_prev_task(struct rq *rq, struct task_struct *p);
436 | ```
437 |
438 | `put_prev_task()` is called whenever a task is to be taken off the CPU. The
439 | behavior of this function is up to the specific `sched_class`. Some schedulers
440 | do very little in this function. For example, the realtime scheduler uses this
441 | function as an opportunity to perform simple bookkeeping. On the other hand,
442 | CFS's `put_prev_task_fair()` needs to do a bit more work. As an optimization,
443 | CFS keeps the currently running task out of its RB tree. It uses the
444 | `put_prev_task` hook as an opportunity to put the currently running task (that
445 | is, the task specified by `p`) back in the RB tree.
446 |
447 | The sched_class's `put_prev_task` is called by the function `put_prev_task()`,
448 | which is
449 | [defined](https://elixir.bootlin.com/linux/v5.10.158/source/kernel/sched/sched.h#L1841)
450 | in `sched.h`. `put_prev_task()` gets called in the core scheduler's
451 | `pick_next_task()`, after the policy-specific `pick_next_task()` implementation
452 | is called, but before any context switch is performed. This gives us an
453 | opportunity to perform any operations we need to do to move on from the
454 | previously running task in our scheduler implementations.
455 |
456 | Note that this was not the case in older kernels: The `sched_class`'s
457 | `pick_next_task()` is expected to call `put_prev_task()` by itself! This is
458 | documented in the following
459 | [comment](https://elixir.bootlin.com/linux/v4.9.330/source/kernel/sched/sched.h#L1241)
460 | in an earlier Linux version (4.9). Before that (3.11), `put_prev_task` actually
461 | [used to be called](https://elixir.bootlin.com/linux/v3.11/source/kernel/sched/core.c#L2445)
462 | by the core scheduler before it called `pick_next_task`.
463 |
464 | ### `task_tick()`
465 |
466 | ```c
467 | /* Called from the timer interrupt handler. p is the currently running task
468 | * and rq is the runqueue that it's on.
469 | */
470 | void task_tick(struct rq *rq, struct task_struct *p, int queued);
471 | ```
472 |
473 | This is one of the most important scheduler functions. It is called whenever a
474 | timer interrupt happens, and its job is to perform bookeeping and set the
475 | `need_resched` flag if the currently-running process needs to be preempted:
476 |
477 | The `need_resched` flag can be set by the function `resched_curr()`,
478 | [found](https://elixir.bootlin.com/linux/v5.10.158/source/kernel/sched/core.c#L608)
479 | in core.c:
480 |
481 | ```c
482 | /* Mark rq's currently-running task 'to be rescheduled now'. */
483 | void resched_curr(struct rq *rq)
484 | ```
485 |
486 | With SMP, there's a `need_resched` flag for every CPU. Thus, `resched_curr()`
487 | might involve sending an APIC inter-processor interrupt to another processor
488 | (you don't want to go here). The takeway is that you should just use
489 | `resched_curr()` to set `need_resched`, and don't try to do this yourself.
490 |
491 | Note: in prior kernel versions, `resched_curr()` used to be called
492 | `resched_task()`.
493 |
494 | ### `select_task_rq()`
495 |
496 | ```c
497 | /* Returns an integer corresponding to the CPU that this task should run on */
498 | int select_task_rq(struct task_struct *p, int task_cpu, int sd_flag, int flags);
499 | ```
500 |
501 | The core scheduler invokes this function to figure out which CPU to assign a
502 | task to. This is used for distributing processes accross multiple CPUs; the core
503 | scheduler will call `enqueue_task()`, passing the runqueue corresponding to the
504 | CPU that is returned by this function. CPU assignment obviously occurs when a
505 | process is first forked, but CPU reassignment can happen for a large variety of
506 | reasons. Here are some instances where `select_task_rq()` is called:
507 |
508 | * When a process is first forked.
509 |
510 | * When a task is woken up after having gone to sleep.
511 |
512 | * In response to any of the syscalls in the execv family. This is an
513 | optimization, since it doesn't hurt the cache to migrate a process that's
514 | about to call exec.
515 |
516 | * And many more places...
517 |
518 | You can check *why* `select_task_rq` was called by looking at `sd_flag`.
519 |
520 | For instance, `sd_flag == SD_BALANCE_FORK` whenever `select_task_rq()` is called
521 | to determine the CPU of a newly forked task. You can find all possible values of
522 | `sd_flag`
523 | [here](https://elixir.bootlin.com/linux/v5.10.158/source/include/linux/sched/sd_flags.h).
524 |
525 | Note that `select_task_rq()` should return a CPU that `p` is allowed to run on.
526 | Each `task_struct` has a
527 | [member](https://elixir.bootlin.com/linux/v5.10.158/source/include/linux/sched.h#L728)
528 | called `cpus_mask`, of type `cpumask_t`. This member represents the task's CPU
529 | affinity - i.e. which CPUs it can run on. It's possible to iterate over these
530 | CPUs with the macro `for_each_cpu()`, defined
531 | [here](https://elixir.bootlin.com/linux/v5.10.158/source/include/linux/cpumask.h#L263).
532 |
533 | ### `set_next_task()`
534 |
535 | ```c
536 | /* Called when a task changes its scheduling class or changes its task group. */
537 | void set_next_task(struct rq *rq, struct task_struct *p, bool first);
538 | ```
539 |
540 | This function is called in the following instances:
541 |
542 | * When the current task's CPU affinity changes.
543 |
544 | * When the current task's priority, nice value, or scheduling policy changes.
545 |
546 | * When the current task's task group changes.
547 |
548 | This function was previously called `set_curr_task()`, but was changed to better
549 | match `put_prev_task()`. Several scheduling policies also call `set_next_task()`
550 | in their implementations of `pick_next_task()`. An [old kernel
551 | commit](https://lore.kernel.org/all/a96d1bcdd716db4a4c5da2fece647a1456c0ed78.1559129225.git.vpillai@digitalocean.com/T/#m2632708495575d24c1a5c54f7295836a907d3d53)
552 | claims that `pick_next_task()` implies `set_next_task()`, but `pick_next_task()`
553 | technically shouldn't modify any state. In practice, this means that
554 | `set_next_task()` ends up just updating some of the task's metadata.
555 |
556 | ### `yield_task()`
557 |
558 | ```c
559 | /* Called when the current task yields the cpu */
560 | void yield_task(struct rq *rq);
561 | ```
562 |
563 | `yield_task()` is used when the current process voluntarily yields its remaining
564 | time on the CPU. Its implementation is usually very simple, as you can see in
565 | [rt](https://elixir.bootlin.com/linux/v5.10.158/source/kernel/sched/rt.c#L1434),
566 | which simply requeues the current task.
567 |
568 | This function is called when a process calls the `sched_yield()` syscall to
569 | relinquish the control of the processor voluntarily. `schedule()` is then
570 | called.
571 |
572 | ### `check_preempt_curr()`
573 |
574 | ```c
575 | /* Preempt the current task with a newly woken task if needed */
576 | void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
577 | ```
578 |
579 | When a new task enters a runnable state, this function is called to check if
580 | that task should preempt the currently running task. For instance, when a new
581 | task is created, it is initially woken up with `wake_up_new_task()`, which
582 | (among other things) places the task on the runqueue, calls the generic
583 | `check_preempt_curr()`, and calls the `sched_class->task_woken()` function if it
584 | exists.
585 |
586 | The generic `check_preempt_curr()` function does the following:
587 |
588 | ```c
589 | void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
590 | {
591 | if (p->sched_class == rq->curr->sched_class)
592 | rq->curr->sched_class->check_preempt_curr(rq, p, flags);
593 | else if (p->sched_class > rq->curr->sched_class)
594 | resched_curr(rq);
595 |
596 | /* CODE OMITTED */
597 | }
598 | ```
599 |
600 | This handles both the case where the new task has a higher priority within a
601 | scheduling class (using the callback pointer) or a higher priority scheduling
602 | class.
603 |
604 | ### `balance()`
605 |
606 | ```c
607 | int balance(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
608 | ```
609 |
610 | `balance()` implements various scheduler load-balancing mechanisms, which are
611 | meant to distribute the load across processors more evenly using various
612 | heuristics. It returns `1` if there is a runnable task of that `sched_class`'s
613 | priority or higher after load balancing occurs, and `0` otherwise.
614 |
615 | `balance()` is called in `put_prev_task_balance()` (which is called in
616 | `pick_next_task()`) as follows:
617 |
618 | ```c
619 | static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
620 | struct rq_flags *rf)
621 | {
622 | #ifdef CONFIG_SMP
623 | const struct sched_class *class;
624 | /*
625 | * We must do the balancing pass before put_prev_task(), such
626 | * that when we release the rq->lock the task is in the same
627 | * state as before we took rq->lock.
628 | *
629 | * We can terminate the balance pass as soon as we know there is
630 | * a runnable task of @class priority or higher.
631 | */
632 | for_class_range(class, prev->sched_class, &idle_sched_class) {
633 | if (class->balance(rq, prev, rf))
634 | break;
635 | }
636 | #endif
637 |
638 | put_prev_task(rq, prev);
639 | }
640 | ```
641 |
642 | The main idea is to prevent any runqueue from becoming empty, as this is a waste
643 | of resources. This loop starts with the currently running task's `sched_class`
644 | and uses the `balance()` callbacks to check if there are runnable tasks of that
645 | `sched_class`'s priority _or higher_. Notably, `sched_class`'s implementation of
646 | `balance()` checks if `sched_class`s of higher priority also have runnable tasks.
647 |
648 | ### `update_curr()`
649 |
650 | ```c
651 | /* Update the current task's runtime statistics. */
652 | void update_curr(struct rq *rq);
653 | ```
654 |
655 | This function updates the current task's stats such as the total execution time.
656 | Implementing this function allows commands like `ps` and `htop` to display
657 | accurate statistics. The implementations of this function typically share a
658 | common segment across the different scheduling classes. This function is
659 | typically called in other `sched_class` functions to facilitate accurate
660 | reporting of statistics.
661 |
662 | ### `prio_changed()`
663 |
664 | ```c
665 | /* Called when the task's priority has changed. */
666 | void prio_changed(struct rq *rq, struct task_struct *p, int oldprio)
667 | ```
668 |
669 | This function is called whenever a task's priority changes, but the
670 | `sched_class` remains the same (you can verify this by checking where the
671 | function pointer is called). This can occur through various syscalls which
672 | modify the `nice` value, the priority, or other scheduler attributes.
673 |
674 | In a scheduler class with priorities, this function will typically check if the
675 | task whose priority changed needs to preempt the currently running task (or if
676 | it is the currently running task, if it should be preempted).
677 |
678 | ### `switched_to()`
679 |
680 | ```c
681 | /* Called when a task gets switched to this scheduling class. */
682 | void switched_to(struct rq *rq, struct task_struct *p);
683 | ```
684 |
685 | `switched_to()` (and its optional counterpart, `switched_from()`) are called
686 | from `check_class_changed()`:
687 |
688 | ```c
689 | static inline void check_class_changed(struct rq *rq, struct task_struct *p,
690 | const struct sched_class *prev_class,
691 | int oldprio)
692 | {
693 | if (prev_class != p->sched_class) {
694 | if (prev_class->switched_from)
695 | prev_class->switched_from(rq, p);
696 |
697 | p->sched_class->switched_to(rq, p);
698 | } else if (oldprio != p->prio || dl_task(p))
699 | p->sched_class->prio_changed(rq, p, oldprio);
700 | }
701 | ```
702 |
703 | `check_class_changed()` gets called from syscalls that modify scheduler
704 | parameters.
705 |
706 | For scheduler classes like
707 | [rt](https://elixir.bootlin.com/linux/v5.10.158/source/kernel/sched/rt.c#L2303)
708 | and
709 | [dl](https://elixir.bootlin.com/linux/v5.10.158/source/kernel/sched/deadline.c#L2456),
710 | the main consideration when a task's policy changes to their policy is that it
711 | could overload their runqueue. They then try to push some tasks to other
712 | runqueues.
713 |
714 | However, for lower priority scheduler classes, like CFS, where overloading is
715 | not an issue, `switched_to()` just ensures that the task gets to run, and
716 | preempts the current task (which may be of a lower-priority policy) if
717 | necessary.
718 |
--------------------------------------------------------------------------------
/E-Freezer/naive.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/E-Freezer/naive.png
--------------------------------------------------------------------------------
/E-Freezer/task_struct.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/E-Freezer/task_struct.png
--------------------------------------------------------------------------------
/F-Farfetchd/Page_table.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/F-Farfetchd/Page_table.png
--------------------------------------------------------------------------------
/F-Farfetchd/index.md:
--------------------------------------------------------------------------------
1 | # Understanding Page Tables in Linux
2 | The goal of this recitation is to provide a high-level overview of x86/arm64 paging as
3 | well as the data structures and functions the Linux kernel uses to manipulate
4 | page tables.
5 |
6 | Before we dive into the details, here's a quick refresher on what a page table
7 | does and how it does it. Page tables allow the operating system to virtualize
8 | the entire address space for each process by providing a mapping between virtual
9 | addresses and physical addresses. What is a virtual address? Every address that
10 | application code references is a virtual address. For the most part this is true
11 | of kernel code as well, except for code that executes before paging is enabled
12 | (think bootloader). Once paging is enabled in the CPU it cannot be disabled.
13 | Every address referenced by an application is translated transparently by the
14 | hardware via the page table.
15 |
16 | A virtual address is usually different than the physical address it maps to, but
17 | it is also possible for virtual addresses to be "identity" mapped to their
18 | corresponding physical address or mapped to no address at all in cases where the
19 | kernel swapped out a page or never allocated one in the first place.
20 |
21 | ## The shape of a page table
22 | Recall that the structure of the page table is rigidly specified by the CPU
23 | architecture. This is necessary since the CPU hardware directly traverses the
24 | page table to transparently map virtual addresses to physical addresses. The
25 | hardware does this by using the virtual address as a set of indices into the
26 | page table.
27 |
28 |
29 |
30 |
31 |
32 | [Source](https://os.phil-opp.com/page-tables)
33 |
34 | This diagram shows how the bits of a 64 bit virtual address specify the indices
35 | into a 4-level x86 page table (you can expect something similar in arm64).
36 | With 4-level paging only bits 0 through 47 are used. Bits 48 through 63 are sign
37 | extension bits that must match bit 47; this prevents clever developers from
38 | stuffing extra information into addresses that might interfere with future addressing
39 | schemes, like 5-level page tables.
40 |
41 | As you can see, there are 9 bits specifying the index into each page table and a
42 | 12 bit offset into the physical page frame. Since 2^9 = 512, each level of the
43 | page table has 512 64 bit entries, which fits perfectly in one 4096 byte frame.
44 | The last 12 bits allow the virtual address to specify a specific byte offset
45 | within the page frame.
46 |
47 |
48 |
49 |
50 |
51 | For clarity, we're using the naming scheme in the diagram (P4, P3,...), which
52 | differs slightly from the names used in Linux. Linux also implements 5-level
53 | paging, which we describe in the next section. For reference, here is how the
54 | names above mapped to Linux before it added 5-level paging:
55 |
56 | ```
57 | Diagram: P4 -> P3 -> P2 -> P1 -> page frame
58 | Linux: PGD -> PUD -> PMD -> PTE -> page frame
59 | ```
60 |
61 | Each entry in the P4 table is the address of a *different* P3 table such that
62 | each page table can have up to 512 different P3 tables. In turn, each entry in a
63 | P3 table points to a different P2 table such that there are 512 * 512 = 262,144
64 | P2 tables. Since most of the virtual address space is unused for a given
65 | process, the kernel does not allocate frames for most of the intermediary tables
66 | comprising the page table.
67 |
68 | I've been using the word frame since each page table entry (in any table, P4,
69 | P3, etc) is the address of a physical 4096 byte memory frame (except for huge
70 | pages). These addresses cannot be virtual addresses since the hardware accesses
71 | them directly to translate virtual addresses to physical addresses. Furthermore,
72 | since frames are page aligned, the last 12 bits of a frame address are all
73 | zeros. The hardware takes advantage of this by using these bits to store
74 | information about the frame in its page table entry.
75 |
76 |
77 |
78 |
79 |
80 | [Source](https://wiki.osdev.org/Paging)
81 |
82 | This diagram shows the bit flags for a page table entry, which in our diagram
83 | above corresponds to an entry in the P1 table. A similar but slightly different
84 | set of flags exists for the intermediary entries as well.
85 |
86 | ## Working with page tables in Linux
87 | In this section we'll take a look at the data structures and functions the Linux
88 | kernel defines to manage page tables.
89 |
90 | ### The fifth level of Dante's page table
91 | We just finished discussing 4-level page tables, but Linux actually implements
92 | 5-level paging and exposes a 5-level paging interface, even when the kernel is
93 | built with 5-level paging disabled. Luckily 5-level paging is similar to the
94 | 4-level scheme we discussed, is simply adds another level of indirection and
95 | uses 9 previously unused bits of the 64 bit virtual address to index it.
96 |
97 | Here's what the page table hierarchy looks like in Linux:
98 |
99 | ```
100 | 4-Level Paging (circa 2016)
101 | Diagram Above
102 | P4 -> P3 -> P2 -> P1 -> page frame
103 | Linux
104 | PGD -> PUD -> PMD -> PTE -> page frame
105 |
106 | 5-Level Paging (current)
107 | Linux
108 | PGD -> P4D -> PUD -> PMD -> PTE -> page frame
109 | ```
110 |
111 | What does this mean for us in light of the fact that our kernel config specifies
112 | 4 level paging?
113 |
114 | ```
115 | CONFIG_PGTABLE_LEVELS=4
116 | ```
117 |
118 | If we look at the output from an example program that reports the physical
119 | addresses of the process' page frame and of the intermediate page tables,
120 | it shows that the `p4d_paddr` and `pud_paddr` are identical.
121 |
122 | ```
123 | [405] inspect_physical_address():
124 | paddr: 0x115a3c069
125 | pf_paddr: 0x115a3c000
126 | pte_paddr: 0x10d2c7000
127 | pmd_paddr: 0x10d623000
128 | pud_paddr: 0x10c215000
129 | p4d_paddr: 0x10c215000
130 | pgd_paddr: 0x10c90a000
131 | dirty: no
132 | refcount: 57
133 | ```
134 |
135 | Digging into `arch/x86/include/asm/pgtable_types.h`, we see the following:
136 |
137 | ```C
138 | #if CONFIG_PGTABLE_LEVELS > 4
139 | typedef struct { p4dval_t p4d; } p4d_t;
140 |
141 | static inline p4d_t native_make_p4d(pudval_t val)
142 | {
143 | return (p4d_t) { val };
144 | }
145 |
146 | static inline p4dval_t native_p4d_val(p4d_t p4d)
147 | {
148 | return p4d.p4d;
149 | }
150 | #else
151 | #include
152 |
153 | static inline p4d_t native_make_p4d(pudval_t val)
154 | {
155 | return (p4d_t) { .pgd = native_make_pgd((pgdval_t)val) };
156 | }
157 |
158 | static inline p4dval_t native_p4d_val(p4d_t p4d)
159 | {
160 | return native_pgd_val(p4d.pgd);
161 | }
162 | #endif
163 | ```
164 | [x86 Source](https://elixir.bootlin.com/linux/v5.10.158/source/arch/x86/include/asm/pgtable_types.h#L332)
165 |
166 | Interesting. Looking at `pgtable-nop4d.h` we find that `p4d_t` is defined as
167 | ```
168 | typedef struct { pgd_t pgd; } p4d_t;
169 | ```
170 |
171 | With 4-level paging the p4d folds into the pgd. p4d entries, which are
172 | represented by `p4d_t`, essentially become a type alias for `pgd_t`. The kernel
173 | does this so that it has a standard 5-level page table interface to program
174 | against regardless of how many levels of page tables actually exist.
175 |
176 | As of writing, arm64 (for linux 5.10.158) directly includes `pgtable-nop4d.h`.
177 |
178 | To summarize, with 4-level paging there are no "real" p4d tables. Instead, pgd
179 | entries contain the addresses of pud tables, and the kernel "pretends" the p4d
180 | exists by making it appear that the p4d is a mirror copy of the pgd.
181 |
182 | If you read on in `arch/x86/include/asm/pgtable_types.h` you'll see that the kernel uses the same
183 | scheme for 3 and 2 level page table configurations as well. arm64 follows a similar scheme in `arch/arm64/include/asm/pgtable-types.h`
184 |
185 | NOTE that you cannot make use of this equivalence directly. Your Farfetch'd
186 | implementation must work correctly for any page table configuration and
187 | therefore must use the macros defined by the kernel.
188 |
189 | ### Data structures, functions, and macros
190 | In this section we'll take a step back and discuss the data structures
191 | and functions the kernel uses to manage page tables in more detail.
192 |
193 | To encapsulate memory management information for each task, `struct task_struct` contains
194 | a `struct mm_struct*`.
195 |
196 | ```C
197 | struct mm_struct *mm;
198 | struct mm_struct *active_mm;
199 | ```
200 |
201 | We won't go into the details of `active_mm`, which is used for kernel threads that do not
202 | have their own `struct_mm`. Check out `context_switch()` in core.c if you want to read
203 | more.
204 |
205 | Looking in `struct mm_struct`, we find the member `pgd_t *pgd;`. This is a
206 | pointer to the first entry in the pgd for this `mm_struct`. Do you think that
207 | this is a virtual address or a physical address? Remember that all memory
208 | references are translated from virtual addresses to physical addresses by the
209 | CPU, so any address the kernel code uses directly must be a virtual address.
210 |
211 | However, it's easy to recover the physical address since all kernel addresses
212 | are linearly mapped to physical addresses. `virt_to_phys` recovers the physical
213 | address using this linear mapping.
214 |
215 |
216 | Section 3.3 in Gordman's [chapter on page table
217 | management](https://www.kernel.org/doc/gorman/html/understand/understand006.html)
218 | provides a good overview of the functions / macros used to navigate the page table.
219 |
220 | A common source of confusion arises from a misunderstanding of what macros like `pud_offset`
221 | return.
222 |
223 | ```C
224 | /* Find an entry in the third-level page table.. */
225 | // From include/linux/pgtable.h. Note that this definition is shared between x86 and arm64.
226 | static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
227 | {
228 | return (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
229 | }
230 |
231 | // x86: arch/x86/include/asm/pgtable_types.h
232 | // arm64: arch/arm64/include/asm/pgtable-types.h
233 | typedef struct { pudval_t pud; } pud_t;
234 |
235 | // x86: arch/x86/include/asm/pgtable_64_types.h
236 | typedef unsigned long pudval_t;
237 |
238 | // arm64: arch/arm64/include/asm/pgtable-types.h
239 | typedef u64 pudval_t;
240 | ```
241 |
242 | We touched on this briefly above. A `pud_t` is just a struct containing an
243 | unsigned long, which means it compiles down to an unsigned long. Recall from our
244 | earlier discussion that each entry is the address of a physical page frame and
245 | that the last 12 bits are reserved for flags since each page frame is page
246 | aligned. The macros that Gordman discusses, like `pte_none()` and
247 | `pmd_present()`, check these flag bits to determine information about the entry.
248 |
249 | If you want to recover the actual value of the entry, the type casting macros
250 | Gordman discussed in section 3.2 are useful. Although the Gordman reading is x86-specific,
251 | arm64 defines similar, if not indentical, macros. Keep in mind that if you want the
252 | physical address the entry points to you'll need to bitwise-and the value with the
253 | correct mask.
254 |
255 | x86 and arm64 either define functions/macros for the mask so you can
256 | manually perform the bitwise-and or define function/macros that outright do the correct
257 | bitwise-and for you. Either way, recovering the physical address the entry points to is possible in
258 | both x86 and arm64, it just may look slightly different depending which architecture you're on and
259 | which function/macros you choose.
260 |
261 | ### Page dirty and refcount
262 | Recall from before that a flag in the page table entry indicates whether the
263 | page frame is dirty or not. Do not read the flag directly; the kernel provides
264 | a macro for this purpose.
265 |
266 | You will find section 3.4 of Gordman useful for figuring out how to retrieve
267 | the refcount of a page frame. Hint: every physical frame has a `struct page` in
268 | the kernel, which is defined
269 | [here](https://elixir.bootlin.com/linux/v5.10.158/source/include/linux/mm_types.h#L70).
270 | Be sure to use the correct kernel functions / macros to access any information
271 | in `struct page`.
272 |
273 |
--------------------------------------------------------------------------------
/F-Farfetchd/x86_address_structure.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/F-Farfetchd/x86_address_structure.png
--------------------------------------------------------------------------------
/F-Farfetchd/x86_address_structure.svg:
--------------------------------------------------------------------------------
1 |
2 |
918 |
--------------------------------------------------------------------------------
/F-Farfetchd/x86_paging_64bit.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/F-Farfetchd/x86_paging_64bit.png
--------------------------------------------------------------------------------
/F-Farfetchd/x86_paging_64bit.svg:
--------------------------------------------------------------------------------
1 |
2 |
930 |
--------------------------------------------------------------------------------
/G-Pantry/page-cache-overview.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/G-Pantry/page-cache-overview.pdf
--------------------------------------------------------------------------------
/G-Pantry/pantryfs-spec.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/G-Pantry/pantryfs-spec.pdf
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | cs4118 Recitations
2 | ==================
3 |
4 | This repository contains the recitation notes for Columbia's Operating Systems I
5 | class, COMSW4118, as taught by Jae Woo Lee. For information about the class,
6 | visit the [course homepage](http://www.cs.columbia.edu/~jae/4118/).
7 |
8 | These recitations are held every other week by the various TAs, generally using
9 | these notes/linked resources as the basis for their sections.
10 |
11 | Issues, patches, and comments, especially by current and former students, are
12 | welcome.
13 |
14 | ### Contents
15 | - [Note A](A-Workflow/workflow.md): VM/kernel workflow setup, Linux source code
16 | navigators
17 | - [Note B](B-Sockets-ServerTesting): Sockets/TCP programming, server testing
18 | strategies
19 | - [Note C](C-Linux-Kernel-Dev/linux-kernel-dev.md): Kernel configuration,
20 | compilation, and style
21 | - [Note D](D-Fridge/waitqueue.pdf): Linux wait queue (hw5)
22 | - [Note E](E-Freezer/freezer.md): Linux scheduler data structures, implementing
23 | a scheduler (hw6)
24 | - [Note F](F-Farfetchd/index.md): Linux page table data structures,
25 | macros/functions (hw7)
26 | - [Note G](G-Pantry): Linux page cache, implementing a file system (hw8)
27 |
28 |
--------------------------------------------------------------------------------