├── A-Workflow └── workflow.md ├── B-Sockets-ServerTesting ├── example_html │ ├── images │ │ ├── crew.jpg │ │ └── ship.jpg │ └── index.html ├── img │ ├── client-server.png │ └── listening-vs-connecting.png ├── overview-sockets-http.pdf └── sockets-servertesting.md ├── C-Linux-Kernel-Dev └── linux-kernel-dev.md ├── D-Fridge └── waitqueue.pdf ├── E-Freezer ├── allclass.png ├── final_simple_runqueue.png ├── freezer.md ├── freezer.png ├── freezer_sched_class.md ├── naive.png └── task_struct.png ├── F-Farfetchd ├── Page_table.png ├── index.md ├── x86_address_structure.png ├── x86_address_structure.svg ├── x86_paging_64bit.png └── x86_paging_64bit.svg ├── G-Pantry ├── page-cache-overview.pdf ├── pantryfs-fill-super-diagram.svg └── pantryfs-spec.pdf └── README.md /A-Workflow/workflow.md: -------------------------------------------------------------------------------- 1 | VM/Kernel Workflow 2 | ================== 3 | Before we get into the heavier assignments in this course, you should invest 4 | sometime into setting up a workflow and develop good coding habits. 5 | 6 | Why Use a VM? 7 | ------------- 8 | 9 | - Hypervisor: Software responsible for managing VMs (e.g. creation, deletion, 10 | resource allocation). In this class, we use VMware. 11 | - Host machine: The hardware running the hypervisor (e.g. your personal laptop) 12 | - Virtual machine (VM): A computer system created by the hypervisor that shares 13 | resources with the host machine. In this class, we virtualize Debian 11 (a 14 | flavor of Linux). VMs adhere to sandbox properties: 15 | - Isolate the work you'll be doing from the rest of the system 16 | - Be able to take snapshots of your development environment so that if it does 17 | get corrupted, you can revert to a clean state 18 | 19 | Working on a VM can be preferable to working on your host machine (i.e. 20 | "bare-metal"). You can virtualize other operating systems to test your 21 | applications. You can work on potentially buggy/dangerous code without 22 | compromising your host machine. 23 | 24 | 25 | - Snapshot: A feature most hypervisors support. Capture the current running 26 | state of the VM, stores it, and then allows you to revert to that state 27 | sometime later. You should snapshot your VM before executing something that 28 | can compromise the system. 29 | 30 | In this class, we will often be "hacking" the Linux kernel and attempting to 31 | boot into those hacked kernels. We need to use VMs since not all of us have a 32 | Linux host machine. Even if we did, we wouldn't want to ruin our host machines 33 | with our potentially buggy kernels. 34 | 35 | VM Setup/Workflow 36 | ----------------- 37 | We've already written a series of guides for working on your VM. 38 | 39 | First and foremost, you should have a Debian VM installed already (see [Debian 40 | VM Setup](https://cs4118.github.io/dev-guides/debian-vm-setup.html) for a guide 41 | on how to do this). 42 | 43 | ### Working on your VM 44 | You have more options than simply working on the VM's graphical interface, which 45 | often feels clunky. 46 | 47 | The most common workflow involves SSHing into your VM, which we've written a 48 | [guide](https://cs4118.github.io/dev-guides/vm-ssh.html#ssh-into-your-local-vm) 49 | for. This is a good option if you want to conserve processing power on your host 50 | machine and disable the VM's graphical interface. 51 | 52 | Alternatively, you can setup an IDE to SSH into your VM. One option is using 53 | Visual Studio Code, which we've written up some notes (below) on how to use. 54 | This is a nice alternative to command-line editors like vim/emacs if you're 55 | not familiar with them. 56 | 57 | ### Additional Tools 58 | If you do decide to work in a command-line environment, here are some tools 59 | we've used to streamline our workflow: 60 | 61 | - `bat`: A better version of `cat` 62 | [(installation)](https://github.com/sharkdp/bat#on-ubuntu-using-most-recent-deb-packages) 63 | - `grep`: Pattern-match files (learn how to effectively use regexs to improve 64 | search results). Even better, `ripgrep`. 65 | - Reverse-i search: Efficiently search through bash history instead of retyping 66 | long commands. 67 | - `tmux`: Terminal multiplexer. (e.g. open 2 panes for vim, 1 for `dmesg`, 1 for 68 | running commands). 69 | 70 | Kernel Workflow 71 | --------------- 72 | 73 | ### Editor enhancements 74 | 75 | Before getting into hw4/supermom, please read through one of our kernel developer 76 | workflow guides. These explain how to set up either `vim` or VSCode for kernel 77 | development. 78 | 79 | - [Vim workflow](https://cs4118.github.io/dev-guides/vim-workflow.html): Here, 80 | we offer a bunch of cool `vim` additions/tricks to help you develop more 81 | efficiently while working on the kernel assignments. Note that this guide is 82 | only relevant if you intend to work on the command-line using `vim`. One notable 83 | mention here is `cscope`. This is a kernel-source navigator that works directly 84 | in your terminal/vim session. This is far more powerful than using `grep` to 85 | look for a symbol definition/use in the source-tree. 86 | - [VSCode workflow](https://cs4118.github.io/dev-guides/vscode-workflow.html): 87 | In this guide, we explain how to configure autocomplete, formatting, and 88 | various other features for working on the kernel while using VSCode. VSCode 89 | is a very powerful editor, and being able to take advantage of its functionality 90 | will make your life much easier. 91 | 92 | ### Web Tools 93 | 94 | If you don't want to use `cscope`, there's an popular online kernel-source 95 | navigator: [bootlin](https://elixir.bootlin.com/linux/v5.10.158/source). Note 96 | that kernel version matters when you're navigating code – be sure you select the 97 | correct version. 98 | 99 | Like bootlin, you can look for symbols in the kernel-source and look for 100 | instances of where they are defined and used. However, bootlin doesn't index all 101 | symbols, so you might have better luck searching with `cscope`. 102 | -------------------------------------------------------------------------------- /B-Sockets-ServerTesting/example_html/images/crew.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/B-Sockets-ServerTesting/example_html/images/crew.jpg -------------------------------------------------------------------------------- /B-Sockets-ServerTesting/example_html/images/ship.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/B-Sockets-ServerTesting/example_html/images/ship.jpg -------------------------------------------------------------------------------- /B-Sockets-ServerTesting/example_html/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 |

4 | Star Trek: The Next Generation, the best TV show ever! 5 |

6 |

7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |

The ship:

And the crew:

17 | 18 | 19 | -------------------------------------------------------------------------------- /B-Sockets-ServerTesting/img/client-server.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/B-Sockets-ServerTesting/img/client-server.png -------------------------------------------------------------------------------- /B-Sockets-ServerTesting/img/listening-vs-connecting.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/B-Sockets-ServerTesting/img/listening-vs-connecting.png -------------------------------------------------------------------------------- /B-Sockets-ServerTesting/overview-sockets-http.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/B-Sockets-ServerTesting/overview-sockets-http.pdf -------------------------------------------------------------------------------- /B-Sockets-ServerTesting/sockets-servertesting.md: -------------------------------------------------------------------------------- 1 | # Sockets and Server Testing 2 | 3 | ## Sockets and HTTP 4 | 5 | ### What is a socket? 6 | A socket is a software construct used for many modes of communication between 7 | processes. The mode of communication that this recitation will focus on is 8 | network communication. In particular, stream sockets represent an endpoint for 9 | reliable, bidirectional connections such as TCP connections. This allows for two 10 | processes, on separate computers, to communicate over a TCP/IP network 11 | connection. 12 | 13 | Sockets have: 14 | - an IP address, to (typically) identify the computer that the socket endpoint 15 | belongs to 16 | - a port number, to identify which process running on the computer the socket 17 | endpoint belongs to 18 | - a protocol, such as TCP (reliable) or UDP (unreliable). Stream sockets use TCP 19 | 20 | An IP address and port number are both required in order for a computer to 21 | communicate with a specific process on a remote computer. 22 | 23 | ### The client-server model 24 | The two endpoints in a socket connection serve different roles. One end acts as 25 | a *server*: 26 | - It tells the operating system that it should receive incoming connections on a 27 | port number 28 | - It waits for incoming connections 29 | - When it receives a connection, it creates a *new socket* for each client, 30 | which will then be used to communicate with that client 31 | 32 | The other end is a *client*: 33 | - It "connects" to the server using the server’s IP address and the port number 34 | 35 | After a client connects to a server, there is bidirectional communication 36 | between the two processes, often with I/O system calls such as `read()` and 37 | `write()`, or their socket-specific variants `recv()` and `send()`. 38 | 39 | ### Sockets with netcat 40 | A simple way to demonstrate the bidirectional and network-based communcation of 41 | sockets is with `netcat`. `netcat` is a bare-bones program to send streams of 42 | binary data over the network. 43 | 44 | Imagine we have two computers that can communicate over the internet, with the 45 | IP addresses `clac.cs.columbia.edu` and `clorp.cs.nyu.edu`. 46 | 47 | Because of the client-server model, connecting two socket endpoints to each 48 | other is not a symmetrical process. One socket needs to act as the server, while 49 | the other needs to act as a client. You tell `netcat` to act as a server with 50 | the `-l` flag: 51 | 52 | ```console 53 | joy@clac.cs.columbia.edu:~$ nc -l 10000 54 | ``` 55 | 56 | The `netcat` program on `clac.cs.columbia.edu` will create a socket and wait for 57 | connections on port 10000. To tell `netcat` to act as a client, you supply the 58 | IP address of the server and the port number of the socket listening on that 59 | server: 60 | 61 | ```console 62 | jeremy@clorp.cs.nyu.edu:~$ nc clac.cs.columbia.edu 10000 63 | ``` 64 | 65 | Notice the differences between these two commands. The first command only 66 | requires a port number, and doesn't require the IP address of the other 67 | computer. The second command requires knowledge of both the IP address (what 68 | computer to connect to) and the port number (which process to connect to on that 69 | computer). This asymmetry is the client-server model. 70 | 71 | After the client connects to the server, the server `netcat` process creates a 72 | new socket for bidirectional communicaiton. After the two processes connect 73 | there is no functional difference between client and server. What you type on 74 | one end should be visible on the other -- a full duplex stream of data. 75 | 76 | ### Sockets API Summary 77 | ![](img/client-server.png) 78 | 79 | **`socket()`** 80 | - Called by both the client and the server 81 | - On the server-side, a listening socket is created; a connected socket will be 82 | created later by `accept()` 83 | 84 | **`bind()`** 85 | - Usually called only by the server 86 | - Binds the listening socket to a specific port that should be known to the 87 | client 88 | 89 | **`listen()`** 90 | - Called only by the server 91 | - Sets up the listening socket to accept connections 92 | 93 | **`accept()`** 94 | - Called only by the server 95 | - By default blocks until a connection request arrives 96 | - Creates and returns a new socket for each client 97 | 98 | **`connect()`** 99 | - Called only by the client 100 | - Requires an IP address and port number to connect to 101 | - Attempt to establish connection by reaching out to server 102 | 103 | **`send()` and `recv()`** 104 | - Called by both the client and server 105 | - Reads and writes to the other side 106 | - Message boundaries may not be preserved 107 | - nearly the same as `write()` and `read()`, but with socket-specific options 108 | 109 | A TCP client may use these functions as such: 110 | ```c 111 | int fd = socket(...); 112 | connect(fd, ... /* server address */); 113 | 114 | // Communicate with the server by send()ing from and recv()ing to fd. 115 | 116 | close(fd); 117 | ``` 118 | 119 | And a TCP server: 120 | 121 | ```c 122 | int serv_fd = socket(...); 123 | bind(serv_fd, ... /* server address */); 124 | listen(serv_fd, ... /* max pending connections */); 125 | 126 | // use an infinite loop, to continue accepting new incoming clients 127 | for (;;) { 128 | int clnt_fd = accept(serv_fd, ...); 129 | 130 | // Communicate with the client by send()ing from and recv()ing to 131 | // clnt_fd, NOT serv_fd. 132 | 133 | close(clnt_fd); 134 | } 135 | ``` 136 | 137 | ### Listening socket vs connected socket 138 | ![](img/listening-vs-connecting.png) 139 | 140 | To form a bidirectional channel between client and server, three sockets are used: 141 | - The server uses two sockets 142 | - The listening socket, to accept incoming connections from a client 143 | - The client socket, which is created when an incoming connection has been 144 | `accept()`ed. 145 | - The client uses one socket 146 | - The `connect()`ing socket, which reaches out to the server. Once the 147 | connection has been made, communication can be done between the server's client 148 | socket and the client's connecting socket. 149 | 150 | ### HTTP 1.0 151 | HTTP 1.0 is a protocol between a client, typically a web browser, and a server, 152 | typically a web server hosting files such as HTML. It is an outdated version of 153 | the HTTP protocol and simpler than newer versions. 154 | 155 | When visiting a website, a URL is specified in the following format: 156 | 157 | ``` 158 | http://example.com:80/index.html 159 | ^^^^ ^^^^^^^^^^^ ^^^^^^^^^^^^^ 160 | | | | | 161 | | | | URI = /index.html 162 | | | port number = 80 163 | | domain name = example.com 164 | protocol = HTTP 165 | ``` 166 | 167 | Based on the information provided by the user in the URL, a web client will 168 | establish a socket connection with the IP address of the domain name. After 169 | establishing the connection, the two computers exchange text in the form of HTTP 170 | requests: 171 | 172 | - The client sends an **HTTP request** for a resource on the server 173 | - The server sends an **HTTP response** 174 | 175 | **HTTP request** 176 | - First line: method, request URI, version 177 | - Ex: "GET /index.html HTTP/1.0\r\n" 178 | - Followed by 0 or more headers 179 | - Ex: "Host: www.google.com\r\n" 180 | - Followed by an empty line 181 | - "\r\n" 182 | 183 | **HTTP response** 184 | - First line: response status 185 | - Success: "HTTP/1.0 200 OK\r\n" 186 | - Failure: "HTTP/1.0 404 Not Found\r\n" 187 | - Followed by 0 or more response headers 188 | - Followed by an empty line 189 | - "\r\n" 190 | - Followed by the content of the response 191 | - Ex: image file or HTML file 192 | 193 | We can see the contents of real HTTP requests using `netcat` by pretending to be 194 | either a web client or server. Our client and server won't actually work, since 195 | they simply recieve the incoming request but do nothing to process the request 196 | or reply. 197 | 198 | Let's first act as a web server. We tell `netcat` to open a server connection 199 | with `nc -l 10000`, and then in a web browser navigate to the URL with the 200 | domain name of this server. We can use the domain name `localhost` to specify 201 | the local computer rather than connecting to a remote computer over the 202 | internet. In Chrome, we'll navigate to the URL 203 | `http://localhost:10000/index.html`. `netcat` outputs this: 204 | 205 | ```console 206 | $ nc -l 10000 207 | GET /index.html HTTP/1.1 # GET == method; /index.html == request URI; HTTP/1.1 == version 208 | Host: localhost:10000 # header 209 | Connection: keep-alive # more headers... 210 | -removed for brevity- 211 | # blank newline to indicate end of headers/request 212 | ``` 213 | 214 | To act as a client, we can type our HTTP request manually into netcat rather 215 | than doing it through the web browser. Here, we try to send an HTTP request to 216 | the domain name `example.com` on port `80` (the default for HTTP web servers) 217 | for the URI `/index.html`. Note that we specify the `-C` with `netcat` so that 218 | newlines are `\r\n` rather than `\n` -- a requirement of the HTTP protocol. This 219 | flag may vary depending on `netcat` version -- check `man nc`. 220 | 221 | ```console 222 | $ nc -C example.com 80 223 | GET /index.html HTTP/1.0 # GET == method; /index.html == request URI; HTTP/1.1 == version 224 | # blank line to specify end of request 225 | HTTP/1.0 200 OK # start of HTTP response. HTTP/1.0 == version; 200 OK == response status 226 | Accept-Ranges: bytes # header 227 | Content-Type: text/html # more headers... 228 | -removed for brevity- 229 | # blank newline to indicate end of headers and start of file contents 230 | # HTML contents 231 | 232 | 233 | Example Domain 234 | -removed for brevity- 235 | ``` 236 | 237 | ## Testing your multi-server 238 | 239 | ### Siege 240 | Siege is a command-line tool that allows you to benchmark your webserver using 241 | load testing. Given a few parameters, Siege gives you information about the 242 | number of successful transactions to your website, percent availability, the 243 | latency/throughput of your server, and more. 244 | 245 | To install siege, run the following command: 246 | 247 | ``` 248 | sudo apt install siege 249 | ``` 250 | 251 | To use siege with your webserver in HW3, run your server and test with the 252 | following command: 253 | 254 | ``` 255 | siege http://:/ 256 | ``` 257 | 258 | This will run for an infinite amount of time. When you 259 | Ctrl-C out of the command, a list of statistics will be 260 | outputted on your terminal. 261 | 262 | A better way to test with siege is using its options. The `-c` and `-r` options 263 | are particularly useful, as they allow you to specify the number of concurrent 264 | "users" and repetitions per user, respectively. For example, the following 265 | command will create 25 concurrent users that will each attempt to hit the server 266 | 50 times, resulting in 1250 hit attempts: 267 | 268 | ``` 269 | siege -c 25 -r 50 http://:/ 270 | ``` 271 | 272 | There are many other options, specified in the siege man page. These include 273 | `-t`, which specifies how long each user should run (as opposed to how many 274 | times), and `-f`, which specifies a file path that contains a list of URLs to 275 | test. 276 | 277 | ### Additional guidance on testing/benchmarking 278 | When grading, we're going to test your implementation using a mix of manual 279 | connections (e.g. using tools like netcat) and stress testers like siege. 280 | 281 | You should use `netcat` to make sure that basic functionality works and that you 282 | can indeed have more than 1 connection to the server at any given time. `netcat` 283 | is nice because it allows you to establish a connection and then prompts you for 284 | the data to send. You should also use `netcat` to test that your cleanup logic 285 | is correct, as you can control exactly when connections start/terminate. 286 | 287 | Your server should be resilient to any client failure. `netcat` is a useful tool 288 | to test these kinds of failures, as you can simulate bad requests or 289 | disconnections at various points during the transaction. Your server should be 290 | able to gracefully handle these scenarios -- under no condition should your 291 | server crash because of a client failure. 292 | 293 | Once you've tested the basic functionality, use a stress tester to make sure 294 | that your server handles concurrent hoards of requests in a reasonable amount of 295 | time. Since we're all on VMs running on different host machines, we can't really 296 | say "X requests must finish in Y seconds". We're just looking to make sure that 297 | your server doesn't take years (e.g. because it is actually serializing 298 | requests). 299 | 300 | Our grading scripts make heavy use of siege url files. siege will basically make 301 | requests using the URLs specified in this file. Use this to make sure your 302 | server can concurrently handle all kinds of requests and correctly respond to 303 | all of them (2xx, 4xx, 5xx, multi-server never responds with 3xx). 304 | 305 | Regarding benchmarking, the assignment prompt occasionally instructs you to 306 | compare the performance of the implementation of one part with another. However, 307 | since you are testing multi-server in a virtual machine, the performance isn’t 308 | guaranteed to be significantly better. As such, don’t worry too much about the 309 | benchmarking instructions - it’s not a hard and fast requirement. 310 | 311 | 312 | ## Acknowledgements 313 | - Some examples were taken from John Hui's [Advanced 314 | Programming](https://cs3157.github.io/www/2022-9/) lecture notes. We recommend 315 | reading them on top of these recitation notes. 316 | - [Lecture 13 - TCP/IP Networking](https://cs3157.github.io/www/2022-9/lecture-notes/13-tcp-ip.pdf) 317 | - [Lecture 14 - HTTP](https://cs3157.github.io/www/2022-9/lecture-notes/14-http.pdf) 318 | -------------------------------------------------------------------------------- /C-Linux-Kernel-Dev/linux-kernel-dev.md: -------------------------------------------------------------------------------- 1 | Linux Kernel Development 2 | ======================== 3 | This document is meant to supplement our [compilation 4 | guide](https://cs4118.github.io/dev-guides/debian-kernel-compilation.html), 5 | which goes over how to install a new kernel on your machine. 6 | 7 | Kernel Module Makefile 8 | ---------------------- 9 | More info in our [kernel module 10 | guide](https://cs4118.github.io/dev-guides/linux-modules.html), but for the sake 11 | of comparison, here it is again: 12 | 13 | ```make 14 | obj-m += hello.o 15 | all: 16 | make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules 17 | clean: 18 | make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean 19 | ``` 20 | 21 | This Makefile is separate from the top-level kernel source Makefile we're about 22 | to work with. Here, we're adding to the `obj-m` "goal definition" (build object 23 | as module) and using a specialized Makefile located at `/lib/modules/$(shell 24 | uname -r)/build` to link our module into the kernel. 25 | 26 | Top-level Kernel Makefile 27 | ------------------------- 28 | 29 | `linux/Makefile` is the only Makefile you'll be executing when developing with 30 | the kernel. At a high level, this Makefile is responsible for: 31 | 32 | - recursively calling Makefiles in subdirectories 33 | - generating kernel configurations 34 | - building and installing kernel images 35 | 36 | Here's an overview of the common operations you'll be doing with this Makefile. 37 | 38 | ### Preparing for development 39 | 40 | It's good practice to ensure you have a clean kernel source before beginning to 41 | development. Here's the cleaning options that this Makefile provides, as 42 | reported by `make help`: 43 | 44 | ``` 45 | Cleaning targets: 46 | clean - Remove most generated files but keep the config and 47 | enough build support to build external modules 48 | mrproper - Remove all generated files + config + various backup files 49 | distclean. - mrproper + remove editor backup and patch files 50 | ``` 51 | 52 | `make mrproper` is usually sufficient for our purposes. Be warned that you'll 53 | have to completely rebuild the kernel after running this! 54 | 55 | ### Configuration and compilation 56 | 57 | `linux/.config` is the kernel's configuration file. It lists a bunch of options 58 | that determine the properties of the kernel you're about to build. It also 59 | determines what code will be compiled and linked into the final kernel image. 60 | For example, if `CONFIG_SMP` is set, you're letting the kernel know that you 61 | have more than one processor so it can provide multi-processing functionality. 62 | 63 | There's a bunch of different ways to generate this configuration file. Here's 64 | the relevant options we have, as reported by `make help` in the top-level kernel 65 | directory: 66 | 67 | ``` 68 | Configuration targets: 69 | config - Update current config utilising a line-oriented program 70 | menuconfig - Update current config utilising a menu based program 71 | oldconfig - Update current config utilising a provided .config as base 72 | localmodconfig - Update current config disabling modules not loaded 73 | defconfig - New config with default from ARCH supplied defconfig 74 | olddefconfig - Same as oldconfig but sets new symbols to their 75 | default value without prompting 76 | ``` 77 | 78 | Summarizing from our [compilation 79 | guide](https://cs4118.github.io/dev-guides/debian-kernel-compilation.html), we 80 | set up our kernel config in a couple of steps: 81 | 82 | ``` 83 | make olddefconfig # Use current kernel's .config + ARCH defaults 84 | make menuconfig # Manually edit some configs 85 | ``` 86 | 87 | If you want to **significantly** reduce your build time, you can also set your 88 | config to skip unloaded modules during compilation: 89 | 90 | ```sh 91 | yes '' | make localmodconfig 92 | ``` 93 | 94 | As you're developing in the kernel, you might add additional source files that 95 | need to be compiled and linked in. Open up the directory's Makefile and add your 96 | desired object file to the `obj-y` goal definition, which is for "built-in" 97 | functionality (as opposed to kernel modules). For example, here's the relevant 98 | portion of `linux/kernel/Makefile`: 99 | 100 | ```make 101 | obj-y = fork.o exec_domain.o panic.o \ 102 | cpu.o exit.o softirq.o resource.o \ 103 | sysctl.o sysctl_binary.o capability.o ptrace.o user.o \ 104 | signal.o sys.o umh.o workqueue.o pid.o task_work.o \ 105 | extable.o params.o \ 106 | kthread.o sys_ni.o nsproxy.o \ 107 | notifier.o ksysfs.o cred.o reboot.o \ 108 | async.o range.o smpboot.o ucount.o 109 | ``` 110 | 111 | If you were adding a source file to `linux/kernel`, you'd add the object file 112 | target here. 113 | 114 | Once you're ready to compile your kernel, you run the following in the top-level 115 | source directory: 116 | 117 | ```sh 118 | make -j $(nproc) 119 | ``` 120 | 121 | This will compile your kernel across all available CPUs, as reported by `nproc`. 122 | Note that this top-level Makefile will recursively build source code in 123 | subdirectories for you! 124 | 125 | ### Kernel installation 126 | 127 | Like in our [compilation 128 | guide](https://cs4118.github.io/dev-guides/debian-kernel-compilation.html), you 129 | run the following command once the compilation finishes: 130 | 131 | ```sh 132 | sudo make modules_install && sudo make install 133 | ``` 134 | 135 | The first time you install a kernel, you must build `modules_install`. All 136 | subsequent times, `install` is sufficient. 137 | 138 | Finally, reboot and select the kernel version you just installed via the GRUB 139 | menu! 140 | 141 | Kernel Code Style 142 | ----------------- 143 | 144 | The kernel source comes with a linter written in Perl (located at 145 | `linux/scripts/checkpatch.pl`). 146 | 147 | checkpatch will let you know about any stylistic errors and warnings that it 148 | finds in the codebase. You should get into the habit of running checkpatch and 149 | fixing what it suggests before major checkpoints. 150 | 151 | If you want a general overview of kernel code style, here's 152 | [one](https://www.kernel.org/doc/html/v5.10/process/coding-style.html). You can 153 | also find this in `linux/Documentation/process/coding-style.rst`. 154 | 155 | Debugging Techniques 156 | -------------------- 157 | 158 | - Take 159 | [snapshots](https://docs.vmware.com/en/VMware-Fusion/12/com.vmware.fusion.using.doc/GUID-4C90933D-A31F-4A56-B5CA-58D3AE6E93CF.html) 160 | of your VM before installing software that may corrupt it. 161 | - Use `printk/pr_info` to log messages to the kernel log buffer (viewable in 162 | userspace with by running `sudo dmesg`) 163 | - Set up a [serial port](http://cs4118.github.io/freezer/#tips) to redirect log 164 | buffer to your host machine (in case VM crashes before you can check it with 165 | `dmesg`). 166 | -------------------------------------------------------------------------------- /D-Fridge/waitqueue.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/D-Fridge/waitqueue.pdf -------------------------------------------------------------------------------- /E-Freezer/allclass.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/E-Freezer/allclass.png -------------------------------------------------------------------------------- /E-Freezer/final_simple_runqueue.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/E-Freezer/final_simple_runqueue.png -------------------------------------------------------------------------------- /E-Freezer/freezer.md: -------------------------------------------------------------------------------- 1 | # Implementing a scheduling class 2 | 3 | ## Introduction 4 | 5 | One of the reasons students struggle with this assignment boils down to a lack 6 | of understanding of how the scheduler works. In this guide, I hope to provide 7 | you with a clear understanding of how the Linux scheduler pieces fit together. I 8 | hope to paint a picture that you can use to implement the freezer scheduler. 9 | 10 | ## The `task_struct` 11 | 12 | Recall from HW1 that in Linux, every process is defined by its `struct 13 | task_struct`. When you have multiple tasks forked off a common parent, they are 14 | linked together in a doubly linked-list `struct list_head sibling` embedded 15 | within the `task_struct`. For example, if you had four processes running on your 16 | system, each forked off one parent, it would look something like this (the 17 | parent is not shown): 18 | 19 |

20 |
21 |
22 | 23 | However, at this stage, none of these processes are actually running on a CPU. 24 | In order to get them onto a CPU, I need to introduce you to the `struct rq`. 25 | 26 | ## The `struct_rq` 27 | 28 | The `struct rq` is a per-cpu run queue data structure. I like to think of it as 29 | the virtual CPU. It contains a lot of information (most of which goes way over 30 | my head), but it also includes the list of tasks that will (eventually) run on 31 | that CPU. 32 | 33 | A naive implementation would be to embed a `struct list_head runqueue_head` (for 34 | example) into the `struct rq`, and embed a `struct list_head node` into every 35 | `task_struct`. 36 | 37 |
38 |
39 | This is a BAD implementation. 40 |
41 | 42 | The main problem with this implementation is that it does not extend well. At 43 | this point, you know Linux has more than one scheduling class. Linux comes built 44 | with a deadline class, a real-time class, and the primary CFS. Having a 45 | `list_head` embedded directly into the `struct rq` for each scheduling class is 46 | not feasible. 47 | 48 | The solution is to create a new structure containing the `list_head` and any 49 | bookkeeping variables. Then, we can include just the wrapper structure in the 50 | `struct rq`. Linux includes these structures in `linux/kernel/sched/sched.h`. 51 | 52 | By convention, Linux scheduler-specific wrapper structures are named `struct 53 | _rq`. For example, the CFS class defines a `struct cfs_rq` which is 54 | then declared inside of `struct rq` as `struct cfs_rq cfs`. 55 | 56 | The following snippet is taken from `linux/kernel/sched/sched.h`: 57 | 58 | ```c 59 | struct cfs_rq { 60 | struct load_weight load; 61 | unsigned int nr_running; 62 | unsigned int h_nr_running; 63 | unsigned int idle_h_nr_running; 64 | /* code omitted */ 65 | }; 66 | 67 | struct rq { 68 | /* code omitted */ 69 | struct cfs_rq cfs; 70 | struct rt_rq rt; 71 | struct dl_rq dl; 72 | /* code omitted */ 73 | }; 74 | ``` 75 | 76 | ## The `freezer_rq` 77 | 78 | At this point, you've probably guessed that you will need to do the same thing 79 | for freezer. You are right. The `freezer_rq` should include the head of the 80 | freezer runqueue. Additionally, you may need to include some bookkeeping 81 | variables. Think of what you would actually need and don't add anything extra 82 | (it should be pretty simple). 83 | 84 | ## The `sched_freezer_entity` 85 | 86 | Now that you have the `struct rq` setup, you need to have some mechanism to join 87 | your `task_struct`s into the queue. Here, too, you can't just include a 88 | `list_head node` to add a task onto the scheduler-specific runqueue because 89 | you'll need additional bookkeeping. As you have probably guessed, we are going 90 | to wrap the `list_head` and all the bookkeeping variables into their own struct. 91 | 92 | In Linux, we name these structs `sched_{class}_entity` (one exception is that 93 | CFS names this `sched_entity`). For example, the real-time scheduling class 94 | calls it `sched_rt_entity`. We will name ours `struct sched_freezer_entity`. 95 | Again, make sure you only include what you need in this struct. 96 | 97 | With all this setup, here is what the final picture looks like: 98 | 99 |
100 |
101 |
102 | 103 | In the picture above, the two structs on the far left represent a system with 104 | two CPUs. I colored these blue and green to distinguish them from each other, 105 | and to show that different `task_structs` linked on one `siblings` linked-list 106 | can run on separate CPUs. 107 | 108 | ## The `sched_class` 109 | 110 | At this point, we have set up the data structures, but we are still not done. We 111 | now need to implement the freezer functionality to let the kernel use freezer as 112 | a scheduler. 113 | 114 | Think about this situation: Say we have a CFS task about to return from main(). 115 | The OS needs to call CFS `dequeue_task()` to remove it from the CFS queue. How 116 | can we ensure that the OS will call the CFS implementation of `dequeue_task()`? 117 | The answer is `struct sched_class`, defined in `linux/kernel/sched/sched.h`. 118 | Here is what the structure looks like: 119 | 120 | ```c 121 | struct sched_class { 122 | 123 | #ifdef CONFIG_UCLAMP_TASK 124 | int uclamp_enabled; 125 | #endif 126 | 127 | void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); 128 | void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); 129 | void (*yield_task) (struct rq *rq); 130 | bool (*yield_to_task)(struct rq *rq, struct task_struct *p); 131 | 132 | void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags); 133 | 134 | struct task_struct *(*pick_next_task)(struct rq *rq); 135 | 136 | void (*put_prev_task)(struct rq *rq, struct task_struct *p); 137 | void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first); 138 | 139 | #ifdef CONFIG_SMP 140 | int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf); 141 | int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags); 142 | void (*migrate_task_rq)(struct task_struct *p, int new_cpu); 143 | 144 | void (*task_woken)(struct rq *this_rq, struct task_struct *task); 145 | 146 | void (*set_cpus_allowed)(struct task_struct *p, 147 | const struct cpumask *newmask); 148 | 149 | void (*rq_online)(struct rq *rq); 150 | void (*rq_offline)(struct rq *rq); 151 | #endif 152 | 153 | void (*task_tick)(struct rq *rq, struct task_struct *p, int queued); 154 | void (*task_fork)(struct task_struct *p); 155 | void (*task_dead)(struct task_struct *p); 156 | 157 | /* 158 | * The switched_from() call is allowed to drop rq->lock, therefore we 159 | * cannot assume the switched_from/switched_to pair is serliazed by 160 | * rq->lock. They are however serialized by p->pi_lock. 161 | */ 162 | void (*switched_from)(struct rq *this_rq, struct task_struct *task); 163 | void (*switched_to) (struct rq *this_rq, struct task_struct *task); 164 | void (*prio_changed) (struct rq *this_rq, struct task_struct *task, 165 | int oldprio); 166 | 167 | unsigned int (*get_rr_interval)(struct rq *rq, 168 | struct task_struct *task); 169 | 170 | void (*update_curr)(struct rq *rq); 171 | 172 | #define TASK_SET_GROUP 0 173 | #define TASK_MOVE_GROUP 1 174 | 175 | #ifdef CONFIG_FAIR_GROUP_SCHED 176 | void (*task_change_group)(struct task_struct *p, int type); 177 | #endif 178 | } __aligned(STRUCT_ALIGNMENT); /* STRUCT_ALIGN(), vmlinux.lds.h */ 179 | ``` 180 | 181 | As you can see, `struct sched_class` contains many function pointers. When we 182 | add a new scheduling class, we create an instance of `struct sched_class` and 183 | set the function pointers to point to our implementation of these functions. If 184 | we look in the file `linux/kernel/sched/fair.c`, we see how CFS does it: 185 | 186 | ```c 187 | const struct sched_class fair_sched_class 188 | __section("__fair_sched_class") = { 189 | .enqueue_task = enqueue_task_fair, 190 | .dequeue_task = dequeue_task_fair, 191 | .yield_task = yield_task_fair, 192 | .yield_to_task = yield_to_task_fair, 193 | 194 | .check_preempt_curr = check_preempt_wakeup, 195 | 196 | .pick_next_task = __pick_next_task_fair, 197 | .put_prev_task = put_prev_task_fair, 198 | .set_next_task = set_next_task_fair, 199 | 200 | #ifdef CONFIG_SMP 201 | .balance = balance_fair, 202 | .select_task_rq = select_task_rq_fair, 203 | .migrate_task_rq = migrate_task_rq_fair, 204 | 205 | .rq_online = rq_online_fair, 206 | .rq_offline = rq_offline_fair, 207 | 208 | .task_dead = task_dead_fair, 209 | .set_cpus_allowed = set_cpus_allowed_common, 210 | #endif 211 | 212 | .task_tick = task_tick_fair, 213 | .task_fork = task_fork_fair, 214 | 215 | .prio_changed = prio_changed_fair, 216 | .switched_from = switched_from_fair, 217 | .switched_to = switched_to_fair, 218 | 219 | .get_rr_interval = get_rr_interval_fair, 220 | 221 | .update_curr = update_curr_fair, 222 | 223 | #ifdef CONFIG_FAIR_GROUP_SCHED 224 | .task_change_group = task_change_group_fair, 225 | #endif 226 | 227 | #ifdef CONFIG_UCLAMP_TASK 228 | .uclamp_enabled = 1, 229 | #endif 230 | }; 231 | ``` 232 | 233 | Note that the dot notation is a C99 feature that allows you to set specific 234 | fields of the struct by name in an initializer. This notation is also called 235 | [designated 236 | initializers](http://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html). Also, 237 | not every function needs to be implemented. You will need to figure out what is 238 | and is not necessary. To see an example of a bare minimum scheduler, see the 239 | [idle_sched_class](https://elixir.bootlin.com/linux/v5.10.158/source/kernel/sched/idle.c#L487), 240 | which is the scheduling policy used when no other tasks are ready to be 241 | executed. 242 | 243 | As you can see, CFS initializes the `struct sched_class` function pointers to 244 | the CFS implementation. Two things of note here. First, the convention is to 245 | name the struct `_sched_class`, so CFS names it `fair_sched_class`. 246 | Second, we name a particular class's functions as 247 | `_`. For example, the CFS implementation of 248 | `enqueue_task` as `enqueue_task_fair`. Now, every time the kernel needs to call 249 | a function, it can simply call `p->sched_class->`. Here, `p` is of 250 | the type `task_struct *`, `sched_class` is a pointer within the `task_struct` 251 | pointing to an instance of `struct sched_class`, and the `` points 252 | to the specific implementaion of the the function to be called. 253 | 254 | One final thing: you may have noticed the `__section("__fair_sched_class")` 255 | macro in the declaration of`struct sched_class fair_sched_class`. When building 256 | the kernel, this allows the linker to align the `sched_class`'s contiguously in 257 | memory through the use of a linker script. A linker script describes how various 258 | sections in the input (source) files should be mapped into the output 259 | (binary/object) file, and to control the memory layout of the output file. 260 | 261 | We can see this in `linux/include/asm-generic/vmlinux.lds.h`: 262 | 263 | ```c 264 | /* 265 | * The order of the sched class addresses are important, as they are 266 | * used to determine the order of the priority of each sched class in 267 | * relation to each other. 268 | */ 269 | #define SCHED_DATA \ 270 | STRUCT_ALIGN(); \ 271 | __begin_sched_classes = .; \ 272 | *(__idle_sched_class) \ 273 | *(__fair_sched_class) \ 274 | *(__rt_sched_class) \ 275 | *(__dl_sched_class) \ 276 | *(__stop_sched_class) \ 277 | __end_sched_classes = .; 278 | ``` 279 | 280 | This effectively allows the kernel to treat the `sched_class` structs as part of 281 | an array of `sched_class`'s. The first class in the array is of lower priority 282 | than the second. In other words, `sched_class_dl` has a higher priority than 283 | `sched_class_rt`. Now, every time a new process needs to be scheduled, the 284 | kernel can simply go through the class array and check if there is a process of 285 | that class that needs to run. Let's take a look at this as implemented in 286 | `linux\kernel\sched\core.c`. 287 | 288 | ```c 289 | static inline struct task_struct * 290 | pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) 291 | { 292 | const struct sched_class *class; 293 | struct task_struct *p; 294 | 295 | /* code omitted */ 296 | 297 | for_each_class(class) { 298 | p = class->pick_next_task(rq); 299 | if (p) 300 | return p; 301 | } 302 | 303 | /* The idle class should always have a runnable task: */ 304 | BUG(); 305 | } 306 | ``` 307 | 308 | This makes use of the `for_each_class()` macro, which takes advantage of the 309 | array structure of the `sched_class`'s. We can see this implementation in 310 | `linux/kernel/sched/sched.h`: 311 | 312 | ```c 313 | /* Defined in include/asm-generic/vmlinux.lds.h */ 314 | extern struct sched_class __begin_sched_classes[]; 315 | extern struct sched_class __end_sched_classes[]; 316 | 317 | #define sched_class_highest (__end_sched_classes - 1) 318 | #define sched_class_lowest (__begin_sched_classes - 1) 319 | 320 | #define for_class_range(class, _from, _to) \ 321 | for (class = (_from); class != (_to); class--) 322 | 323 | #define for_each_class(class) \ 324 | for_class_range(class, sched_class_highest, sched_class_lowest) 325 | ``` 326 | 327 | Essentially, when a process wants to relinquish its time on a CPU, `schedule()` 328 | gets called. Following the chain of calls in the kernel, `pick_next_task()` 329 | eventually gets called, and the OS will loop through each scheduling class by 330 | calling `for_each_class(class)`. Here, we call the `pick_next_task()` function 331 | of a particular instance of `struct sched_class`. If `pick_next_task()` returns 332 | `NULL`, the kernel will simply move on to the next class. If the kernel reaches 333 | the lowest priority class on the list (i.e. `idle_sched_class`) then there are 334 | no tasks to run and the CPU will go into idle mode. 335 | -------------------------------------------------------------------------------- /E-Freezer/freezer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/E-Freezer/freezer.png -------------------------------------------------------------------------------- /E-Freezer/freezer_sched_class.md: -------------------------------------------------------------------------------- 1 | # Linux Scheduler Series 2 | 3 | > This guide is adapted from a blog series written by Mitchell Gouzenko, a 4 | > former OS TA. The original posts can be found 5 | > [here](https://mgouzenko.github.io/jekyll/update/2016/11/04/the-linux-process-scheduler.html). 6 | > 7 | > The code snippets and links in this post correspond to Linux v5.10.158. 8 | 9 | ## Introduction 10 | 11 | I'm writing this series after TA-ing an operating systems class for two 12 | semesters. Each year, tears begin to flow by the time we get to the infamous 13 | Scheduler Assignment - where students are asked to implement a round-robin 14 | scheduler in the Linux kernel. The assignment is known to leave relatively 15 | competent programmers in shambles. I don't blame them; the seemingly simple task 16 | of writing a round robin scheduler is complicated by two confounding factors: 17 | 18 | * The Linux scheduler is cryptic as hell and on top of that, very poorly 19 | documented. 20 | * Bugs in scheduler code will often trigger a kernel panic, freezing the system 21 | without providing any logs or meaningful error messages. 22 | 23 | I hope to ease students' suffering by addressing the first bullet point. In this 24 | series, I will explain how the scheduler's infrastructure is set up, emphasizing 25 | how one may leverage its modularity to plug in their own scheduler. 26 | 27 | We'll begin by examining the basic role of the core scheduler and how the rest 28 | of the kernel interfaces with it. Then, we'll look at `sched_class`, the modular 29 | data structure that permits various scheduling algorithms to live and operate 30 | side-by-side in the kernel. 31 | 32 | ### A Top Down Approach 33 | 34 | Initially, I'll treat the scheduler as a black box. I will make gross 35 | over-simplifications but note very clearly when I do so. Little by little, we 36 | will delve into the scheduler's internals, unfolding the truth behind these 37 | simplifications. By the end of this series, you should be able to start tackling 38 | the problem of writing your own working scheduler. 39 | 40 | ### Disclaimer 41 | 42 | I'm not an expert kernel hacker. I'm just a student who has spent a modest 43 | number of hours reading, screaming at, and sobbing over kernel code. If I make a 44 | mistake, please point it out in the comments section, and I'll do my best to 45 | correct it. 46 | 47 | ## What is the Linux Scheduler? 48 | 49 | Linux is a multi-tasking system. At any instant, there are many processes active 50 | at once, but a single CPU can only perform work on behalf of one process at a 51 | time. At a high level, Linux context switches from process to process, letting 52 | the CPU perform work on behalf of each one in turn. This switching occurs 53 | quickly enough to create the illusion that all processes are running 54 | simultaneously. **The scheduler is in charge of coordinating all of this 55 | switching.** 56 | 57 | ### Where Does the Scheduler Fit? 58 | 59 | You can find most scheduler-related code under `kernel/sched`. Now, the 60 | scheduler has a distinct and non-trivial job. The rest of the kernel doesn't 61 | know or care how the scheduler performs its magic, as long as it can be called 62 | upon to schedule tasks. So, to hide the complexity of the scheduler, it is 63 | invoked with a simple and well-defined API. The scheduler - from the perspective 64 | of the rest of the kernel - has two main responsibilities: 65 | 66 | * **Responsibility I**: Provide an interface to halt the currently running 67 | process and switch to a new one. To do so, the scheduler must pick the next 68 | process to run, which is a nontrivial problem. 69 | 70 | * **Responsibility II**: Indicate to the rest of the OS when a new process 71 | should be run. 72 | 73 | ### Responsibility I: Switching to the Next Process 74 | 75 | To fulfill its first responsibility, the scheduler must somehow keep track of 76 | all the running processes. 77 | 78 | ### The Runqueue 79 | 80 | Here's the first over-simplification: you can think of the scheduler as a system 81 | that maintains a simple queue of processes in the form of a linked list. The 82 | process at the head of the queue is allowed to run for some "time slice" - say, 83 | 10 milliseconds. After this time slice expires, the process is moved to the back 84 | of the queue, and the next process gets to run on the CPU for the same time 85 | slice. When a running process is forcibly stopped and taken off the CPU in this 86 | way, we say that it has been **preempted**. The linked list of processes waiting 87 | to have a go on the CPU is called the runqueue. Each CPU has its own runqueue, 88 | and a given process may appear on only one CPU's runqueue at a time. Processes 89 | CAN migrate between various CPUs' runqueues, but we'll save this discussion for 90 | later. 91 | 92 |
93 |
94 | Figure 1: An over-simplification of the runqueue 95 |
96 | 97 | The scheduler is not really this simple; the runqueue is defined in the kernel 98 | as `struct rq`, and you can take a peek at its definition 99 | [here](https://elixir.bootlin.com/linux/v5.10.158/source/kernel/sched/sched.h#L897). 100 | Spoiler alert: it's not a linked list! To be fair, the explanation that I gave 101 | above more or less describes the very first Linux runqueue. But over the years, 102 | the scheduler evolved to incorporate multiple scheduling algorithms. These 103 | include: 104 | 105 | * Completely Fair Scheduler (CFS) 106 | * Real-Time Scheduler 107 | * Deadline Scheduler 108 | 109 | The modern-day runqueue is no longer a linked list but actually a collection of 110 | algorithm-specific runqueues corresponding to the list above. Indeed, 111 | `struct rq` has the following members: 112 | 113 | ```c 114 | struct cfs_rq cfs; // CFS scheduler runqueue 115 | struct rt_rq rt; // Real-time scheduler runqueue 116 | struct dl_rq dl; // Deadline scheduler runqueue 117 | ``` 118 | 119 | For example, CFS, which is the default scheduler in modern Linux kernels, uses a 120 | red-black tree data structure to keep track of processes, with each process 121 | assigned a "virtual runtime" that determines its priority in the scheduling 122 | queue. The scheduler then selects the process with the lowest virtual runtime to 123 | run next, ensuring that each process gets a fair share of CPU time over the long 124 | term. 125 | 126 | Keep these details in the back of your mind so that you don't get bogged down. 127 | Remember: the goal here is to understand how the scheduler interoperates with 128 | the rest of the kernel. The main takeaway is that a process is allowed to run 129 | for some time, and when that time expires, it gets preempted so that the next 130 | process can run. 131 | 132 | ### Preemption vs Yielding 133 | 134 | Preemption is not always the reason a process is taken off the CPU. For example, 135 | a process might voluntarily go to sleep, waiting for an IO event or lock. To do 136 | this, the process puts itself on a "wait queue" and takes itself off the 137 | runqueue. In this case, the process has **yielded** the CPU. In summary: 138 | 139 | * "preemption" is when a process is forcibly kicked off the CPU. 140 | 141 | * "yielding" is when a process voluntarily gives up the CPU. 142 | 143 | In addition to an expired timeslice, there are several other reasons that 144 | preemption may occur. For example, when an interrupt occurs, the CPU may be 145 | preempted to handle the interrupt. Additionally, a real-time process may have a 146 | higher priority than some other process and may preempt lower-priority processes 147 | to ensure that it meets its deadline. 148 | 149 | ### `schedule()` 150 | 151 | With a conceptual understanding of the runqueue, we now have the background to 152 | understand how Responsibility I is carried out by the scheduler. The 153 | `schedule()` function is the crux of Responsibility I: it halts the currently 154 | running process and runs the next one on the CPU. This function is referred to 155 | by many texts as "the entrypoint into the scheduler". `schedule()` invokes 156 | `__schedule()` to do most of the real work. Here is the portion relevant to us: 157 | 158 | ```c 159 | static void __sched notrace __schedule(bool preempt) 160 | { 161 | struct task_struct *prev, *next; 162 | unsigned long *switch_count; 163 | struct rq *rq; 164 | 165 | /* CODE OMITTED */ 166 | 167 | next = pick_next_task(rq, prev, &rf); 168 | clear_tsk_need_resched(prev); 169 | clear_preempt_need_resched(); 170 | 171 | if (likely(prev != next)) { 172 | rq->nr_switches++; 173 | RCU_INIT_POINTER(rq->curr, next); 174 | ++*switch_count; 175 | 176 | psi_sched_switch(prev, next, !task_on_rq_queued(prev)); 177 | trace_sched_switch(preempt, prev, next); 178 | rq = context_switch(rq, prev, next, &rf); 179 | } 180 | 181 | /* CODE OMITTED */ 182 | } 183 | ``` 184 | 185 | `pick_next_task()` looks at the runqueue `rq` and returns the `task_struct` 186 | associated with the process that should be run next. If we consider t=10 in 187 | Figure 1, `pick_next_task()` would return the `task_struct` for Process 2. Then, 188 | `context_switch()` switches the CPU's state to that of the returned 189 | `task_struct`. This fullfills Responsibility I. 190 | 191 | ## Responsibility II: When Should the Next Process Run? 192 | 193 | Great, so we've seen that `schedule()` is used to context switch to the next 194 | task. But when does this *actually* happen? 195 | 196 | As mentioned previously, a user-space program might voluntarily go to sleep 197 | waiting for an IO event or a lock. In this case, the kernel will call 198 | `schedule()` on behalf of the process that needs to sleep. But what if the 199 | user-space program never sleeps? Here's one such program: 200 | 201 | ```c 202 | int main() 203 | { 204 | while(1); 205 | } 206 | ``` 207 | 208 | If `schedule()` were only called when a user-space program voluntarily sleeps, 209 | then programs like the one above would use up the processor indefinitely. Thus, 210 | we need a mechanism to preempt processes that have exhausted their time slice! 211 | 212 | This preemption is accomplished via the timer interrupt. The timer interrupt 213 | fires periodically, allowing control to jump to the timer interrupt handler in 214 | the kernel. This handler calls the function `update_process_times()`, shown 215 | below. 216 | 217 | ```c 218 | /* 219 | * Called from the timer interrupt handler to charge one tick to the current 220 | * process. user_tick is 1 if the tick is user time, 0 for system. 221 | */ 222 | void update_process_times(int user_tick) 223 | { 224 | struct task_struct *p = current; 225 | 226 | PRANDOM_ADD_NOISE(jiffies, user_tick, p, 0); 227 | 228 | /* Note: this timer irq context must be accounted for as well. */ 229 | account_process_tick(p, user_tick); 230 | run_local_timers(); 231 | rcu_sched_clock_irq(user_tick); 232 | #ifdef CONFIG_IRQ_WORK 233 | if (in_irq()) 234 | irq_work_tick(); 235 | #endif 236 | scheduler_tick(); 237 | if (IS_ENABLED(CONFIG_POSIX_TIMERS)) 238 | run_posix_cpu_timers(); 239 | } 240 | ``` 241 | 242 | Notice how `update_process_times()` invokes `scheduler_tick()`. In 243 | `scheduler_tick()`, the scheduler checks to see if the running process's time 244 | has expired. If so, it sets a (over-simplification alert) per-CPU flag called 245 | `need_resched`. This indicates to the rest of the kernel that `schedule()` 246 | should be called. In our simplified example, `scheduler_tick()` would set this 247 | flag when the current process has been running for 10 milliseconds or more. 248 | 249 | But wait, why the heck can't `scheduler_tick()` just call `schedule()` by 250 | itself, from within the timer interrupt? After all, if the scheduler knows that 251 | a process's time has expired, shouldn't it just context switch right away? 252 | 253 | As it turns out, it is not always safe to call `schedule()`. In particular, if 254 | the currently running process is holding a spinlock in the kernel, it cannot be 255 | put to sleep from the interrupt handler. (Let me repeat that one more time 256 | because people always forget: **you cannot sleep with a spinlock.** Sleeping 257 | with a spinlock may cause the kernel to deadlock, and will bring you anguish 258 | for many hours when you can't figure out why your system has hung.) 259 | 260 | When the scheduler sets the `need_resched` flag, it's really saying, "please 261 | dearest kernel, invoke `schedule()` at your earliest convenience." The kernel 262 | keeps a count of how many spinlocks the currently running process has acquired. 263 | When that count goes down to 0, the kernel knows that it's okay to put the 264 | process to sleep. The kernel checks the `need_resched` flag in two main places: 265 | 266 | * when returning from an interrupt handler 267 | 268 | * when returning to user-space from a system call 269 | 270 | If `need_resched` is `True` and the spinlock count is 0, then the kernel calls 271 | `schedule()`. With our simple linked-list runqueue, this delayed invocation of 272 | `schedule()` implies that a process can possibly run for a bit longer than its 273 | timeslice. We're cool with that because it's always safe to call `schedule()` 274 | when the kernel is about to return to user-space. That's because user-space 275 | programs are allowed to sleep. So, by the time the kernel is about to return to 276 | user-space, it cannot be holding any spinlocks. This means that there won't be a 277 | large delay between when `need_resched` is set, and when `schedule()` gets 278 | called. 279 | 280 | ## Understanding `sched_class` 281 | 282 | In this section, I will analyze `struct sched_class` and talk briefly about what 283 | most of the functions do. I've reproduced `struct sched_class` below. 284 | 285 | ```c 286 | struct sched_class { 287 | 288 | #ifdef CONFIG_UCLAMP_TASK 289 | int uclamp_enabled; 290 | #endif 291 | 292 | void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); 293 | void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); 294 | void (*yield_task) (struct rq *rq); 295 | bool (*yield_to_task)(struct rq *rq, struct task_struct *p); 296 | 297 | void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags); 298 | 299 | struct task_struct *(*pick_next_task)(struct rq *rq); 300 | 301 | void (*put_prev_task)(struct rq *rq, struct task_struct *p); 302 | void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first); 303 | 304 | #ifdef CONFIG_SMP 305 | int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf); 306 | int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags); 307 | void (*migrate_task_rq)(struct task_struct *p, int new_cpu); 308 | 309 | void (*task_woken)(struct rq *this_rq, struct task_struct *task); 310 | 311 | void (*set_cpus_allowed)(struct task_struct *p, 312 | const struct cpumask *newmask); 313 | 314 | void (*rq_online)(struct rq *rq); 315 | void (*rq_offline)(struct rq *rq); 316 | #endif 317 | 318 | void (*task_tick)(struct rq *rq, struct task_struct *p, int queued); 319 | void (*task_fork)(struct task_struct *p); 320 | void (*task_dead)(struct task_struct *p); 321 | 322 | /* 323 | * The switched_from() call is allowed to drop rq->lock, therefore we 324 | * cannot assume the switched_from/switched_to pair is serliazed by 325 | * rq->lock. They are however serialized by p->pi_lock. 326 | */ 327 | void (*switched_from)(struct rq *this_rq, struct task_struct *task); 328 | void (*switched_to) (struct rq *this_rq, struct task_struct *task); 329 | void (*prio_changed) (struct rq *this_rq, struct task_struct *task, 330 | int oldprio); 331 | 332 | unsigned int (*get_rr_interval)(struct rq *rq, 333 | struct task_struct *task); 334 | 335 | void (*update_curr)(struct rq *rq); 336 | 337 | #define TASK_SET_GROUP 0 338 | #define TASK_MOVE_GROUP 1 339 | 340 | #ifdef CONFIG_FAIR_GROUP_SCHED 341 | void (*task_change_group)(struct task_struct *p, int type); 342 | #endif 343 | } __aligned(STRUCT_ALIGNMENT); /* STRUCT_ALIGN(), vmlinux.lds.h */ 344 | ``` 345 | 346 | ### `enqueue_task()` and `dequeue_task()` 347 | 348 | ```c 349 | /* Called to enqueue task_struct p on runqueue rq. */ 350 | void enqueue_task(struct rq *rq, struct task_struct *p, int flags); 351 | 352 | /* Called to dequeue task_struct p from runqueue rq. */ 353 | void dequeue_task(struct rq *rq, struct task_struct *p, int flags); 354 | ``` 355 | 356 | `enqueue_task()` and `dequeue_task()` are used to put a task on the runqueue and 357 | remove a task from the runqueue, respectively. 358 | 359 | These functions are called for a variety of reasons: 360 | 361 | * When a child process is first forked, `enqueue_task()` is called to put it on 362 | a runqueue. When a process exits, `dequeue_task()` takes it off the runqueue. 363 | 364 | * When a process goes to sleep, `dequeue_task()` takes it off the runqueue. For 365 | example, this happens when the process needs to wait for a lock or IO event. 366 | When the IO event occurs, or the lock becomes available, the process wakes up. 367 | It must then be re-enqueued with `enqueue_task()`. 368 | 369 | * Process migration - if a process must be migrated from one CPU's runqueue to 370 | another, it's dequeued from its old runqueue and enqueued on a different one 371 | using this function. 372 | 373 | * When `set_cpus_allowed()` is called to change the task's processor affinity, 374 | it may need to be enqueued on a different CPU's runqueue. 375 | 376 | * When the priority of a process is boosted to avoid priority inversion. In this 377 | case, the task used to have a low-priority `sched_class`, but is being 378 | promoted to a `sched_class` with high priority. This action occurs in 379 | `rt_mutex_setprio()`. 380 | 381 | * From `__sched_setscheduler`. If a task's `sched_class` has changed, it's 382 | dequeued using its old `sched_class` and enqueued with the new one. 383 | 384 | Each of these functions are passed the task to be enqueued/dequeued, as well as 385 | the runqueue it should be added to/removed from. In addition, these functions 386 | are given a bit vector of flags that describe *why* enqueue or dequeue is being 387 | called. Here are the various flags, which are described in 388 | [sched.h](https://elixir.bootlin.com/linux/v5.10.158/source/kernel/sched/sched.h#L1743): 389 | 390 | ```c 391 | /* 392 | * {de,en}queue flags: 393 | * 394 | * DEQUEUE_SLEEP - task is no longer runnable 395 | * ENQUEUE_WAKEUP - task just became runnable 396 | * 397 | * SAVE/RESTORE - an otherwise spurious dequeue/enqueue, done to ensure tasks 398 | * are in a known state which allows modification. Such pairs 399 | * should preserve as much state as possible. 400 | * 401 | * MOVE - paired with SAVE/RESTORE, explicitly does not preserve the location 402 | * in the runqueue. 403 | * 404 | * ENQUEUE_HEAD - place at front of runqueue (tail if not specified) 405 | * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline) 406 | * ENQUEUE_MIGRATED - the task was migrated during wakeup 407 | * 408 | */ 409 | ``` 410 | 411 | The `flags` argument can be tested using the bitwise `&` operation. For example, 412 | if the task was just migrated from another CPU, `flags & ENQUEUE_MIGRATED` 413 | evaluates to 1. 414 | 415 | ### `pick_next_task()` 416 | 417 | ```c 418 | /* Pick the task that should be currently running. */ 419 | struct task_struct *pick_next_task(struct rq *rq); 420 | ``` 421 | 422 | `pick_next_task()` is called by the core scheduler to determine which of `rq`'s 423 | tasks should be running. The name is a bit misleading: This function is not 424 | supposed to return the task that should run *after* the currently running task; 425 | instead, it's supposed to return the `task_struct` that should be running now, 426 | **in this instant.** 427 | 428 | The kernel will context switch from the currently running task to the task 429 | returned by `pick_next_task()`. 430 | 431 | ### `put_prev_task()` 432 | 433 | ```c 434 | /* Called right before p is going to be taken off the CPU. */ 435 | void put_prev_task(struct rq *rq, struct task_struct *p); 436 | ``` 437 | 438 | `put_prev_task()` is called whenever a task is to be taken off the CPU. The 439 | behavior of this function is up to the specific `sched_class`. Some schedulers 440 | do very little in this function. For example, the realtime scheduler uses this 441 | function as an opportunity to perform simple bookkeeping. On the other hand, 442 | CFS's `put_prev_task_fair()` needs to do a bit more work. As an optimization, 443 | CFS keeps the currently running task out of its RB tree. It uses the 444 | `put_prev_task` hook as an opportunity to put the currently running task (that 445 | is, the task specified by `p`) back in the RB tree. 446 | 447 | The sched_class's `put_prev_task` is called by the function `put_prev_task()`, 448 | which is 449 | [defined](https://elixir.bootlin.com/linux/v5.10.158/source/kernel/sched/sched.h#L1841) 450 | in `sched.h`. `put_prev_task()` gets called in the core scheduler's 451 | `pick_next_task()`, after the policy-specific `pick_next_task()` implementation 452 | is called, but before any context switch is performed. This gives us an 453 | opportunity to perform any operations we need to do to move on from the 454 | previously running task in our scheduler implementations. 455 | 456 | Note that this was not the case in older kernels: The `sched_class`'s 457 | `pick_next_task()` is expected to call `put_prev_task()` by itself! This is 458 | documented in the following 459 | [comment](https://elixir.bootlin.com/linux/v4.9.330/source/kernel/sched/sched.h#L1241) 460 | in an earlier Linux version (4.9). Before that (3.11), `put_prev_task` actually 461 | [used to be called](https://elixir.bootlin.com/linux/v3.11/source/kernel/sched/core.c#L2445) 462 | by the core scheduler before it called `pick_next_task`. 463 | 464 | ### `task_tick()` 465 | 466 | ```c 467 | /* Called from the timer interrupt handler. p is the currently running task 468 | * and rq is the runqueue that it's on. 469 | */ 470 | void task_tick(struct rq *rq, struct task_struct *p, int queued); 471 | ``` 472 | 473 | This is one of the most important scheduler functions. It is called whenever a 474 | timer interrupt happens, and its job is to perform bookeeping and set the 475 | `need_resched` flag if the currently-running process needs to be preempted: 476 | 477 | The `need_resched` flag can be set by the function `resched_curr()`, 478 | [found](https://elixir.bootlin.com/linux/v5.10.158/source/kernel/sched/core.c#L608) 479 | in core.c: 480 | 481 | ```c 482 | /* Mark rq's currently-running task 'to be rescheduled now'. */ 483 | void resched_curr(struct rq *rq) 484 | ``` 485 | 486 | With SMP, there's a `need_resched` flag for every CPU. Thus, `resched_curr()` 487 | might involve sending an APIC inter-processor interrupt to another processor 488 | (you don't want to go here). The takeway is that you should just use 489 | `resched_curr()` to set `need_resched`, and don't try to do this yourself. 490 | 491 | Note: in prior kernel versions, `resched_curr()` used to be called 492 | `resched_task()`. 493 | 494 | ### `select_task_rq()` 495 | 496 | ```c 497 | /* Returns an integer corresponding to the CPU that this task should run on */ 498 | int select_task_rq(struct task_struct *p, int task_cpu, int sd_flag, int flags); 499 | ``` 500 | 501 | The core scheduler invokes this function to figure out which CPU to assign a 502 | task to. This is used for distributing processes accross multiple CPUs; the core 503 | scheduler will call `enqueue_task()`, passing the runqueue corresponding to the 504 | CPU that is returned by this function. CPU assignment obviously occurs when a 505 | process is first forked, but CPU reassignment can happen for a large variety of 506 | reasons. Here are some instances where `select_task_rq()` is called: 507 | 508 | * When a process is first forked. 509 | 510 | * When a task is woken up after having gone to sleep. 511 | 512 | * In response to any of the syscalls in the execv family. This is an 513 | optimization, since it doesn't hurt the cache to migrate a process that's 514 | about to call exec. 515 | 516 | * And many more places... 517 | 518 | You can check *why* `select_task_rq` was called by looking at `sd_flag`. 519 | 520 | For instance, `sd_flag == SD_BALANCE_FORK` whenever `select_task_rq()` is called 521 | to determine the CPU of a newly forked task. You can find all possible values of 522 | `sd_flag` 523 | [here](https://elixir.bootlin.com/linux/v5.10.158/source/include/linux/sched/sd_flags.h). 524 | 525 | Note that `select_task_rq()` should return a CPU that `p` is allowed to run on. 526 | Each `task_struct` has a 527 | [member](https://elixir.bootlin.com/linux/v5.10.158/source/include/linux/sched.h#L728) 528 | called `cpus_mask`, of type `cpumask_t`. This member represents the task's CPU 529 | affinity - i.e. which CPUs it can run on. It's possible to iterate over these 530 | CPUs with the macro `for_each_cpu()`, defined 531 | [here](https://elixir.bootlin.com/linux/v5.10.158/source/include/linux/cpumask.h#L263). 532 | 533 | ### `set_next_task()` 534 | 535 | ```c 536 | /* Called when a task changes its scheduling class or changes its task group. */ 537 | void set_next_task(struct rq *rq, struct task_struct *p, bool first); 538 | ``` 539 | 540 | This function is called in the following instances: 541 | 542 | * When the current task's CPU affinity changes. 543 | 544 | * When the current task's priority, nice value, or scheduling policy changes. 545 | 546 | * When the current task's task group changes. 547 | 548 | This function was previously called `set_curr_task()`, but was changed to better 549 | match `put_prev_task()`. Several scheduling policies also call `set_next_task()` 550 | in their implementations of `pick_next_task()`. An [old kernel 551 | commit](https://lore.kernel.org/all/a96d1bcdd716db4a4c5da2fece647a1456c0ed78.1559129225.git.vpillai@digitalocean.com/T/#m2632708495575d24c1a5c54f7295836a907d3d53) 552 | claims that `pick_next_task()` implies `set_next_task()`, but `pick_next_task()` 553 | technically shouldn't modify any state. In practice, this means that 554 | `set_next_task()` ends up just updating some of the task's metadata. 555 | 556 | ### `yield_task()` 557 | 558 | ```c 559 | /* Called when the current task yields the cpu */ 560 | void yield_task(struct rq *rq); 561 | ``` 562 | 563 | `yield_task()` is used when the current process voluntarily yields its remaining 564 | time on the CPU. Its implementation is usually very simple, as you can see in 565 | [rt](https://elixir.bootlin.com/linux/v5.10.158/source/kernel/sched/rt.c#L1434), 566 | which simply requeues the current task. 567 | 568 | This function is called when a process calls the `sched_yield()` syscall to 569 | relinquish the control of the processor voluntarily. `schedule()` is then 570 | called. 571 | 572 | ### `check_preempt_curr()` 573 | 574 | ```c 575 | /* Preempt the current task with a newly woken task if needed */ 576 | void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) 577 | ``` 578 | 579 | When a new task enters a runnable state, this function is called to check if 580 | that task should preempt the currently running task. For instance, when a new 581 | task is created, it is initially woken up with `wake_up_new_task()`, which 582 | (among other things) places the task on the runqueue, calls the generic 583 | `check_preempt_curr()`, and calls the `sched_class->task_woken()` function if it 584 | exists. 585 | 586 | The generic `check_preempt_curr()` function does the following: 587 | 588 | ```c 589 | void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) 590 | { 591 | if (p->sched_class == rq->curr->sched_class) 592 | rq->curr->sched_class->check_preempt_curr(rq, p, flags); 593 | else if (p->sched_class > rq->curr->sched_class) 594 | resched_curr(rq); 595 | 596 | /* CODE OMITTED */ 597 | } 598 | ``` 599 | 600 | This handles both the case where the new task has a higher priority within a 601 | scheduling class (using the callback pointer) or a higher priority scheduling 602 | class. 603 | 604 | ### `balance()` 605 | 606 | ```c 607 | int balance(struct rq *rq, struct task_struct *prev, struct rq_flags *rf); 608 | ``` 609 | 610 | `balance()` implements various scheduler load-balancing mechanisms, which are 611 | meant to distribute the load across processors more evenly using various 612 | heuristics. It returns `1` if there is a runnable task of that `sched_class`'s 613 | priority or higher after load balancing occurs, and `0` otherwise. 614 | 615 | `balance()` is called in `put_prev_task_balance()` (which is called in 616 | `pick_next_task()`) as follows: 617 | 618 | ```c 619 | static void put_prev_task_balance(struct rq *rq, struct task_struct *prev, 620 | struct rq_flags *rf) 621 | { 622 | #ifdef CONFIG_SMP 623 | const struct sched_class *class; 624 | /* 625 | * We must do the balancing pass before put_prev_task(), such 626 | * that when we release the rq->lock the task is in the same 627 | * state as before we took rq->lock. 628 | * 629 | * We can terminate the balance pass as soon as we know there is 630 | * a runnable task of @class priority or higher. 631 | */ 632 | for_class_range(class, prev->sched_class, &idle_sched_class) { 633 | if (class->balance(rq, prev, rf)) 634 | break; 635 | } 636 | #endif 637 | 638 | put_prev_task(rq, prev); 639 | } 640 | ``` 641 | 642 | The main idea is to prevent any runqueue from becoming empty, as this is a waste 643 | of resources. This loop starts with the currently running task's `sched_class` 644 | and uses the `balance()` callbacks to check if there are runnable tasks of that 645 | `sched_class`'s priority _or higher_. Notably, `sched_class`'s implementation of 646 | `balance()` checks if `sched_class`s of higher priority also have runnable tasks. 647 | 648 | ### `update_curr()` 649 | 650 | ```c 651 | /* Update the current task's runtime statistics. */ 652 | void update_curr(struct rq *rq); 653 | ``` 654 | 655 | This function updates the current task's stats such as the total execution time. 656 | Implementing this function allows commands like `ps` and `htop` to display 657 | accurate statistics. The implementations of this function typically share a 658 | common segment across the different scheduling classes. This function is 659 | typically called in other `sched_class` functions to facilitate accurate 660 | reporting of statistics. 661 | 662 | ### `prio_changed()` 663 | 664 | ```c 665 | /* Called when the task's priority has changed. */ 666 | void prio_changed(struct rq *rq, struct task_struct *p, int oldprio) 667 | ``` 668 | 669 | This function is called whenever a task's priority changes, but the 670 | `sched_class` remains the same (you can verify this by checking where the 671 | function pointer is called). This can occur through various syscalls which 672 | modify the `nice` value, the priority, or other scheduler attributes. 673 | 674 | In a scheduler class with priorities, this function will typically check if the 675 | task whose priority changed needs to preempt the currently running task (or if 676 | it is the currently running task, if it should be preempted). 677 | 678 | ### `switched_to()` 679 | 680 | ```c 681 | /* Called when a task gets switched to this scheduling class. */ 682 | void switched_to(struct rq *rq, struct task_struct *p); 683 | ``` 684 | 685 | `switched_to()` (and its optional counterpart, `switched_from()`) are called 686 | from `check_class_changed()`: 687 | 688 | ```c 689 | static inline void check_class_changed(struct rq *rq, struct task_struct *p, 690 | const struct sched_class *prev_class, 691 | int oldprio) 692 | { 693 | if (prev_class != p->sched_class) { 694 | if (prev_class->switched_from) 695 | prev_class->switched_from(rq, p); 696 | 697 | p->sched_class->switched_to(rq, p); 698 | } else if (oldprio != p->prio || dl_task(p)) 699 | p->sched_class->prio_changed(rq, p, oldprio); 700 | } 701 | ``` 702 | 703 | `check_class_changed()` gets called from syscalls that modify scheduler 704 | parameters. 705 | 706 | For scheduler classes like 707 | [rt](https://elixir.bootlin.com/linux/v5.10.158/source/kernel/sched/rt.c#L2303) 708 | and 709 | [dl](https://elixir.bootlin.com/linux/v5.10.158/source/kernel/sched/deadline.c#L2456), 710 | the main consideration when a task's policy changes to their policy is that it 711 | could overload their runqueue. They then try to push some tasks to other 712 | runqueues. 713 | 714 | However, for lower priority scheduler classes, like CFS, where overloading is 715 | not an issue, `switched_to()` just ensures that the task gets to run, and 716 | preempts the current task (which may be of a lower-priority policy) if 717 | necessary. 718 | -------------------------------------------------------------------------------- /E-Freezer/naive.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/E-Freezer/naive.png -------------------------------------------------------------------------------- /E-Freezer/task_struct.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/E-Freezer/task_struct.png -------------------------------------------------------------------------------- /F-Farfetchd/Page_table.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/F-Farfetchd/Page_table.png -------------------------------------------------------------------------------- /F-Farfetchd/index.md: -------------------------------------------------------------------------------- 1 | # Understanding Page Tables in Linux 2 | The goal of this recitation is to provide a high-level overview of x86/arm64 paging as 3 | well as the data structures and functions the Linux kernel uses to manipulate 4 | page tables. 5 | 6 | Before we dive into the details, here's a quick refresher on what a page table 7 | does and how it does it. Page tables allow the operating system to virtualize 8 | the entire address space for each process by providing a mapping between virtual 9 | addresses and physical addresses. What is a virtual address? Every address that 10 | application code references is a virtual address. For the most part this is true 11 | of kernel code as well, except for code that executes before paging is enabled 12 | (think bootloader). Once paging is enabled in the CPU it cannot be disabled. 13 | Every address referenced by an application is translated transparently by the 14 | hardware via the page table. 15 | 16 | A virtual address is usually different than the physical address it maps to, but 17 | it is also possible for virtual addresses to be "identity" mapped to their 18 | corresponding physical address or mapped to no address at all in cases where the 19 | kernel swapped out a page or never allocated one in the first place. 20 | 21 | ## The shape of a page table 22 | Recall that the structure of the page table is rigidly specified by the CPU 23 | architecture. This is necessary since the CPU hardware directly traverses the 24 | page table to transparently map virtual addresses to physical addresses. The 25 | hardware does this by using the virtual address as a set of indices into the 26 | page table. 27 | 28 |
29 | 30 |
31 | 32 | [Source](https://os.phil-opp.com/page-tables) 33 | 34 | This diagram shows how the bits of a 64 bit virtual address specify the indices 35 | into a 4-level x86 page table (you can expect something similar in arm64). 36 | With 4-level paging only bits 0 through 47 are used. Bits 48 through 63 are sign 37 | extension bits that must match bit 47; this prevents clever developers from 38 | stuffing extra information into addresses that might interfere with future addressing 39 | schemes, like 5-level page tables. 40 | 41 | As you can see, there are 9 bits specifying the index into each page table and a 42 | 12 bit offset into the physical page frame. Since 2^9 = 512, each level of the 43 | page table has 512 64 bit entries, which fits perfectly in one 4096 byte frame. 44 | The last 12 bits allow the virtual address to specify a specific byte offset 45 | within the page frame. 46 | 47 |
48 | 49 |
50 | 51 | For clarity, we're using the naming scheme in the diagram (P4, P3,...), which 52 | differs slightly from the names used in Linux. Linux also implements 5-level 53 | paging, which we describe in the next section. For reference, here is how the 54 | names above mapped to Linux before it added 5-level paging: 55 | 56 | ``` 57 | Diagram: P4 -> P3 -> P2 -> P1 -> page frame 58 | Linux: PGD -> PUD -> PMD -> PTE -> page frame 59 | ``` 60 | 61 | Each entry in the P4 table is the address of a *different* P3 table such that 62 | each page table can have up to 512 different P3 tables. In turn, each entry in a 63 | P3 table points to a different P2 table such that there are 512 * 512 = 262,144 64 | P2 tables. Since most of the virtual address space is unused for a given 65 | process, the kernel does not allocate frames for most of the intermediary tables 66 | comprising the page table. 67 | 68 | I've been using the word frame since each page table entry (in any table, P4, 69 | P3, etc) is the address of a physical 4096 byte memory frame (except for huge 70 | pages). These addresses cannot be virtual addresses since the hardware accesses 71 | them directly to translate virtual addresses to physical addresses. Furthermore, 72 | since frames are page aligned, the last 12 bits of a frame address are all 73 | zeros. The hardware takes advantage of this by using these bits to store 74 | information about the frame in its page table entry. 75 | 76 |
77 | 78 |
79 | 80 | [Source](https://wiki.osdev.org/Paging) 81 | 82 | This diagram shows the bit flags for a page table entry, which in our diagram 83 | above corresponds to an entry in the P1 table. A similar but slightly different 84 | set of flags exists for the intermediary entries as well. 85 | 86 | ## Working with page tables in Linux 87 | In this section we'll take a look at the data structures and functions the Linux 88 | kernel defines to manage page tables. 89 | 90 | ### The fifth level of Dante's page table 91 | We just finished discussing 4-level page tables, but Linux actually implements 92 | 5-level paging and exposes a 5-level paging interface, even when the kernel is 93 | built with 5-level paging disabled. Luckily 5-level paging is similar to the 94 | 4-level scheme we discussed, is simply adds another level of indirection and 95 | uses 9 previously unused bits of the 64 bit virtual address to index it. 96 | 97 | Here's what the page table hierarchy looks like in Linux: 98 | 99 | ``` 100 | 4-Level Paging (circa 2016) 101 | Diagram Above 102 | P4 -> P3 -> P2 -> P1 -> page frame 103 | Linux 104 | PGD -> PUD -> PMD -> PTE -> page frame 105 | 106 | 5-Level Paging (current) 107 | Linux 108 | PGD -> P4D -> PUD -> PMD -> PTE -> page frame 109 | ``` 110 | 111 | What does this mean for us in light of the fact that our kernel config specifies 112 | 4 level paging? 113 | 114 | ``` 115 | CONFIG_PGTABLE_LEVELS=4 116 | ``` 117 | 118 | If we look at the output from an example program that reports the physical 119 | addresses of the process' page frame and of the intermediate page tables, 120 | it shows that the `p4d_paddr` and `pud_paddr` are identical. 121 | 122 | ``` 123 | [405] inspect_physical_address(): 124 | paddr: 0x115a3c069 125 | pf_paddr: 0x115a3c000 126 | pte_paddr: 0x10d2c7000 127 | pmd_paddr: 0x10d623000 128 | pud_paddr: 0x10c215000 129 | p4d_paddr: 0x10c215000 130 | pgd_paddr: 0x10c90a000 131 | dirty: no 132 | refcount: 57 133 | ``` 134 | 135 | Digging into `arch/x86/include/asm/pgtable_types.h`, we see the following: 136 | 137 | ```C 138 | #if CONFIG_PGTABLE_LEVELS > 4 139 | typedef struct { p4dval_t p4d; } p4d_t; 140 | 141 | static inline p4d_t native_make_p4d(pudval_t val) 142 | { 143 | return (p4d_t) { val }; 144 | } 145 | 146 | static inline p4dval_t native_p4d_val(p4d_t p4d) 147 | { 148 | return p4d.p4d; 149 | } 150 | #else 151 | #include 152 | 153 | static inline p4d_t native_make_p4d(pudval_t val) 154 | { 155 | return (p4d_t) { .pgd = native_make_pgd((pgdval_t)val) }; 156 | } 157 | 158 | static inline p4dval_t native_p4d_val(p4d_t p4d) 159 | { 160 | return native_pgd_val(p4d.pgd); 161 | } 162 | #endif 163 | ``` 164 | [x86 Source](https://elixir.bootlin.com/linux/v5.10.158/source/arch/x86/include/asm/pgtable_types.h#L332) 165 | 166 | Interesting. Looking at `pgtable-nop4d.h` we find that `p4d_t` is defined as 167 | ``` 168 | typedef struct { pgd_t pgd; } p4d_t; 169 | ``` 170 | 171 | With 4-level paging the p4d folds into the pgd. p4d entries, which are 172 | represented by `p4d_t`, essentially become a type alias for `pgd_t`. The kernel 173 | does this so that it has a standard 5-level page table interface to program 174 | against regardless of how many levels of page tables actually exist. 175 | 176 | As of writing, arm64 (for linux 5.10.158) directly includes `pgtable-nop4d.h`. 177 | 178 | To summarize, with 4-level paging there are no "real" p4d tables. Instead, pgd 179 | entries contain the addresses of pud tables, and the kernel "pretends" the p4d 180 | exists by making it appear that the p4d is a mirror copy of the pgd. 181 | 182 | If you read on in `arch/x86/include/asm/pgtable_types.h` you'll see that the kernel uses the same 183 | scheme for 3 and 2 level page table configurations as well. arm64 follows a similar scheme in `arch/arm64/include/asm/pgtable-types.h` 184 | 185 | NOTE that you cannot make use of this equivalence directly. Your Farfetch'd 186 | implementation must work correctly for any page table configuration and 187 | therefore must use the macros defined by the kernel. 188 | 189 | ### Data structures, functions, and macros 190 | In this section we'll take a step back and discuss the data structures 191 | and functions the kernel uses to manage page tables in more detail. 192 | 193 | To encapsulate memory management information for each task, `struct task_struct` contains 194 | a `struct mm_struct*`. 195 | 196 | ```C 197 | struct mm_struct *mm; 198 | struct mm_struct *active_mm; 199 | ``` 200 | 201 | We won't go into the details of `active_mm`, which is used for kernel threads that do not 202 | have their own `struct_mm`. Check out `context_switch()` in core.c if you want to read 203 | more. 204 | 205 | Looking in `struct mm_struct`, we find the member `pgd_t *pgd;`. This is a 206 | pointer to the first entry in the pgd for this `mm_struct`. Do you think that 207 | this is a virtual address or a physical address? Remember that all memory 208 | references are translated from virtual addresses to physical addresses by the 209 | CPU, so any address the kernel code uses directly must be a virtual address. 210 | 211 | However, it's easy to recover the physical address since all kernel addresses 212 | are linearly mapped to physical addresses. `virt_to_phys` recovers the physical 213 | address using this linear mapping. 214 | 215 | 216 | Section 3.3 in Gordman's [chapter on page table 217 | management](https://www.kernel.org/doc/gorman/html/understand/understand006.html) 218 | provides a good overview of the functions / macros used to navigate the page table. 219 | 220 | A common source of confusion arises from a misunderstanding of what macros like `pud_offset` 221 | return. 222 | 223 | ```C 224 | /* Find an entry in the third-level page table.. */ 225 | // From include/linux/pgtable.h. Note that this definition is shared between x86 and arm64. 226 | static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address) 227 | { 228 | return (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address); 229 | } 230 | 231 | // x86: arch/x86/include/asm/pgtable_types.h 232 | // arm64: arch/arm64/include/asm/pgtable-types.h 233 | typedef struct { pudval_t pud; } pud_t; 234 | 235 | // x86: arch/x86/include/asm/pgtable_64_types.h 236 | typedef unsigned long pudval_t; 237 | 238 | // arm64: arch/arm64/include/asm/pgtable-types.h 239 | typedef u64 pudval_t; 240 | ``` 241 | 242 | We touched on this briefly above. A `pud_t` is just a struct containing an 243 | unsigned long, which means it compiles down to an unsigned long. Recall from our 244 | earlier discussion that each entry is the address of a physical page frame and 245 | that the last 12 bits are reserved for flags since each page frame is page 246 | aligned. The macros that Gordman discusses, like `pte_none()` and 247 | `pmd_present()`, check these flag bits to determine information about the entry. 248 | 249 | If you want to recover the actual value of the entry, the type casting macros 250 | Gordman discussed in section 3.2 are useful. Although the Gordman reading is x86-specific, 251 | arm64 defines similar, if not indentical, macros. Keep in mind that if you want the 252 | physical address the entry points to you'll need to bitwise-and the value with the 253 | correct mask. 254 | 255 | x86 and arm64 either define functions/macros for the mask so you can 256 | manually perform the bitwise-and or define function/macros that outright do the correct 257 | bitwise-and for you. Either way, recovering the physical address the entry points to is possible in 258 | both x86 and arm64, it just may look slightly different depending which architecture you're on and 259 | which function/macros you choose. 260 | 261 | ### Page dirty and refcount 262 | Recall from before that a flag in the page table entry indicates whether the 263 | page frame is dirty or not. Do not read the flag directly; the kernel provides 264 | a macro for this purpose. 265 | 266 | You will find section 3.4 of Gordman useful for figuring out how to retrieve 267 | the refcount of a page frame. Hint: every physical frame has a `struct page` in 268 | the kernel, which is defined 269 | [here](https://elixir.bootlin.com/linux/v5.10.158/source/include/linux/mm_types.h#L70). 270 | Be sure to use the correct kernel functions / macros to access any information 271 | in `struct page`. 272 | 273 | -------------------------------------------------------------------------------- /F-Farfetchd/x86_address_structure.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/F-Farfetchd/x86_address_structure.png -------------------------------------------------------------------------------- /F-Farfetchd/x86_address_structure.svg: -------------------------------------------------------------------------------- 1 | 2 | 17 | 19 | 20 | 22 | image/svg+xml 23 | 25 | 26 | 27 | 28 | 29 | 31 | 56 | 60 | 65 | 70 | 75 | 80 | 85 | 90 | 95 | 100 | 105 | 110 | 115 | 120 | 125 | 130 | 135 | 140 | 145 | 150 | 155 | 160 | 165 | 170 | 175 | 180 | 185 | 190 | 195 | 200 | 205 | 210 | 215 | 220 | 225 | 230 | 235 | 240 | 245 | 250 | 255 | 260 | 265 | 270 | 275 | 280 | 285 | 290 | 295 | 300 | 305 | 310 | 315 | 320 | 325 | 330 | 335 | 340 | 345 | 350 | 355 | 360 | 365 | 370 | 375 | 378 | 32 389 | 39 400 | 40 411 | 48 422 | 431 | 440 | 449 | 458 | 0 469 | 8 480 | 16 491 | 24 502 | 31 513 | 15 524 | 7 535 | 23 546 | 555 | 564 | 572 | 582 | P4 index 585 | 588 | 589 | 9 598 | 9 607 | 615 | 9 624 | 9 633 | 12 642 | 650 | 47 661 | 55 672 | 56 683 | 63 694 | 699 | 16 708 | 713 | 718 | 723 | 728 | 733 | 738 | 748 | Sign extension 751 | 754 | 755 | 765 | P2 index 768 | 771 | 772 | 782 | P3 index 785 | 788 | 789 | 799 | P1 index 802 | 805 | 806 | 816 | Offset 819 | 820 | 821 | 826 | 831 | 836 | 841 | 846 | 851 | 856 | 861 | 866 | 871 | 876 | 881 | 886 | 891 | 896 | 901 | 906 | 911 | 916 | 917 | 918 | -------------------------------------------------------------------------------- /F-Farfetchd/x86_paging_64bit.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/F-Farfetchd/x86_paging_64bit.png -------------------------------------------------------------------------------- /F-Farfetchd/x86_paging_64bit.svg: -------------------------------------------------------------------------------- 1 | 2 | 17 | 19 | 20 | 22 | image/svg+xml 23 | 25 | 26 | 27 | 28 | 29 | 31 | 55 | 59 | 67 | 70 | 73 | 76 | 79 | 82 | 85 | 88 | 91 | 94 | 97 | 100 | 103 | 106 | 109 | 112 | 115 | 118 | 122 | 126 | 129 | 132 | 135 | 138 | 141 | 144 | 147 | 150 | 153 | 156 | 159 | 162 | 165 | 168 | 171 | 174 | 177 | 180 | 183 | 186 | 189 | 192 | 195 | 198 | 201 | 204 | 207 | 210 | 213 | 216 | 219 | 222 | 225 | 229 | 237 | 245 | 253 | 261 | 268 | 276 | 283 | 290 | 297 | 304 | 310 | 316 | 322 | 328 | 332 | 336 | 340 | 344 | 348 | 354 | 358 | 362 | 366 | 370 | 374 | 378 | 382 | 386 | 390 | 394 | 398 | 402 | 406 | 410 | 414 | 418 | 422 | 426 | 430 | 434 | 438 | 442 | 446 | 450 | 454 | 457 | CR3 register 466 | 32 475 | 39 484 | 40 493 | 47 502 | 511 | 520 | 529 | 538 | 0 547 | 8 556 | 16 565 | 24 574 | 31 583 | 15 592 | 7 601 | 23 610 | 613 | ... 622 | 623 | 626 | ... 635 | 636 | 639 | 4K memory page 650 | 651 | 660 | P2 entry 671 | 674 | ... 683 | 684 | 687 | ... 696 | 697 | P2 table 708 | 711 | ... 720 | 721 | 724 | ... 733 | 734 | P3 entry 745 | 754 | P3 table 765 | P1 entry 776 | 779 | ... 788 | 789 | 792 | ... 801 | 802 | P1 table 813 | 816 | ... 825 | 826 | 829 | ... 838 | 839 | 847 | P4 entry 850 | 851 | 859 | 867 | P4 table 870 | 871 | 9 879 | 9 887 | 895 | 9 903 | 9 911 | 12 919 | 927 | 928 | 929 | 930 | -------------------------------------------------------------------------------- /G-Pantry/page-cache-overview.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/G-Pantry/page-cache-overview.pdf -------------------------------------------------------------------------------- /G-Pantry/pantryfs-spec.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs4118/recitations/b809b6edf86dbe12c9f698ab3abaaef64893f532/G-Pantry/pantryfs-spec.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | cs4118 Recitations 2 | ================== 3 | 4 | This repository contains the recitation notes for Columbia's Operating Systems I 5 | class, COMSW4118, as taught by Jae Woo Lee. For information about the class, 6 | visit the [course homepage](http://www.cs.columbia.edu/~jae/4118/). 7 | 8 | These recitations are held every other week by the various TAs, generally using 9 | these notes/linked resources as the basis for their sections. 10 | 11 | Issues, patches, and comments, especially by current and former students, are 12 | welcome. 13 | 14 | ### Contents 15 | - [Note A](A-Workflow/workflow.md): VM/kernel workflow setup, Linux source code 16 | navigators 17 | - [Note B](B-Sockets-ServerTesting): Sockets/TCP programming, server testing 18 | strategies 19 | - [Note C](C-Linux-Kernel-Dev/linux-kernel-dev.md): Kernel configuration, 20 | compilation, and style 21 | - [Note D](D-Fridge/waitqueue.pdf): Linux wait queue (hw5) 22 | - [Note E](E-Freezer/freezer.md): Linux scheduler data structures, implementing 23 | a scheduler (hw6) 24 | - [Note F](F-Farfetchd/index.md): Linux page table data structures, 25 | macros/functions (hw7) 26 | - [Note G](G-Pantry): Linux page cache, implementing a file system (hw8) 27 | 28 | --------------------------------------------------------------------------------