├── .github └── ISSUE_TEMPLATE │ ├── bug_report.md │ └── feature_request.md ├── .gitignore ├── Assets └── linux-kernel.png ├── Booting ├── README.md ├── images │ ├── bss.png │ ├── kernel_first_address.png │ ├── linear_address.png │ ├── minimal_stack.png │ ├── simple_bootloader.png │ ├── stack1.png │ ├── stack2.png │ ├── try_vmlinuz_in_qemu.png │ └── video_mode_setup_menu.png ├── linux-bootstrap-1.md ├── linux-bootstrap-2.md ├── linux-bootstrap-3.md ├── linux-bootstrap-4.md ├── linux-bootstrap-5.md └── linux-bootstrap-6.md ├── CODEOWNERS ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── Cgroups ├── README.md ├── images │ └── menuconfig.png └── linux-cgroups-1.md ├── Concepts ├── README.md ├── linux-cpu-1.md ├── linux-cpu-2.md ├── linux-cpu-3.md └── linux-cpu-4.md ├── DataStructures ├── README.md ├── linux-datastructures-1.md ├── linux-datastructures-2.md └── linux-datastructures-3.md ├── Dockerfile ├── Initialization ├── README.md ├── images │ ├── CONFIG_NR_CPUS.png │ ├── NX.png │ ├── brk_area.png │ └── kernel_command_line.png ├── linux-initialization-1.md ├── linux-initialization-10.md ├── linux-initialization-2.md ├── linux-initialization-3.md ├── linux-initialization-4.md ├── linux-initialization-5.md ├── linux-initialization-6.md ├── linux-initialization-7.md ├── linux-initialization-8.md └── linux-initialization-9.md ├── Interrupts ├── README.md ├── images │ └── kernel.png ├── linux-interrupts-1.md ├── linux-interrupts-10.md ├── linux-interrupts-2.md ├── linux-interrupts-3.md ├── linux-interrupts-4.md ├── linux-interrupts-5.md ├── linux-interrupts-6.md ├── linux-interrupts-7.md ├── linux-interrupts-8.md └── linux-interrupts-9.md ├── KernelStructures ├── .gitkeep ├── README.md └── linux-kernelstructure-1.md ├── LICENSE ├── LINKS.md ├── MM ├── README.md ├── images │ ├── kernel_configuration_menu1.png │ ├── kernel_configuration_menu2.png │ └── memblock.png ├── linux-mm-1.md ├── linux-mm-2.md └── linux-mm-3.md ├── Makefile ├── Misc ├── README.md ├── images │ ├── dgap_menu.png │ ├── git_diff.png │ ├── github.png │ ├── google_linux.png │ ├── menuconfig.png │ ├── nconfig.png │ └── qemu.png ├── linux-misc-1.md ├── linux-misc-2.md ├── linux-misc-3.md └── linux-misc-4.md ├── README.md ├── SUMMARY.md ├── Scripts ├── LinuxKernelInsides.pdf ├── README.md ├── get_all_links.py └── latex.sh ├── SyncPrim ├── README.md ├── linux-sync-1.md ├── linux-sync-2.md ├── linux-sync-3.md ├── linux-sync-4.md ├── linux-sync-5.md └── linux-sync-6.md ├── SysCall ├── README.md ├── images │ └── ls_shell.png ├── linux-syscall-1.md ├── linux-syscall-2.md ├── linux-syscall-3.md ├── linux-syscall-4.md ├── linux-syscall-5.md └── linux-syscall-6.md ├── Theory ├── README.md ├── images │ └── 4_level_paging.png ├── linux-theory-1.md ├── linux-theory-2.md └── linux-theory-3.md ├── Timers ├── README.md ├── images │ ├── HZ.png │ └── base_small.png ├── linux-timers-1.md ├── linux-timers-2.md ├── linux-timers-3.md ├── linux-timers-4.md ├── linux-timers-5.md ├── linux-timers-6.md └── linux-timers-7.md ├── contributors.md └── cover.jpg /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | title: '[BUG] Write a suitable title' 5 | labels: 'bug' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Describe the bug** 11 | A clear and concise description of what the bug is. 12 | 13 | **To Reproduce** 14 | Steps to reproduce the behavior: 15 | 1. Go to '...' 16 | 2. Click on '....' 17 | 3. Scroll down to '....' 18 | 4. See error 19 | 20 | **Expected behavior** 21 | A clear and concise description of what you expected to happen. 22 | 23 | **Screenshots** 24 | If applicable, add screenshots to help explain your problem. 25 | 26 | **Additional context** 27 | Add any other context about the problem here. -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feature request 3 | about: Suggest an idea for this project 4 | title: '[FEATURE] Write a suitable title' 5 | labels: 'enhancement' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Is your feature request related to a problem? Please describe.** 11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 12 | 13 | **Describe the solution you'd like** 14 | A clear and concise description of what you want to happen. 15 | 16 | **Describe alternatives you've considered** 17 | A clear and concise description of any alternative solutions or features you've considered. 18 | 19 | **Additional context** 20 | Add any other context or screenshots about the feature request here. 21 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.tex 2 | build 3 | -------------------------------------------------------------------------------- /Assets/linux-kernel.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Assets/linux-kernel.png -------------------------------------------------------------------------------- /Booting/README.md: -------------------------------------------------------------------------------- 1 | # Kernel Boot Process 2 | 3 | This chapter describes the Linux kernel boot process. Here you will see a series of posts which describes the full cycle of the kernel loading process: 4 | 5 | * [From the bootloader to kernel](linux-bootstrap-1.md) - describes all stages from turning on the computer to running the first instruction of the kernel. 6 | * [First steps in the kernel setup code](linux-bootstrap-2.md) - describes first steps in the kernel setup code. You will see heap initialization, query of different parameters like EDD, IST and etc... 7 | * [Video mode initialization and transition to protected mode](linux-bootstrap-3.md) - describes video mode initialization in the kernel setup code and transition to protected mode. 8 | * [Transition to 64-bit mode](linux-bootstrap-4.md) - describes preparation for transition into 64-bit mode and details of transition. 9 | * [Kernel Decompression](linux-bootstrap-5.md) - describes preparation before kernel decompression and details of direct decompression. 10 | * [Kernel load address randomization](linux-bootstrap-6.md) - describes randomization of the Linux kernel load address. 11 | 12 | This chapter coincides with `Linux kernel v4.17`. 13 | -------------------------------------------------------------------------------- /Booting/images/bss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Booting/images/bss.png -------------------------------------------------------------------------------- /Booting/images/kernel_first_address.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Booting/images/kernel_first_address.png -------------------------------------------------------------------------------- /Booting/images/linear_address.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Booting/images/linear_address.png -------------------------------------------------------------------------------- /Booting/images/minimal_stack.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Booting/images/minimal_stack.png -------------------------------------------------------------------------------- /Booting/images/simple_bootloader.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Booting/images/simple_bootloader.png -------------------------------------------------------------------------------- /Booting/images/stack1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Booting/images/stack1.png -------------------------------------------------------------------------------- /Booting/images/stack2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Booting/images/stack2.png -------------------------------------------------------------------------------- /Booting/images/try_vmlinuz_in_qemu.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Booting/images/try_vmlinuz_in_qemu.png -------------------------------------------------------------------------------- /Booting/images/video_mode_setup_menu.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Booting/images/video_mode_setup_menu.png -------------------------------------------------------------------------------- /Booting/linux-bootstrap-5.md: -------------------------------------------------------------------------------- 1 | Kernel booting process. Part 5. 2 | ================================================================================ 3 | 4 | Kernel Decompression 5 | -------------------------------------------------------------------------------- 6 | 7 | This is the fifth part of the `Kernel booting process` series. We went over the transition to 64-bit mode in the previous [part](https://github.com/0xAX/linux-insides/blob/v4.16/Booting/linux-bootstrap-4.md#transition-to-the-long-mode) and we will continue where we left off in this part. We will study the steps taken to prepare for kernel decompression, relocation and the process of kernel decompression itself. So... let's dive into the kernel code again. 8 | 9 | Preparing to Decompress the Kernel 10 | -------------------------------------------------------------------------------- 11 | 12 | We stopped right before the jump to the `64-bit` entry point - `startup_64` which is located in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/head_64.S) source code file. We already covered the jump to `startup_64` from `startup_32` in the previous part: 13 | 14 | ```assembly 15 | pushl $__KERNEL_CS 16 | leal startup_64(%ebp), %eax 17 | ... 18 | ... 19 | ... 20 | pushl %eax 21 | ... 22 | ... 23 | ... 24 | lret 25 | ``` 26 | 27 | Since we have loaded a new `Global Descriptor Table` and the CPU has transitioned to a new mode (`64-bit` mode in our case), we set up the segment registers again at the beginning of the `startup_64` function: 28 | 29 | ```assembly 30 | .code64 31 | .org 0x200 32 | ENTRY(startup_64) 33 | xorl %eax, %eax 34 | movl %eax, %ds 35 | movl %eax, %es 36 | movl %eax, %ss 37 | movl %eax, %fs 38 | movl %eax, %gs 39 | ``` 40 | 41 | All segment registers besides the `cs` register are now reset in `long mode`. 42 | 43 | The next step is to compute the difference between the location the kernel was compiled to be loaded at and the location where it is actually loaded: 44 | 45 | ```assembly 46 | #ifdef CONFIG_RELOCATABLE 47 | leaq startup_32(%rip), %rbp 48 | movl BP_kernel_alignment(%rsi), %eax 49 | decl %eax 50 | addq %rax, %rbp 51 | notq %rax 52 | andq %rax, %rbp 53 | cmpq $LOAD_PHYSICAL_ADDR, %rbp 54 | jge 1f 55 | #endif 56 | movq $LOAD_PHYSICAL_ADDR, %rbp 57 | 1: 58 | movl BP_init_size(%rsi), %ebx 59 | subl $_end, %ebx 60 | addq %rbp, %rbx 61 | ``` 62 | 63 | The `rbp` register contains the decompressed kernel's start address. After this code executes, the `rbx` register will contain the address where the kernel code will be relocated to for decompression. We've already done this before in the `startup_32` function ( you can read about this in the previous part - [Calculate relocation address](https://github.com/0xAX/linux-insides/blob/v4.16/Booting/linux-bootstrap-4.md#calculate-relocation-address)), but we need to do this calculation again because the bootloader can use the 64-bit boot protocol now and `startup_32` is no longer being executed. 64 | 65 | In the next step we set up the stack pointer, reset the flags register and set up the `GDT` again to overwrite the `32-bit` specific values with those from the `64-bit` protocol: 66 | 67 | ```assembly 68 | leaq boot_stack_end(%rbx), %rsp 69 | 70 | leaq gdt(%rip), %rax 71 | movq %rax, gdt64+2(%rip) 72 | lgdt gdt64(%rip) 73 | 74 | pushq $0 75 | popfq 76 | ``` 77 | 78 | If you take a look at the code after the `lgdt gdt64(%rip)` instruction, you will see that there is some additional code. This code builds the trampoline to enable [5-level paging](https://lwn.net/Articles/708526/) if needed. We will only consider 4-level paging in this book, so this code will be omitted. 79 | 80 | As you can see above, the `rbx` register contains the start address of the kernel decompressor code and we just put this address with an offset of `boot_stack_end` in the `rsp` register which points to the top of the stack. After this step, the stack will be correct. You can find the definition of the `boot_stack_end` constant in the end of the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/head_64.S) assembly source code file: 81 | 82 | ```assembly 83 | .bss 84 | .balign 4 85 | boot_heap: 86 | .fill BOOT_HEAP_SIZE, 1, 0 87 | boot_stack: 88 | .fill BOOT_STACK_SIZE, 1, 0 89 | boot_stack_end: 90 | ``` 91 | 92 | It located in the end of the `.bss` section, right before `.pgtable`. If you peek inside the [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/vmlinux.lds.S) linker script, you will find the definitions of `.bss` and `.pgtable` there. 93 | 94 | Since the stack is now correct, we can copy the compressed kernel to the address that we got above, when we calculated the relocation address of the decompressed kernel. Before we get into the details, let's take a look at this assembly code: 95 | 96 | ```assembly 97 | pushq %rsi 98 | leaq (_bss-8)(%rip), %rsi 99 | leaq (_bss-8)(%rbx), %rdi 100 | movq $_bss, %rcx 101 | shrq $3, %rcx 102 | std 103 | rep movsq 104 | cld 105 | popq %rsi 106 | ``` 107 | 108 | This set of instructions copies the compressed kernel over to where it will be decompressed. 109 | 110 | First of all we push `rsi` to the stack. We need preserve the value of `rsi`, because this register now stores a pointer to `boot_params` which is a real mode structure that contains booting related data (remember, this structure was populated at the start of the kernel setup). We pop the pointer to `boot_params` back to `rsi` after we execute this code. 111 | 112 | The next two `leaq` instructions calculate the effective addresses of the `rip` and `rbx` registers with an offset of `_bss - 8` and assign the results to `rsi` and `rdi` respectively. Why do we calculate these addresses? The compressed kernel image is located between this code (from `startup_32` to the current code) and the decompression code. You can verify this by looking at this linker script - [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/vmlinux.lds.S): 113 | 114 | ``` 115 | . = 0; 116 | .head.text : { 117 | _head = . ; 118 | HEAD_TEXT 119 | _ehead = . ; 120 | } 121 | .rodata..compressed : { 122 | *(.rodata..compressed) 123 | } 124 | .text : { 125 | _text = .; /* Text */ 126 | *(.text) 127 | *(.text.*) 128 | _etext = . ; 129 | } 130 | ``` 131 | 132 | Note that the `.head.text` section contains `startup_32`. You may remember it from the previous part: 133 | 134 | ```assembly 135 | __HEAD 136 | .code32 137 | ENTRY(startup_32) 138 | ... 139 | ... 140 | ... 141 | ``` 142 | 143 | The `.text` section contains the decompression code: 144 | 145 | ```assembly 146 | .text 147 | relocated: 148 | ... 149 | ... 150 | ... 151 | /* 152 | * Do the decompression, and jump to the new kernel.. 153 | */ 154 | ... 155 | ``` 156 | 157 | And `.rodata..compressed` contains the compressed kernel image. So `rsi` will contain the absolute address of `_bss - 8`, and `rdi` will contain the relocation relative address of `_bss - 8`. In the same way we store these addresses in registers, we put the address of `_bss` in the `rcx` register. As you can see in the `vmlinux.lds.S` linker script, it's located at the end of all sections with the setup/kernel code. Now we can start copying data from `rsi` to `rdi`, `8` bytes at a time, with the `movsq` instruction. 158 | 159 | Note that we execute an `std` instruction before copying the data. This sets the `DF` flag, which means that `rsi` and `rdi` will be decremented. In other words, we will copy the bytes backwards. At the end, we clear the `DF` flag with the `cld` instruction, and restore the `boot_params` structure to `rsi`. 160 | 161 | Now we have a pointer to the `.text` section's address after relocation, and we can jump to it: 162 | 163 | ```assembly 164 | leaq relocated(%rbx), %rax 165 | jmp *%rax 166 | ``` 167 | 168 | The final touches before kernel decompression 169 | -------------------------------------------------------------------------------- 170 | 171 | In the previous paragraph we saw that the `.text` section starts with the `relocated` label. The first thing we do is to clear the `bss` section with: 172 | 173 | ```assembly 174 | xorl %eax, %eax 175 | leaq _bss(%rip), %rdi 176 | leaq _ebss(%rip), %rcx 177 | subq %rdi, %rcx 178 | shrq $3, %rcx 179 | rep stosq 180 | ``` 181 | 182 | We need to initialize the `.bss` section, because we'll soon jump to [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) code. Here we just clear `eax`, put the addresses of `_bss` in `rdi` and `_ebss` in `rcx`, and fill `.bss` with zeros with the `rep stosq` instruction. 183 | 184 | At the end, we can see a call to the `extract_kernel` function: 185 | 186 | ```assembly 187 | pushq %rsi 188 | movq %rsi, %rdi 189 | leaq boot_heap(%rip), %rsi 190 | leaq input_data(%rip), %rdx 191 | movl $z_input_len, %ecx 192 | movq %rbp, %r8 193 | movq $z_output_len, %r9 194 | call extract_kernel 195 | popq %rsi 196 | ``` 197 | 198 | Like before, we push `rsi` onto the stack to preserve the pointer to `boot_params`. We also copy the contents of `rsi` to `rdi`. Then, we set `rsi` to point to the area where the kernel will be decompressed. The last step is to prepare the parameters for the `extract_kernel` function and call it to decompress the kernel. The `extract_kernel` function is defined in the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/misc.c) source code file and takes six arguments: 199 | 200 | * `rmode` - a pointer to the [boot_params](https://github.com/torvalds/linux/blob/v4.16/arch/x86/include/uapi/asm/bootparam.h) structure which is filled by either the bootloader or during early kernel initialization; 201 | * `heap` - a pointer to `boot_heap` which represents the start address of the early boot heap; 202 | * `input_data` - a pointer to the start of the compressed kernel or in other words, a pointer to the `arch/x86/boot/compressed/vmlinux.bin.bz2` file; 203 | * `input_len` - the size of the compressed kernel; 204 | * `output` - the start address of the decompressed kernel; 205 | * `output_len` - the size of the decompressed kernel; 206 | 207 | All arguments will be passed through registers as per the [System V Application Binary Interface](https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-1.0.pdf). We've finished all the preparations and can now decompress the kernel. 208 | 209 | Kernel decompression 210 | -------------------------------------------------------------------------------- 211 | 212 | As we saw in the previous paragraph, the `extract_kernel` function is defined in the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/misc.c) source code file and takes six arguments. This function starts with the video/console initialization that we already saw in the previous parts. We need to do this again because we don't know if we started in [real mode](https://en.wikipedia.org/wiki/Real_mode) or if a bootloader was used, or whether the bootloader used the `32` or `64-bit` boot protocol. 213 | 214 | After the first initialization steps, we store pointers to the start of the free memory and to the end of it: 215 | 216 | ```C 217 | free_mem_ptr = heap; 218 | free_mem_end_ptr = heap + BOOT_HEAP_SIZE; 219 | ``` 220 | 221 | Here, `heap` is the second parameter of the `extract_kernel` function as passed to it in [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/head_64.S): 222 | 223 | ```assembly 224 | leaq boot_heap(%rip), %rsi 225 | ``` 226 | 227 | As you saw above, `boot_heap` is defined as: 228 | 229 | ```assembly 230 | boot_heap: 231 | .fill BOOT_HEAP_SIZE, 1, 0 232 | ``` 233 | 234 | where `BOOT_HEAP_SIZE` is a macro which expands to `0x10000` (`0x400000` in the case of a `bzip2` kernel) and represents the size of the heap. 235 | 236 | After we initialize the heap pointers, the next step is to call the `choose_random_location` function from the [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/kaslr.c) source code file. As we can guess from the function name, it chooses a memory location to write the decompressed kernel to. It may look weird that we need to find or even `choose` where to decompress the compressed kernel image, but the Linux kernel supports [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization) which allows decompression of the kernel into a random address, for security reasons. 237 | 238 | We'll take a look at how the kernel's load address is randomized in the next part. 239 | 240 | Now let's get back to [misc.c](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/misc.c). After getting the address for the kernel image, we need to check that the random address we got is correctly aligned, and in general, not wrong: 241 | 242 | ```C 243 | if ((unsigned long)output & (MIN_KERNEL_ALIGN - 1)) 244 | error("Destination physical address inappropriately aligned"); 245 | 246 | if (virt_addr & (MIN_KERNEL_ALIGN - 1)) 247 | error("Destination virtual address inappropriately aligned"); 248 | 249 | if (heap > 0x3fffffffffffUL) 250 | error("Destination address too large"); 251 | 252 | if (virt_addr + max(output_len, kernel_total_size) > KERNEL_IMAGE_SIZE) 253 | error("Destination virtual address is beyond the kernel mapping area"); 254 | 255 | if ((unsigned long)output != LOAD_PHYSICAL_ADDR) 256 | error("Destination address does not match LOAD_PHYSICAL_ADDR"); 257 | 258 | if (virt_addr != LOAD_PHYSICAL_ADDR) 259 | error("Destination virtual address changed when not relocatable"); 260 | ``` 261 | 262 | After all these checks we will see the familiar message: 263 | 264 | ``` 265 | Decompressing Linux... 266 | ``` 267 | 268 | Now, we call the `__decompress` function to decompress the kernel: 269 | 270 | ```C 271 | __decompress(input_data, input_len, NULL, NULL, output, output_len, NULL, error); 272 | ``` 273 | 274 | The implementation of the `__decompress` function depends on what decompression algorithm was chosen during kernel compilation: 275 | 276 | ```C 277 | #ifdef CONFIG_KERNEL_GZIP 278 | #include "../../../../lib/decompress_inflate.c" 279 | #endif 280 | 281 | #ifdef CONFIG_KERNEL_BZIP2 282 | #include "../../../../lib/decompress_bunzip2.c" 283 | #endif 284 | 285 | #ifdef CONFIG_KERNEL_LZMA 286 | #include "../../../../lib/decompress_unlzma.c" 287 | #endif 288 | 289 | #ifdef CONFIG_KERNEL_XZ 290 | #include "../../../../lib/decompress_unxz.c" 291 | #endif 292 | 293 | #ifdef CONFIG_KERNEL_LZO 294 | #include "../../../../lib/decompress_unlzo.c" 295 | #endif 296 | 297 | #ifdef CONFIG_KERNEL_LZ4 298 | #include "../../../../lib/decompress_unlz4.c" 299 | #endif 300 | ``` 301 | 302 | After the kernel is decompressed, two more functions are called: `parse_elf` and `handle_relocations`. The main point of these functions is to move the decompressed kernel image to its correct place in memory. This is because the decompression is done [in-place](https://en.wikipedia.org/wiki/In-place_algorithm), and we still need to move the kernel to the correct address. As we already know, the kernel image is an [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) executable. The main goal of the `parse_elf` function is to move loadable segments to the correct address. We can see the kernel's loadable segments in the output of the `readelf` program: 303 | 304 | ``` 305 | readelf -l vmlinux 306 | 307 | Elf file type is EXEC (Executable file) 308 | Entry point 0x1000000 309 | There are 5 program headers, starting at offset 64 310 | 311 | Program Headers: 312 | Type Offset VirtAddr PhysAddr 313 | FileSiz MemSiz Flags Align 314 | LOAD 0x0000000000200000 0xffffffff81000000 0x0000000001000000 315 | 0x0000000000893000 0x0000000000893000 R E 200000 316 | LOAD 0x0000000000a93000 0xffffffff81893000 0x0000000001893000 317 | 0x000000000016d000 0x000000000016d000 RW 200000 318 | LOAD 0x0000000000c00000 0x0000000000000000 0x0000000001a00000 319 | 0x00000000000152d8 0x00000000000152d8 RW 200000 320 | LOAD 0x0000000000c16000 0xffffffff81a16000 0x0000000001a16000 321 | 0x0000000000138000 0x000000000029b000 RWE 200000 322 | ``` 323 | 324 | The goal of the `parse_elf` function is to load these segments to the `output` address we got from the `choose_random_location` function. This function starts by checking the [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) signature: 325 | 326 | ```C 327 | Elf64_Ehdr ehdr; 328 | Elf64_Phdr *phdrs, *phdr; 329 | 330 | memcpy(&ehdr, output, sizeof(ehdr)); 331 | 332 | if (ehdr.e_ident[EI_MAG0] != ELFMAG0 || 333 | ehdr.e_ident[EI_MAG1] != ELFMAG1 || 334 | ehdr.e_ident[EI_MAG2] != ELFMAG2 || 335 | ehdr.e_ident[EI_MAG3] != ELFMAG3) { 336 | error("Kernel is not a valid ELF file"); 337 | return; 338 | } 339 | ``` 340 | 341 | If the ELF header is not valid, it prints an error message and halts. If we have a valid `ELF` file, we go through all the program headers from the given `ELF` file and copy all loadable segments with correct 2 megabyte aligned addresses to the output buffer: 342 | 343 | ```C 344 | for (i = 0; i < ehdr.e_phnum; i++) { 345 | phdr = &phdrs[i]; 346 | 347 | switch (phdr->p_type) { 348 | case PT_LOAD: 349 | #ifdef CONFIG_X86_64 350 | if ((phdr->p_align % 0x200000) != 0) 351 | error("Alignment of LOAD segment isn't multiple of 2MB"); 352 | #endif 353 | #ifdef CONFIG_RELOCATABLE 354 | dest = output; 355 | dest += (phdr->p_paddr - LOAD_PHYSICAL_ADDR); 356 | #else 357 | dest = (void *)(phdr->p_paddr); 358 | #endif 359 | memmove(dest, output + phdr->p_offset, phdr->p_filesz); 360 | break; 361 | default: 362 | break; 363 | } 364 | } 365 | ``` 366 | 367 | That's all. 368 | 369 | From this moment, all loadable segments are in the correct place. 370 | 371 | The next step after the `parse_elf` function is to call the `handle_relocations` function. The implementation of this function depends on the `CONFIG_X86_NEED_RELOCS` kernel configuration option and if it is enabled, this function adjusts addresses in the kernel image. This function is also only called if the `CONFIG_RANDOMIZE_BASE` configuration option was enabled during kernel configuration. The implementation of the `handle_relocations` function is easy enough. This function subtracts the value of `LOAD_PHYSICAL_ADDR` from the value of the base load address of the kernel and thus we obtain the difference between where the kernel was linked to load and where it was actually loaded. After this we can relocate the kernel since we know the actual address where the kernel was loaded, the address where it was linked to run and the relocation table which is at the end of the kernel image. 372 | 373 | After the kernel is relocated, we return from the `extract_kernel` function to [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/head_64.S). 374 | 375 | The address of the kernel will be in the `rax` register and we jump to it: 376 | 377 | ```assembly 378 | jmp *%rax 379 | ``` 380 | 381 | That's all. Now we are in the kernel! 382 | 383 | Conclusion 384 | -------------------------------------------------------------------------------- 385 | 386 | This is the end of the fifth part about the Linux kernel booting process. We will not see any more posts about the kernel booting process (there may be updates to this and previous posts though), but there will be many posts about other kernel internals. 387 | 388 | The Next chapter will describe more advanced details about Linux kernel booting process, like load address randomization and etc. 389 | 390 | If you have any questions or suggestions write me a comment or ping me in [twitter](https://twitter.com/0xAX). 391 | 392 | **Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-internals).** 393 | 394 | Links 395 | -------------------------------------------------------------------------------- 396 | 397 | * [address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization) 398 | * [initrd](https://en.wikipedia.org/wiki/Initrd) 399 | * [long mode](https://en.wikipedia.org/wiki/Long_mode) 400 | * [bzip2](http://www.bzip.org/) 401 | * [RdRand instruction](https://en.wikipedia.org/wiki/RdRand) 402 | * [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) 403 | * [Programmable Interval Timers](https://en.wikipedia.org/wiki/Intel_8253) 404 | * [Previous part](https://github.com/0xAX/linux-insides/blob/v4.16/Booting/linux-bootstrap-4.md) 405 | -------------------------------------------------------------------------------- /CODEOWNERS: -------------------------------------------------------------------------------- 1 | # Owner of the repository 2 | * @0xAX 3 | 4 | # Documentation owners 5 | *.md @0xAX @klaudiagrz 6 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | We as members, contributors, and leaders pledge to make participation in our 6 | community a harassment-free experience for everyone, regardless of age, body 7 | size, visible or invisible disability, ethnicity, sex characteristics, gender 8 | identity and expression, level of experience, education, socio-economic status, 9 | nationality, personal appearance, race, religion, or sexual identity 10 | and orientation. 11 | 12 | We pledge to act and interact in ways that contribute to an open, welcoming, 13 | diverse, inclusive, and healthy community. 14 | 15 | ## Our Standards 16 | 17 | Examples of behavior that contributes to a positive environment for our 18 | community include: 19 | 20 | * Demonstrating empathy and kindness toward other people 21 | * Being respectful of differing opinions, viewpoints, and experiences 22 | * Giving and gracefully accepting constructive feedback 23 | * Accepting responsibility and apologizing to those affected by our mistakes, 24 | and learning from the experience 25 | * Focusing on what is best not just for us as individuals, but for the 26 | overall community 27 | 28 | Examples of unacceptable behavior include: 29 | 30 | * The use of sexualized language or imagery, and sexual attention or 31 | advances of any kind 32 | * Trolling, insulting or derogatory comments, and personal or political attacks 33 | * Public or private harassment 34 | * Publishing others' private information, such as a physical or email 35 | address, without their explicit permission 36 | * Other conduct which could reasonably be considered inappropriate in a 37 | professional setting 38 | 39 | ## Enforcement Responsibilities 40 | 41 | Community leaders are responsible for clarifying and enforcing our standards of 42 | acceptable behavior and will take appropriate and fair corrective action in 43 | response to any behavior that they deem inappropriate, threatening, offensive, 44 | or harmful. 45 | 46 | Community leaders have the right and responsibility to remove, edit, or reject 47 | comments, commits, code, wiki edits, issues, and other contributions that are 48 | not aligned to this Code of Conduct, and will communicate reasons for moderation 49 | decisions when appropriate. 50 | 51 | ## Scope 52 | 53 | This Code of Conduct applies within all community spaces, and also applies when 54 | an individual is officially representing the community in public spaces. 55 | Examples of representing our community include using an official e-mail address, 56 | posting via an official social media account, or acting as an appointed 57 | representative at an online or offline event. 58 | 59 | ## Enforcement 60 | 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 62 | reported to the community leaders responsible for enforcement at 63 | kuleshovmail@gmail.com. 64 | All complaints will be reviewed and investigated promptly and fairly. 65 | 66 | All community leaders are obligated to respect the privacy and security of the 67 | reporter of any incident. 68 | 69 | ## Enforcement Guidelines 70 | 71 | Community leaders will follow these Community Impact Guidelines in determining 72 | the consequences for any action they deem in violation of this Code of Conduct: 73 | 74 | ### 1. Correction 75 | 76 | **Community Impact**: Use of inappropriate language or other behavior deemed 77 | unprofessional or unwelcome in the community. 78 | 79 | **Consequence**: A private, written warning from community leaders, providing 80 | clarity around the nature of the violation and an explanation of why the 81 | behavior was inappropriate. A public apology may be requested. 82 | 83 | ### 2. Warning 84 | 85 | **Community Impact**: A violation through a single incident or series 86 | of actions. 87 | 88 | **Consequence**: A warning with consequences for continued behavior. No 89 | interaction with the people involved, including unsolicited interaction with 90 | those enforcing the Code of Conduct, for a specified period of time. This 91 | includes avoiding interactions in community spaces as well as external channels 92 | like social media. Violating these terms may lead to a temporary or 93 | permanent ban. 94 | 95 | ### 3. Temporary Ban 96 | 97 | **Community Impact**: A serious violation of community standards, including 98 | sustained inappropriate behavior. 99 | 100 | **Consequence**: A temporary ban from any sort of interaction or public 101 | communication with the community for a specified period of time. No public or 102 | private interaction with the people involved, including unsolicited interaction 103 | with those enforcing the Code of Conduct, is allowed during this period. 104 | Violating these terms may lead to a permanent ban. 105 | 106 | ### 4. Permanent Ban 107 | 108 | **Community Impact**: Demonstrating a pattern of violation of community 109 | standards, including sustained inappropriate behavior, harassment of an 110 | individual, or aggression toward or disparagement of classes of individuals. 111 | 112 | **Consequence**: A permanent ban from any sort of public interaction within 113 | the community. 114 | 115 | ## Attribution 116 | 117 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 118 | version 2.0, available at 119 | https://www.contributor-covenant.org/version/2/0/code_of_conduct.html. 120 | 121 | Community Impact Guidelines were inspired by [Mozilla's code of conduct 122 | enforcement ladder](https://github.com/mozilla/diversity). 123 | 124 | [homepage]: https://www.contributor-covenant.org 125 | 126 | For answers to common questions about this code of conduct, see the FAQ at 127 | https://www.contributor-covenant.org/faq. Translations are available at 128 | https://www.contributor-covenant.org/translations. 129 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing 2 | 3 | This document outlines the contribution workflow, starting from opening an issue, creating a pull request (PR), reviewing, and merging the PR. When working on this project, make sure to follow the [Code of Conduct](./CODE_OF_CONDUCT.md). 4 | 5 | Thank you for your contribution. 6 | 7 | ## New contributor guide 8 | 9 | If you are a new open source contributor, here are some resources you may find useful before providing your first contributions: 10 | 11 | - [Finding ways to contribute to open source on GitHub](https://docs.github.com/en/get-started/exploring-projects-on-github/finding-ways-to-contribute-to-open-source-on-github) 12 | - [Set up Git](https://docs.github.com/en/get-started/getting-started-with-git/set-up-git) 13 | - [GitHub flow](https://docs.github.com/en/get-started/using-github/github-flow) 14 | - [Collaborating with pull requests](https://docs.github.com/en/github/collaborating-with-pull-requests) 15 | 16 | **Working on your first pull request?** You can learn how from this free series [How to Contribute to an Open Source Project on GitHub](https://kcd.im/pull-request). 17 | 18 | ## Create an issue 19 | 20 | If you have any improvement ideas, notice a missing feature or a bug, create a GitHub issue by clicking **Issues -> New issue** in GitHub. Make sure to fill the issue template with a detailed description of the bug or suggested improvements. Provide proper argumentation and screenshots, if necessary. 21 | 22 | If you find any existing issue to work on, you are welcome to open a PR with a fix. 23 | 24 | ## Open a pull request 25 | 26 | If you want to directly contribute to the project, create a pull reguest with the suggested changes. To do so: 27 | 28 | 1. [Fork the repository](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo#fork-an-example-repository). 29 | 30 | 2. Make changes on your local copy of the forked repository. 31 | 32 | 3. Commit and push the changes to GitHub. 33 | 34 | > [!IMPORTANT] 35 | > Don't forget to update your fork. Since many contributors may be working on the same content based on the `master` branch, some merge conflicts may occur. Remember to rebase with `master` every time before pushing your changes and make sure your branch doesn't have any conflicts with `master`. If you run into any merge conflicts, read the [Resolve merge conflicts](https://github.com/skills/resolve-merge-conflicts) tutorial to learn how to resolve merge conflicts and other issues. 36 | 37 | 4. Open a pull request in GitHub. Fill the pull request template with the reason and description for the provided changes. Link your pull request with the existing issue, if applicable. After submitting your PR, wait for the review from the project maintainers. 38 | 39 | ## Review and approval process 40 | 41 | After you submit your PR, wait for the review. The project maintainers will evaluate your changes and provide feedback either using [suggested changes](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/incorporating-feedback-in-your-pull-request) or pull request comments. Address the review suggestions and comments as soon as you can. If your PR looks good, the maintainers approve and merge it. 42 | 43 | ## Contributors 44 | 45 | All contributions get credit in [Contributors](contributors.md). Don't forget to add yourself there. 46 | -------------------------------------------------------------------------------- /Cgroups/README.md: -------------------------------------------------------------------------------- 1 | # Cgroups 2 | 3 | This chapter describes `control groups` mechanism in the Linux kernel. 4 | 5 | * [Introduction](linux-cgroups-1.md) 6 | -------------------------------------------------------------------------------- /Cgroups/images/menuconfig.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Cgroups/images/menuconfig.png -------------------------------------------------------------------------------- /Concepts/README.md: -------------------------------------------------------------------------------- 1 | # Linux kernel concepts 2 | 3 | This chapter describes various concepts which are used in the Linux kernel. 4 | 5 | * [Per-CPU variables](linux-cpu-1.md) 6 | * [CPU masks](linux-cpu-2.md) 7 | * [The initcall mechanism](linux-cpu-3.md) 8 | * [Notification Chains](linux-cpu-4.md) -------------------------------------------------------------------------------- /Concepts/linux-cpu-1.md: -------------------------------------------------------------------------------- 1 | Per-CPU variables 2 | ================================================================================ 3 | 4 | Per-CPU variables are one of the kernel features. You can understand the meaning of this feature by reading its name. We can create a variable and each processor core will have its own copy of this variable. In this part, we take a closer look at this feature and try to understand how it is implemented and how it works. 5 | 6 | The kernel provides an API for creating per-cpu variables - the `DEFINE_PER_CPU` macro: 7 | 8 | ```C 9 | #define DEFINE_PER_CPU(type, name) \ 10 | DEFINE_PER_CPU_SECTION(type, name, "") 11 | ``` 12 | 13 | This macro defined in the [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/percpu-defs.h) as many other macros for work with per-cpu variables. Now we will see how this feature is implemented. 14 | 15 | Take a look at the `DEFINE_PER_CPU` definition. We see that it takes 2 parameters: `type` and `name`, so we can use it to create per-cpu variables, for example like this: 16 | 17 | ```C 18 | DEFINE_PER_CPU(int, per_cpu_n) 19 | ``` 20 | 21 | We pass the type and the name of our variable. `DEFINE_PER_CPU` calls the `DEFINE_PER_CPU_SECTION` macro and passes the same two parameters and empty string to it. Let's look at the definition of the `DEFINE_PER_CPU_SECTION`: 22 | 23 | ```C 24 | #define DEFINE_PER_CPU_SECTION(type, name, sec) \ 25 | __PCPU_ATTRS(sec) PER_CPU_DEF_ATTRIBUTES \ 26 | __typeof__(type) name 27 | ``` 28 | 29 | ```C 30 | #define __PCPU_ATTRS(sec) \ 31 | __percpu __attribute__((section(PER_CPU_BASE_SECTION sec))) \ 32 | PER_CPU_ATTRIBUTES 33 | ``` 34 | 35 | where `section` is: 36 | 37 | ```C 38 | #define PER_CPU_BASE_SECTION ".data..percpu" 39 | ``` 40 | 41 | After all macros are expanded we will get a global per-cpu variable: 42 | 43 | ```C 44 | __attribute__((section(".data..percpu"))) int per_cpu_n 45 | ``` 46 | 47 | It means that we will have a `per_cpu_n` variable in the `.data..percpu` section. We can find this section in the `vmlinux`: 48 | 49 | ``` 50 | .data..percpu 00013a58 0000000000000000 0000000001a5c000 00e00000 2**12 51 | CONTENTS, ALLOC, LOAD, DATA 52 | ``` 53 | 54 | Ok, now we know that when we use the `DEFINE_PER_CPU` macro, a per-cpu variable in the `.data..percpu` section will be created. When the kernel initializes it calls the `setup_per_cpu_areas` function which loads the `.data..percpu` section multiple times, one section per CPU. 55 | 56 | Let's look at the per-CPU areas initialization process. It starts in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) from the call of the `setup_per_cpu_areas` function which is defined in the [arch/x86/kernel/setup_percpu.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup_percpu.c). 57 | 58 | ```C 59 | pr_info("NR_CPUS:%d nr_cpumask_bits:%d nr_cpu_ids:%d nr_node_ids:%d\n", 60 | NR_CPUS, nr_cpumask_bits, nr_cpu_ids, nr_node_ids); 61 | ``` 62 | 63 | The `setup_per_cpu_areas` starts from the output information about the maximum number of CPUs set during kernel configuration with the `CONFIG_NR_CPUS` configuration option, actual number of CPUs, `nr_cpumask_bits` is the same that `NR_CPUS` bit for the new `cpumask` operators and number of `NUMA` nodes. 64 | 65 | We can see this output in the dmesg: 66 | 67 | ``` 68 | $ dmesg | grep percpu 69 | [ 0.000000] setup_percpu: NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1 70 | ``` 71 | 72 | In the next step we check the `percpu` first chunk allocator. All percpu areas are allocated in chunks. The first chunk is used for the static percpu variables. The Linux kernel has `percpu_alloc` command line parameters which provides the type of the first chunk allocator. We can read about it in the kernel documentation: 73 | 74 | ``` 75 | percpu_alloc= Select which percpu first chunk allocator to use. 76 | Currently supported values are "embed" and "page". 77 | Archs may support subset or none of the selections. 78 | See comments in mm/percpu.c for details on each 79 | allocator. This parameter is primarily for debugging 80 | and performance comparison. 81 | ``` 82 | 83 | The [mm/percpu.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/mm/percpu.c) contains the handler of this command line option: 84 | 85 | ```C 86 | early_param("percpu_alloc", percpu_alloc_setup); 87 | ``` 88 | 89 | Where the `percpu_alloc_setup` function sets the `pcpu_chosen_fc` variable depends on the `percpu_alloc` parameter value. By default the first chunk allocator is `auto`: 90 | 91 | ```C 92 | enum pcpu_fc pcpu_chosen_fc __initdata = PCPU_FC_AUTO; 93 | ``` 94 | 95 | If the `percpu_alloc` parameter is not given to the kernel command line, the `embed` allocator will be used which embeds the first percpu chunk into bootmem with the [memblock](https://0xax.gitbook.io/linux-insides/summary/mm/linux-mm-1). The last allocator is the first chunk `page` allocator which maps the first chunk with `PAGE_SIZE` pages. 96 | 97 | As I wrote above, first of all we make a check of the first chunk allocator type in the `setup_per_cpu_areas`. We check that first chunk allocator is not page: 98 | 99 | ```C 100 | if (pcpu_chosen_fc != PCPU_FC_PAGE) { 101 | ... 102 | ... 103 | ... 104 | } 105 | ``` 106 | 107 | If it is not `PCPU_FC_PAGE`, we will use the `embed` allocator and allocate space for the first chunk with the `pcpu_embed_first_chunk` function: 108 | 109 | ```C 110 | rc = pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE, 111 | dyn_size, atom_size, 112 | pcpu_cpu_distance, 113 | pcpu_fc_alloc, pcpu_fc_free); 114 | ``` 115 | 116 | As shown above, the `pcpu_embed_first_chunk` function embeds the first percpu chunk into bootmem then we pass a couple of parameters to the `pcup_embed_first_chunk`. They are as follows: 117 | 118 | * `PERCPU_FIRST_CHUNK_RESERVE` - the size of the reserved space for the static `percpu` variables; 119 | * `dyn_size` - minimum free size for dynamic allocation in bytes; 120 | * `atom_size` - all allocations are whole multiples of this and aligned to this parameter; 121 | * `pcpu_cpu_distance` - callback to determine distance between cpus; 122 | * `pcpu_fc_alloc` - function to allocate `percpu` page; 123 | * `pcpu_fc_free` - function to release `percpu` page. 124 | 125 | We calculate all of these parameters before the call of the `pcpu_embed_first_chunk`: 126 | 127 | ```C 128 | const size_t dyn_size = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE - PERCPU_FIRST_CHUNK_RESERVE; 129 | size_t atom_size; 130 | #ifdef CONFIG_X86_64 131 | atom_size = PMD_SIZE; 132 | #else 133 | atom_size = PAGE_SIZE; 134 | #endif 135 | ``` 136 | 137 | If the first chunk allocator is `PCPU_FC_PAGE`, we will use the `pcpu_page_first_chunk` instead of the `pcpu_embed_first_chunk`. After that `percpu` areas up, we setup `percpu` offset and its segment for every CPU with the `setup_percpu_segment` function (only for `x86` systems) and move some early data from the arrays to the `percpu` variables (`x86_cpu_to_apicid`, `irq_stack_ptr` and etc...). After the kernel finishes the initialization process, we will have loaded N `.data..percpu` sections, where N is the number of CPUs, and the section used by the bootstrap processor will contain an uninitialized variable created with the `DEFINE_PER_CPU` macro. 138 | 139 | The kernel provides an API for per-cpu variables manipulating: 140 | 141 | * get_cpu_var(var) 142 | * put_cpu_var(var) 143 | 144 | 145 | Let's look at the `get_cpu_var` implementation: 146 | 147 | ```C 148 | #define get_cpu_var(var) \ 149 | (*({ \ 150 | preempt_disable(); \ 151 | this_cpu_ptr(&var); \ 152 | })) 153 | ``` 154 | 155 | The Linux kernel is preemptible and accessing a per-cpu variable requires us to know which processor the kernel is running on. So, current code must not be preempted and moved to the another CPU while accessing a per-cpu variable. That's why, first of all we can see a call of the `preempt_disable` function then a call of the `this_cpu_ptr` macro, which looks like: 156 | 157 | ```C 158 | #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr) 159 | ``` 160 | 161 | and 162 | 163 | ```C 164 | #define raw_cpu_ptr(ptr) per_cpu_ptr(ptr, 0) 165 | ``` 166 | 167 | where `per_cpu_ptr` returns a pointer to the per-cpu variable for the given cpu (second parameter). After we've created a per-cpu variable and made modifications to it, we must call the `put_cpu_var` macro which enables preemption with a call of `preempt_enable` function. So the typical usage of a per-cpu variable is as follows: 168 | 169 | ```C 170 | get_cpu_var(var); 171 | ... 172 | //Do something with the 'var' 173 | ... 174 | put_cpu_var(var); 175 | ``` 176 | 177 | Let's look at the `per_cpu_ptr` macro: 178 | 179 | ```C 180 | #define per_cpu_ptr(ptr, cpu) \ 181 | ({ \ 182 | __verify_pcpu_ptr(ptr); \ 183 | SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu))); \ 184 | }) 185 | ``` 186 | 187 | As I wrote above, this macro returns a per-cpu variable for the given cpu. First of all it calls `__verify_pcpu_ptr`: 188 | 189 | ```C 190 | #define __verify_pcpu_ptr(ptr) 191 | do { 192 | const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL; 193 | (void)__vpp_verify; 194 | } while (0) 195 | ``` 196 | 197 | which makes the given `ptr` type of `const void __percpu *`, 198 | 199 | After this we can see the call of the `SHIFT_PERCPU_PTR` macro with two parameters. As first parameter we pass our ptr and for second parameter we pass the cpu number to the `per_cpu_offset` macro: 200 | 201 | ```C 202 | #define per_cpu_offset(x) (__per_cpu_offset[x]) 203 | ``` 204 | 205 | which expands to getting the `x` element from the `__per_cpu_offset` array: 206 | 207 | 208 | ```C 209 | extern unsigned long __per_cpu_offset[NR_CPUS]; 210 | ``` 211 | 212 | where `NR_CPUS` is the number of CPUs. The `__per_cpu_offset` array is filled with the distances between cpu-variable copies. For example all per-cpu data is `X` bytes in size, so if we access `__per_cpu_offset[Y]`, `X*Y` will be accessed. Let's look at the `SHIFT_PERCPU_PTR` implementation: 213 | 214 | ```C 215 | #define SHIFT_PERCPU_PTR(__p, __offset) \ 216 | RELOC_HIDE((typeof(*(__p)) __kernel __force *)(__p), (__offset)) 217 | ``` 218 | 219 | `RELOC_HIDE` just returns offset `(typeof(ptr)) (__ptr + (off))` and it will return a pointer to the variable. 220 | 221 | That's all! Of course it is not the full API, but a general overview. It can be hard to start with, but to understand per-cpu variables you mainly need to understand the [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/percpu-defs.h) magic. 222 | 223 | Let's again look at the algorithm of getting a pointer to a per-cpu variable: 224 | 225 | * The kernel creates multiple `.data..percpu` sections (one per-cpu) during initialization process; 226 | * All variables created with the `DEFINE_PER_CPU` macro will be relocated to the first section or for CPU0; 227 | * `__per_cpu_offset` array filled with the distance (`BOOT_PERCPU_OFFSET`) between `.data..percpu` sections; 228 | * When the `per_cpu_ptr` is called, for example for getting a pointer on a certain per-cpu variable for the third CPU, the `__per_cpu_offset` array will be accessed, where every index points to the required CPU. 229 | 230 | That's all. 231 | -------------------------------------------------------------------------------- /Concepts/linux-cpu-2.md: -------------------------------------------------------------------------------- 1 | CPU masks 2 | ================================================================================ 3 | 4 | Introduction 5 | -------------------------------------------------------------------------------- 6 | 7 | `Cpumasks` is a special way provided by the Linux kernel to store information about CPUs in the system. The relevant source code and header files which contains API for `Cpumasks` manipulation: 8 | 9 | * [include/linux/cpumask.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/cpumask.h) 10 | * [lib/cpumask.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/lib/cpumask.c) 11 | * [kernel/cpu.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/cpu.c) 12 | 13 | As comment says from the [include/linux/cpumask.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/cpumask.h): Cpumasks provide a bitmap suitable for representing the set of CPU's in a system, one bit position per CPU number. We already saw a bit about cpumask in the `boot_cpu_init` function from the [Kernel entry point](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-4) part. This function makes first boot cpu online, active and etc...: 14 | 15 | ```C 16 | set_cpu_online(cpu, true); 17 | set_cpu_active(cpu, true); 18 | set_cpu_present(cpu, true); 19 | set_cpu_possible(cpu, true); 20 | ``` 21 | 22 | Before we consider implementation of these functions, let's consider all of these masks. 23 | 24 | The `cpu_possible` is a set of cpu ID's which can be plugged in anytime during the life of that system boot or in other words mask of possible CPUs contains maximum number of CPUs which are possible in the system. It will be equal to value of the `NR_CPUS` which is set statically via the `CONFIG_NR_CPUS` kernel configuration option. 25 | 26 | The `cpu_present` mask represents which CPUs are currently plugged in. 27 | 28 | The `cpu_online` represents a subset of the `cpu_present` and indicates CPUs which are available for scheduling or in other words a bit from this mask tells the kernel if a processor may be utilized by the Linux kernel. 29 | 30 | The last mask is `cpu_active`. Bits of this mask tells to Linux kernel is a task may be moved to a certain processor. 31 | 32 | All of these masks depend on the `CONFIG_HOTPLUG_CPU` configuration option and if this option is disabled `possible == present` and `active == online`. The implementations of all of these functions are very similar. Every function checks the second parameter. If it is `true`, it calls `cpumask_set_cpu` otherwise it calls `cpumask_clear_cpu` . 33 | 34 | There are two ways for a `cpumask` creation. First is to use `cpumask_t`. It is defined as: 35 | 36 | ```C 37 | typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t; 38 | ``` 39 | 40 | It wraps the `cpumask` structure which contains one bitmask `bits` field. The `DECLARE_BITMAP` macro gets two parameters: 41 | 42 | * bitmap name; 43 | * number of bits. 44 | 45 | and creates an array of `unsigned long` with the given name. Its implementation is pretty easy: 46 | 47 | ```C 48 | #define DECLARE_BITMAP(name,bits) \ 49 | unsigned long name[BITS_TO_LONGS(bits)] 50 | ``` 51 | 52 | where `BITS_TO_LONGS`: 53 | 54 | ```C 55 | #define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long)) 56 | #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d)) 57 | ``` 58 | 59 | As we are focusing on the `x86_64` architecture, `unsigned long` is 8-bytes size and our array will contain only one element: 60 | 61 | ``` 62 | (((8) + (64) - 1) / (64)) = 1 63 | ``` 64 | 65 | `NR_CPUS` macro represents the number of CPUs in the system and depends on the `CONFIG_NR_CPUS` macro which is defined in [include/linux/threads.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/threads.h) and looks like this: 66 | 67 | ```C 68 | #ifndef CONFIG_NR_CPUS 69 | #define CONFIG_NR_CPUS 1 70 | #endif 71 | 72 | #define NR_CPUS CONFIG_NR_CPUS 73 | ``` 74 | 75 | The second way to define cpumask is to use the `DECLARE_BITMAP` macro directly and the `to_cpumask` macro which converts the given bitmap to `struct cpumask *`: 76 | 77 | ```C 78 | #define to_cpumask(bitmap) \ 79 | ((struct cpumask *)(1 ? (bitmap) \ 80 | : (void *)sizeof(__check_is_bitmap(bitmap)))) 81 | ``` 82 | 83 | We can see the ternary operator operator here which is `true` every time. `__check_is_bitmap` inline function is defined as: 84 | 85 | ```C 86 | static inline int __check_is_bitmap(const unsigned long *bitmap) 87 | { 88 | return 1; 89 | } 90 | ``` 91 | 92 | And returns `1` every time. We need it here for only one purpose: at compile time it checks that a given `bitmap` is a bitmap, or in other words it checks that a given `bitmap` has type - `unsigned long *`. So we just pass `cpu_possible_bits` to the `to_cpumask` macro for converting an array of `unsigned long` to the `struct cpumask *`. 93 | 94 | cpumask API 95 | -------------------------------------------------------------------------------- 96 | 97 | As we can define cpumask with one of the methods, Linux kernel provides API for manipulating a cpumask. Let's consider one of the function which presented above. For example `set_cpu_online`. This function takes two parameters: 98 | 99 | * Index of CPU; 100 | * CPU status; 101 | 102 | Implementation of this function looks as: 103 | 104 | ```C 105 | void set_cpu_online(unsigned int cpu, bool online) 106 | { 107 | if (online) { 108 | cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits)); 109 | cpumask_set_cpu(cpu, to_cpumask(cpu_active_bits)); 110 | } else { 111 | cpumask_clear_cpu(cpu, to_cpumask(cpu_online_bits)); 112 | } 113 | } 114 | ``` 115 | 116 | First of all it checks the second `state` parameter and calls `cpumask_set_cpu` or `cpumask_clear_cpu` depending on it. Here we can see casting to the `struct cpumask *` of the second parameter in the `cpumask_set_cpu`. In our case it is `cpu_online_bits` which is a bitmap and defined as: 117 | 118 | ```C 119 | static DECLARE_BITMAP(cpu_online_bits, CONFIG_NR_CPUS) __read_mostly; 120 | ``` 121 | 122 | The `cpumask_set_cpu` function makes only one call to the `set_bit` function: 123 | 124 | ```C 125 | static inline void cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp) 126 | { 127 | set_bit(cpumask_check(cpu), cpumask_bits(dstp)); 128 | } 129 | ``` 130 | 131 | The `set_bit` function takes two parameters too, and sets a given bit (first parameter) in the memory (second parameter or `cpu_online_bits` bitmap). We can see here that before `set_bit` is called, its two parameters will be passed to the 132 | 133 | * cpumask_check; 134 | * cpumask_bits. 135 | 136 | Let's consider these two macros. First if `cpumask_check` does nothing in our case and just returns given parameter. The second `cpumask_bits` just returns the `bits` field from the given `struct cpumask *` structure: 137 | 138 | ```C 139 | #define cpumask_bits(maskp) ((maskp)->bits) 140 | ``` 141 | 142 | Now let's look on the `set_bit` implementation: 143 | 144 | ```C 145 | static __always_inline void 146 | set_bit(long nr, volatile unsigned long *addr) 147 | { 148 | if (IS_IMMEDIATE(nr)) { 149 | asm volatile(LOCK_PREFIX "orb %1,%0" 150 | : CONST_MASK_ADDR(nr, addr) 151 | : "iq" ((u8)CONST_MASK(nr)) 152 | : "memory"); 153 | } else { 154 | asm volatile(LOCK_PREFIX "bts %1,%0" 155 | : BITOP_ADDR(addr) : "Ir" (nr) : "memory"); 156 | } 157 | } 158 | ``` 159 | 160 | This function looks scary, but it is not so hard as it seems. First of all it passes `nr` or number of the bit to the `IS_IMMEDIATE` macro which just calls the GCC internal `__builtin_constant_p` function: 161 | 162 | ```C 163 | #define IS_IMMEDIATE(nr) (__builtin_constant_p(nr)) 164 | ``` 165 | 166 | `__builtin_constant_p` checks that given parameter is known constant at compile-time. As our `cpu` is not compile-time constant, the `else` clause will be executed: 167 | 168 | ```C 169 | asm volatile(LOCK_PREFIX "bts %1,%0" : BITOP_ADDR(addr) : "Ir" (nr) : "memory"); 170 | ``` 171 | 172 | Let's try to understand how it works step by step: 173 | 174 | `LOCK_PREFIX` is a x86 `lock` instruction. This instruction tells the cpu to occupy the system bus while the instruction(s) will be executed. This allows the CPU to synchronize memory access, preventing simultaneous access of multiple processors (or devices - the DMA controller for example) to one memory cell. 175 | 176 | `BITOP_ADDR` casts the given parameter to the `(*(volatile long *)` and adds `+m` constraints. `+` means that this operand is both read and written by the instruction. `m` shows that this is a memory operand. `BITOP_ADDR` is defined as: 177 | 178 | ```C 179 | #define BITOP_ADDR(x) "+m" (*(volatile long *) (x)) 180 | ``` 181 | 182 | Next is the `memory` clobber. It tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands (for example, accessing the memory pointed to by one of the input parameters). 183 | 184 | `Ir` - immediate register operand. 185 | 186 | 187 | The `bts` instruction sets a given bit in a bit string and stores the value of a given bit in the `CF` flag. So we passed the cpu number which is zero in our case and after `set_bit` is executed, it sets the zero bit in the `cpu_online_bits` cpumask. It means that the first cpu is online at this moment. 188 | 189 | Besides the `set_cpu_*` API, cpumask of course provides another API for cpumasks manipulation. Let's consider it in short. 190 | 191 | Additional cpumask API 192 | -------------------------------------------------------------------------------- 193 | 194 | cpumask provides a set of macros for getting the numbers of CPUs in various states. For example: 195 | 196 | ```C 197 | #define num_online_cpus() cpumask_weight(cpu_online_mask) 198 | ``` 199 | 200 | This macro returns the amount of `online` CPUs. It calls the `cpumask_weight` function with the `cpu_online_mask` bitmap (read about it). The`cpumask_weight` function makes one call of the `bitmap_weight` function with two parameters: 201 | 202 | * cpumask bitmap; 203 | * `nr_cpumask_bits` - which is `NR_CPUS` in our case. 204 | 205 | ```C 206 | static inline unsigned int cpumask_weight(const struct cpumask *srcp) 207 | { 208 | return bitmap_weight(cpumask_bits(srcp), nr_cpumask_bits); 209 | } 210 | ``` 211 | 212 | and calculates the number of bits in the given bitmap. Besides the `num_online_cpus`, cpumask provides macros for the all CPU states: 213 | 214 | * num_possible_cpus; 215 | * num_active_cpus; 216 | * cpu_online; 217 | * cpu_possible. 218 | 219 | and many more. 220 | 221 | Besides that the Linux kernel provides the following API for the manipulation of `cpumask`: 222 | 223 | * `for_each_cpu` - iterates over every cpu in a mask; 224 | * `for_each_cpu_not` - iterates over every cpu in a complemented mask; 225 | * `cpumask_clear_cpu` - clears a cpu in a cpumask; 226 | * `cpumask_test_cpu` - tests a cpu in a mask; 227 | * `cpumask_setall` - set all cpus in a mask; 228 | * `cpumask_size` - returns size to allocate for a 'struct cpumask' in bytes; 229 | 230 | and many many more... 231 | 232 | Links 233 | -------------------------------------------------------------------------------- 234 | 235 | * [cpumask documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) 236 | -------------------------------------------------------------------------------- /Concepts/linux-cpu-4.md: -------------------------------------------------------------------------------- 1 | Notification Chains in Linux Kernel 2 | ================================================================================ 3 | 4 | Introduction 5 | -------------------------------------------------------------------------------- 6 | 7 | The Linux kernel is huge piece of [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) code which consists from many different subsystems. Each subsystem has its own purpose which is independent of other subsystems. But often one subsystem wants to know something from other subsystem(s). There is special mechanism in the Linux kernel which allows to solve this problem partly. The name of this mechanism is - `notification chains` and its main purpose to provide a way for different subsystems to subscribe on asynchronous events from other subsystems. Note that this mechanism is only for communication inside kernel, but there are other mechanisms for communication between kernel and userspace. 8 | 9 | Before we consider `notification chains` [API](https://en.wikipedia.org/wiki/Application_programming_interface) and implementation of this API, let's look at `Notification chains` mechanism from theoretical side as we did it in other parts of this book. Everything which is related to `notification chains` mechanism is located in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) header file and [kernel/notifier.c](https://github.com/torvalds/linux/blob/master/kernel/notifier.c) source code file. So let's open them and start to dive. 10 | 11 | Notification Chains related data structures 12 | -------------------------------------------------------------------------------- 13 | 14 | Let's start to consider `notification chains` mechanism from related data structures. As I wrote above, main data structures should be located in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) header file, so the Linux kernel provides generic API which does not depend on certain architecture. In general, the `notification chains` mechanism represents a list (that's why it's named `chains`) of [callback](https://en.wikipedia.org/wiki/Callback_%28computer_programming%29) functions which are will be executed when an event will be occurred. 15 | 16 | All of these callback functions are represented as `notifier_fn_t` type in the Linux kernel: 17 | 18 | ```C 19 | typedef int (*notifier_fn_t)(struct notifier_block *nb, unsigned long action, void *data); 20 | ``` 21 | 22 | So we may see that it takes three following arguments: 23 | 24 | * `nb` - is linked list of function pointers (will see it now); 25 | * `action` - is type of an event. A notification chain may support multiple events, so we need this parameter to distinguish an event from other events; 26 | * `data` - is storage for private information. Actually it allows to provide additional data information about an event. 27 | 28 | Additionally we may see that `notifier_fn_t` returns an integer value. This integer value maybe one of: 29 | 30 | * `NOTIFY_DONE` - subscriber does not interested in notification; 31 | * `NOTIFY_OK` - notification was processed correctly; 32 | * `NOTIFY_BAD` - something went wrong; 33 | * `NOTIFY_STOP` - notification is done, but no further callbacks should be called for this event. 34 | 35 | All of these results defined as macros in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) header file: 36 | 37 | ```C 38 | #define NOTIFY_DONE 0x0000 39 | #define NOTIFY_OK 0x0001 40 | #define NOTIFY_BAD (NOTIFY_STOP_MASK|0x0002) 41 | #define NOTIFY_STOP (NOTIFY_OK|NOTIFY_STOP_MASK) 42 | ``` 43 | 44 | Where `NOTIFY_STOP_MASK` represented by the: 45 | 46 | ```C 47 | #define NOTIFY_STOP_MASK 0x8000 48 | ``` 49 | 50 | macro and means that callbacks will not be called during next notifications. 51 | 52 | Each part of the Linux kernel which wants to be notified on a certain event will should provide own `notifier_fn_t` callback function. Main role of the `notification chains` mechanism is to call certain callbacks when an asynchronous event occurred. 53 | 54 | The main building block of the `notification chains` mechanism is the `notifier_block` structure: 55 | 56 | ```C 57 | struct notifier_block { 58 | notifier_fn_t notifier_call; 59 | struct notifier_block __rcu *next; 60 | int priority; 61 | }; 62 | ``` 63 | 64 | which is defined in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) file. This struct contains pointer to callback function - `notifier_call`, link to the next notification callback and `priority` of a callback function as functions with higher priority are executed first. 65 | 66 | The Linux kernel provides notification chains of four following types: 67 | 68 | * Blocking notifier chains; 69 | * SRCU notifier chains; 70 | * Atomic notifier chains; 71 | * Raw notifier chains. 72 | 73 | Let's consider all of these types of notification chains by order: 74 | 75 | In the first case for the `blocking notifier chains`, callbacks will be called/executed in process context. This means that the calls in a notification chain may be blocked. 76 | 77 | The second `SRCU notifier chains` represent alternative form of `blocking notifier chains`. In the first case, blocking notifier chains uses `rw_semaphore` synchronization primitive to protect chain links. `SRCU` notifier chains run in process context too, but uses special form of [RCU](https://en.wikipedia.org/wiki/Read-copy-update) mechanism which is permissible to block in an read-side critical section. 78 | 79 | In the third case for the `atomic notifier chains` runs in interrupt or atomic context and protected by [spinlock](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-1) synchronization primitive. The last `raw notifier chains` provides special type of notifier chains without any locking restrictions on callbacks. This means that protection rests on the shoulders of caller side. It is very useful when we want to protect our chain with very specific locking mechanism. 80 | 81 | If we will look at the implementation of the `notifier_block` structure, we will see that it contains pointer to the `next` element from a notification chain list, but we have no head. Actually a head of such list is in separate structure depends on type of a notification chain. For example for the `blocking notifier chains`: 82 | 83 | ```C 84 | struct blocking_notifier_head { 85 | struct rw_semaphore rwsem; 86 | struct notifier_block __rcu *head; 87 | }; 88 | ``` 89 | 90 | or for `atomic notification chains`: 91 | 92 | ```C 93 | struct atomic_notifier_head { 94 | spinlock_t lock; 95 | struct notifier_block __rcu *head; 96 | }; 97 | ``` 98 | 99 | Now as we know a little about `notification chains` mechanism let's consider implementation of its API. 100 | 101 | Notification Chains 102 | -------------------------------------------------------------------------------- 103 | 104 | Usually there are two sides in a publish/subscriber mechanisms. One side who wants to get notifications and other side(s) who generates these notifications. We will consider notification chains mechanism from both sides. We will consider `blocking notification chains` in this part, because of other types of notification chains are similar to it and differ mostly in protection mechanisms. 105 | 106 | Before a notification producer is able to produce notification, first of all it should initialize head of a notification chain. For example let's consider notification chains related to kernel [loadable modules](https://en.wikipedia.org/wiki/Loadable_kernel_module). If we will look in the [kernel/module.c](https://github.com/torvalds/linux/blob/master/kernel/module.c) source code file, we will see following definition: 107 | 108 | ```C 109 | static BLOCKING_NOTIFIER_HEAD(module_notify_list); 110 | ``` 111 | 112 | which defines head for loadable modules blocking notifier chain. The `BLOCKING_NOTIFIER_HEAD` macro is defined in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) header file and expands to the following code: 113 | 114 | ```C 115 | #define BLOCKING_INIT_NOTIFIER_HEAD(name) do { \ 116 | init_rwsem(&(name)->rwsem); \ 117 | (name)->head = NULL; \ 118 | } while (0) 119 | ``` 120 | 121 | So we may see that it takes name of a name of a head of a blocking notifier chain and initializes read/write [semaphore](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-3) and set head to `NULL`. Besides the `BLOCKING_INIT_NOTIFIER_HEAD` macro, the Linux kernel additionally provides `ATOMIC_INIT_NOTIFIER_HEAD`, `RAW_INIT_NOTIFIER_HEAD` macros and `srcu_init_notifier` function for initialization atomic and other types of notification chains. 122 | 123 | After initialization of a head of a notification chain, a subsystem which wants to receive notification from the given notification chain should register with certain function which depends on the type of notification. If you will look in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) header file, you will see following four function for this: 124 | 125 | ```C 126 | extern int atomic_notifier_chain_register(struct atomic_notifier_head *nh, 127 | struct notifier_block *nb); 128 | 129 | extern int blocking_notifier_chain_register(struct blocking_notifier_head *nh, 130 | struct notifier_block *nb); 131 | 132 | extern int raw_notifier_chain_register(struct raw_notifier_head *nh, 133 | struct notifier_block *nb); 134 | 135 | extern int srcu_notifier_chain_register(struct srcu_notifier_head *nh, 136 | struct notifier_block *nb); 137 | ``` 138 | 139 | As I already wrote above, we will cover only blocking notification chains in the part, so let's consider implementation of the `blocking_notifier_chain_register` function. Implementation of this function is located in the [kernel/notifier.c](https://github.com/torvalds/linux/blob/master/kernel/notifier.c) source code file and as we may see the `blocking_notifier_chain_register` takes two parameters: 140 | 141 | * `nh` - head of a notification chain; 142 | * `nb` - notification descriptor. 143 | 144 | Now let's look at the implementation of the `blocking_notifier_chain_register` function: 145 | 146 | ```C 147 | int raw_notifier_chain_register(struct raw_notifier_head *nh, 148 | struct notifier_block *n) 149 | { 150 | return notifier_chain_register(&nh->head, n); 151 | } 152 | ``` 153 | 154 | As we may see it just returns result of the `notifier_chain_register` function from the same source code file and as we may understand this function does all job for us. Definition of the `notifier_chain_register` function looks: 155 | 156 | ```C 157 | int blocking_notifier_chain_register(struct blocking_notifier_head *nh, 158 | struct notifier_block *n) 159 | { 160 | int ret; 161 | 162 | if (unlikely(system_state == SYSTEM_BOOTING)) 163 | return notifier_chain_register(&nh->head, n); 164 | 165 | down_write(&nh->rwsem); 166 | ret = notifier_chain_register(&nh->head, n); 167 | up_write(&nh->rwsem); 168 | return ret; 169 | } 170 | ``` 171 | 172 | As we may see implementation of the `blocking_notifier_chain_register` is pretty simple. First of all there is check which check current system state and if a system in rebooting state we just call the `notifier_chain_register`. In other way we do the same call of the `notifier_chain_register` but as you may see this call is protected with read/write semaphores. Now let's look at the implementation of the `notifier_chain_register` function: 173 | 174 | ```C 175 | static int notifier_chain_register(struct notifier_block **nl, 176 | struct notifier_block *n) 177 | { 178 | while ((*nl) != NULL) { 179 | if (n->priority > (*nl)->priority) 180 | break; 181 | nl = &((*nl)->next); 182 | } 183 | n->next = *nl; 184 | rcu_assign_pointer(*nl, n); 185 | return 0; 186 | } 187 | ``` 188 | 189 | This function just inserts new `notifier_block` (given by a subsystem which wants to get notifications) to the notification chain list. Besides subscribing on an event, subscriber may unsubscribe from a certain events with the set of `unsubscribe` functions: 190 | 191 | ```C 192 | extern int atomic_notifier_chain_unregister(struct atomic_notifier_head *nh, 193 | struct notifier_block *nb); 194 | 195 | extern int blocking_notifier_chain_unregister(struct blocking_notifier_head *nh, 196 | struct notifier_block *nb); 197 | 198 | extern int raw_notifier_chain_unregister(struct raw_notifier_head *nh, 199 | struct notifier_block *nb); 200 | 201 | extern int srcu_notifier_chain_unregister(struct srcu_notifier_head *nh, 202 | struct notifier_block *nb); 203 | ``` 204 | 205 | When a producer of notifications wants to notify subscribers about an event, the `*.notifier_call_chain` function will be called. As you already may guess each type of notification chains provides own function to produce notification: 206 | 207 | ```C 208 | extern int atomic_notifier_call_chain(struct atomic_notifier_head *nh, 209 | unsigned long val, void *v); 210 | 211 | extern int blocking_notifier_call_chain(struct blocking_notifier_head *nh, 212 | unsigned long val, void *v); 213 | 214 | extern int raw_notifier_call_chain(struct raw_notifier_head *nh, 215 | unsigned long val, void *v); 216 | 217 | extern int srcu_notifier_call_chain(struct srcu_notifier_head *nh, 218 | unsigned long val, void *v); 219 | ``` 220 | 221 | Let's consider implementation of the `blocking_notifier_call_chain` function. This function is defined in the [kernel/notifier.c](https://github.com/torvalds/linux/blob/master/kernel/notifier.c) source code file: 222 | 223 | ```C 224 | int blocking_notifier_call_chain(struct blocking_notifier_head *nh, 225 | unsigned long val, void *v) 226 | { 227 | return __blocking_notifier_call_chain(nh, val, v, -1, NULL); 228 | } 229 | ``` 230 | 231 | and as we may see it just returns result of the `__blocking_notifier_call_chain` function. As we may see, the `blocking_notifer_call_chain` takes three parameters: 232 | 233 | * `nh` - head of notification chain list; 234 | * `val` - type of a notification; 235 | * `v` - input parameter which may be used by handlers. 236 | 237 | But the `__blocking_notifier_call_chain` function takes five parameters: 238 | 239 | ```C 240 | int __blocking_notifier_call_chain(struct blocking_notifier_head *nh, 241 | unsigned long val, void *v, 242 | int nr_to_call, int *nr_calls) 243 | { 244 | ... 245 | ... 246 | ... 247 | } 248 | ``` 249 | 250 | Where `nr_to_call` and `nr_calls` are number of notifier functions to be called and number of sent notifications. As you may guess the main goal of the `__blocking_notifer_call_chain` function and other functions for other notification types is to call callback function when an event occurs. Implementation of the `__blocking_notifier_call_chain` is pretty simple, it just calls the `notifier_call_chain` function from the same source code file protected with read/write semaphore: 251 | 252 | ```C 253 | int __blocking_notifier_call_chain(struct blocking_notifier_head *nh, 254 | unsigned long val, void *v, 255 | int nr_to_call, int *nr_calls) 256 | { 257 | int ret = NOTIFY_DONE; 258 | 259 | if (rcu_access_pointer(nh->head)) { 260 | down_read(&nh->rwsem); 261 | ret = notifier_call_chain(&nh->head, val, v, nr_to_call, 262 | nr_calls); 263 | up_read(&nh->rwsem); 264 | } 265 | return ret; 266 | } 267 | ``` 268 | 269 | and returns its result. In this case all job is done by the `notifier_call_chain` function. Main purpose of this function is to inform registered notifiers about an asynchronous event: 270 | 271 | ```C 272 | static int notifier_call_chain(struct notifier_block **nl, 273 | unsigned long val, void *v, 274 | int nr_to_call, int *nr_calls) 275 | { 276 | ... 277 | ... 278 | ... 279 | ret = nb->notifier_call(nb, val, v); 280 | ... 281 | ... 282 | ... 283 | return ret; 284 | } 285 | ``` 286 | 287 | That's all. In general all looks pretty simple. 288 | 289 | Now let's consider on a simple example related to [loadable modules](https://en.wikipedia.org/wiki/Loadable_kernel_module). If we will look in the [kernel/module.c](https://github.com/torvalds/linux/blob/master/kernel/module.c). As we already saw in this part, there is: 290 | 291 | ```C 292 | static BLOCKING_NOTIFIER_HEAD(module_notify_list); 293 | ``` 294 | 295 | definition of the `module_notify_list` in the [kernel/module.c](https://github.com/torvalds/linux/blob/master/kernel/module.c) source code file. This definition determines head of list of blocking notifier chains related to kernel modules. There are at least three following events: 296 | 297 | * MODULE_STATE_LIVE 298 | * MODULE_STATE_COMING 299 | * MODULE_STATE_GOING 300 | 301 | in which maybe interested some subsystems of the Linux kernel. For example tracing of kernel modules states. Instead of direct call of the `atomic_notifier_chain_register`, `blocking_notifier_chain_register` and etc., most notification chains come with a set of wrappers used to register to them. Registration on these modules events is going with the help of such wrapper: 302 | 303 | ```C 304 | int register_module_notifier(struct notifier_block *nb) 305 | { 306 | return blocking_notifier_chain_register(&module_notify_list, nb); 307 | } 308 | ``` 309 | 310 | If we will look in the [kernel/tracepoint.c](https://github.com/torvalds/linux/blob/master/kernel/tracepoint.c) source code file, we will see such registration during initialization of [tracepoints](https://www.kernel.org/doc/Documentation/trace/tracepoints.txt): 311 | 312 | ```C 313 | static __init int init_tracepoints(void) 314 | { 315 | int ret; 316 | 317 | ret = register_module_notifier(&tracepoint_module_nb); 318 | if (ret) 319 | pr_warn("Failed to register tracepoint module enter notifier\n"); 320 | 321 | return ret; 322 | } 323 | ``` 324 | 325 | Where `tracepoint_module_nb` provides callback function: 326 | 327 | ```C 328 | static struct notifier_block tracepoint_module_nb = { 329 | .notifier_call = tracepoint_module_notify, 330 | .priority = 0, 331 | }; 332 | ``` 333 | 334 | When one of the `MODULE_STATE_LIVE`, `MODULE_STATE_COMING` or `MODULE_STATE_GOING` events occurred. For example the `MODULE_STATE_LIVE` the `MODULE_STATE_COMING` notifications will be sent during execution of the [init_module](http://man7.org/linux/man-pages/man2/init_module.2.html) [system call](https://0xax.gitbook.io/linux-insides/summary/syscall/linux-syscall-1). Or for example `MODULE_STATE_GOING` will be sent during execution of the [delete_module](http://man7.org/linux/man-pages/man2/delete_module.2.html) `system call`: 335 | 336 | ```C 337 | SYSCALL_DEFINE2(delete_module, const char __user *, name_user, 338 | unsigned int, flags) 339 | { 340 | ... 341 | ... 342 | ... 343 | blocking_notifier_call_chain(&module_notify_list, 344 | MODULE_STATE_GOING, mod); 345 | ... 346 | ... 347 | ... 348 | } 349 | ``` 350 | 351 | Thus when one of these system call will be called from userspace, the Linux kernel will send certain notification depending on a system call and the `tracepoint_module_notify` callback function will be called. 352 | 353 | That's all. 354 | 355 | Links 356 | -------------------------------------------------------------------------------- 357 | 358 | * [C programming language](https://en.wikipedia.org/wiki/C_%28programming_language%29) 359 | * [API](https://en.wikipedia.org/wiki/Application_programming_interface) 360 | * [callback](https://en.wikipedia.org/wiki/Callback_%28computer_programming%29) 361 | * [RCU](https://en.wikipedia.org/wiki/Read-copy-update) 362 | * [spinlock](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-1) 363 | * [loadable modules](https://en.wikipedia.org/wiki/Loadable_kernel_module) 364 | * [semaphore](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-3) 365 | * [tracepoints](https://www.kernel.org/doc/Documentation/trace/tracepoints.txt) 366 | * [system call](https://0xax.gitbook.io/linux-insides/summary/syscall/linux-syscall-1) 367 | * [init_module system call](http://man7.org/linux/man-pages/man2/init_module.2.html) 368 | * [delete_module](http://man7.org/linux/man-pages/man2/delete_module.2.html) 369 | * [previous part](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-3) 370 | -------------------------------------------------------------------------------- /DataStructures/README.md: -------------------------------------------------------------------------------- 1 | Data Structures in the Linux Kernel 2 | ======================================================================== 3 | 4 | Linux kernel provides different implementations of data structures like doubly linked list, B+ tree, priority heap and many many more. 5 | 6 | This part considers the following data structures and algorithms: 7 | 8 | * [Doubly linked list](linux-datastructures-1.md) 9 | * [Radix tree](linux-datastructures-2.md) 10 | * [Bit arrays](linux-datastructures-3.md) 11 | -------------------------------------------------------------------------------- /DataStructures/linux-datastructures-1.md: -------------------------------------------------------------------------------- 1 | Data Structures in the Linux Kernel 2 | ================================================================================ 3 | 4 | Doubly linked list 5 | -------------------------------------------------------------------------------- 6 | 7 | Linux kernel provides its own implementation of doubly linked list, which you can find in the [include/linux/list.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/list.h). We will start `Data Structures in the Linux kernel` from the doubly linked list data structure. Why? Because it is very popular in the kernel, just try to [search](http://lxr.free-electrons.com/ident?i=list_head) 8 | 9 | First of all, let's look on the main structure in the [include/linux/types.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/types.h): 10 | 11 | ```C 12 | struct list_head { 13 | struct list_head *next, *prev; 14 | }; 15 | ``` 16 | 17 | You can note that it is different from many implementations of doubly linked list which you have seen. For example, this doubly linked list structure from the [glib](http://www.gnu.org/software/libc/) library looks like : 18 | 19 | ```C 20 | struct GList { 21 | gpointer data; 22 | GList *next; 23 | GList *prev; 24 | }; 25 | ``` 26 | 27 | Usually a linked list structure contains a pointer to the item. The implementation of linked list in Linux kernel does not. So the main question is - `where does the list store the data?`. The actual implementation of linked list in the kernel is - `Intrusive list`. An intrusive linked list does not contain data in its nodes - A node just contains pointers to the next and previous node and list nodes part of the data that are added to the list. This makes the data structure generic, so it does not care about entry data type anymore. 28 | 29 | For example: 30 | 31 | ```C 32 | struct nmi_desc { 33 | spinlock_t lock; 34 | struct list_head head; 35 | }; 36 | ``` 37 | 38 | Let's look at some examples to understand how `list_head` is used in the kernel. As I already wrote about, there are many, really many different places where lists are used in the kernel. Let's look for an example in miscellaneous character drivers. Misc character drivers API from the [drivers/char/misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/drivers/char/misc.c) is used for writing small drivers for handling simple hardware or virtual devices. Those drivers share same major number: 39 | 40 | ```C 41 | #define MISC_MAJOR 10 42 | ``` 43 | 44 | but have their own minor number. For example you can see it with: 45 | 46 | ``` 47 | ls -l /dev | grep 10 48 | crw------- 1 root root 10, 235 Mar 21 12:01 autofs 49 | drwxr-xr-x 10 root root 200 Mar 21 12:01 cpu 50 | crw------- 1 root root 10, 62 Mar 21 12:01 cpu_dma_latency 51 | crw------- 1 root root 10, 203 Mar 21 12:01 cuse 52 | drwxr-xr-x 2 root root 100 Mar 21 12:01 dri 53 | crw-rw-rw- 1 root root 10, 229 Mar 21 12:01 fuse 54 | crw------- 1 root root 10, 228 Mar 21 12:01 hpet 55 | crw------- 1 root root 10, 183 Mar 21 12:01 hwrng 56 | crw-rw----+ 1 root kvm 10, 232 Mar 21 12:01 kvm 57 | crw-rw---- 1 root disk 10, 237 Mar 21 12:01 loop-control 58 | crw------- 1 root root 10, 227 Mar 21 12:01 mcelog 59 | crw------- 1 root root 10, 59 Mar 21 12:01 memory_bandwidth 60 | crw------- 1 root root 10, 61 Mar 21 12:01 network_latency 61 | crw------- 1 root root 10, 60 Mar 21 12:01 network_throughput 62 | crw-r----- 1 root kmem 10, 144 Mar 21 12:01 nvram 63 | brw-rw---- 1 root disk 1, 10 Mar 21 12:01 ram10 64 | crw--w---- 1 root tty 4, 10 Mar 21 12:01 tty10 65 | crw-rw---- 1 root dialout 4, 74 Mar 21 12:01 ttyS10 66 | crw------- 1 root root 10, 63 Mar 21 12:01 vga_arbiter 67 | crw------- 1 root root 10, 137 Mar 21 12:01 vhci 68 | ``` 69 | 70 | Now let's have a close look at how lists are used in the misc device drivers. First of all, let's look on `miscdevice` structure: 71 | 72 | ```C 73 | struct miscdevice 74 | { 75 | int minor; 76 | const char *name; 77 | const struct file_operations *fops; 78 | struct list_head list; 79 | struct device *parent; 80 | struct device *this_device; 81 | const char *nodename; 82 | mode_t mode; 83 | }; 84 | ``` 85 | 86 | We can see the fourth field in the `miscdevice` structure - `list` which is a list of registered devices. In the beginning of the source code file we can see the definition of misc_list: 87 | 88 | ```C 89 | static LIST_HEAD(misc_list); 90 | ``` 91 | 92 | which expands to the definition of variables with `list_head` type: 93 | 94 | ```C 95 | #define LIST_HEAD(name) \ 96 | struct list_head name = LIST_HEAD_INIT(name) 97 | ``` 98 | 99 | and initializes it with the `LIST_HEAD_INIT` macro, which sets previous and next entries with the address of variable - name: 100 | 101 | ```C 102 | #define LIST_HEAD_INIT(name) { &(name), &(name) } 103 | ``` 104 | 105 | Now let's look on the `misc_register` function which registers a miscellaneous device. At the start it initializes `miscdevice->list` with the `INIT_LIST_HEAD` function: 106 | 107 | ```C 108 | INIT_LIST_HEAD(&misc->list); 109 | ``` 110 | 111 | which does the same as the `LIST_HEAD_INIT` macro: 112 | 113 | ```C 114 | static inline void INIT_LIST_HEAD(struct list_head *list) 115 | { 116 | list->next = list; 117 | list->prev = list; 118 | } 119 | ``` 120 | 121 | In the next step after a device is created by the `device_create` function, we add it to the miscellaneous devices list with: 122 | 123 | ``` 124 | list_add(&misc->list, &misc_list); 125 | ``` 126 | 127 | Kernel `list.h` provides this API for the addition of a new entry to the list. Let's look at its implementation: 128 | 129 | ```C 130 | static inline void list_add(struct list_head *new, struct list_head *head) 131 | { 132 | __list_add(new, head, head->next); 133 | } 134 | ``` 135 | 136 | It just calls internal function `__list_add` with the 3 given parameters: 137 | 138 | * new - new entry. 139 | * head - list head after which the new item will be inserted. 140 | * head->next - next item after list head. 141 | 142 | Implementation of the `__list_add` is pretty simple: 143 | 144 | ```C 145 | static inline void __list_add(struct list_head *new, 146 | struct list_head *prev, 147 | struct list_head *next) 148 | { 149 | next->prev = new; 150 | new->next = next; 151 | new->prev = prev; 152 | prev->next = new; 153 | } 154 | ``` 155 | 156 | Here we add a new item between `prev` and `next`. So `misc` list which we defined at the start with the `LIST_HEAD_INIT` macro will contain previous and next pointers to the `miscdevice->list`. 157 | 158 | There is still one question: how to get list's entry. There is a special macro: 159 | 160 | ```C 161 | #define list_entry(ptr, type, member) \ 162 | container_of(ptr, type, member) 163 | ``` 164 | 165 | which gets three parameters: 166 | 167 | * ptr - the structure list_head pointer; 168 | * type - structure type; 169 | * member - the name of the list_head within the structure; 170 | 171 | For example: 172 | 173 | ```C 174 | const struct miscdevice *p = list_entry(v, struct miscdevice, list) 175 | ``` 176 | 177 | After this we can access to any `miscdevice` field with `p->minor` or `p->name` and etc... Let's look on the `list_entry` implementation: 178 | 179 | ```C 180 | #define list_entry(ptr, type, member) \ 181 | container_of(ptr, type, member) 182 | ``` 183 | 184 | As we can see it just calls `container_of` macro with the same arguments. At first sight, the `container_of` looks strange: 185 | 186 | ```C 187 | #define container_of(ptr, type, member) ({ \ 188 | const typeof( ((type *)0)->member ) *__mptr = (ptr); \ 189 | (type *)( (char *)__mptr - offsetof(type,member) );}) 190 | ``` 191 | 192 | First of all you can note that it consists of two expressions in curly brackets. The compiler will evaluate the whole block in the curly braces and use the value of the last expression. 193 | 194 | For example: 195 | 196 | ``` 197 | #include 198 | 199 | int main() { 200 | int i = 0; 201 | printf("i = %d\n", ({++i; ++i;})); 202 | return 0; 203 | } 204 | ``` 205 | 206 | will print `2`. 207 | 208 | The next point is `typeof`, it's simple. As you can understand from its name, it just returns the type of the given variable. When I first saw the implementation of the `container_of` macro, the strangest thing I found was the zero in the `((type *)0)` expression. Actually this pointer magic calculates the offset of the given field from the address of the structure, but as we have `0` here, it will be just a zero offset along with the field width. Let's look at a simple example: 209 | 210 | ```C 211 | #include 212 | 213 | struct s { 214 | int field1; 215 | char field2; 216 | char field3; 217 | }; 218 | 219 | int main() { 220 | printf("%p\n", &((struct s*)0)->field3); 221 | return 0; 222 | } 223 | ``` 224 | 225 | will print `0x5`. 226 | 227 | The next `offsetof` macro calculates offset from the beginning of the structure to the given structure's field. Its implementation is very similar to the previous code: 228 | 229 | ```C 230 | #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER) 231 | ``` 232 | 233 | Let's summarize all about `container_of` macro. The `container_of` macro returns the address of the structure by the given address of the structure's field with `list_head` type, the name of the structure field with `list_head` type and type of the container structure. At the first line this macro declares the `__mptr` pointer which points to the field of the structure that `ptr` points to and assigns `ptr` to it. Now `ptr` and `__mptr` point to the same address. Technically we don't need this line but it's useful for type checking. The first line ensures that the given structure (`type` parameter) has a member called `member`. In the second line it calculates offset of the field from the structure with the `offsetof` macro and subtracts it from the structure address. That's all. 234 | 235 | Of course `list_add` and `list_entry` is not the only functions which `` provides. Implementation of the doubly linked list provides the following API: 236 | 237 | * list_add 238 | * list_add_tail 239 | * list_del 240 | * list_replace 241 | * list_move 242 | * list_is_last 243 | * list_empty 244 | * list_cut_position 245 | * list_splice 246 | * list_for_each 247 | * list_for_each_entry 248 | 249 | and many more. 250 | -------------------------------------------------------------------------------- /DataStructures/linux-datastructures-2.md: -------------------------------------------------------------------------------- 1 | Data Structures in the Linux Kernel 2 | ================================================================================ 3 | 4 | Radix tree 5 | -------------------------------------------------------------------------------- 6 | 7 | As you already know Linux kernel provides many different libraries and functions which implement different data structures and algorithms. In this part we will consider one of these data structures - [Radix tree](http://en.wikipedia.org/wiki/Radix_tree). There are two files which are related to `radix tree` implementation and API in the linux kernel: 8 | 9 | * [include/linux/radix-tree.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/radix-tree.h) 10 | * [lib/radix-tree.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/lib/radix-tree.c) 11 | 12 | Lets talk about what a `radix tree` is. Radix tree is a `compressed trie` where a [trie](http://en.wikipedia.org/wiki/Trie) is a data structure which implements an interface of an associative array and allows to store values as `key-value`. The keys are usually strings, but any data type can be used. A trie is different from an `n-tree` because of its nodes. Nodes of a trie do not store keys; instead, a node of a trie stores single character labels. The key which is related to a given node is derived by traversing from the root of the tree to this node. For example: 13 | 14 | 15 | ``` 16 |                +-----------+ 17 |                |           | 18 |                |    " "    | 19 | | | 20 |         +------+-----------+------+ 21 |         |                         | 22 |         |                         | 23 |    +----v------+            +-----v-----+ 24 |    |           |            |           | 25 |    |    g      |            |     c     | 26 | | | | | 27 |    +-----------+            +-----------+ 28 |         |                         | 29 |         |                         | 30 |    +----v------+            +-----v-----+ 31 |    |           |            |           | 32 |    |    o      |            |     a     | 33 | | | | | 34 |    +-----------+            +-----------+ 35 |                                   | 36 |                                   | 37 |                             +-----v-----+ 38 |                             |           | 39 |                             |     t     | 40 | | | 41 |                             +-----------+ 42 | ``` 43 | 44 | So in this example, we can see the `trie` with keys, `go` and `cat`. The compressed trie or `radix tree` differs from `trie` in that all intermediates nodes which have only one child are removed. 45 | 46 | Radix tree in Linux kernel is the data structure which maps values to integer keys. It is represented by the following structures from the file [include/linux/radix-tree.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/radix-tree.h): 47 | 48 | ```C 49 | struct radix_tree_root { 50 | unsigned int height; 51 | gfp_t gfp_mask; 52 | struct radix_tree_node __rcu *rnode; 53 | }; 54 | ``` 55 | 56 | This structure presents the root of a radix tree and contains three fields: 57 | 58 | * `height` - height of the tree; 59 | * `gfp_mask` - tells how memory allocations will be performed; 60 | * `rnode` - pointer to the child node. 61 | 62 | The first field we will discuss is `gfp_mask`: 63 | 64 | Low-level kernel memory allocation functions take a set of flags as - `gfp_mask`, which describes how that allocation is to be performed. These `GFP_` flags which control the allocation process can have following values: (`GFP_NOIO` flag) means allocation can block but must not initiate disk I/O; (`__GFP_HIGHMEM` flag) means either ZONE_HIGHMEM or ZONE_NORMAL memory can be used; (`GFP_ATOMIC` flag) means the allocation is high-priority and must not sleep, etc. 65 | 66 | * `GFP_NOIO` - allocation can block but must not initiate disk I/O; 67 | * `__GFP_HIGHMEM` - either ZONE_HIGHMEM or ZONE_NORMAL can be used; 68 | * `GFP_ATOMIC` - allocation process is high-priority and must not sleep; 69 | 70 | etc. 71 | 72 | The next field is `rnode`: 73 | 74 | ```C 75 | struct radix_tree_node { 76 | unsigned int path; 77 | unsigned int count; 78 | union { 79 | struct { 80 | struct radix_tree_node *parent; 81 | void *private_data; 82 | }; 83 | struct rcu_head rcu_head; 84 | }; 85 | /* For tree user */ 86 | struct list_head private_list; 87 | void __rcu *slots[RADIX_TREE_MAP_SIZE]; 88 | unsigned long tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS]; 89 | }; 90 | ``` 91 | 92 | This structure contains information about the offset in a parent and height from the bottom, count of the child nodes and fields for accessing and freeing a node. This fields are described below: 93 | 94 | * `path` - offset in parent & height from the bottom; 95 | * `count` - count of the child nodes; 96 | * `parent` - pointer to the parent node; 97 | * `private_data` - used by the user of a tree; 98 | * `rcu_head` - used for freeing a node; 99 | * `private_list` - used by the user of a tree; 100 | 101 | The two last fields of the `radix_tree_node` - `tags` and `slots` are important and interesting. Every node can contains a set of slots which are store pointers to the data. Empty slots in the Linux kernel radix tree implementation store `NULL`. Radix trees in the linux kernel also supports tags which are associated with the `tags` fields in the `radix_tree_node` structure. Tags allow individual bits to be set on records which are stored in the radix tree. 102 | 103 | Now that we know about radix tree structure, it is time to look on its API. 104 | 105 | Linux kernel radix tree API 106 | --------------------------------------------------------------------------------- 107 | 108 | We start from the data structure initialization. There are two ways to initialize a new radix tree. The first is to use `RADIX_TREE` macro: 109 | 110 | ```C 111 | RADIX_TREE(name, gfp_mask); 112 | ```` 113 | 114 | As you can see we pass the `name` parameter, so with the `RADIX_TREE` macro we can define and initialize radix tree with the given name. Implementation of the `RADIX_TREE` is easy: 115 | 116 | ```C 117 | #define RADIX_TREE(name, mask) \ 118 | struct radix_tree_root name = RADIX_TREE_INIT(mask) 119 | 120 | #define RADIX_TREE_INIT(mask) { \ 121 | .height = 0, \ 122 | .gfp_mask = (mask), \ 123 | .rnode = NULL, \ 124 | } 125 | ``` 126 | 127 | At the beginning of the `RADIX_TREE` macro we define instance of the `radix_tree_root` structure with the given name and call `RADIX_TREE_INIT` macro with the given mask. The `RADIX_TREE_INIT` macro just initializes `radix_tree_root` structure with the default values and the given mask. 128 | 129 | The second way is to define `radix_tree_root` structure by hand and pass it with mask to the `INIT_RADIX_TREE` macro: 130 | 131 | ```C 132 | struct radix_tree_root my_radix_tree; 133 | INIT_RADIX_TREE(my_tree, gfp_mask_for_my_radix_tree); 134 | ``` 135 | 136 | where: 137 | 138 | ```C 139 | #define INIT_RADIX_TREE(root, mask) \ 140 | do { \ 141 | (root)->height = 0; \ 142 | (root)->gfp_mask = (mask); \ 143 | (root)->rnode = NULL; \ 144 | } while (0) 145 | ``` 146 | 147 | makes the same initialization with default values as it does `RADIX_TREE_INIT` macro. 148 | 149 | The next are two functions for inserting and deleting records to/from a radix tree: 150 | 151 | * `radix_tree_insert`; 152 | * `radix_tree_delete`; 153 | 154 | The first `radix_tree_insert` function takes three parameters: 155 | 156 | * root of a radix tree; 157 | * index key; 158 | * data to insert; 159 | 160 | The `radix_tree_delete` function takes the same set of parameters as the `radix_tree_insert`, but without data. 161 | 162 | Searching through a radix tree is implemented in three ways: 163 | 164 | * `radix_tree_lookup`; 165 | * `radix_tree_gang_lookup`; 166 | * `radix_tree_lookup_slot`. 167 | 168 | The first `radix_tree_lookup` function takes two parameters: 169 | 170 | * root of a radix tree; 171 | * index key; 172 | 173 | This function tries to find the given key in the tree and return the record associated with this key. The second `radix_tree_gang_lookup` function have the following signature 174 | 175 | ```C 176 | unsigned int radix_tree_gang_lookup(struct radix_tree_root *root, 177 | void **results, 178 | unsigned long first_index, 179 | unsigned int max_items); 180 | ``` 181 | 182 | and returns number of records, sorted by the keys, starting from the first index. Number of the returned records will not be greater than `max_items` value. 183 | 184 | And the last `radix_tree_lookup_slot` function will return the slot which will contain the data. 185 | 186 | Links 187 | --------------------------------------------------------------------------------- 188 | 189 | * [Radix tree](http://en.wikipedia.org/wiki/Radix_tree) 190 | * [Trie](http://en.wikipedia.org/wiki/Trie) 191 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM lrx0014/gitbook:3.2.3 2 | COPY ./ /srv/gitbook/ 3 | EXPOSE 4000 -------------------------------------------------------------------------------- /Initialization/README.md: -------------------------------------------------------------------------------- 1 | # Kernel initialization process 2 | 3 | You will find here a couple of posts which describe the full cycle of kernel initialization from its first step after the kernel has been decompressed to the start of the first process run by the kernel itself. 4 | 5 | *Note* That there will not be a description of the all kernel initialization steps. Here will be only generic kernel part, without interrupts handling, ACPI, and many other parts. All parts which I have missed, will be described in other chapters. 6 | 7 | * [First steps after kernel decompression](linux-initialization-1.md) - describes first steps in the kernel. 8 | * [Early interrupt and exception handling](linux-initialization-2.md) - describes early interrupts initialization and early page fault handler. 9 | * [Last preparations before the kernel entry point](linux-initialization-3.md) - describes the last preparations before the call of the `start_kernel`. 10 | * [Kernel entry point](linux-initialization-4.md) - describes first steps in the kernel generic code. 11 | * [Continue of architecture-specific initializations](linux-initialization-5.md) - describes architecture-specific initialization. 12 | * [Architecture-specific initializations, again...](linux-initialization-6.md) - describes continue of the architecture-specific initialization process. 13 | * [The End of the architecture-specific initializations, almost...](linux-initialization-7.md) - describes the end of the `setup_arch` related stuff. 14 | * [Scheduler initialization](linux-initialization-8.md) - describes preparation before scheduler initialization and initialization of it. 15 | * [RCU initialization](linux-initialization-9.md) - describes the initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). 16 | * [End of the initialization](linux-initialization-10.md) - the last part about Linux kernel initialization. 17 | -------------------------------------------------------------------------------- /Initialization/images/CONFIG_NR_CPUS.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Initialization/images/CONFIG_NR_CPUS.png -------------------------------------------------------------------------------- /Initialization/images/NX.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Initialization/images/NX.png -------------------------------------------------------------------------------- /Initialization/images/brk_area.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Initialization/images/brk_area.png -------------------------------------------------------------------------------- /Initialization/images/kernel_command_line.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Initialization/images/kernel_command_line.png -------------------------------------------------------------------------------- /Initialization/linux-initialization-3.md: -------------------------------------------------------------------------------- 1 | Kernel initialization. Part 3. 2 | ================================================================================ 3 | 4 | Last preparations before the kernel entry point 5 | -------------------------------------------------------------------------------- 6 | 7 | This is the third part of the Linux kernel initialization process series. In the previous [part](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-2.md) we saw early interrupt and exception handling and will continue to dive into the Linux kernel initialization process in the current part. Our next point is 'kernel entry point' - `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) source code file. Yes, technically it is not kernel's entry point but the start of the generic kernel code which does not depend on certain architecture. But before we call the `start_kernel` function, we must do some preparations. So let's continue. 8 | 9 | boot_params again 10 | -------------------------------------------------------------------------------- 11 | 12 | In the previous part we stopped at setting Interrupt Descriptor Table and loading it in the `IDTR` register. At the next step after this we can see a call of the `copy_bootdata` function: 13 | 14 | ```C 15 | copy_bootdata(__va(real_mode_data)); 16 | ``` 17 | 18 | This function takes one argument - virtual address of the `real_mode_data`. Remember that we passed the address of the `boot_params` structure from [arch/x86/include/uapi/asm/bootparam.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/uapi/asm/bootparam.h#L114) to the `x86_64_start_kernel` function as first argument in [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head_64.S): 19 | 20 | ``` 21 | /* rsi is pointer to real mode structure with interesting info. 22 | pass it to C */ 23 | movq %rsi, %rdi 24 | ``` 25 | 26 | Now let's look at `__va` macro. This macro defined in [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c): 27 | 28 | ```C 29 | #define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET)) 30 | ``` 31 | 32 | where `PAGE_OFFSET` is `__PAGE_OFFSET` which is `0xffff880000000000` and the base virtual address of the direct mapping of all physical memory. So we're getting virtual address of variable `boot_params` which come along from real mode, and pass it to the `copy_bootdata` function, where we copy `real_mode_data` to the `boot_params` which is defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/d9919d43cbf6790d2bc0c0a2743c51fc25f26919/arch/x86/kernel/setup.c) 33 | 34 | ```C 35 | struct boot_params boot_params; 36 | ``` 37 | 38 | Let's look at the `copy_boot_data` implementation: 39 | 40 | ```C 41 | static void __init copy_bootdata(char *real_mode_data) 42 | { 43 | char * command_line; 44 | unsigned long cmd_line_ptr; 45 | 46 | memcpy(&boot_params, real_mode_data, sizeof boot_params); 47 | sanitize_boot_params(&boot_params); 48 | cmd_line_ptr = get_cmd_line_ptr(); 49 | if (cmd_line_ptr) { 50 | command_line = __va(cmd_line_ptr); 51 | memcpy(boot_command_line, command_line, COMMAND_LINE_SIZE); 52 | } 53 | } 54 | ``` 55 | 56 | First of all, note that this function is declared with `__init` prefix. It means that this function will be used only during the initialization and used memory will be freed. 57 | 58 | We can see declaration of two variables for the kernel command line and copying `real_mode_data` to the `boot_params` with the `memcpy` function. The next call of the `sanitize_boot_params` function which fills some fields of the `boot_params` structure like `ext_ramdisk_image` and etc... if bootloaders which fail to initialize unknown fields in `boot_params` to zero. After this we're getting address of the command line with the call of the `get_cmd_line_ptr` function: 59 | 60 | ```C 61 | unsigned long cmd_line_ptr = boot_params.hdr.cmd_line_ptr; 62 | cmd_line_ptr |= (u64)boot_params.ext_cmd_line_ptr << 32; 63 | return cmd_line_ptr; 64 | ``` 65 | 66 | which gets the 64-bit address of the command line from the kernel boot header and returns it. In the last step we check `cmd_line_ptr`, getting its virtual address and copy it to the `boot_command_line` which is just an array of bytes: 67 | 68 | ```C 69 | extern char __initdata boot_command_line[]; 70 | ``` 71 | 72 | After this we will have copied kernel command line and `boot_params` structure. In the next step we can see call of the `load_ucode_bsp` function which loads processor microcode, but we will not see it here. 73 | 74 | After microcode was loaded we can see the check of the `console_loglevel` and the `early_printk` function which prints `Kernel Alive` string. But you'll never see this output because `early_printk` is not initialized yet. It is a minor bug in the kernel and i sent the patch - [commit](http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=91d8f0416f3989e248d3a3d3efb821eda10a85d2) and you will see it in the mainline soon. So you can skip this code. 75 | 76 | Move on init pages 77 | -------------------------------------------------------------------------------- 78 | 79 | In the next step, as we have copied `boot_params` structure, we need to move from the early page tables to the page tables for initialization process. We already set early page tables for switchover, you can read about it in the previous [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-1) and dropped all it in the `reset_early_page_tables` function (you can read about it in the previous part too) and kept only kernel high mapping. After this we call: 80 | 81 | ```C 82 | clear_page(init_level4_pgt); 83 | ``` 84 | 85 | function and pass `init_level4_pgt` which also defined in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head_64.S) and looks: 86 | 87 | ```assembly 88 | NEXT_PAGE(init_level4_pgt) 89 | .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE 90 | .org init_level4_pgt + L4_PAGE_OFFSET*8, 0 91 | .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE 92 | .org init_level4_pgt + L4_START_KERNEL*8, 0 93 | .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE 94 | ``` 95 | 96 | which maps first 2 gigabytes and 512 megabytes for the kernel code, data and bss. `clear_page` function defined in the [arch/x86/lib/clear_page_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/lib/clear_page_64.S) let's look on this function: 97 | 98 | ```assembly 99 | ENTRY(clear_page) 100 | CFI_STARTPROC 101 | xorl %eax,%eax 102 | movl $4096/64,%ecx 103 | .p2align 4 104 | .Lloop: 105 | decl %ecx 106 | #define PUT(x) movq %rax,x*8(%rdi) 107 | movq %rax,(%rdi) 108 | PUT(1) 109 | PUT(2) 110 | PUT(3) 111 | PUT(4) 112 | PUT(5) 113 | PUT(6) 114 | PUT(7) 115 | leaq 64(%rdi),%rdi 116 | jnz .Lloop 117 | nop 118 | ret 119 | CFI_ENDPROC 120 | .Lclear_page_end: 121 | ENDPROC(clear_page) 122 | ``` 123 | 124 | As you can understand from the function name it clears or fills with zeros page tables. First of all note that this function starts with the `CFI_STARTPROC` and `CFI_ENDPROC` which are expands to GNU assembly directives: 125 | 126 | ```C 127 | #define CFI_STARTPROC .cfi_startproc 128 | #define CFI_ENDPROC .cfi_endproc 129 | ``` 130 | 131 | and used for debugging. After `CFI_STARTPROC` macro we zero out `eax` register and put 64 to the `ecx` (it will be a counter). Next we can see loop which starts with the `.Lloop` label and it starts from the `ecx` decrement. After it we put zero from the `rax` register to the `rdi` which contains the base address of the `init_level4_pgt` now and do the same procedure seven times but every time move `rdi` offset on 8. After this we will have first 64 bytes of the `init_level4_pgt` filled with zeros. In the next step we put the address of the `init_level4_pgt` with 64-bytes offset to the `rdi` again and repeat all operations until `ecx` reaches zero. In the end we will have `init_level4_pgt` filled with zeros. 132 | 133 | As we have `init_level4_pgt` filled with zeros, we set the last `init_level4_pgt` entry to kernel high mapping with the: 134 | 135 | ```C 136 | init_level4_pgt[511] = early_top_pgt[511]; 137 | ``` 138 | 139 | Remember that we dropped all `early_top_pgt` entries in the `reset_early_page_table` function and kept only kernel high mapping there. 140 | 141 | The last step in the `x86_64_start_kernel` function is the call of the: 142 | 143 | ```C 144 | x86_64_start_reservations(real_mode_data); 145 | ``` 146 | 147 | function with the `real_mode_data` as argument. The `x86_64_start_reservations` function defined in the same source code file as the `x86_64_start_kernel` function and looks: 148 | 149 | ```C 150 | void __init x86_64_start_reservations(char *real_mode_data) 151 | { 152 | if (!boot_params.hdr.version) 153 | copy_bootdata(__va(real_mode_data)); 154 | 155 | reserve_ebda_region(); 156 | 157 | start_kernel(); 158 | } 159 | ``` 160 | 161 | You can see that it is the last function before we are in the kernel entry point - `start_kernel` function. Let's look what it does and how it works. 162 | 163 | Last step before kernel entry point 164 | -------------------------------------------------------------------------------- 165 | 166 | First of all we can see in the `x86_64_start_reservations` function the check for `boot_params.hdr.version`: 167 | 168 | ```C 169 | if (!boot_params.hdr.version) 170 | copy_bootdata(__va(real_mode_data)); 171 | ``` 172 | 173 | and if it is zero we call `copy_bootdata` function again with the virtual address of the `real_mode_data` (read about its implementation). 174 | 175 | In the next step we can see the call of the `reserve_ebda_region` function which defined in the [arch/x86/kernel/head.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head.c). This function reserves memory block for the `EBDA` or Extended BIOS Data Area. The Extended BIOS Data Area located in the top of conventional memory and contains data about ports, disk parameters and etc... 176 | 177 | Let's look on the `reserve_ebda_region` function. It starts from the checking is paravirtualization enabled or not: 178 | 179 | ```C 180 | if (paravirt_enabled()) 181 | return; 182 | ``` 183 | 184 | we exit from the `reserve_ebda_region` function if paravirtualization is enabled because if it enabled the extended BIOS data area is absent. In the next step we need to get the end of the low memory: 185 | 186 | ```C 187 | lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES); 188 | lowmem <<= 10; 189 | ``` 190 | 191 | We're getting the virtual address of the BIOS low memory in kilobytes and convert it to bytes with shifting it on 10 (multiply on 1024 in other words). After this we need to get the address of the extended BIOS data are with the: 192 | 193 | ```C 194 | ebda_addr = get_bios_ebda(); 195 | ``` 196 | 197 | where `get_bios_ebda` function defined in the [arch/x86/include/asm/bios_ebda.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/bios_ebda.h) and looks like: 198 | 199 | ```C 200 | static inline unsigned int get_bios_ebda(void) 201 | { 202 | unsigned int address = *(unsigned short *)phys_to_virt(0x40E); 203 | address <<= 4; 204 | return address; 205 | } 206 | ``` 207 | 208 | Let's try to understand how it works. Here we can see that we are converting physical address `0x40E` to the virtual, where `0x0040:0x000e` is the segment which contains base address of the extended BIOS data area. Don't worry that we are using `phys_to_virt` function for converting a physical address to virtual address. You can note that previously we have used `__va` macro for the same point, but `phys_to_virt` is the same: 209 | 210 | ```C 211 | static inline void *phys_to_virt(phys_addr_t address) 212 | { 213 | return __va(address); 214 | } 215 | ``` 216 | 217 | only with one difference: we pass argument with the `phys_addr_t` which depends on `CONFIG_PHYS_ADDR_T_64BIT`: 218 | 219 | ```C 220 | #ifdef CONFIG_PHYS_ADDR_T_64BIT 221 | typedef u64 phys_addr_t; 222 | #else 223 | typedef u32 phys_addr_t; 224 | #endif 225 | ``` 226 | 227 | This configuration option is enabled by `CONFIG_PHYS_ADDR_T_64BIT`. After that we got virtual address of the segment which stores the base address of the extended BIOS data area, we shift it on 4 and return. After this `ebda_addr` variables contains the base address of the extended BIOS data area. 228 | 229 | In the next step we check that address of the extended BIOS data area and low memory is not less than `INSANE_CUTOFF` macro 230 | 231 | ```C 232 | if (ebda_addr < INSANE_CUTOFF) 233 | ebda_addr = LOWMEM_CAP; 234 | 235 | if (lowmem < INSANE_CUTOFF) 236 | lowmem = LOWMEM_CAP; 237 | ``` 238 | 239 | which is: 240 | 241 | ```C 242 | #define INSANE_CUTOFF 0x20000U 243 | ``` 244 | 245 | or 128 kilobytes. In the last step we get lower part in the low memory and extended BIOS data area and call `memblock_reserve` function which will reserve memory region for extended BIOS data between low memory and one megabyte mark: 246 | 247 | ```C 248 | lowmem = min(lowmem, ebda_addr); 249 | lowmem = min(lowmem, LOWMEM_CAP); 250 | memblock_reserve(lowmem, 0x100000 - lowmem); 251 | ``` 252 | 253 | `memblock_reserve` function is defined at [mm/memblock.c](https://github.com/torvalds/linux/blob/master/mm/memblock.c) and takes two parameters: 254 | 255 | * base physical address; 256 | * region size. 257 | 258 | and reserves memory region for the given base address and size. `memblock_reserve` is the first function in this book from Linux kernel memory manager framework. We will take a closer look on memory manager soon, but now let's look at its implementation. 259 | 260 | First touch of the Linux kernel memory manager framework 261 | -------------------------------------------------------------------------------- 262 | 263 | In the previous paragraph we stopped at the call of the `memblock_reserve` function and as I said before it is the first function from the memory manager framework. Let's try to understand how it works. `memblock_reserve` function just calls: 264 | 265 | ```C 266 | memblock_reserve_region(base, size, MAX_NUMNODES, 0); 267 | ``` 268 | 269 | function and passes 4 parameters there: 270 | 271 | * physical base address of the memory region; 272 | * size of the memory region; 273 | * maximum number of numa nodes; 274 | * flags. 275 | 276 | At the start of the `memblock_reserve_region` body we can see definition of the `memblock_type` structure: 277 | 278 | ```C 279 | struct memblock_type *_rgn = &memblock.reserved; 280 | ``` 281 | 282 | which presents the type of the memory block and looks: 283 | 284 | ```C 285 | struct memblock_type { 286 | unsigned long cnt; 287 | unsigned long max; 288 | phys_addr_t total_size; 289 | struct memblock_region *regions; 290 | }; 291 | ``` 292 | 293 | As we need to reserve memory block for extended BIOS data area, the type of the current memory region is reserved where `memblock` structure is: 294 | 295 | ```C 296 | struct memblock { 297 | bool bottom_up; 298 | phys_addr_t current_limit; 299 | struct memblock_type memory; 300 | struct memblock_type reserved; 301 | #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP 302 | struct memblock_type physmem; 303 | #endif 304 | }; 305 | ``` 306 | 307 | and describes generic memory block. You can see that we initialize `_rgn` by assigning it to the address of the `memblock.reserved`. `memblock` is the global variable which looks: 308 | 309 | ```C 310 | struct memblock memblock __initdata_memblock = { 311 | .memory.regions = memblock_memory_init_regions, 312 | .memory.cnt = 1, 313 | .memory.max = INIT_MEMBLOCK_REGIONS, 314 | .reserved.regions = memblock_reserved_init_regions, 315 | .reserved.cnt = 1, 316 | .reserved.max = INIT_MEMBLOCK_REGIONS, 317 | #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP 318 | .physmem.regions = memblock_physmem_init_regions, 319 | .physmem.cnt = 1, 320 | .physmem.max = INIT_PHYSMEM_REGIONS, 321 | #endif 322 | .bottom_up = false, 323 | .current_limit = MEMBLOCK_ALLOC_ANYWHERE, 324 | }; 325 | ``` 326 | 327 | We will not dive into detail of this variable, but we will see all details about it in the parts about memory manager. Just note that `memblock` variable defined with the `__initdata_memblock` which is: 328 | 329 | ```C 330 | #define __initdata_memblock __meminitdata 331 | ``` 332 | 333 | and `__meminit_data` is: 334 | 335 | ```C 336 | #define __meminitdata __section(.meminit.data) 337 | ``` 338 | 339 | From this we can conclude that all memory blocks will be in the `.meminit.data` section. After we defined `_rgn` we print information about it with `memblock_dbg` macros. You can enable it by passing `memblock=debug` to the kernel command line. 340 | 341 | After debugging lines were printed next is the call of the following function: 342 | 343 | ```C 344 | memblock_add_range(_rgn, base, size, nid, flags); 345 | ``` 346 | 347 | which adds new memory block region into the `.meminit.data` section. As we do not initialize `_rgn` but it just contains `&memblock.reserved`, we just fill passed `_rgn` with the base address of the extended BIOS data area region, size of this region and flags: 348 | 349 | ```C 350 | if (type->regions[0].size == 0) { 351 | WARN_ON(type->cnt != 1 || type->total_size); 352 | type->regions[0].base = base; 353 | type->regions[0].size = size; 354 | type->regions[0].flags = flags; 355 | memblock_set_region_node(&type->regions[0], nid); 356 | type->total_size = size; 357 | return 0; 358 | } 359 | ``` 360 | 361 | After we filled our region we can see the call of the `memblock_set_region_node` function with two parameters: 362 | 363 | * address of the filled memory region; 364 | * NUMA node id. 365 | 366 | where our regions represented by the `memblock_region` structure: 367 | 368 | ```C 369 | struct memblock_region { 370 | phys_addr_t base; 371 | phys_addr_t size; 372 | unsigned long flags; 373 | #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP 374 | int nid; 375 | #endif 376 | }; 377 | ``` 378 | 379 | NUMA node id depends on `MAX_NUMNODES` macro which is defined in the [include/linux/numa.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/numa.h): 380 | 381 | ```C 382 | #define MAX_NUMNODES (1 << NODES_SHIFT) 383 | ``` 384 | 385 | where `NODES_SHIFT` depends on `CONFIG_NODES_SHIFT` configuration parameter and defined as: 386 | 387 | ```C 388 | #ifdef CONFIG_NODES_SHIFT 389 | #define NODES_SHIFT CONFIG_NODES_SHIFT 390 | #else 391 | #define NODES_SHIFT 0 392 | #endif 393 | ``` 394 | 395 | `memblock_set_region_node` function just fills `nid` field from `memblock_region` with the given value: 396 | 397 | ```C 398 | static inline void memblock_set_region_node(struct memblock_region *r, int nid) 399 | { 400 | r->nid = nid; 401 | } 402 | ``` 403 | 404 | After this we will have first reserved `memblock` for the extended BIOS data area in the `.meminit.data` section. `reserve_ebda_region` function finished its work on this step and we can go back to the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head64.c). 405 | 406 | We finished all preparations before the kernel entry point! The last step in the `x86_64_start_reservations` function is the call of the: 407 | 408 | ```C 409 | start_kernel() 410 | ``` 411 | 412 | function from [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) file. 413 | 414 | That's all for this part. 415 | 416 | Conclusion 417 | -------------------------------------------------------------------------------- 418 | 419 | It is the end of the third part about Linux kernel insides. In next part we will see the first initialization steps in the kernel entry point - `start_kernel` function. It will be the first step before we will see launch of the first `init` process. 420 | 421 | If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). 422 | 423 | **Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** 424 | 425 | Links 426 | -------------------------------------------------------------------------------- 427 | 428 | * [BIOS data area](http://stanislavs.org/helppc/bios_data_area.html) 429 | * [What is in the extended BIOS data area on a PC?](http://www.kryslix.com/nsfaq/Q.6.html) 430 | * [Previous part](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-2.md) 431 | -------------------------------------------------------------------------------- /Interrupts/README.md: -------------------------------------------------------------------------------- 1 | # Interrupts and Interrupt Handling 2 | 3 | In the following posts, we will cover interrupts and exceptions handling in the Linux kernel. 4 | 5 | * [Interrupts and Interrupt Handling. Part 1.](linux-interrupts-1.md) - describes interrupts and interrupt handling theory. 6 | * [Interrupts in the Linux Kernel](linux-interrupts-2.md) - describes stuffs related to interrupts and exceptions handling from the early stage. 7 | * [Early interrupt handlers](linux-interrupts-3.md) - describes early interrupt handlers. 8 | * [Interrupt handlers](linux-interrupts-4.md) - describes first non-early interrupt handlers. 9 | * [Implementation of exception handlers](linux-interrupts-5.md) - describes implementation of some exception handlers such as double fault, divide by zero etc. 10 | * [Handling non-maskable interrupts](linux-interrupts-6.md) - describes handling of non-maskable interrupts and remaining interrupt handlers from the architecture-specific part. 11 | * [External hardware interrupts](linux-interrupts-7.md) - describes early initialization of code which is related to handling external hardware interrupts. 12 | * [Non-early initialization of the IRQs](linux-interrupts-8.md) - describes non-early initialization of code which is related to handling external hardware interrupts. 13 | * [Softirq, Tasklets and Workqueues](linux-interrupts-9.md) - describes softirqs, tasklets and workqueues concepts. 14 | * [Last part](linux-interrupts-10.md) - this is the last part of the `Interrupts and Interrupt Handling` chapter and here we will see a real hardware driver and some interrupts related stuff. 15 | -------------------------------------------------------------------------------- /Interrupts/images/kernel.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Interrupts/images/kernel.png -------------------------------------------------------------------------------- /KernelStructures/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/KernelStructures/.gitkeep -------------------------------------------------------------------------------- /KernelStructures/README.md: -------------------------------------------------------------------------------- 1 | # Internal `system` structures of the Linux kernel 2 | 3 | This is not usual chapter of `linux-insides`. As you may understand from the title, it mostly describes 4 | internal `system` structures of the Linux kernel. Like `Interrupt Descriptor Table`, `Global Descriptor 5 | Table` and many many more. 6 | 7 | Most of information is taken from official [Intel](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html) and [AMD](http://developer.amd.com/resources/developer-guides-manuals/) manuals. 8 | -------------------------------------------------------------------------------- /KernelStructures/linux-kernelstructure-1.md: -------------------------------------------------------------------------------- 1 | interrupt-descriptor table (IDT) 2 | ================================================================================ 3 | 4 | Three general interrupt & exceptions sources: 5 | 6 | * Exceptions - sync; 7 | * Software interrupts - sync; 8 | * External interrupts - async. 9 | 10 | Types of Exceptions: 11 | 12 | * Faults - are precise exceptions reported on the boundary `before` the instruction causing the exception. The saved `%rip` points to the faulting instruction; 13 | * Traps - are precise exceptions reported on the boundary `following` the instruction causing the exception. The same with `%rip`; 14 | * Aborts - are imprecise exceptions. Because they are imprecise, aborts typically do not allow reliable program restart. 15 | 16 | `Maskable` interrupts trigger the interrupt-handling mechanism only when `RFLAGS.IF=1`. Otherwise they are held pending for as long as the `RFLAGS.IF` bit is cleared to 0. 17 | 18 | `Nonmaskable` interrupts (NMI) are unaffected by the value of the 'RFLAGS.IF' bit. However, the occurrence of an NMI masks further NMIs until an IRET instruction is executed. 19 | 20 | Specific exception and interrupt sources are assigned a fixed vector-identification number (also called an “interrupt vector” or simply “vector”). The interrupt vector is used by the interrupt-handling mechanism to locate the system-software service routine assigned to the exception or interrupt. Up to 21 | 256 unique interrupt vectors are available. The first 32 vectors are reserved for predefined exception and interrupt conditions. They are defined in the [arch/x86/include/asm/traps.h](http://lxr.free-electrons.com/source/arch/x86/include/asm/traps.h#L121) header file: 22 | 23 | ``` 24 | /* Interrupts/Exceptions */ 25 | enum { 26 | X86_TRAP_DE = 0, /* 0, Divide-by-zero */ 27 | X86_TRAP_DB, /* 1, Debug */ 28 | X86_TRAP_NMI, /* 2, Non-maskable Interrupt */ 29 | X86_TRAP_BP, /* 3, Breakpoint */ 30 | X86_TRAP_OF, /* 4, Overflow */ 31 | X86_TRAP_BR, /* 5, Bound Range Exceeded */ 32 | X86_TRAP_UD, /* 6, Invalid Opcode */ 33 | X86_TRAP_NM, /* 7, Device Not Available */ 34 | X86_TRAP_DF, /* 8, Double Fault */ 35 | X86_TRAP_OLD_MF, /* 9, Coprocessor Segment Overrun */ 36 | X86_TRAP_TS, /* 10, Invalid TSS */ 37 | X86_TRAP_NP, /* 11, Segment Not Present */ 38 | X86_TRAP_SS, /* 12, Stack Segment Fault */ 39 | X86_TRAP_GP, /* 13, General Protection Fault */ 40 | X86_TRAP_PF, /* 14, Page Fault */ 41 | X86_TRAP_SPURIOUS, /* 15, Spurious Interrupt */ 42 | X86_TRAP_MF, /* 16, x87 Floating-Point Exception */ 43 | X86_TRAP_AC, /* 17, Alignment Check */ 44 | X86_TRAP_MC, /* 18, Machine Check */ 45 | X86_TRAP_XF, /* 19, SIMD Floating-Point Exception */ 46 | X86_TRAP_IRET = 32, /* 32, IRET Exception */ 47 | }; 48 | ``` 49 | 50 | Error Codes 51 | -------------------------------------------------------------------------------- 52 | 53 | The processor exception-handling mechanism reports error and status information for some exceptions using an error code. The error code is pushed onto the stack by the exception-mechanism during the control transfer into the exception handler. The error code has two formats: 54 | 55 | * most error-reporting exceptions format; 56 | * page fault format. 57 | 58 | Here is format of selector error code: 59 | 60 | ``` 61 | 31 16 15 3 2 1 0 62 | +-------------------------------------------------------------------------------+ 63 | | | | T | I | E | 64 | | Reserved | Selector Index | - | D | X | 65 | | | | I | T | T | 66 | +-------------------------------------------------------------------------------+ 67 | ``` 68 | 69 | Where: 70 | 71 | * `EXT` - If this bit is set to 1, the exception source is external to the processor. If cleared to 0, the exception source is internal to the processor; 72 | * `IDT` - If this bit is set to 1, the error-code selector-index field references a gate descriptor located in the `interrupt-descriptor table`. If cleared to 0, the selector-index field references a descriptor in either the `global-descriptor table` or local-descriptor table `LDT`, as indicated by the `TI` bit; 73 | * `TI` - If this bit is set to 1, the error-code selector-index field references a descriptor in the `LDT`. If cleared to 0, the selector-index field references a descriptor in the `GDT`. 74 | * `Selector Index` - The selector-index field specifies the index into either the `GDT`, `LDT`, or `IDT`, as specified by the `IDT` and `TI` bits. 75 | 76 | Page-Fault Error Code format is: 77 | 78 | ``` 79 | 31 4 3 2 1 0 80 | +-------------------------------------------------------------------------------+ 81 | | | | R | U | R | - | 82 | | Reserved | I/D | S | - | - | P | 83 | | | | V | S | W | - | 84 | +-------------------------------------------------------------------------------+ 85 | ``` 86 | 87 | Where: 88 | 89 | * `I/D` - If this bit is set to 1, it indicates that the access that caused the page fault was an instruction fetch; 90 | * `RSV` - If this bit is set to 1, the page fault is a result of the processor reading a 1 from a reserved field within a page-translation-table entry; 91 | * `U/S` - If this bit is cleared to 0, an access in supervisor mode (`CPL=0, 1, or 2`) caused the page fault. If this bit is set to 1, an access in user mode (CPL=3) caused the page fault; 92 | * `R/W` - If this bit is cleared to 0, the access that caused the page fault is a memory read. If this bit is set to 1, the memory access that caused the page fault was a write; 93 | * `P` - If this bit is cleared to 0, the page fault was caused by a not-present page. If this bit is set to 1, the page fault was caused by a page-protection violation. 94 | 95 | Interrupt Control Transfers 96 | -------------------------------------------------------------------------------- 97 | 98 | The IDT may contain any of three kinds of gate descriptors: 99 | 100 | * `Task Gate` - contains the segment selector for a TSS for an exception and/or interrupt handler task; 101 | * `Interrupt Gate` - contains segment selector and offset that the processor uses to transfer program execution to a handler procedure in an interrupt handler code segment; 102 | * `Trap Gate` - contains segment selector and offset that the processor uses to transfer program execution to a handler procedure in an exception handler code segment. 103 | 104 | General format of gates is: 105 | 106 | ``` 107 | 127 96 108 | +-------------------------------------------------------------------------------+ 109 | | | 110 | | Reserved | 111 | | | 112 | +-------------------------------------------------------------------------------- 113 | 95 64 114 | +-------------------------------------------------------------------------------+ 115 | | | 116 | | Offset 63..32 | 117 | | | 118 | +-------------------------------------------------------------------------------+ 119 | 63 48 47 46 44 42 39 34 32 120 | +-------------------------------------------------------------------------------+ 121 | | | | D | | | | | | | 122 | | Offset 31..16 | P | P | 0 |Type |0 0 0 | 0 | 0 | IST | 123 | | | | L | | | | | | | 124 | -------------------------------------------------------------------------------+ 125 | 31 16 15 0 126 | +-------------------------------------------------------------------------------+ 127 | | | | 128 | | Segment Selector | Offset 15..0 | 129 | | | | 130 | +-------------------------------------------------------------------------------+ 131 | ``` 132 | 133 | Where 134 | 135 | * `Selector` - Segment Selector for destination code segment; 136 | * `Offset` - Offset to handler procedure entry point; 137 | * `DPL` - Descriptor Privilege Level; 138 | * `P` - Segment Present flag; 139 | * `IST` - Interrupt Stack Table; 140 | * `TYPE` - one of: Local descriptor-table (LDT) segment descriptor, Task-state segment (TSS) descriptor, Call-gate descriptor, Interrupt-gate descriptor, Trap-gate descriptor or Task-gate descriptor. 141 | 142 | An `IDT` descriptor is represented by the following structure in the Linux kernel (only for `x86_64`): 143 | 144 | ```C 145 | struct gate_struct64 { 146 | u16 offset_low; 147 | u16 segment; 148 | unsigned ist : 3, zero0 : 5, type : 5, dpl : 2, p : 1; 149 | u16 offset_middle; 150 | u32 offset_high; 151 | u32 zero1; 152 | } __attribute__((packed)); 153 | ``` 154 | 155 | which is defined in the [arch/x86/include/asm/desc_defs.h](http://lxr.free-electrons.com/source/arch/x86/include/asm/desc_defs.h#L51) header file. 156 | 157 | A task gate descriptor does not contain `IST` field and its format differs from interrupt/trap gates: 158 | 159 | ```C 160 | struct ldttss_desc64 { 161 | u16 limit0; 162 | u16 base0; 163 | unsigned base1 : 8, type : 5, dpl : 2, p : 1; 164 | unsigned limit1 : 4, zero0 : 3, g : 1, base2 : 8; 165 | u32 base3; 166 | u32 zero1; 167 | } __attribute__((packed)); 168 | ``` 169 | 170 | Exceptions During a Task Switch 171 | -------------------------------------------------------------------------------- 172 | 173 | An exception can occur during a task switch while loading a segment selector. Page faults can also occur when accessing a TSS. In these cases, the hardware task-switch mechanism completes loading the new task state from the TSS, and then triggers the appropriate exception mechanism. 174 | 175 | **In long mode, an exception cannot occur during a task switch, because the hardware task-switch mechanism is disabled.** 176 | 177 | Nonmaskable interrupt 178 | -------------------------------------------------------------------------------- 179 | 180 | **TODO** 181 | 182 | API 183 | -------------------------------------------------------------------------------- 184 | 185 | **TODO** 186 | 187 | Interrupt Stack Table 188 | -------------------------------------------------------------------------------- 189 | 190 | **TODO** 191 | -------------------------------------------------------------------------------- /LINKS.md: -------------------------------------------------------------------------------- 1 | Useful links 2 | ======================== 3 | 4 | Linux boot 5 | ------------------------ 6 | 7 | * [Linux/x86 boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt) 8 | * [Linux kernel parameters](https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/kernel-parameters.rst) 9 | 10 | Protected mode 11 | ------------------------ 12 | 13 | * [64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html) 14 | 15 | Memory management in the Linux kernel 16 | -------------------------------------- 17 | 18 | * [Notes on the linux kernel VM subsystem by @lorenzo-stoakes](https://github.com/lorenzo-stoakes/linux-vm-notes) 19 | 20 | Serial programming 21 | ------------------------ 22 | 23 | * [8250 UART Programming](http://en.wikibooks.org/wiki/Serial_Programming/8250_UART_Programming#UART_Registers) 24 | * [Serial ports on OSDEV](http://wiki.osdev.org/Serial_Ports) 25 | 26 | VGA 27 | ------------------------ 28 | 29 | * [Video Graphics Array (VGA)](http://en.wikipedia.org/wiki/Video_Graphics_Array) 30 | 31 | IO 32 | ------------------------ 33 | 34 | * [IO port programming](http://www.tldp.org/HOWTO/text/IO-Port-Programming) 35 | 36 | GCC and GAS 37 | ------------------------ 38 | 39 | * [GCC type attributes](https://gcc.gnu.org/onlinedocs/gcc/Type-Attributes.html) 40 | * [Assembler Directives](http://www.chemie.fu-berlin.de/chemnet/use/info/gas/gas_toc.html#TOC65) 41 | 42 | 43 | Important data structures 44 | -------------------------- 45 | 46 | * [task_struct definition](http://lxr.free-electrons.com/source/include/linux/sched.h#L1274) 47 | 48 | 49 | Useful links 50 | ------------------------ 51 | 52 | * [Linux x86 Program Start Up](http://dbp-consulting.com/tutorials/debugging/linuxProgramStartup.html) 53 | * [Memory Layout in Program Execution (32 bits)](http://fgiasson.com/articles/memorylayout.txt) 54 | -------------------------------------------------------------------------------- /MM/README.md: -------------------------------------------------------------------------------- 1 | # Linux kernel memory management 2 | 3 | This chapter describes memory management in the Linux kernel. You will see here a 4 | couple of posts which describe different parts of the Linux memory management framework: 5 | 6 | * [Memblock](linux-mm-1.md) - describes early `memblock` allocator. 7 | * [Fix-Mapped Addresses and ioremap](linux-mm-2.md) - describes `fix-mapped` addresses and early `ioremap`. 8 | * [kmemcheck](linux-mm-3.md) - third part describes `kmemcheck` tool. 9 | -------------------------------------------------------------------------------- /MM/images/kernel_configuration_menu1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/MM/images/kernel_configuration_menu1.png -------------------------------------------------------------------------------- /MM/images/kernel_configuration_menu2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/MM/images/kernel_configuration_menu2.png -------------------------------------------------------------------------------- /MM/images/memblock.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/MM/images/memblock.png -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | ### HELP 2 | 3 | .PHONY: help 4 | help: ## Print help 5 | @egrep "(^### |^\S+:.*##\s)" Makefile | sed 's/^###\s*//' | sed 's/^\(\S*\)\:.*##\s*\(.*\)/ \1 - \2/' 6 | 7 | ### DOCKER 8 | 9 | .PHONY: run 10 | run: image ## docker run ... 11 | (docker stop linux-insides-book 2>&1) > /dev/null || true 12 | docker run --detach --rm -p 4000:4000 --name linux-insides-book --hostname linux-insides-book linux-insides-book 13 | 14 | .PHONY: image 15 | image: ## docker image build ... 16 | docker image build --rm --squash --label linux-insides --tag linux-insides-book:latest -f Dockerfile . 2> /dev/null || \ 17 | docker image build --rm --label linux-insides --tag linux-insides-book:latest -f Dockerfile . 18 | 19 | ### LAUNCH BROWSER 20 | 21 | .PHONY: browse 22 | browse: ## Launch broweser 23 | @timeout 60 sh -c 'until nc -z 127.0.0.1 4000; do sleep 1; done' || true 24 | @(uname | grep Darwin > /dev/null) && open http://127.0.0.1:4000 || true 25 | @(uname | grep Linux > /dev/null) && xdg-open http://127.0.0.1:4000 || true 26 | -------------------------------------------------------------------------------- /Misc/README.md: -------------------------------------------------------------------------------- 1 | # Misc 2 | 3 | This chapter contains parts which are not directly related to the Linux kernel source code and implementation of different subsystems. 4 | -------------------------------------------------------------------------------- /Misc/images/dgap_menu.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Misc/images/dgap_menu.png -------------------------------------------------------------------------------- /Misc/images/git_diff.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Misc/images/git_diff.png -------------------------------------------------------------------------------- /Misc/images/github.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Misc/images/github.png -------------------------------------------------------------------------------- /Misc/images/google_linux.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Misc/images/google_linux.png -------------------------------------------------------------------------------- /Misc/images/menuconfig.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Misc/images/menuconfig.png -------------------------------------------------------------------------------- /Misc/images/nconfig.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Misc/images/nconfig.png -------------------------------------------------------------------------------- /Misc/images/qemu.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Misc/images/qemu.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | linux-insides 2 | =============== 3 | 4 | A book-in-progress about the linux kernel and its insides. 5 | 6 | **The goal is simple** - to share my modest knowledge about the insides of the linux kernel and help people who are interested in linux kernel insides, and other low-level subject matter. Feel free to go through the book [Start here](https://github.com/0xAX/linux-insides/blob/master/SUMMARY.md) 7 | 8 | **Questions/Suggestions**: Feel free about any questions or suggestions by pinging me at twitter [@0xAX](https://twitter.com/0xAX), adding an [issue](https://github.com/0xAX/linux-insides/issues/new) or just drop me an [email](mailto:anotherworldofworld@gmail.com). 9 | 10 | Generating eBooks and PDFs - [documentation](https://github.com/GitbookIO/gitbook/blob/master/docs/ebook.md) 11 | 12 | # Mailing List 13 | 14 | We have a Google Group mailing list for learning the kernel source code. Here are some instructions about how to use it. 15 | 16 | #### Join 17 | 18 | Send an email with any subject/content to `kernelhacking+subscribe@googlegroups.com`. Then you will receive a confirmation email. Reply it with any content and then you are done. 19 | 20 | > If you have Google account, you can also open the [archive page](https://groups.google.com/forum/#!forum/kernelhacking) and click **Apply to join group**. You will be approved automatically. 21 | 22 | #### Send emails to mailing list 23 | 24 | Just send emails to `kernelhacking@googlegroups.com`. The basic usage is the same as other mailing lists powered by mailman. 25 | 26 | #### Archives 27 | 28 | https://groups.google.com/forum/#!forum/kernelhacking 29 | 30 | On other languages 31 | ------------------- 32 | 33 | * [Brazilian Portuguese](https://github.com/mauri870/linux-insides) 34 | * [Chinese](https://github.com/MintCN/linux-insides-zh) 35 | * [Japanese](https://github.com/tkmru/linux-insides-ja) 36 | * [Korean](https://github.com/junsooo/linux-insides-ko) 37 | * [Russian](https://github.com/proninyaroslav/linux-insides-ru) 38 | * [Spanish](https://github.com/leolas95/linux-insides) 39 | * [Turkish](https://github.com/ayyucedemirbas/linux-insides_Turkish) 40 | 41 | Docker 42 | ------ 43 | 44 | In order to run your own copy of the book with gitbook within a local container: 45 | 46 | 1. Enable Docker experimental features with vim or another text editor 47 | ```bash 48 | sudo vim /usr/lib/systemd/system/docker.service 49 | ``` 50 | 51 | Then add --experimental=true to the end of the ExecStart=/usr/bin/dockerd -H fd:// line and save. 52 | 53 | Eg: *ExecStart=/usr/bin/dockerd -H fd:// --experimental=true* 54 | 55 | Then, you need to reload and restart the Docker daemon: 56 | ```bash 57 | systemctl daemon-reload 58 | systemctl restart docker.service 59 | ``` 60 | 61 | 2. Run docker image 62 | ```bash 63 | make run 64 | ``` 65 | 66 | 3. Open your local copy of linux insides book under this url 67 | http://localhost:4000 or run `make browse` 68 | 69 | 70 | Contributions 71 | -------------- 72 | 73 | Feel free to create issues or pull-requests if you have any problems. 74 | 75 | **Please read [CONTRIBUTING.md](https://github.com/0xAX/linux-insides/blob/master/CONTRIBUTING.md) before pushing any changes.** 76 | 77 | ![linux-kernel](Assets/linux-kernel.png) 78 | 79 | Author 80 | --------------- 81 | 82 | [@0xAX](https://twitter.com/0xAX) 83 | 84 | LICENSE 85 | ------------- 86 | 87 | Licensed [BY-NC-SA Creative Commons](http://creativecommons.org/licenses/by-nc-sa/4.0/). 88 | -------------------------------------------------------------------------------- /SUMMARY.md: -------------------------------------------------------------------------------- 1 | ### Summary 2 | 3 | * [Booting](Booting/README.md) 4 | * [From bootloader to kernel](Booting/linux-bootstrap-1.md) 5 | * [First steps in the kernel setup code](Booting/linux-bootstrap-2.md) 6 | * [Video mode initialization and transition to protected mode](Booting/linux-bootstrap-3.md) 7 | * [Transition to 64-bit mode](Booting/linux-bootstrap-4.md) 8 | * [Kernel decompression](Booting/linux-bootstrap-5.md) 9 | * [Kernel load address randomization](Booting/linux-bootstrap-6.md) 10 | * [Initialization](Initialization/README.md) 11 | * [First steps in the kernel](Initialization/linux-initialization-1.md) 12 | * [Early interrupts handler](Initialization/linux-initialization-2.md) 13 | * [Last preparations before the kernel entry point](Initialization/linux-initialization-3.md) 14 | * [Kernel entry point](Initialization/linux-initialization-4.md) 15 | * [Continue architecture-specific boot-time initializations](Initialization/linux-initialization-5.md) 16 | * [Architecture-specific initializations, again...](Initialization/linux-initialization-6.md) 17 | * [End of the architecture-specific initializations, almost...](Initialization/linux-initialization-7.md) 18 | * [Scheduler initialization](Initialization/linux-initialization-8.md) 19 | * [RCU initialization](Initialization/linux-initialization-9.md) 20 | * [End of initialization](Initialization/linux-initialization-10.md) 21 | * [Interrupts](Interrupts/README.md) 22 | * [Introduction](Interrupts/linux-interrupts-1.md) 23 | * [Start to dive into interrupts](Interrupts/linux-interrupts-2.md) 24 | * [Interrupt handlers](Interrupts/linux-interrupts-3.md) 25 | * [Initialization of non-early interrupt gates](Interrupts/linux-interrupts-4.md) 26 | * [Implementation of some exception handlers](Interrupts/linux-interrupts-5.md) 27 | * [Handling Non-Maskable interrupts](Interrupts/linux-interrupts-6.md) 28 | * [Dive into external hardware interrupts](Interrupts/linux-interrupts-7.md) 29 | * [Initialization of external hardware interrupts structures](Interrupts/linux-interrupts-8.md) 30 | * [Softirq, Tasklets and Workqueues](Interrupts/linux-interrupts-9.md) 31 | * [Last part](Interrupts/linux-interrupts-10.md) 32 | * [System calls](SysCall/README.md) 33 | * [Introduction to system calls](SysCall/linux-syscall-1.md) 34 | * [How the Linux kernel handles a system call](SysCall/linux-syscall-2.md) 35 | * [vsyscall and vDSO](SysCall/linux-syscall-3.md) 36 | * [How the Linux kernel runs a program](SysCall/linux-syscall-4.md) 37 | * [Implementation of the open system call](SysCall/linux-syscall-5.md) 38 | * [Limits on resources in Linux](SysCall/linux-syscall-6.md) 39 | * [Timers and time management](Timers/README.md) 40 | * [Introduction](Timers/linux-timers-1.md) 41 | * [Clocksource framework](Timers/linux-timers-2.md) 42 | * [The tick broadcast framework and dyntick](Timers/linux-timers-3.md) 43 | * [Introduction to timers](Timers/linux-timers-4.md) 44 | * [Clockevents framework](Timers/linux-timers-5.md) 45 | * [x86 related clock sources](Timers/linux-timers-6.md) 46 | * [Time related system calls](Timers/linux-timers-7.md) 47 | * [Synchronization primitives](SyncPrim/README.md) 48 | * [Introduction to spinlocks](SyncPrim/linux-sync-1.md) 49 | * [Queued spinlocks](SyncPrim/linux-sync-2.md) 50 | * [Semaphores](SyncPrim/linux-sync-3.md) 51 | * [Mutex](SyncPrim/linux-sync-4.md) 52 | * [Reader/Writer semaphores](SyncPrim/linux-sync-5.md) 53 | * [SeqLock](SyncPrim/linux-sync-6.md) 54 | * [RCU]() 55 | * [Lockdep]() 56 | * [Memory management](MM/README.md) 57 | * [Memblock](MM/linux-mm-1.md) 58 | * [Fixmaps and ioremap](MM/linux-mm-2.md) 59 | * [kmemcheck](MM/linux-mm-3.md) 60 | * [Cgroups](Cgroups/README.md) 61 | * [Introduction to Control Groups](Cgroups/linux-cgroups-1.md) 62 | * [SMP]() 63 | * [Concepts](Concepts/README.md) 64 | * [Per-CPU variables](Concepts/linux-cpu-1.md) 65 | * [Cpumasks](Concepts/linux-cpu-2.md) 66 | * [The initcall mechanism](Concepts/linux-cpu-3.md) 67 | * [Notification Chains](Concepts/linux-cpu-4.md) 68 | * [Data Structures in the Linux Kernel](DataStructures/README.md) 69 | * [Doubly linked list](DataStructures/linux-datastructures-1.md) 70 | * [Radix tree](DataStructures/linux-datastructures-2.md) 71 | * [Bit arrays](DataStructures/linux-datastructures-3.md) 72 | * [Theory](Theory/README.md) 73 | * [Paging](Theory/linux-theory-1.md) 74 | * [Elf64](Theory/linux-theory-2.md) 75 | * [Inline assembly](Theory/linux-theory-3.md) 76 | * [CPUID]() 77 | * [MSR]() 78 | * [Initial ram disk]() 79 | * [initrd]() 80 | * [Misc](Misc/README.md) 81 | * [Linux kernel development](Misc/linux-misc-1.md) 82 | * [How the kernel is compiled](Misc/linux-misc-2.md) 83 | * [Linkers](Misc/linux-misc-3.md) 84 | * [Program startup process in userspace](Misc/linux-misc-4.md) 85 | * [Write and Submit your first Linux kernel Patch]() 86 | * [Data types in the kernel]() 87 | * [KernelStructures](KernelStructures/README.md) 88 | * [IDT](KernelStructures/linux-kernelstructure-1.md) 89 | * [Useful links](LINKS.md) 90 | * [Contributors](contributors.md) 91 | -------------------------------------------------------------------------------- /Scripts/LinuxKernelInsides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Scripts/LinuxKernelInsides.pdf -------------------------------------------------------------------------------- /Scripts/README.md: -------------------------------------------------------------------------------- 1 | # Scripts 2 | 3 | ## Description 4 | 5 | `get_all_links.py` : justify one link is live or dead with network connection 6 | 7 | `latex.sh` : a script for converting Markdown files in each of the subdirectories into a unified PDF typeset in LaTeX 8 | 9 | ## Usage 10 | 11 | `get_all_links.py` : 12 | 13 | ``` 14 | ./get_all_links.py ../ 15 | ``` 16 | 17 | `latex.sh` : 18 | 19 | ``` 20 | ./latex.sh 21 | ``` 22 | -------------------------------------------------------------------------------- /Scripts/get_all_links.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | from __future__ import print_function 4 | from socket import timeout 5 | 6 | import os 7 | import sys 8 | import codecs 9 | import re 10 | 11 | import markdown 12 | 13 | try: 14 | # compatible for python2 15 | from urllib2 import urlopen 16 | from urllib2 import HTTPError 17 | from urllib2 import URLError 18 | except ImportError: 19 | # compatible for python3 20 | from urllib.request import urlopen 21 | from urllib.error import HTTPError 22 | from urllib.error import URLError 23 | 24 | def check_live_url(url): 25 | 26 | result = False 27 | try: 28 | ret = urlopen(url, timeout=2) 29 | result = (ret.code == 200) 30 | except HTTPError as e: 31 | print(e, file=sys.stderr) 32 | except URLError as e: 33 | print(e, file=sys.stderr) 34 | except timeout as e: 35 | print(e, file=sys.stderr) 36 | except Exception as e: 37 | print(e, file=sys.stderr) 38 | 39 | return result 40 | 41 | 42 | def main(path): 43 | 44 | filenames = [] 45 | for (dirpath, dnames, fnames) in os.walk(path): 46 | for fname in fnames: 47 | if fname.endswith('.md'): 48 | filenames.append(os.sep.join([dirpath, fname])) 49 | 50 | urls = [] 51 | 52 | for filename in filenames: 53 | fd = codecs.open(filename, mode="r", encoding="utf-8") 54 | for line in fd.readlines(): 55 | refs = re.findall(r'(?<=rating < cs->rating) 37 | break; 38 | entry = &tmp->list; 39 | } 40 | list_add(&cs->list, entry); 41 | } 42 | ``` 43 | 44 | If two parallel processes will try to do it simultaneously, both process may found the same `entry` may occur [race condition](https://en.wikipedia.org/wiki/Race_condition) or in other words, the second process which will execute `list_add`, will overwrite a clock source from the first thread. 45 | 46 | Besides this simple example, synchronization primitives are ubiquitous in the Linux kernel. If we will go through the previous [chapter](https://0xax.gitbook.io/linux-insides/summary/timers/) or other chapters again or if we will look at the Linux kernel source code in general, we will meet many places like this. We will not consider how `mutex` is implemented in the Linux kernel. Actually, the Linux kernel provides a set of different synchronization primitives like: 47 | 48 | * `mutex`; 49 | * `semaphores`; 50 | * `seqlocks`; 51 | * `atomic operations`; 52 | * etc. 53 | 54 | We will start this chapter from the `spinlock`. 55 | 56 | Spinlocks in the Linux kernel. 57 | -------------------------------------------------------------------------------- 58 | 59 | The `spinlock` is a low-level synchronization mechanism which in simple words, represents a variable which can be in two states: 60 | 61 | * `acquired`; 62 | * `released`. 63 | 64 | Each process which wants to acquire a `spinlock`, must write a value which represents `spinlock acquired` state to this variable and write `spinlock released` state to the variable. If a process tries to execute code which is protected by a `spinlock`, it will be locked while a process which holds this lock will release it. In this case all related operations must be [atomic](https://en.wikipedia.org/wiki/Linearizability) to prevent [race conditions](https://en.wikipedia.org/wiki/Race_condition) state. The `spinlock` is represented by the `spinlock_t` type in the Linux kernel. If we will look at the Linux kernel code, we will see that this type is [widely](http://lxr.free-electrons.com/ident?i=spinlock_t) used. The `spinlock_t` is defined as: 65 | 66 | ```C 67 | typedef struct spinlock { 68 | union { 69 | struct raw_spinlock rlock; 70 | 71 | #ifdef CONFIG_DEBUG_LOCK_ALLOC 72 | # define LOCK_PADSIZE (offsetof(struct raw_spinlock, dep_map)) 73 | struct { 74 | u8 __padding[LOCK_PADSIZE]; 75 | struct lockdep_map dep_map; 76 | }; 77 | #endif 78 | }; 79 | } spinlock_t; 80 | ``` 81 | 82 | and located in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock_types.h) header file. We may see that its implementation depends on the state of the `CONFIG_DEBUG_LOCK_ALLOC` kernel configuration option. We will skip this now, because all debugging related stuff will be in the end of this part. So, if the `CONFIG_DEBUG_LOCK_ALLOC` kernel configuration option is disabled, the `spinlock_t` contains [union](https://en.wikipedia.org/wiki/Union_type#C.2FC.2B.2B) with one field which is - `raw_spinlock`: 83 | 84 | ```C 85 | typedef struct spinlock { 86 | union { 87 | struct raw_spinlock rlock; 88 | }; 89 | } spinlock_t; 90 | ``` 91 | 92 | The `raw_spinlock` structure defined in the [same](https://github.com/torvalds/linux/blob/master/include/linux/spinlock_types.h) header file represents the implementation of `normal` spinlock. Let's look how the `raw_spinlock` structure is defined: 93 | 94 | ```C 95 | typedef struct raw_spinlock { 96 | arch_spinlock_t raw_lock; 97 | #ifdef CONFIG_DEBUG_SPINLOCK 98 | unsigned int magic, owner_cpu; 99 | void *owner; 100 | #endif 101 | #ifdef CONFIG_DEBUG_LOCK_ALLOC 102 | struct lockdep_map dep_map; 103 | #endif 104 | } raw_spinlock_t; 105 | ``` 106 | 107 | where the `arch_spinlock_t` represents architecture-specific `spinlock` implementation. As we mentioned above, we will skip debugging kernel configuration options. As we focus on [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture in this book, the `arch_spinlock_t` that we will consider is defined in the [include/asm-generic/qspinlock_types.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock_types.h) header file and looks: 108 | 109 | ```C 110 | typedef struct qspinlock { 111 | union { 112 | atomic_t val; 113 | struct { 114 | u8 locked; 115 | u8 pending; 116 | }; 117 | struct { 118 | u16 locked_pending; 119 | u16 tail; 120 | }; 121 | }; 122 | } arch_spinlock_t; 123 | ``` 124 | 125 | We will not stop on this structures for now. Let's look at the operations on a `spinlock`. The Linux kernel provides following main operations on a `spinlock`: 126 | 127 | * `spin_lock_init` - produces initialization of the given `spinlock`; 128 | * `spin_lock` - acquires given `spinlock`; 129 | * `spin_lock_bh` - disables software [interrupts](https://en.wikipedia.org/wiki/Interrupt) and acquire given `spinlock`; 130 | * `spin_lock_irqsave` and `spin_lock_irq` - disable interrupts on local processor, preserve/not preserve previous interrupt state in the `flags` and acquire given `spinlock`; 131 | * `spin_unlock` - releases given `spinlock`; 132 | * `spin_unlock_bh` - releases given `spinlock` and enables software interrupts; 133 | * `spin_is_locked` - returns the state of the given `spinlock`; 134 | * and etc. 135 | 136 | Let's look on the implementation of the `spin_lock_init` macro. As I already wrote, this and other macro are defined in the [include/linux/spinlock.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock.h) header file and the `spin_lock_init` macro looks: 137 | 138 | ```C 139 | #define spin_lock_init(_lock) \ 140 | do { \ 141 | spinlock_check(_lock); \ 142 | raw_spin_lock_init(&(_lock)->rlock); \ 143 | } while (0) 144 | ``` 145 | 146 | As we may see, the `spin_lock_init` macro takes a `spinlock` and executes two operations: check the given `spinlock` and execute the `raw_spin_lock_init`. The implementation of the `spinlock_check` is pretty easy, this function just returns the `raw_spinlock_t` of the given `spinlock` to be sure that we got exactly `normal` raw spinlock: 147 | 148 | ```C 149 | static __always_inline raw_spinlock_t *spinlock_check(spinlock_t *lock) 150 | { 151 | return &lock->rlock; 152 | } 153 | ``` 154 | 155 | The `raw_spin_lock_init` macro: 156 | 157 | ```C 158 | # define raw_spin_lock_init(lock) \ 159 | do { \ 160 | *(lock) = __RAW_SPIN_LOCK_UNLOCKED(lock); \ 161 | } while (0) \ 162 | ``` 163 | 164 | assigns the value of the `__RAW_SPIN_LOCK_UNLOCKED` with the given `spinlock` to the given `raw_spinlock_t`. As we may understand from the name of the `__RAW_SPIN_LOCK_UNLOCKED` macro, this macro does initialization of the given `spinlock` and set it to `released` state. This macro is defined in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock_types.h) header file and expands to the following macros: 165 | 166 | ```C 167 | #define __RAW_SPIN_LOCK_UNLOCKED(lockname) \ 168 | (raw_spinlock_t) __RAW_SPIN_LOCK_INITIALIZER(lockname) 169 | 170 | #define __RAW_SPIN_LOCK_INITIALIZER(lockname) \ 171 | { \ 172 | .raw_lock = __ARCH_SPIN_LOCK_UNLOCKED, \ 173 | SPIN_DEBUG_INIT(lockname) \ 174 | SPIN_DEP_MAP_INIT(lockname) \ 175 | } 176 | ``` 177 | 178 | As I already wrote above, we will not consider stuff which is related to debugging of synchronization primitives. In this case we will not consider the `SPIN_DEBUG_INIT` and the `SPIN_DEP_MAP_INIT` macros. So the `__RAW_SPINLOCK_UNLOCKED` macro will be expanded to the: 179 | 180 | ```C 181 | *(&(_lock)->rlock) = __ARCH_SPIN_LOCK_UNLOCKED; 182 | ``` 183 | 184 | where the `__ARCH_SPIN_LOCK_UNLOCKED` is: 185 | 186 | ```C 187 | #define __ARCH_SPIN_LOCK_UNLOCKED { { .val = ATOMIC_INIT(0) } } 188 | ``` 189 | 190 | for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. So, after the expansion of the `spin_lock_init` macro, a given `spinlock` will be initialized and its state will be - `unlocked`. 191 | 192 | From this moment we know how to initialize a `spinlock`, now let's consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) which Linux kernel provides for manipulations of `spinlocks`. The first is: 193 | 194 | ```C 195 | static __always_inline void spin_lock(spinlock_t *lock) 196 | { 197 | raw_spin_lock(&lock->rlock); 198 | } 199 | ``` 200 | 201 | function which allows us to `acquire` a `spinlock`. The `raw_spin_lock` macro is defined in the same header file and expands to the call of `_raw_spin_lock`: 202 | 203 | ```C 204 | #define raw_spin_lock(lock) _raw_spin_lock(lock) 205 | ``` 206 | 207 | Where `_raw_spin_lock` is defined depends on whether `CONFIG_SMP` option is set and `CONFIG_INLINE_SPIN_LOCK` option is set. If the [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) is disabled, `_raw_spin_lock` is defined in the [include/linux/spinlock_api_up.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock_api_up.h) header file as a macro and looks like: 208 | 209 | ```C 210 | #define _raw_spin_lock(lock) __LOCK(lock) 211 | ``` 212 | 213 | If the SMP is enabled and `CONFIG_INLINE_SPIN_LOCK` is set, it is defined in [include/linux/spinlock_api_smp.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock_api_smp.h) header file as the following: 214 | 215 | ```C 216 | #define _raw_spin_lock(lock) __raw_spin_lock(lock) 217 | ``` 218 | 219 | If the SMP is enabled and `CONFIG_INLINE_SPIN_LOCK` is not set, it is defined in [kernel/locking/spinlock.c](https://github.com/torvalds/linux/blob/master/kernel/locking/spinlock.c) source code file as the following: 220 | 221 | ```C 222 | void __lockfunc _raw_spin_lock(raw_spinlock_t *lock) 223 | { 224 | __raw_spin_lock(lock); 225 | } 226 | ``` 227 | 228 | Here we will consider the latter form of `_raw_spin_lock`. The `__raw_spin_lock` function looks: 229 | 230 | ```C 231 | static inline void __raw_spin_lock(raw_spinlock_t *lock) 232 | { 233 | preempt_disable(); 234 | spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); 235 | LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock); 236 | } 237 | ``` 238 | 239 | As you may see, first of all we disable [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29) by the call of the `preempt_disable` macro from the [include/linux/preempt.h](https://github.com/torvalds/linux/blob/master/include/linux/preempt.h) (more about this you may read in the ninth [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-9) of the Linux kernel initialization process chapter). When we unlock the given `spinlock`, preemption will be enabled again: 240 | 241 | ```C 242 | static inline void __raw_spin_unlock(raw_spinlock_t *lock) 243 | { 244 | ... 245 | ... 246 | ... 247 | preempt_enable(); 248 | } 249 | ``` 250 | 251 | We need to do this to prevent the process from other processes to preempt it while it is spinning on a lock. The `spin_acquire` macro which through a chain of other macros expands to the call of the: 252 | 253 | ```C 254 | #define spin_acquire(l, s, t, i) lock_acquire_exclusive(l, s, t, NULL, i) 255 | #define lock_acquire_exclusive(l, s, t, n, i) lock_acquire(l, s, t, 0, 1, n, i) 256 | ``` 257 | 258 | The `lock_acquire` function: 259 | 260 | ```C 261 | void lock_acquire(struct lockdep_map *lock, unsigned int subclass, 262 | int trylock, int read, int check, 263 | struct lockdep_map *nest_lock, unsigned long ip) 264 | { 265 | unsigned long flags; 266 | 267 | if (unlikely(current->lockdep_recursion)) 268 | return; 269 | 270 | raw_local_irq_save(flags); 271 | check_flags(flags); 272 | 273 | current->lockdep_recursion = 1; 274 | trace_lock_acquire(lock, subclass, trylock, read, check, nest_lock, ip); 275 | __lock_acquire(lock, subclass, trylock, read, check, 276 | irqs_disabled_flags(flags), nest_lock, ip, 0, 0); 277 | current->lockdep_recursion = 0; 278 | raw_local_irq_restore(flags); 279 | } 280 | ``` 281 | 282 | As I wrote above, we will not consider stuff here which is related to debugging or tracing. The main point of the `lock_acquire` function is to disable hardware interrupts by the call of the `raw_local_irq_save` macro, because the given spinlock might be acquired with enabled hardware interrupts. In this way the process will not be preempted. Note that in the end of the `lock_acquire` function we will enable hardware interrupts again with the help of the `raw_local_irq_restore` macro. As you already may guess, the main work will be in the `__lock_acquire` function which is defined in the [kernel/locking/lockdep.c](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep.c) source code file. 283 | 284 | The `__lock_acquire` function looks big. We will try to understand what this function does, but not in this part. Actually this function is mostly related to the Linux kernel [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) and it is not topic of this part. If we will return to the definition of the `__raw_spin_lock` function, we will see that it contains the following definition in the end: 285 | 286 | ```C 287 | LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock); 288 | ``` 289 | 290 | The `LOCK_CONTENDED` macro is defined in the [include/linux/lockdep.h](https://github.com/torvalds/linux/blob/master/include/linux/lockdep.h) header file and just calls the given function with the given `spinlock`: 291 | 292 | ```C 293 | #define LOCK_CONTENDED(_lock, try, lock) \ 294 | lock(_lock) 295 | ``` 296 | 297 | In our case, the `lock` is `do_raw_spin_lock` function from the [include/linux/spinlock.h](https://github.com/torvalds/linux/blob/master/include/linux/spnlock.h) header file and the `_lock` is the given `raw_spinlock_t`: 298 | 299 | ```C 300 | static inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock) 301 | { 302 | __acquire(lock); 303 | arch_spin_lock(&lock->raw_lock); 304 | } 305 | ``` 306 | 307 | The `__acquire` here is just [Sparse](https://en.wikipedia.org/wiki/Sparse) related macro and we are not interested in it in this moment. The `arch_spin_lock` macro is defined in the [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlocks.h) header file as the following: 308 | 309 | ```C 310 | #define arch_spin_lock(l) queued_spin_lock(l) 311 | ``` 312 | 313 | We stop here for this part. In the next part, we'll dive into how queued spinlocks works and related concepts. 314 | 315 | Conclusion 316 | -------------------------------------------------------------------------------- 317 | 318 | This concludes the first part covering synchronization primitives in the Linux kernel. In this part, we met first synchronization primitive `spinlock` provided by the Linux kernel. In the next part we will continue to dive into this interesting theme and will see other `synchronization` related stuff. 319 | 320 | If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](mailto:anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new). 321 | 322 | **Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** 323 | 324 | Links 325 | -------------------------------------------------------------------------------- 326 | 327 | * [Concurrent computing](https://en.wikipedia.org/wiki/Concurrent_computing) 328 | * [Synchronization](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) 329 | * [Clocksource framework](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-2) 330 | * [Mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) 331 | * [Race condition](https://en.wikipedia.org/wiki/Race_condition) 332 | * [Atomic operations](https://en.wikipedia.org/wiki/Linearizability) 333 | * [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) 334 | * [x86_64](https://en.wikipedia.org/wiki/X86-64) 335 | * [Interrupts](https://en.wikipedia.org/wiki/Interrupt) 336 | * [Preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29) 337 | * [Linux kernel lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) 338 | * [Sparse](https://en.wikipedia.org/wiki/Sparse) 339 | * [xadd instruction](http://x86.renejeschke.de/html/file_module_x86_id_327.html) 340 | * [NOP](https://en.wikipedia.org/wiki/NOP) 341 | * [Memory barriers](https://www.kernel.org/doc/Documentation/memory-barriers.txt) 342 | * [Previous chapter](https://0xax.gitbook.io/linux-insides/summary/timers/) 343 | -------------------------------------------------------------------------------- /SysCall/README.md: -------------------------------------------------------------------------------- 1 | # System calls 2 | 3 | This chapter describes the `system call` concept in the Linux kernel. 4 | 5 | * [Introduction to system call concept](linux-syscall-1.md) - this part is introduction to the `system call` concept in the Linux kernel. 6 | * [How the Linux kernel handles a system call](linux-syscall-2.md) - this part describes how the Linux kernel handles a system call from a userspace application. 7 | * [vsyscall and vDSO](linux-syscall-3.md) - third part describes `vsyscall` and `vDSO` concepts. 8 | * [How the Linux kernel runs a program](linux-syscall-4.md) - this part describes startup process of a program. 9 | * [Implementation of the open system call](linux-syscall-5.md) - this part describes implementation of the [open](http://man7.org/linux/man-pages/man2/open.2.html) system call. 10 | * [Limits on resources in Linux](linux-syscall-6.md) - this part describes implementation of the [getrlimit/setrlimit](https://linux.die.net/man/2/getrlimit) system calls. 11 | -------------------------------------------------------------------------------- /SysCall/images/ls_shell.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/SysCall/images/ls_shell.png -------------------------------------------------------------------------------- /SysCall/linux-syscall-6.md: -------------------------------------------------------------------------------- 1 | Limits on resources in Linux 2 | ================================================================================ 3 | 4 | Each process in the system uses certain amount of different resources like files, CPU time, memory and so on. 5 | 6 | Such resources are not infinite and each process and we should have an instrument to manage it. Sometimes it is useful to know current limits for a certain resource or to change its value. In this post we will consider such instruments that allow us to get information about limits for a process and increase or decrease such limits. 7 | 8 | We will start from userspace view and then we will look how it is implemented in the Linux kernel. 9 | 10 | There are three main fundamental [system calls](https://en.wikipedia.org/wiki/System_call) to manage resource limit for a process: 11 | 12 | * `getrlimit` 13 | * `setrlimit` 14 | * `prlimit` 15 | 16 | The first two allows a process to read and set limits on a system resource. The last one is extension for previous functions. The `prlimit` allows to set and read the resource limits of a process specified by [PID](https://en.wikipedia.org/wiki/Process_identifier). Definitions of these functions looks: 17 | 18 | The `getrlimit` is: 19 | 20 | ```C 21 | int getrlimit(int resource, struct rlimit *rlim); 22 | ``` 23 | 24 | The `setrlimit` is: 25 | 26 | ```C 27 | int setrlimit(int resource, const struct rlimit *rlim); 28 | ``` 29 | 30 | And the definition of the `prlimit` is: 31 | 32 | ```C 33 | int prlimit(pid_t pid, int resource, const struct rlimit *new_limit, 34 | struct rlimit *old_limit); 35 | ``` 36 | 37 | In the first two cases, functions takes two parameters: 38 | 39 | * `resource` - represents resource type (we will see available types later); 40 | * `rlim` - combination of `soft` and `hard` limits. 41 | 42 | There are two types of limits: 43 | 44 | * `soft` 45 | * `hard` 46 | 47 | The first provides actual limit for a resource of a process. The second is a ceiling value of a `soft` limit and can be set only by superuser. So, `soft` limit can never exceed related `hard` limit. 48 | 49 | Both these values are combined in the `rlimit` structure: 50 | 51 | ```C 52 | struct rlimit { 53 | rlim_t rlim_cur; 54 | rlim_t rlim_max; 55 | }; 56 | ``` 57 | 58 | The last one function looks a little bit complex and takes `4` arguments. Besides `resource` argument, it takes: 59 | 60 | * `pid` - specifies an ID of a process on which the `prlimit` should be executed; 61 | * `new_limit` - provides new limits values if it is not `NULL`; 62 | * `old_limit` - current `soft` and `hard` limits will be placed here if it is not `NULL`. 63 | 64 | Exactly `prlimit` function is used by [ulimit](https://www.gnu.org/software/bash/manual/html_node/Bash-Builtins.html#index-ulimit) util. We can verify this with the help of [strace](https://linux.die.net/man/1/strace) util. 65 | 66 | For example: 67 | 68 | ``` 69 | ~$ strace ulimit -s 2>&1 | grep rl 70 | 71 | prlimit64(0, RLIMIT_NPROC, NULL, {rlim_cur=63727, rlim_max=63727}) = 0 72 | prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=4*1024}) = 0 73 | prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0 74 | ``` 75 | 76 | Here we can see `prlimit64`, but not the `prlimit`. The fact is that we see underlying system call here instead of library call. 77 | 78 | Now let's look at list of available resources: 79 | 80 | | Resource | Description 81 | |-------------------|------------------------------------------------------------------------------------------| 82 | | RLIMIT_CPU | CPU time limit given in seconds | 83 | | RLIMIT_FSIZE | the maximum size of files that a process may create | 84 | | RLIMIT_DATA | the maximum size of the process's data segment | 85 | | RLIMIT_STACK | the maximum size of the process stack in bytes | 86 | | RLIMIT_CORE | the maximum size of a [core](http://man7.org/linux/man-pages/man5/core.5.html) file. | 87 | | RLIMIT_RSS | the number of bytes that can be allocated for a process in RAM | 88 | | RLIMIT_NPROC | the maximum number of processes that can be created by a user | 89 | | RLIMIT_NOFILE | the maximum number of a file descriptor that can be opened by a process | 90 | | RLIMIT_MEMLOCK | the maximum number of bytes of memory that may be locked into RAM by [mlock](http://man7.org/linux/man-pages/man2/mlock.2.html).| 91 | | RLIMIT_AS | the maximum size of virtual memory in bytes. | 92 | | RLIMIT_LOCKS | the maximum number [flock](https://linux.die.net/man/1/flock) and locking related [fcntl](http://man7.org/linux/man-pages/man2/fcntl.2.html) calls| 93 | | RLIMIT_SIGPENDING | maximum number of [signals](http://man7.org/linux/man-pages/man7/signal.7.html) that may be queued for a user of the calling process| 94 | | RLIMIT_MSGQUEUE | the number of bytes that can be allocated for [POSIX message queues](http://man7.org/linux/man-pages/man7/mq_overview.7.html) | 95 | | RLIMIT_NICE | the maximum [nice](https://linux.die.net/man/1/nice) value that can be set by a process | 96 | | RLIMIT_RTPRIO | maximum real-time priority value | 97 | | RLIMIT_RTTIME | maximum number of microseconds that a process may be scheduled under real-time scheduling policy without making blocking system call| 98 | 99 | If you're looking into source code of open source projects, you will note that reading or updating of a resource limit is quite widely used operation. 100 | 101 | For example: [systemd](https://github.com/systemd/systemd/blob/01a45898fce8def67d51332bccc410eb1e8710e7/src/core/main.c) 102 | 103 | ```C 104 | /* Don't limit the coredump size */ 105 | (void) setrlimit(RLIMIT_CORE, &RLIMIT_MAKE_CONST(RLIM_INFINITY)); 106 | ``` 107 | 108 | Or [haproxy](https://github.com/haproxy/haproxy/blob/25f067ccec52f53b0248a05caceb7841a3cb99df/src/haproxy.c): 109 | 110 | ```C 111 | getrlimit(RLIMIT_NOFILE, &limit); 112 | if (limit.rlim_cur < global.maxsock) { 113 | Warning("[%s.main()] FD limit (%d) too low for maxconn=%d/maxsock=%d. Please raise 'ulimit-n' to %d or more to avoid any trouble.\n", 114 | argv[0], (int)limit.rlim_cur, global.maxconn, global.maxsock, global.maxsock); 115 | } 116 | ``` 117 | 118 | We've just saw a little bit about resources limits related stuff in the userspace, now let's look at the same system calls in the Linux kernel. 119 | 120 | Limits on resource in the Linux kernel 121 | -------------------------------------------------------------------------------- 122 | 123 | Both implementation of `getrlimit` system call and `setrlimit` looks similar. Both they execute `do_prlimit` function that is core implementation of the `prlimit` system call and copy from/to given `rlimit` from/to userspace: 124 | 125 | The `getrlimit`: 126 | 127 | ```C 128 | SYSCALL_DEFINE2(getrlimit, unsigned int, resource, struct rlimit __user *, rlim) 129 | { 130 | struct rlimit value; 131 | int ret; 132 | 133 | ret = do_prlimit(current, resource, NULL, &value); 134 | if (!ret) 135 | ret = copy_to_user(rlim, &value, sizeof(*rlim)) ? -EFAULT : 0; 136 | 137 | return ret; 138 | } 139 | ``` 140 | 141 | and `setrlimit`: 142 | 143 | ```C 144 | SYSCALL_DEFINE2(setrlimit, unsigned int, resource, struct rlimit __user *, rlim) 145 | { 146 | struct rlimit new_rlim; 147 | 148 | if (copy_from_user(&new_rlim, rlim, sizeof(*rlim))) 149 | return -EFAULT; 150 | return do_prlimit(current, resource, &new_rlim, NULL); 151 | } 152 | ``` 153 | 154 | Implementations of these system calls are defined in the [kernel/sys.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/sys.c) kernel source code file. 155 | 156 | First of all the `do_prlimit` function executes a check that the given resource is valid: 157 | 158 | ```C 159 | if (resource >= RLIM_NLIMITS) 160 | return -EINVAL; 161 | ``` 162 | 163 | and in a failure case returns `-EINVAL` error. After this check will pass successfully and new limits was passed as non `NULL` value, two following checks: 164 | 165 | ```C 166 | if (new_rlim) { 167 | if (new_rlim->rlim_cur > new_rlim->rlim_max) 168 | return -EINVAL; 169 | if (resource == RLIMIT_NOFILE && 170 | new_rlim->rlim_max > sysctl_nr_open) 171 | return -EPERM; 172 | } 173 | ``` 174 | 175 | check that the given `soft` limit does not exceed `hard` limit and in a case when the given resource is the maximum number of a file descriptors that hard limit is not greater than `sysctl_nr_open` value. The value of the `sysctl_nr_open` can be found via [procfs](https://en.wikipedia.org/wiki/Procfs): 176 | 177 | ``` 178 | ~$ cat /proc/sys/fs/nr_open 179 | 1048576 180 | ``` 181 | 182 | After all of these checks we lock `tasklist` to be sure that [signal]() handlers related things will not be destroyed while we updating limits for a given resource: 183 | 184 | ```C 185 | read_lock(&tasklist_lock); 186 | ... 187 | ... 188 | ... 189 | read_unlock(&tasklist_lock); 190 | ``` 191 | 192 | We need to do this because `prlimit` system call allows us to update limits of another task by the given pid. As task list is locked, we take the `rlimit` instance that is responsible for the given resource limit of the given process: 193 | 194 | ```C 195 | rlim = tsk->signal->rlim + resource; 196 | ``` 197 | 198 | where the `tsk->signal->rlim` is just array of `struct rlimit` that represents certain resources. And if the `new_rlim` is not `NULL` we just update its value. If `old_rlim` is not `NULL` we fill it: 199 | 200 | ```C 201 | if (old_rlim) 202 | *old_rlim = *rlim; 203 | ``` 204 | 205 | That's all. 206 | 207 | Conclusion 208 | -------------------------------------------------------------------------------- 209 | 210 | This is the end of the second part that describes implementation of the system calls in the Linux kernel. If you have questions or suggestions, ping me on Twitter [0xAX](https://twitter.com/0xAX), drop me an [email](mailto:anotherworldofworld@gmail.com), or just create an [issue](https://github.com/0xAX/linux-internals/issues/new). 211 | 212 | **Please note that English is not my first language and I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-internals).** 213 | 214 | Links 215 | -------------------------------------------------------------------------------- 216 | 217 | * [system calls](https://en.wikipedia.org/wiki/System_call) 218 | * [PID](https://en.wikipedia.org/wiki/Process_identifier) 219 | * [ulimit](https://www.gnu.org/software/bash/manual/html_node/Bash-Builtins.html#index-ulimit) 220 | * [strace](https://linux.die.net/man/1/strace) 221 | * [POSIX message queues](http://man7.org/linux/man-pages/man7/mq_overview.7.html) 222 | -------------------------------------------------------------------------------- /Theory/README.md: -------------------------------------------------------------------------------- 1 | # Theory 2 | 3 | This chapter describes various theoretical concepts and concepts which are not directly related to practice but useful to know. 4 | 5 | * [Paging](linux-theory-1.md) 6 | * [Elf64 format](linux-theory-2.md) 7 | * [Inline assembly](linux-theory-3.md) 8 | -------------------------------------------------------------------------------- /Theory/images/4_level_paging.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Theory/images/4_level_paging.png -------------------------------------------------------------------------------- /Theory/linux-theory-1.md: -------------------------------------------------------------------------------- 1 | Paging 2 | ================================================================================ 3 | 4 | Introduction 5 | -------------------------------------------------------------------------------- 6 | 7 | In the fifth [part](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-5) of the series `Linux kernel booting process` we learned about what the kernel does in its earliest stage. In the next step the kernel will initialize different things like `initrd` mounting, lockdep initialization, and many many other things, before we can see how the kernel runs the first init process. 8 | 9 | Yeah, there will be many different things, but many many and once again many work with **memory**. 10 | 11 | In my view, memory management is one of the most complex parts of the Linux kernel and system programming in general. This is why we need to get acquainted with paging, before we proceed with the kernel initialization stuff. 12 | 13 | `Paging` is a mechanism that translates a linear memory address to a physical address. If you have read the previous parts of this book, you may remember that we saw segmentation in real mode when physical addresses are calculated by shifting a segment register by four and adding an offset. We also saw segmentation in protected mode, where we used the descriptor tables and base addresses from descriptors with offsets to calculate the physical addresses. Now we will see paging in 64-bit mode. 14 | 15 | As the Intel manual says: 16 | 17 | > Paging provides a mechanism for implementing a conventional demand-paged, virtual-memory system where sections of a program’s execution environment are mapped into physical memory as needed. 18 | 19 | So... In this post I will try to explain the theory behind paging. Of course it will be closely related to the `x86_64` version of the Linux kernel, but we will not go into too much details (at least in this post). 20 | 21 | Enabling paging 22 | -------------------------------------------------------------------------------- 23 | 24 | There are three paging modes: 25 | 26 | * 32-bit paging; 27 | * PAE paging; 28 | * IA-32e paging. 29 | 30 | We will only explain the last mode here. To enable the `IA-32e paging` paging mode we need to do the following things: 31 | 32 | * set the `CR0.PG` bit; 33 | * set the `CR4.PAE` bit; 34 | * set the `IA32_EFER.LME` bit. 35 | 36 | We already saw where those bits were set in [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S): 37 | 38 | ```assembly 39 | movl $(X86_CR0_PG | X86_CR0_PE), %eax 40 | movl %eax, %cr0 41 | ``` 42 | 43 | and 44 | 45 | ```assembly 46 | movl $MSR_EFER, %ecx 47 | rdmsr 48 | btsl $_EFER_LME, %eax 49 | wrmsr 50 | ``` 51 | 52 | Paging structures 53 | -------------------------------------------------------------------------------- 54 | 55 | Paging divides the linear address space into fixed-size pages. Pages can be mapped into the physical address space or external storage. This fixed size is `4096` bytes for the `x86_64` Linux kernel. To perform the translation from linear address to physical address, special structures are used. Every structure is `4096` bytes and contains `512` entries (this only for `PAE` and `IA32_EFER.LME` modes). Paging structures are hierarchical and the Linux kernel uses 4 level of paging in the `x86_64` architecture. The CPU uses a part of linear addresses to identify the entry in another paging structure which is at the lower level, physical memory region (`page frame`) or physical address in this region (`page offset`). The address of the top level paging structure located in the `cr3` register. We have already seen this in [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S): 56 | 57 | ```assembly 58 | leal pgtable(%ebx), %eax 59 | movl %eax, %cr3 60 | ``` 61 | 62 | We build the page table structures and put the address of the top-level structure in the `cr3` register. Here `cr3` is used to store the address of the top-level structure, the `PML4` or `Page Global Directory` as it is called in the Linux kernel. `cr3` is 64-bit register and has the following structure: 63 | 64 | ``` 65 | 63 52 51 32 66 | -------------------------------------------------------------------------------- 67 | | | | 68 | | Reserved MBZ | Address of the top level structure | 69 | | | | 70 | -------------------------------------------------------------------------------- 71 | 31 12 11 5 4 3 2 0 72 | -------------------------------------------------------------------------------- 73 | | | | P | P | | 74 | | Address of the top level structure | Reserved | C | W | Reserved | 75 | | | | D | T | | 76 | -------------------------------------------------------------------------------- 77 | ``` 78 | 79 | These fields have the following meanings: 80 | 81 | * Bits 63:52 - reserved must be 0. 82 | * Bits 51:12 - stores the address of the top level paging structure; 83 | * Bits 11: 5 - reserved must be 0; 84 | * Bits 4 : 3 - PWT or Page-Level Writethrough and PCD or Page-level cache disable indicate. These bits control the way the page or Page Table is handled by the hardware cache; 85 | * Bits 2 : 0 - ignored; 86 | 87 | The linear address translation is following: 88 | 89 | * A given linear address arrives to the [MMU](http://en.wikipedia.org/wiki/Memory_management_unit) instead of memory bus. 90 | * 64-bit linear address is split into some parts. Only low 48 bits are significant, it means that `2^48` or 256 TBytes of linear-address space may be accessed at any given time. 91 | * `cr3` register stores the address of the 4 top-level paging structure. 92 | * `47:39` bits of the given linear address store an index into the paging structure level-4, `38:30` bits store index into the paging structure level-3, `29:21` bits store an index into the paging structure level-2, `20:12` bits store an index into the paging structure level-1 and `11:0` bits provide the offset into the physical page in byte. 93 | 94 | schematically, we can imagine it like this: 95 | 96 | ![4-level paging](images/4_level_paging.png) 97 | 98 | Every access to a linear address is either a supervisor-mode access or a user-mode access. This access is determined by the `CPL` (current privilege level). If `CPL < 3` it is a supervisor mode access level, otherwise it is a user mode access level. For example, the top level page table entry contains access bits and has the following structure (See [arch/x86/include/asm/pgtable_types.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/pgtable_types.h) for the bit offset definitions): 99 | 100 | ``` 101 | 63 62 52 51 32 102 | -------------------------------------------------------------------------------- 103 | | N | | | 104 | | | Available | Address of the paging structure on lower level | 105 | | X | | | 106 | -------------------------------------------------------------------------------- 107 | 31 12 11 9 8 7 6 5 4 3 2 1 0 108 | -------------------------------------------------------------------------------- 109 | | | | M |I| | P | P |U|W| | 110 | | Address of the paging structure on lower level | AVL | B |G|A| C | W | | | P | 111 | | | | Z |N| | D | T |S|R| | 112 | -------------------------------------------------------------------------------- 113 | ``` 114 | 115 | Where: 116 | 117 | * 63 bit - N/X bit (No Execute Bit) which presents ability to execute the code from physical pages mapped by the table entry; 118 | * 62:52 bits - ignored by CPU, used by system software; 119 | * 51:12 bits - stores physical address of the lower level paging structure; 120 | * 11: 9 bits - ignored by CPU; 121 | * MBZ - must be zero bits; 122 | * Ignored bits; 123 | * A - accessed bit indicates was physical page or page structure accessed; 124 | * PWT and PCD used for cache; 125 | * U/S - user/supervisor bit controls user access to all the physical pages mapped by this table entry; 126 | * R/W - read/write bit controls read/write access to all the physical pages mapped by this table entry; 127 | * P - present bit. Current bit indicates was page table or physical page loaded into primary memory or not. 128 | 129 | Ok, we know about the paging structures and their entries. Now let's see some details about 4-level paging in the Linux kernel. 130 | 131 | Paging structures in the Linux kernel 132 | -------------------------------------------------------------------------------- 133 | 134 | As we've seen, the Linux kernel in `x86_64` uses 4-level page tables. Their names are: 135 | 136 | * Page Global Directory 137 | * Page Upper Directory 138 | * Page Middle Directory 139 | * Page Table Entry 140 | 141 | After you've compiled and installed the Linux kernel, you can see the `System.map` file which stores the virtual addresses of the functions that are used by the kernel. For example: 142 | 143 | ``` 144 | $ grep "start_kernel" System.map 145 | ffffffff81efe497 T x86_64_start_kernel 146 | ffffffff81efeaa2 T start_kernel 147 | ``` 148 | 149 | We can see `0xffffffff81efe497` here. I doubt you really have that much RAM installed. But anyway, `start_kernel` and `x86_64_start_kernel` will be executed. The address space in `x86_64` is `2^64` wide, but it's too large, that's why a smaller address space is used, only 48-bits wide. So we have a situation where the physical address space is limited to 48 bits, but addressing still performs with 64 bit pointers. How is this problem solved? Look at this diagram: 150 | 151 | ``` 152 | 0xffffffffffffffff +-----------+ 153 | | | 154 | | | Kernelspace 155 | | | 156 | 0xffff800000000000 +-----------+ 157 | | | 158 | | | 159 | | hole | 160 | | | 161 | | | 162 | 0x00007fffffffffff +-----------+ 163 | | | 164 | | | Userspace 165 | | | 166 | 0x0000000000000000  +-----------+ 167 | ``` 168 | 169 | This solution is `sign extension`. Here we can see that the lower 48 bits of a virtual address can be used for addressing. Bits `63:48` can be either only zeroes or only ones. Note that the virtual address space is split into 2 parts: 170 | 171 | * Kernel space 172 | * Userspace 173 | 174 | Userspace occupies the lower part of the virtual address space, from `0x000000000000000` to `0x00007fffffffffff` and kernel space occupies the highest part from `0xffff8000000000` to `0xffffffffffffffff`. Note that bits `63:47` is 0 for userspace and 1 for kernel space. All addresses which are in kernel space and in userspace or in other words which higher `63:48` bits are zeroes or ones are called `canonical` addresses. There is a `non-canonical` area between these memory regions. Together these two memory regions (kernel space and user space) are exactly `2^48` bits wide. We can find the virtual memory map with 4 level page tables in the [Documentation/x86/x86_64/mm.txt](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/x86_64/mm.txt): 175 | 176 | ``` 177 | 0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm 178 | hole caused by [48:63] sign extension 179 | ffff800000000000 - ffff87ffffffffff (=43 bits) guard hole, reserved for hypervisor 180 | ffff880000000000 - ffffc7ffffffffff (=64 TB) direct mapping of all phys. memory 181 | ffffc80000000000 - ffffc8ffffffffff (=40 bits) hole 182 | ffffc90000000000 - ffffe8ffffffffff (=45 bits) vmalloc/ioremap space 183 | ffffe90000000000 - ffffe9ffffffffff (=40 bits) hole 184 | ffffea0000000000 - ffffeaffffffffff (=40 bits) virtual memory map (1TB) 185 | ... unused hole ... 186 | ffffec0000000000 - fffffc0000000000 (=44 bits) kasan shadow memory (16TB) 187 | ... unused hole ... 188 | ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks 189 | ... unused hole ... 190 | ffffffff80000000 - ffffffffa0000000 (=512 MB) kernel text mapping, from phys 0 191 | ffffffffa0000000 - ffffffffff5fffff (=1525 MB) module mapping space 192 | ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls 193 | ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole 194 | ``` 195 | 196 | We can see here the memory map for user space, kernel space and the non-canonical area in-between them. The user space memory map is simple. Let's take a closer look at the kernel space. We can see that it starts from the guard hole which is reserved for the hypervisor. We can find the definition of this guard hole in [arch/x86/include/asm/page_64_types.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/page_64_types.h): 197 | 198 | ```C 199 | #define __PAGE_OFFSET _AC(0xffff880000000000, UL) 200 | ``` 201 | 202 | Previously this guard hole and `__PAGE_OFFSET` was from `0xffff800000000000` to `0xffff87ffffffffff` to prevent access to non-canonical area, but was later extended by 3 bits for the hypervisor. 203 | 204 | Next is the lowest usable address in kernel space - `ffff880000000000`. This virtual memory region is for direct mapping of all the physical memory. After the memory space which maps all the physical addresses, the guard hole. It needs to be between the direct mapping of all the physical memory and the vmalloc area. After the virtual memory map for the first terabyte and the unused hole after it, we can see the `kasan` shadow memory. It was added by [commit](https://github.com/torvalds/linux/commit/ef7f0d6a6ca8c9e4b27d78895af86c2fbfaeedb2) and provides the kernel address sanitizer. After the next unused hole we can see the `esp` fixup stacks (we will talk about it in other parts of this book) and the start of the kernel text mapping from the physical address - `0`. We can find the definition of this address in the same file as the `__PAGE_OFFSET`: 205 | 206 | ```C 207 | #define __START_KERNEL_map _AC(0xffffffff80000000, UL) 208 | ``` 209 | 210 | Usually kernel's `.text` starts here with the `CONFIG_PHYSICAL_START` offset. We have seen it in the post about [ELF64](https://github.com/0xAX/linux-insides/blob/master/Theory/ELF.md): 211 | 212 | ``` 213 | readelf -s vmlinux | grep ffffffff81000000 214 | 1: ffffffff81000000 0 SECTION LOCAL DEFAULT 1 215 | 65099: ffffffff81000000 0 NOTYPE GLOBAL DEFAULT 1 _text 216 | 90766: ffffffff81000000 0 NOTYPE GLOBAL DEFAULT 1 startup_64 217 | ``` 218 | 219 | Here I check `vmlinux` with `CONFIG_PHYSICAL_START` is `0x1000000`. So we have the start point of the kernel `.text` - `0xffffffff80000000` and offset - `0x1000000`, the resulted virtual address will be `0xffffffff80000000 + 1000000 = 0xffffffff81000000`. 220 | 221 | After the kernel `.text` region there is the virtual memory region for kernel module, `vsyscalls` and an unused hole of 2 megabytes. 222 | 223 | We've seen how virtual memory map in the kernel is laid out and how a virtual address is translated into a physical one. Let's take the following address as example: 224 | 225 | ``` 226 | 0xffffffff81000000 227 | ``` 228 | 229 | In binary it will be: 230 | 231 | ``` 232 | 1111111111111111 111111111 111111110 000001000 000000000 000000000000 233 | 63:48 47:39 38:30 29:21 20:12 11:0 234 | ``` 235 | 236 | This virtual address is split in parts as described above: 237 | 238 | * `63:48` - bits not used; 239 | * `47:39` - bits store an index into the paging structure level-4; 240 | * `38:30` - bits store index into the paging structure level-3; 241 | * `29:21` - bits store an index into the paging structure level-2; 242 | * `20:12` - bits store an index into the paging structure level-1; 243 | * `11:0` - bits provide the offset into the physical page in byte. 244 | 245 | That is all. Now you know a little about theory of `paging` and we can go ahead in the kernel source code and see the first initialization steps. 246 | 247 | Conclusion 248 | -------------------------------------------------------------------------------- 249 | 250 | It's the end of this short part about paging theory. Of course this post doesn't cover every detail of paging, but soon we'll see in practice how the Linux kernel builds paging structures and works with them. 251 | 252 | **Please note that English is not my first language and I am really sorry for any inconvenience. If you've found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** 253 | 254 | Links 255 | -------------------------------------------------------------------------------- 256 | 257 | * [Paging on Wikipedia](http://en.wikipedia.org/wiki/Paging) 258 | * [Intel 64 and IA-32 architectures software developer's manual volume 3A](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html) 259 | * [MMU](http://en.wikipedia.org/wiki/Memory_management_unit) 260 | * [ELF64](https://github.com/0xAX/linux-insides/blob/master/Theory/ELF.md) 261 | * [Documentation/x86/x86_64/mm.txt](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/x86_64/mm.txt) 262 | * [Last part - Kernel booting process](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-5) 263 | -------------------------------------------------------------------------------- /Theory/linux-theory-2.md: -------------------------------------------------------------------------------- 1 | Executable and Linkable Format 2 | ================================================================================ 3 | 4 | ELF (Executable and Linkable Format) is a standard file format for executable files, object code, shared libraries and core dumps. Linux and many UNIX-like operating systems use this format. Let's look at the structure of the ELF-64 Object File Format and some definitions in the Linux kernel source code which related with it. 5 | 6 | An ELF object file consists of the following parts: 7 | 8 | * ELF header - describes the main characteristics of the object file: type, CPU architecture, the virtual address of the entry point, the size and offset of the remaining parts, etc...; 9 | * Program header table - lists the available segments and their attributes. Program header table need loaders for placing sections of the file as virtual memory segments; 10 | * Section header table - contains the description of the sections. 11 | 12 | Now let's have a closer look on these components. 13 | 14 | **ELF header** 15 | 16 | The ELF header is located at the beginning of the object file. Its main purpose is to locate all other parts of the object file. The file header contains the following fields: 17 | 18 | * ELF identification - array of bytes which helps identify the file as an ELF object file and also provides information about general object file characteristic; 19 | * Object file type - identifies the object file type. This field can describe that ELF file is a relocatable object file, an executable file, etc...; 20 | * Target architecture; 21 | * Version of the object file format; 22 | * Virtual address of the program entry point; 23 | * File offset of the program header table; 24 | * File offset of the section header table; 25 | * Size of an ELF header; 26 | * Size of a program header table entry; 27 | * and other fields... 28 | 29 | You can find the `elf64_hdr` structure which presents ELF64 header in the Linux kernel source code: 30 | 31 | ```C 32 | typedef struct elf64_hdr { 33 | unsigned char e_ident[EI_NIDENT]; 34 | Elf64_Half e_type; 35 | Elf64_Half e_machine; 36 | Elf64_Word e_version; 37 | Elf64_Addr e_entry; 38 | Elf64_Off e_phoff; 39 | Elf64_Off e_shoff; 40 | Elf64_Word e_flags; 41 | Elf64_Half e_ehsize; 42 | Elf64_Half e_phentsize; 43 | Elf64_Half e_phnum; 44 | Elf64_Half e_shentsize; 45 | Elf64_Half e_shnum; 46 | Elf64_Half e_shstrndx; 47 | } Elf64_Ehdr; 48 | ``` 49 | 50 | This structure defined in the [elf.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/uapi/linux/elf.h#L220) 51 | 52 | **Sections** 53 | 54 | All data stores in a sections in an Elf object file. Sections identified by index in the section header table. Section header contains following fields: 55 | 56 | * Section name; 57 | * Section type; 58 | * Section attributes; 59 | * Virtual address in memory; 60 | * Offset in file; 61 | * Size of section; 62 | * Link to other section; 63 | * Miscellaneous information; 64 | * Address alignment boundary; 65 | * Size of entries, if section has table; 66 | 67 | And presented with the following `elf64_shdr` structure in the Linux kernel: 68 | 69 | ```C 70 | typedef struct elf64_shdr { 71 | Elf64_Word sh_name; 72 | Elf64_Word sh_type; 73 | Elf64_Xword sh_flags; 74 | Elf64_Addr sh_addr; 75 | Elf64_Off sh_offset; 76 | Elf64_Xword sh_size; 77 | Elf64_Word sh_link; 78 | Elf64_Word sh_info; 79 | Elf64_Xword sh_addralign; 80 | Elf64_Xword sh_entsize; 81 | } Elf64_Shdr; 82 | ``` 83 | 84 | [elf.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/uapi/linux/elf.h#L312) 85 | 86 | **Program header table** 87 | 88 | All sections are grouped into segments in an executable or shared object file. Program header is an array of structures which describe every segment. It looks like: 89 | 90 | ```C 91 | typedef struct elf64_phdr { 92 | Elf64_Word p_type; 93 | Elf64_Word p_flags; 94 | Elf64_Off p_offset; 95 | Elf64_Addr p_vaddr; 96 | Elf64_Addr p_paddr; 97 | Elf64_Xword p_filesz; 98 | Elf64_Xword p_memsz; 99 | Elf64_Xword p_align; 100 | } Elf64_Phdr; 101 | ``` 102 | 103 | in the Linux kernel source code. 104 | 105 | `elf64_phdr` defined in the same [elf.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/uapi/linux/elf.h#L254). 106 | 107 | The ELF object file also contains other fields/structures which you can find in the [Documentation](http://www.uclibc.org/docs/elf-64-gen.pdf). Now let's a look at the `vmlinux` ELF object. 108 | 109 | vmlinux 110 | -------------------------------------------------------------------------------- 111 | 112 | `vmlinux` is also a relocatable ELF object file . We can take a look at it with the `readelf` utility. First of all let's look at the header: 113 | 114 | ``` 115 | $ readelf -h vmlinux 116 | ELF Header: 117 | Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 118 | Class: ELF64 119 | Data: 2's complement, little endian 120 | Version: 1 (current) 121 | OS/ABI: UNIX - System V 122 | ABI Version: 0 123 | Type: EXEC (Executable file) 124 | Machine: Advanced Micro Devices X86-64 125 | Version: 0x1 126 | Entry point address: 0x1000000 127 | Start of program headers: 64 (bytes into file) 128 | Start of section headers: 381608416 (bytes into file) 129 | Flags: 0x0 130 | Size of this header: 64 (bytes) 131 | Size of program headers: 56 (bytes) 132 | Number of program headers: 5 133 | Size of section headers: 64 (bytes) 134 | Number of section headers: 73 135 | Section header string table index: 70 136 | ``` 137 | 138 | Here we can see that `vmlinux` is a 64-bit executable file. 139 | 140 | We can read from the [Documentation/x86/x86_64/mm.txt](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/x86_64/mm.txt#L21): 141 | 142 | ``` 143 | ffffffff80000000 - ffffffffa0000000 (=512 MB) kernel text mapping, from phys 0 144 | ``` 145 | 146 | We can then look this address up in the `vmlinux` ELF object with: 147 | 148 | ``` 149 | $ readelf -s vmlinux | grep ffffffff81000000 150 | 1: ffffffff81000000 0 SECTION LOCAL DEFAULT 1 151 | 65099: ffffffff81000000 0 NOTYPE GLOBAL DEFAULT 1 _text 152 | 90766: ffffffff81000000 0 NOTYPE GLOBAL DEFAULT 1 startup_64 153 | ``` 154 | 155 | Note that the address of the `startup_64` routine is not `ffffffff80000000`, but `ffffffff81000000` and now I'll explain why. 156 | 157 | We can see following definition in the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/vmlinux.lds.S): 158 | 159 | ``` 160 | . = __START_KERNEL; 161 | ... 162 | ... 163 | .. 164 | /* Text and read-only data */ 165 | .text : AT(ADDR(.text) - LOAD_OFFSET) { 166 | _text = .; 167 | ... 168 | ... 169 | ... 170 | } 171 | ``` 172 | 173 | Where `__START_KERNEL` is: 174 | 175 | ``` 176 | #define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START) 177 | ``` 178 | 179 | `__START_KERNEL_map` is the value from the documentation - `ffffffff80000000` and `__PHYSICAL_START` is `0x1000000`. That's why address of the `startup_64` is `ffffffff81000000`. 180 | 181 | And at last we can get program headers from `vmlinux` with the following command: 182 | 183 | ``` 184 | readelf -l vmlinux 185 | 186 | Elf file type is EXEC (Executable file) 187 | Entry point 0x1000000 188 | There are 5 program headers, starting at offset 64 189 | 190 | Program Headers: 191 | Type Offset VirtAddr PhysAddr 192 | FileSiz MemSiz Flags Align 193 | LOAD 0x0000000000200000 0xffffffff81000000 0x0000000001000000 194 | 0x0000000000cfd000 0x0000000000cfd000 R E 200000 195 | LOAD 0x0000000001000000 0xffffffff81e00000 0x0000000001e00000 196 | 0x0000000000100000 0x0000000000100000 RW 200000 197 | LOAD 0x0000000001200000 0x0000000000000000 0x0000000001f00000 198 | 0x0000000000014d98 0x0000000000014d98 RW 200000 199 | LOAD 0x0000000001315000 0xffffffff81f15000 0x0000000001f15000 200 | 0x000000000011d000 0x0000000000279000 RWE 200000 201 | NOTE 0x0000000000b17284 0xffffffff81917284 0x0000000001917284 202 | 0x0000000000000024 0x0000000000000024 4 203 | 204 | Section to Segment mapping: 205 | Segment Sections... 206 | 00 .text .notes __ex_table .rodata __bug_table .pci_fixup .builtin_fw 207 | .tracedata __ksymtab __ksymtab_gpl __kcrctab __kcrctab_gpl 208 | __ksymtab_strings __param __modver 209 | 01 .data .vvar 210 | 02 .data..percpu 211 | 03 .init.text .init.data .x86_cpu_dev.init .altinstructions 212 | .altinstr_replacement .iommu_table .apicdrivers .exit.text 213 | .smp_locks .data_nosave .bss .brk 214 | ``` 215 | 216 | Here we can see five segments with sections list. You can find all of these sections in the generated linker script at - `arch/x86/kernel/vmlinux.lds`. 217 | 218 | That's all. Of course it's not a full description of ELF (Executable and Linkable Format), but if you want to know more, you can find the documentation - [here](http://www.uclibc.org/docs/elf-64-gen.pdf) 219 | -------------------------------------------------------------------------------- /Timers/README.md: -------------------------------------------------------------------------------- 1 | # Timers and time management 2 | 3 | This chapter describes timers and time management related concepts in the Linux kernel. 4 | 5 | * [Introduction](linux-timers-1.md) - An introduction to the timers in the Linux kernel. 6 | * [Introduction to the clocksource framework](linux-timers-2.md) - Describes `clocksource` framework in the Linux kernel. 7 | * [The tick broadcast framework and dyntick](linux-timers-3.md) - Describes tick broadcast framework and dyntick concept. 8 | * [Introduction to timers](linux-timers-4.md) - Describes timers in the Linux kernel. 9 | * [Introduction to the clockevents framework](linux-timers-5.md) - Describes yet another clock/time management related framework : `clockevents`. 10 | * [x86 related clock sources](linux-timers-6.md) - Describes `x86_64` related clock sources. 11 | * [Time related system calls in the Linux kernel](linux-timers-7.md) - Describes time related system calls. 12 | -------------------------------------------------------------------------------- /Timers/images/HZ.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Timers/images/HZ.png -------------------------------------------------------------------------------- /Timers/images/base_small.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/Timers/images/base_small.png -------------------------------------------------------------------------------- /contributors.md: -------------------------------------------------------------------------------- 1 | # Contributors 2 | 3 | Special thanks to all the people who helped to develop this project: 4 | 5 | * [Akash Shende](https://github.com/akash0x53) 6 | * [Jakub Kramarz](https://github.com/jkramarz) 7 | * [ckrooss](https://github.com/ckrooss) 8 | * [ecksun](https://github.com/ecksun) 9 | * [Maciek Makowski](https://github.com/mmakowski) 10 | * [Thomas Marcelis](https://github.com/ThomasMarcelis) 11 | * [Chris Costes](https://github.com/ccostes) 12 | * [nathansoz](https://github.com/nathansoz) 13 | * [RubanDeventhiran](https://github.com/RubanDeventhiran) 14 | * [fuzhli](https://github.com/fuzhli) 15 | * [andars](https://github.com/andars) 16 | * [Alexandru Pana](https://github.com/alexpana) 17 | * [Bogdan Rădulescu](https://github.com/bogdanr) 18 | * [zil](https://github.com/zil) 19 | * [codelitt](https://github.com/codelitt) 20 | * [gulyasm](https://github.com/gulyasm) 21 | * [alx741](https://github.com/alx741) 22 | * [Haddayn](https://github.com/Haddayn) 23 | * [Daniel Campoverde Carrión](https://github.com/alx741) 24 | * [Guillaume Gomez](https://github.com/GuillaumeGomez) 25 | * [Leandro Moreira](https://github.com/leandromoreira) 26 | * [Jonatan Pålsson](https://github.com/jonte) 27 | * [George Horrell](https://github.com/georgehorrell) 28 | * [Ciro Santilli](https://github.com/cirosantilli) 29 | * [Kevin Soules](https://github.com/eax64) 30 | * [Fabio Pozzi](https://github.com/fabiopozzi) 31 | * [Kevin Swinton](https://github.com/kevinjswinton) 32 | * [Leandro Moreira](https://github.com/leandromoreira) 33 | * [LYF610400210](https://github.com/LYF610400210) 34 | * [Cam Cope](https://github.com/ccope) 35 | * [Miquel Sabaté Solà](https://github.com/mssola) 36 | * [Michael Aquilina](https://github.com/MichaelAquilina) 37 | * [Gabriel Sullice](https://github.com/gabesullice) 38 | * [Michael Drüing](https://github.com/darkstar) 39 | * [Alexander Polakov](https://github.com/polachok) 40 | * [Anton Davydov](https://github.com/davydovanton) 41 | * [Arpan Kapoor](https://github.com/arpankapoor) 42 | * [Brandon Fosdick](https://github.com/bfoz) 43 | * [Ashleigh Newman-Jones](https://github.com/anewmanjones) 44 | * [Terrell Russell](https://github.com/trel) 45 | * [Mario](https://github.com/bedna-KU) 46 | * [Ewoud Kohl van Wijngaarden](https://github.com/ekohl) 47 | * [Jochen Maes](https://github.com/sejo) 48 | * [Brother-Lal](https://github.com/Brother-Lal) 49 | * [Brian McKenna](https://github.com/puffnfresh) 50 | * [Josh Triplett](https://github.com/joshtriplett) 51 | * [James Flowers](https://github.com/comjf) 52 | * [Alexander Harding](https://github.com/aeharding) 53 | * [Dzmitry Plashchynski](https://github.com/plashchynski) 54 | * [Simarpreet Singh](https://github.com/simar7) 55 | * [umatomba](https://github.com/umatomba) 56 | * [Vaibhav Tulsyan](https://github.com/xennygrimmato) 57 | * [Brandon Wamboldt](https://github.com/brandonwamboldt) 58 | * [Maxime Leboeuf](https://github.com/leboeuf) 59 | * [Maximilien Richer](https://github.com/halfa) 60 | * [marmeladema](https://github.com/marmeladema) 61 | * [Anisse Astier](https://github.com/anisse) 62 | * [TheCodeArtist](https://github.com/TheCodeArtist) 63 | * [Ehsun N](https://github.com/imehsunn) 64 | * [Adam Shannon](https://github.com/adamdecaf) 65 | * [Donny Nadolny](https://github.com/dnadolny) 66 | * [Ehsun N](https://github.com/imehsunn) 67 | * [Waqar Ahmed](https://github.com/Waqar144) 68 | * [Ian Miell](https://github.com/ianmiell) 69 | * [DongLiang Mu](https://github.com/mudongliang) 70 | * [Johan Manuel](https://github.com/29jm) 71 | * [Brian Rak](https://github.com/brakthehack) 72 | * [Robin Peiremans](https://github.com/rpeiremans) 73 | * [xiaoqiang zhao](https://github.com/hitmoon) 74 | * [aouelete](https://github.com/aouelete) 75 | * [Dennis Birkholz](https://github.com/dennisbirkholz) 76 | * [Anton Tyurin](https://github.com/noxiouz) 77 | * [Bogdan Kulbida](https://github.com/kulbida) 78 | * [Matt Hudgins](https://github.com/mhudgins) 79 | * [Ruth Grace Wong](https://github.com/ruthgrace) 80 | * [Jeremy Lacomis](https://github.com/jlacomis) 81 | * [Dubyah](https://github.com/Dubyah) 82 | * [Matthieu Tardy](https://github.com/c0riolis) 83 | * [michaelian ennis](https://github.com/mennis) 84 | * [Amitay Stern](https://github.com/amist) 85 | * [Matt Todd](https://github.com/mtodd) 86 | * [Piyush Pangtey](https://github.com/pangteypiyush) 87 | * [Alfred Agrell](https://github.com/Alcaro) 88 | * [Jakub Wilk](https://github.com/jwilk) 89 | * [Justus Adam](https://github.com/JustusAdam) 90 | * [Roy Wellington Ⅳ](https://github.com/thanatos) 91 | * [Jonathan Rennison](https://github.com/JGRennison) 92 | * [Mack Stump](https://github.com/rmbreak) 93 | * [Pushpinder Singh](https://github.com/PrinceDhaliwal) 94 | * [Xiaoqin Hu](https://github.com/huxq) 95 | * [Jeremy Cline](https://github.com/jeremycline) 96 | * [Kavindra Nikhurpa](https://github.com/kavi-nikhurpa) 97 | * [Connor Mullen](https://github.com/mullen3) 98 | * [Alex Gonzalez](https://github.com/alex-gonz) 99 | * [Tim Konick](https://github.com/tijko) 100 | * [Anastas Stoyanovsky](https://github.com/anastasds) 101 | * [Faiz Halde](https://github.com/7coder7) 102 | * [Andrew Hayes](https://github.com/AndrewRussellHayes) 103 | * [Matthew Fernandez](https://github.com/Smattr) 104 | * [Yoshihiro YUNOMAE](https://github.com/yunomae) 105 | * [paulch](https://github.com/paulch) 106 | * [Nathan Dautenhahn](https://github.com/ndauten) 107 | * [Sachin Patil](https://github.com/psachin) 108 | * [Stéphan Gorget](https://github.com/phantez) 109 | * [Adrian Reyes](https://github.com/int3rrupt) 110 | * [Chandan Rai](https://github.com/crowchirp) 111 | * [JB Cayrou](https://github.com/jbcayrou) 112 | * [Cornelius Diekmann](https://github.com/diekmann) 113 | * [Andrés Rojas](https://github.com/c0r3dump3d) 114 | * [Beomsu Kim](https://github.com/0xF0D0) 115 | * [Firo Yang](https://github.com/firogh) 116 | * [Edward Hu](https://github.com/BDHU) 117 | * [WarpspeedSCP](https://github.com/WarpspeedSCP) 118 | * [Gabriela Moldovan](https://github.com/gabi-250) 119 | * [kuritonasu](https://github.com/kuritonasu/) 120 | * [Miles Frain](https://github.com/milesfrain) 121 | * [Horace Heaven](https://github.com/horaceheaven) 122 | * [Miha Zidar](https://github.com/zidarsk8) 123 | * [Ivan Kovnatsky](https://github.com/sevenfourk) 124 | * [Takuya Yamamoto](https://github.com/tkyymmt) 125 | * [Dragonly](https://github.com/dragonly) 126 | * [Blameying](https://github.com/Blameying) 127 | * [Junsoo Lee](https://github.com/junsooo) 128 | * [SeongJae Park](https://github.com/sjp38) 129 | * [Stefan20162016](https://github.com/stefan20162016) 130 | * [Marco Torsello](https://github.com/md1512) 131 | * [Bruno Meneguele](https://github.com/bmeneguele) 132 | * [Sebastian Fricke](https://github.com/initBasti) 133 | * [Zhouyi Zhou](https://github.com/zhouzhouyi-hub) 134 | * [Mingzhe Yang](https://github.com/Mutated1994) 135 | * [Yuxin Wu](https://github.com/chaffz) 136 | * [Biao Ding](https://github.com/SmallPond) 137 | * [Arfy slowy](https://github.com/slowy07) 138 | * [Junbo Jiang](https://github.com/junbo42) 139 | * [Dexter Plameras](https://github.com/dexterp) 140 | * [Jun Duan](https://github.com/waltforme) 141 | * [Guochao Xie](https://github.com/XieGuochao) 142 | * [Davide Benini](https://github.com/beninidavide/) 143 | -------------------------------------------------------------------------------- /cover.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0xAX/linux-insides/3504d5315fff9586401590cc2513e23ad9a87b1b/cover.jpg --------------------------------------------------------------------------------