├── .gitbook
    └── assets
    │   └── picture1.png
├── architectural
    ├── README.md
    ├── x86-64-0x0f-prefix-two-byte-opcode.md
    └── intel-sgx-in-linux-under-construction.md
├── hacking-interrupts-exceptions-and-trap-handlers
    ├── README.md
    ├── accessing-user-memory-in-trap-handlers.md
    └── hooking-an-idt-handler.md
├── SUMMARY.md
├── README.md
├── kvm.md
├── kvm
    ├── README.md
    ├── inject-an-interrupt.md
    └── amd-v-and-sev.md
├── target-a-specific-thread.md
├── your-own-syscall.md
├── a-good-environment.md
├── accessing-the-non-exported-in-modules.md
└── get-a-kernel-and-build-it.md


/.gitbook/assets/picture1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NSKernel/Kernel-Play-Guide/HEAD/.gitbook/assets/picture1.png


--------------------------------------------------------------------------------
/architectural/README.md:
--------------------------------------------------------------------------------
1 | # Architectural
2 | 
3 | The kernel deals directly with hardware so you can never leave architectural knowledges behind. Your x86-64 assembly knowledge may cover a lot of the necessary things but some of the machine level things are still required.
4 | 
5 | This chapter will be about those miscellaneous things of different architectures that you might want to know.
6 | 
7 | 


--------------------------------------------------------------------------------
/hacking-interrupts-exceptions-and-trap-handlers/README.md:
--------------------------------------------------------------------------------
 1 | # Hacking Interrupts, Exceptions and Trap Handlers
 2 | 
 3 | On x86-64, several things could take a user program to another world - the kernel space. You have interrupts, exceptions and traps. All of these works basically the same except for some tiny differences. 
 4 | 
 5 | When any of these happens, the CPU will pause the current task, save the register states to the IRQ stack, find the corresponding handler in the kernel and let that handler handle the upcomming things. You can find more on the web.
 6 | 
 7 | In this chapter, we will be mainly discussing the shitty and tricky things you might meet when hacking a handler.
 8 | 
 9 | 
10 | 
11 |  
12 | 
13 | 


--------------------------------------------------------------------------------
/SUMMARY.md:
--------------------------------------------------------------------------------
 1 | # Table of contents
 2 | 
 3 | * [Introduction](README.md)
 4 | * [Get a Kernel and Build It](get-a-kernel-and-build-it.md)
 5 | * [Tools And Environment](a-good-environment.md)
 6 | * [Your Own Syscall](your-own-syscall.md)
 7 | * [Target a Specific Thread](target-a-specific-thread.md)
 8 | * [KVM](kvm/README.md)
 9 |   * [Inject An Interrupt](kvm/inject-an-interrupt.md)
10 |   * [AMD-V and SEV](kvm/amd-v-and-sev.md)
11 | * [Architectural](architectural/README.md)
12 |   * [x86-64: 0x0F Prefix - Two-Byte Opcode](architectural/x86-64-0x0f-prefix-two-byte-opcode.md)
13 |   * [Intel SGX in Linux - Under Construction](architectural/intel-sgx-in-linux-under-construction.md)
14 | * [Hacking Interrupts, Exceptions and Trap Handlers](hacking-interrupts-exceptions-and-trap-handlers/README.md)
15 |   * [Accessing User Memory in Trap Handlers](hacking-interrupts-exceptions-and-trap-handlers/accessing-user-memory-in-trap-handlers.md)
16 |   * [Hooking an IDT handler](hacking-interrupts-exceptions-and-trap-handlers/hooking-an-idt-handler.md)
17 | * [Accessing the Non-Exported in Modules](accessing-the-non-exported-in-modules.md)
18 | 
19 | 


--------------------------------------------------------------------------------
/architectural/x86-64-0x0f-prefix-two-byte-opcode.md:
--------------------------------------------------------------------------------
 1 | # x86-64: 0x0F Prefix - Two-Byte Opcode
 2 | 
 3 | A single byte opcode can offer 256 possibilities of instructions, but we have tens of thousands of instructions and extensions now. There's no magic. The way of achieve such goal is called a Two-Byte Opcode. On x86-64, `0x0F` indicates that the upcomming instruction's opcode consists of two bytes.
 4 | 
 5 | ## `0x0F` "Prefix"
 6 | 
 7 | An x86-64 instruction is encoded like this
 8 | 
 9 | | Prefix | Opcode | Other Garbage |
10 | | :--- | :--- | :--- |
11 | 
12 | 
13 | You might think that `0x0F` is a prefix. But rather than a _prefix_ defined by x86-64, it's an opcode prefix. So the way it works is that once you read an opcode starts with `0x0F`, you should read the upfollow 2 bytes as the opcode.
14 | 
15 | A two-byte opcode is formated as
16 | 
17 | | 0x0F Prefix | Primary Opcode | Secondary Opcode |
18 | | :--- | :--- | :--- |
19 | 
20 | 
21 | Take an example of `vmlaunch` from Intel VT, the opcode is `0F 01 C2`, so the encoding is
22 | 
23 | | 0x0F Prefix | Primary Opcode | Secondary Opcode |
24 | | :--- | :--- | :--- |
25 | | `0F` | `01` | `C2` |
26 | 
27 | 


--------------------------------------------------------------------------------
/architectural/intel-sgx-in-linux-under-construction.md:
--------------------------------------------------------------------------------
 1 | # Intel SGX in Linux - Under Construction
 2 | 
 3 | Intel SGX provides something that is known as an _enclave_. Basically a secret room in which no one else knows what's happening inside. The memory of it is fully encrypted and integrity-checked, and every time it exits the CPU will flush all buffer and cache. So if you are doing something like an SSH key computing then SGX will be a very safe place to do it.
 4 | 
 5 | The code inside the enclave works like a shared library except that they are called not by the `call` instruction but rather something else. Of course, the code inside the enclave can access data in the outside and can also call outside libraries to achieve certain functions.
 6 | 
 7 | ## ECalls and OCalls
 8 | 
 9 | Invoking an oridary function is called a _call_ to a function. So similarly, invoking an enclave function is now called as an ECall. Inside an enclave, invoking an outside function is now called as an OCall.
10 | 
11 | ## EENTER - Enter an Enclave
12 | 
13 | In SGX, you call an enclave function using `eenter`. This instruction \(?, more on that later\) will take you to the enclave function you want. Basically you provide 2 parameters, one is called a TCS \(Thread Control Structure\), another is called an AEP \(Asynchronous Exit Pointer\).
14 | 
15 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Introduction
 2 | 
 3 | Read this book on [https://nskernel.gitbook.io/kernel-play-guide/](https://nskernel.gitbook.io/kernel-play-guide/).
 4 | 
 5 | ## Why I write this book
 6 | 
 7 | This book is created because RTFSC is hard and RTFM is merely impossible for Linux due to the lack of true _manuals._ For years, millions of CS students go into the System and Networking area, and most of them finally decide to go for the network. They are afraid of the Linux Kernel. When they face a million-lines-of-code project they don't know what to do. They know it in concept but when it just comes to the exact detail - the detail that is poorly documented - like adding a flag to a process's metadata, things would go totally frustrating because they simply don't know where to start. Most of the documentations of the Kernel are generated by tools and if you want to find a book it is usually too massive to even start reading. Luckily, if you RTFSC, things could be done. Just with some pain in the ass.
 8 | 
 9 | As a CS student doing system research, I totally understand that feeling. I have read the code so I hope you don't need to do it again. Or at least not from scratch. I hope once you have the concept and the idea of what you want to do, you can find a way to achieve your goal here.
10 | 
11 | ## About this book
12 | 
13 | This book is NOT A DOCUMENTATION, and is NOT AN INTRODUCTION TO THE LINUX KERNEL. It functions as a manual, a what-to-do, instead of a how-it-works. It covers from the most fundamental things like where to find the kernel source code and how to build a custom kernel, all the way to the structures and flows in the kernel. 
14 | 
15 | Some technologies outside the Linux Kernel will be introduced as how-it-works, however, but those are written for better understanding of what-to-do in the kernel instead of and introduction.
16 | 
17 | ## Should you read it
18 | 
19 | GEEK OUT. If you do not know what EXACTLY you want to find in this book then you should probably not read it. If you want to know how Linux works or even just want to learn some OS basics, then you should seek for something else. Like Apple says in their documentations of the Darwin Kernel \(macOS's kernel\), _there are many good introductory operating systems texts, this is not one of them_. 
20 | 
21 | To read this book I assume you have a decent knowledge of x86-64 assembly, C language and operating systems, probably a little bit of compiler as well.
22 | 
23 | ## How will I write this book
24 | 
25 | I will update this book after I run into something and solve it later. It's like a blog, but more organized.
26 | 
27 | 


--------------------------------------------------------------------------------
/kvm.md:
--------------------------------------------------------------------------------
 1 | # KVM
 2 | 
 3 | ## Overview
 4 | 
 5 | In this chapter, we will discuss KVM on x86-64 systems. KVM is a cool technology that is based on the support of both the kernel and the hardware, in our case, AMD SVM and Intel VT. KVM is a huge framework and understanding it is not an easy task. Making matters worse, hardware security features like Intel SGX and AMD SEV can add way too many features into the KVM layer and those could evolve quickly. I'm not an expert in KVM so this part would be more likely to be covering what I met when solving my problem instead of a full comprehensive introduction.
 6 | 
 7 | ## Check your machine
 8 | 
 9 | KVM is not always available on every machine. Since your CPU is either an Intel one or an AMD one because we are talking about x86-64, cat `/proc/cpuinfo` and see if `vmx` or `svm` exists in the feature set. Note that KVM requires to be running on a bear metal machine. You cannot run a KVM session inside a KVM session.
10 | 
11 | ## KVM in brief
12 | 
13 | Virtualization has changed dramatically from an instruction-by-instruction emulation method to now the hardware accelerated virtualization. Think of a _computer_, most of the instructions it executes are arithmic or branch and condition instructions. These are not dangerous and has no side effect to the host and other VMs. So they think: Instead of slowly emulate every instruction, why don't we just let the VM run directly on the CPU and only kick in the hypervisor when things like an I/O happens? And they did.
14 | 
15 | #### Some jargons
16 | 
17 | So the first thing is that VM should get it's own address space. Actually, the VM is going to _think_ it's running on it's own memory. This requires something called SLAT or Second Level Address Translation. What it does is pretty much like another layer of page table. If you have a decent knowledge in OS you may still remember the concept of virtual memory. In virtual memory every process has it's own address space that is mapped to the real physical memory. But also you may recall that half of such address space is mapped to a same region - the kernel. For a VM obviously that is not acceptable. Even if we don't care the feeling of the VM, it is dangerous to a VM access to the hypervisor memory. So they are putting another layer of translation on to the stack - meet SLAT. Just works like page tables, SLAT will create a whole area of memory available to the VM, except that this time the address space is completely isolated from the hypervisor.
18 | 
19 | Another important concept is VMExit. Remember that we said not all instructions can be done without side effects, so some instructions like `int $0x80` will need to be processed by the hypervisor. It is like exit the VM and handling back the control flow back to the hypervisor, so the name of it is VMExit. Obviously VMExit is a heavey action since you need to flush all the cache, unload SLAT, etc.. The capitalization of the word VMExit is an Intel convention. On AMD you are doing VMEXIT.
20 | 
21 | ## In this chapter
22 | 
23 | I will write this chapter into sub-chapters. By now I am doing AMD side of thing so you are going to read a lot on that.
24 | 
25 | 


--------------------------------------------------------------------------------
/kvm/README.md:
--------------------------------------------------------------------------------
 1 | # KVM
 2 | 
 3 | ## Overview
 4 | 
 5 | In this chapter, we will discuss KVM on x86-64 systems. KVM is a cool technology that is based on the support of both the kernel and the hardware, in our case, AMD-V \(AMD SVM\) and Intel VT. KVM is a huge framework and understanding it is not an easy task. Making matters worse, hardware security features like Intel SGX and AMD SEV can add way too many features into the KVM layer and those could evolve quickly. I'm not an expert in KVM so this part would be more likely to be covering what I met when solving my problem instead of a full comprehensive introduction.
 6 | 
 7 | ## Check your machine
 8 | 
 9 | KVM is not always available on every machine. Since your CPU is either an Intel one or an AMD one because we are talking about x86-64, cat `/proc/cpuinfo` and see if `vmx` or `svm` exists in the feature set. Note that KVM requires to be running on a bear metal machine. You cannot run a KVM session inside a KVM session.
10 | 
11 | ## KVM in brief
12 | 
13 | Virtualization has changed dramatically from an instruction-by-instruction emulation method to now the hardware accelerated virtualization. Think of a _computer_, most of the instructions it executes are arithmic or branch and condition instructions. These are not dangerous and has no side effect to the host and other VMs. So they think: Instead of slowly emulate every instruction, why don't we just let the VM run directly on the CPU and only kick in the hypervisor when things like an I/O happens? And they did.
14 | 
15 | #### Some jargons
16 | 
17 | So the first thing is that VM should get it's own address space. Actually, the VM is going to _think_ it's running on it's own memory. This requires something called SLAT or Second Level Address Translation. What it does is pretty much like another layer of page table. If you have a decent knowledge in OS you may still remember the concept of virtual memory. In virtual memory every process has it's own address space that is mapped to the real physical memory. But also you may recall that half of such address space is mapped to a same region - the kernel. For a VM obviously that is not acceptable. Even if we don't care the feeling of the VM, it is dangerous to a VM access to the hypervisor memory. So they are putting another layer of translation on to the stack - meet SLAT. Just works like page tables, SLAT will create a whole area of memory available to the VM, except that this time the address space is completely isolated from the hypervisor.
18 | 
19 | Another important concept is VMExit. Remember that we said not all instructions can be done without side effects, so some instructions like `int $0x80` will need to be processed by the hypervisor. It is like exit the VM and handling back the control flow back to the hypervisor, so the name of it is VMExit. Obviously VMExit is a heavey action since you need to flush all the cache, unload SLAT, etc.. The capitalization of the word VMExit is an Intel convention. On AMD you are doing VMEXIT.
20 | 
21 | ## In this chapter
22 | 
23 | I will write this chapter into sub-chapters. By now I am doing AMD side of thing so you are going to read a lot on that.
24 | 
25 | 


--------------------------------------------------------------------------------
/target-a-specific-thread.md:
--------------------------------------------------------------------------------
 1 | # Target a Specific Thread
 2 | 
 3 | Sometimes you just want to do some special actions for the thread you created, or just want to add a value to a thread. Pretty obvious you are going to add a field to the thread's metadata. In this section we will discuss just how to do that.
 4 | 
 5 | ## `task_struct` - A Thread's Metadata
 6 | 
 7 | In Linux Kernel, a thread is also known as a task. Process is a view from resource management, while threads, well, is a vew from thread management. Linux stores each thread's metadata inside a struct called `task_struct`. The definition of `task_struct` can be found in `/include/linux/sched.h`. As you can see, the whole struct contains tons of fields that are inside some `#ifdef CONFIG_XXX_XXX`, so this is also a warning to you that you should not define yours in them unless you know what you are doing.
 8 | 
 9 | ## Your Own Field in a Thread
10 | 
11 | It's not my job to teach you how to add a field to a struct in C language. This section only contains tips.
12 | 
13 | * After a few lines from the start of `task_struct`, you will see something like this. Follow the comment and add your field after this.
14 | 
15 |   ```c
16 |     /*
17 |   	 * This begins the randomizable portion of task_struct. Only
18 |   	 * scheduling-critical items should be added above here.
19 |   	 */
20 |   	randomized_struct_fields_start
21 |   ```
22 | 
23 | * If you are just adding a generic field into the `task_struct`, be careful not to add your field into areas of a `#ifdef CONFIG_XXX_XXX`.
24 | * Be sure that you are not making the `task_struct` too big.
25 | 
26 | The third tip is obvious. The first two tips could boil down to a simple and brainless rule: always add your fields directly after `randomized_struct_fields_start`.
27 | 
28 | ## Initializing a `task_struct`
29 | 
30 | Now you have your fields, you need to be responsible for them. You should have known that except for the very first thread, every thread is born from `fork`. So now things are clear. Two _tasks_: Initialize the first thread, deal with `fork`.
31 | 
32 | ### Initialize the very first thread
33 | 
34 | The very first task is called the `init_task`. You may find it in `/init/init_task.c`. It's pretty much a classic C struct variable initialization. Nothing too much to say, add your fields and it's done.
35 | 
36 | Note that in early versions of the kernel, `init_task` is located at `/init/init_task.c`, and is defined as a macro `#define INIT_TASK(tsk)`. 
37 | 
38 | ### Deal with fork
39 | 
40 | `fork` will clone one thread into two, so the `task_struct` is also copied. The behaviour of copying the `task_struct` happens in `copy_process` in `/kernel/fork.c`. `copy_process` is a really huge function that does tons of dirty works for us. Understanding it is not easy but we can quickly obeserve that several tens of lines later we find a lot of `p->xxx = xxxx`, which _seems_ to be the place where the `task_struct` copying process happens. And indeed it is. Now you just add `p->your_field = your_value` and it's done. Do remember to be careful with `#ifdef CONFIG_XXX_XXX`.
41 | 
42 | If you want to inherit field value from the parent task\_struct, you can do
43 | 
44 | ```text
45 |     p->your_field = current->your_field;
46 | ```
47 | 
48 | For the meaning of `current`, read the following section.
49 | 
50 | ## `current` - Current `task_struct`
51 | 
52 | In a lot of time you want to access the `task_struct` of the current thread. Linux provides you with `current` identifier. To use it, include `linux/sched.h`. `current` is per-CPU defined so every thread on its own CPU will get its own `task_struct`.
53 | 
54 | 


--------------------------------------------------------------------------------
/your-own-syscall.md:
--------------------------------------------------------------------------------
 1 | # Your Own Syscall
 2 | 
 3 | As the only software interface that bridges the kernel and the user space, syscalls works like kernel APIs, except that most of them are fixed and rarely changed. In a lot of time you might find yourself want some new functions provided by the kernel. And that's when you need to create your own syscall.
 4 | 
 5 | In this chapter we will be discussing how to define a syscall, and then register it into the architecture you want.
 6 | 
 7 | ## Wait before you read! 
 8 | 
 9 | In this chapter we are talking about the x86-64 syscalls. You might be doing your work on another platform but that's OK. Most of the ideas are the same except for the way you declare them in the architecture-specific syscall tables.
10 | 
11 | ## How syscall works on x86-64
12 | 
13 | You may skip for the next section if you know it. The following section is based on an x86-64 machine. 
14 | 
15 | Unlike its x86 \(32 bit\) counterpart, there's an instruction `syscall` for syscalls instead of using `int $0x80`. You set all the parameters into the registers \(`rdi, rsi, rdx, rcx, r8, r9`\), set the syscall number \(`rax`\) and execute the instruction `syscall`. Actually, `syscall` instruction works just like an `int $0x80`. It causes an exception and a handler in the kernel will do the rest jobs.
16 | 
17 | ## Defining a syscall
18 | 
19 | Articles on defining a syscall are quite a mess on the Internet. Some of them work on an older version of kernel but suddenly stop working on a new one. Obviously there's a one and only correct way to do it.
20 | 
21 | Defining a syscall is done by using the macro `SYSCALL_DEFINEx` provided by `<linux/syscalls.h>` where `x` is the number of the parameters of your syscall. For example, you are creating an adder in your kernel, so the syscall would be
22 | 
23 | ```c
24 | SYSCALL_DEFINE2(adder, int, x, int, y) {
25 |     return x + y;
26 | }
27 | ```
28 | 
29 | Just like a C function, right? Because it is. If you know about those _wrong_ examples of creating a syscall, they are usually like:
30 | 
31 | ```c
32 | // WRONG! DO NOT DO THIS!
33 | asmlinkage long sys_adder(int x, int y) {
34 |     return x + y;
35 | }
36 | ```
37 | 
38 | The above code shows what a once-worked way to define a syscall. It tells us that a syscall is merely a `long`-typed function. As for why this is wrong, well, it is some convention problem. Read the `SYSCALL_DEFINEx` in `/arch/x86/include/asm/syscall_wrapper.h` to know more.
39 | 
40 | As a platform limit, remember that you cannot have more than 6 parameters.
41 | 
42 | If you are not doing something large, just create a C file in `/kernel`, write your syscall, and add the C file into the `obj-y`. This is a safe behaviour.
43 | 
44 | ## Register your syscall
45 | 
46 | ### Register your syscall into the kernel
47 | 
48 | Just like you write a function and have to add it into a header, your syscall needs to be added to the kernel's header. The header is located at `/include/linux/syscalls.h`. Find the syscall `sys_ni_syscall`, add your syscall declaration after it \(before the `#endif`, try to find why is this by reading the code\). This time however you add your declaration as
49 | 
50 | ```c
51 | asmlinkage long sys_adder(int x, int y);
52 | ```
53 | 
54 | If your syscall has no parameter, do remember to add `void` between the parameter like
55 | 
56 | ```c
57 | asmlinkage long sys_no_parameter(void);
58 | ```
59 | 
60 | ### Register your syscall into the architecture
61 | 
62 | Here on x86-64, the syscall table is located at `/arch/x86/entry/syscalls/syscall_64.tbl`. Open it, find the last syscall before the x32 section and add yours after it. Remember to add `__x64_sys_` prefix to your syscall name. Generally the second column you choose `common`.
63 | 
64 | You are all set! Compile your kernel and your syscall is ready to use!
65 | 
66 | 


--------------------------------------------------------------------------------
/a-good-environment.md:
--------------------------------------------------------------------------------
 1 | # Tools And Environment
 2 | 
 3 | Now you have your kernel, but might still not know where to start. So before you start your adventure into the kernel, you might want some appropriate gears. 
 4 | 
 5 | ## A Cross Referencer
 6 | 
 7 | It's never fun to play with such a large project like Linux without knowing who's who. Knowing where each identifier is used and defined is really important. Such tool is called a cross referencer. Tons of cross referencers are available and any one of them works. But to get on board quicker without hours of code analysis, an online service might be prefered. Luckily enough, there exists such service.
 8 | 
 9 | My choice is [https://elixir.bootlin.com](https://elixir.bootlin.com). You can select the kernel version from the left and view each file and click identifiers on the page. 
10 | 
11 | You may have your own favour. As long as it works, it's good.
12 | 
13 | ## QEMU
14 | 
15 | A pro tip from every software testers is that you should never put a software in test onto a machine that you care. Except for some hardware features you might want to test with, most kernel experiments could be done by using a virtual machine. The best VM for us that packs with features is QEMU. 
16 | 
17 | Unlike commercial VM that are optimized for cloud environment and are designed to boot commercial OSes, QEMU provides support for switching kernels, loading up your own image. These are really wanted during the process of having fun with the kernel.
18 | 
19 | ## Debootstrap
20 | 
21 | A kernel is a never an OS until you have a minimum boot environment. In the past kernel researchers love Busybox, a really tiny set of UNIX tools in one binary. But everyone using it know that it's a piece of masterwork in term of _small_, but not anything else. Meet Debootstrap. Debootstrap is an environment provided by Debain to create a minimal Debain environment. You will get GNU Coreutils, you will get APT package manager, and best of all, `systemd` \(Yes, that's not for granted. No `systemd` on Busybox.\). A good example of using Debootstrap is provided by Google Syzkaller. See the Image section of [https://github.com/google/syzkaller/blob/master/docs/linux/setup\_ubuntu-host\_qemu-vm\_x86-64-kernel.md](https://github.com/google/syzkaller/blob/master/docs/linux/setup_ubuntu-host_qemu-vm_x86-64-kernel.md). They provided a script and you just follow that and it's done.
22 | 
23 | ## A First Boot
24 | 
25 | Now you have your kernel, your boot environment, time to boot your system for the very first time.
26 | 
27 | ```bash
28 | sudo qemu-system-x86_64 \
29 |    -kernel your/build/dir/arch/x86/boot/bzImage \
30 |    -append "root=/dev/sda debug earlyprintk=serial slub_debug=QUZ" \
31 |    -hda your/debootstrap/dir/stretch.img \
32 |    -net user,hostfwd=tcp::10021-:22 -net nic \
33 |    -enable-kvm \
34 |    -m 2G \
35 |    -smp 2
36 | ```
37 | 
38 | Line 2 you are giving your kernel's path. Line 3 is the arguments for the kernel, so it only works with the `-kernel` option. If you packed your kernel into the image then you have to pass the arguments inside the image. That's why pack the kernel into the image is not suggested. Line 4 is the boot image. Line 5 gives the VM a network card. Line 6 tells QEMU to work under KVM so it would be accelerated. Line 7 means that the VM's memory is 2GB. Line 8 means that you are offering 2 cores to the VM.
39 | 
40 | Now a QEMU window will popup and you are good to go.
41 | 
42 | However, there are times when a GUI is not prefered or simply not available, like an SSH session. The following script gets you covered.
43 | 
44 | ```bash
45 | sudo qemu-system-x86_64 \
46 |    -kernel your/build/dir/arch/x86/boot/bzImage \
47 |    -append "console=ttyS0 root=/dev/sda debug earlyprintk=serial slub_debug=QUZ" \
48 |    -hda your/debootstrap/dir/stretch.img \
49 |    -net user,hostfwd=tcp::10021-:22 -net nic \
50 |    -enable-kvm \
51 |    -nographic \
52 |    -m 2G \
53 |    -smp 2
54 | ```
55 | 
56 | Line 2 added `console=ttyS0` to forward the console output to `ttyS0`. Line 7's `-nographic` tells QEMU to forward `ttyS0` to the terminal you are using. Now have fun in your terminal.
57 | 
58 | 


--------------------------------------------------------------------------------
/kvm/inject-an-interrupt.md:
--------------------------------------------------------------------------------
 1 | # Inject An Interrupt
 2 | 
 3 | It's always easy for a VM to tell the hypervisor something. You just use `cpuid` and this will cause an easy-to-deal VMEXIT. The thing is, weird though, to ask a VM to handle something could be really a pain in the ass. 
 4 | 
 5 | Talking about a real computer, it's done by an interrupt. KVM provides another thing called a request, which would ask the virtualization software like QEMU to trigger an VMEXIT. But that way the VMEXIT is most likely to be handled by the virtualization software instead of a more controllable thing inside the VM. So we are back to the good old fashioned interrupt.
 6 | 
 7 | ## Your Quick Start Guide on x86 \(or x86-64\) Interrupts
 8 | 
 9 | _Everyone_ knows that interrupts, along with traps and exceptions, are what we call as a vectored events, meaning that you register some handlers in an array and when that event happens, the CPU would jump to that handler.
10 | 
11 | The most familiar interrupt to us is the `syscall` interrupt, number `0x80`. And you may have heard that x86 can support at least 256 interrupts. Well, it turns out, that those interrupts are software interrupts, which is generated by calling an instruction \(`int`\). Those _real_ interrupts, I mean the hardware ones, are with such a small amount that you can count them with fingers. Since we are talking about the KVM, all you can use are those. And actually, even less.
12 | 
13 | On an x86 system, hardware interrupts are defined by how many pins the programmable interrupt controller \(aka PIC\) has. That PIC is called Intel 8259, a real chip that once in a while was not part of the actual CPU. That PIC tells you you have 16 hardware interrupts. And that's all you have. To make matter even worse, most of these interrupts has a designated usage. Only interrupt number 9, 10, 11 are free for "peripherals, SCSI and NIC". It's most likely that when you are asking for something as an hypervisor, you are asking as a _peripheral_, so you have to check and pick one of these 3 interrupts.
14 | 
15 | ## KVM Interrupt Injection, the Legacy One
16 | 
17 | **In modern OSes, you are not gonna use the method in this section.**
18 | 
19 | When KVM was not quite a mature thing, they designed a terrible way to inject an interrupt. Interrupts are async, means they could happen at any moment. But since a generic OS has no responsibility to simulate a PIC inside them, those interrupts must be injected during what you would call as an interrupt injection window. That is, when an interrupt happens, the hypervisor queues the interrupt in a queue. And later for whatever reason a VMEXIT happens, that gives the hypervisor a window to inject those interrupts into the VM.
20 | 
21 | If you are dealing with one of those kind of situation, you are gonna queue your interrupt by doing
22 | 
23 | ```text
24 | kvm_queue_interrupt(vcpu, interrupt_number, false);
25 | ```
26 | 
27 | Once you've done this, the interrupt is queued for that vCPU and will be injected on the next injection window. You may also do
28 | 
29 | ```text
30 | kvm_x86_ops->set_irq(vcpu);
31 | ```
32 | 
33 | or
34 | 
35 | ```text
36 | kvm_make_request(KVM_REQ_EVENT, vcpu);
37 | ```
38 | 
39 | which would force triggering an VMEXIT to make sure the interrupt is immediately injected.
40 | 
41 | ## The Modern Way of Interrupt Injection
42 | 
43 | So later people found that hey we can make a PIC inside the guest OS, why using such shitty method? Tons of code is done to handle the interrupts of the legacy one, but modern one is quicker and cooler. **And the cream on the top is, if your OS is using the modern one and you try to inject using the old way, your OS will crash**. SHITTY.
44 | 
45 | How to tell if your friendly little guest is using the modern way?
46 | 
47 | ```c
48 | if (pic_in_kernel(vcpu->kvm) || lapic_in_kernel(vcpu)) {
49 |     // A modern guest
50 | }
51 | ```
52 | 
53 | LAPIC is for Local Advanced Programmable Interrupt Controller. In most cases, the guest you are dealing with is using a modern one. So now the problem: How we gonna inject?
54 | 
55 | ```c
56 | kvm_set_irq(vcpu->kvm, 0, interrupt_number, 1, false);
57 | kvm_set_irq(vcpu->kvm, 0, interrupt_number, 0, false);
58 | ```
59 | 
60 | Why two lines? You are simulating a real voltage level on the _pins_ of the PIC, so that would go to 1 then back to 0.
61 | 
62 | ## Check Well Before You Inject
63 | 
64 | As mentioned before, injecting using legacy method to a modern machine will crash it. But you should also check if the machine you are injecting has its interrupt enabled. When falling into an interrupt handler or just masking out some interrupts, a guest could simply be refusing the interrupt. If you force an interrupt, you may crash that machine. To avoid this, check with
65 | 
66 | ```c
67 | if (kvm_x86_ops->interrupt_allowed(vcpu)) {
68 |     // You may inject here
69 | }
70 | ```
71 | 
72 | 
73 | 
74 | 


--------------------------------------------------------------------------------
/hacking-interrupts-exceptions-and-trap-handlers/accessing-user-memory-in-trap-handlers.md:
--------------------------------------------------------------------------------
 1 | # Accessing User Memory in Trap Handlers
 2 | 
 3 | The paramenters of a trap handler provide valuable information: the register state. The handling process of nearly half the available traps can be benefited from knowing the very last instruction that was executed by the CPU. Nearly all of these traps can be benefited from reading the user space memory.
 4 | 
 5 | However, reading the user space memory from a trap handler is rather tricky. No explicit context switch is found when entering a trap but somehow you just cannot directly dereference a user space pointer \(some older version kernels can but not the new version, more on the details later\). But, just take the example of reading the instruction, you now have the `rip` register, but you cannot `*rip`, what should you do?
 6 | 
 7 | ## A Weird Way to Read and Write
 8 | 
 9 | The most shitty way to do such job is manually do a page table walk. Obviously we will never do this. The kernel provide us a set of very fun but weird functions: `get_user`, `put_user`, `copy_from_user`, `copy_to_user`. The detail usage is left for you to search the man pages, but let's talk about these functions on how and why they works.
10 | 
11 | I call these functions weird because the definitions. These are actually macros instead of regular C functions. Let's see how they works shall we. Take `get_user` as an example, suppose you have a variable `short hello`, and a pointer to the user space `short *user_ptr`, now you want to do what would regularly be
12 | 
13 | ```text
14 | hello = *user_ptr;
15 | ```
16 | 
17 | Now you should do
18 | 
19 | ```text
20 | get_user(hello, user_ptr);
21 | ```
22 | 
23 | It is worth noting that `hello` and `user_ptr` need to be a simple C type, like `short, int, long`. So look at it something bizarre comes up in our \(or just my\) head: where is the pid? You never specify which process to be copied from. 
24 | 
25 | If you have a great knowledge in kernel thread management or you have read [Target a Specific Thread](https://nskernel.gitbook.io/kernel-play-guide/target-a-specific-thread) chapter, it should be familiar to you about the `current` macro. Bascially as we said that the context is not switched so `current` remains the same. That means that page tables are still there... ? Yes indeed. So the function `get_user` as well as others simply targets at the `current` thread.
26 | 
27 | ## Why Dereferencing Directly Won't Work?
28 | 
29 | As curious or frustrated you might be, why a simple dereference won't work? Generally you get the following error:
30 | 
31 | ```text
32 | [   24.114923] BUG: unable to handle kernel paging request at 0xADDRESS
33 | ```
34 | 
35 | Actually it's rather tricky. Now RTFSC gives us some hints. At `/arch/x86/include/asm/uaccess.h` you can find `get_user`. `get_user` do some simple checks then you will find the true worker function: `__get_user_%P4`, which is one of `__get_user_1,2,4,8` depending on the size you request. In `/arch/x86/lib/getuser.S` where these functions get actual defined, the secret is no more. Let's take the example of `__get_user_1`, the code is
36 | 
37 | ```text
38 | __KEEPIDENTS__2(__KEEPIDENTS__3)
39 | 		__KEEPIDENTS__4 __KEEPIDENTS__5(__KEEPIDENTS__6), %_ASM_DX
40 | 		cmp TASK_addr_limit(%__KEEPIDENTS__9),%_ASM_AX
41 | 		jae bad_get_user
42 | 		sbb %__KEEPIDENTS__13, %_ASM_DX		/* array_index_mask_nospec() */
43 | 		and %__KEEPIDENTS__16, %_ASM_AX
44 | 		ASM_STAC
45 | 1:	movzbl (%__KEEPIDENTS__19),%edx
46 | 		xor %__KEEPIDENTS__22,%eax
47 | 		ASM_CLAC
48 | 		ret
49 | ENDPROC(__get_user_1)
50 | EXPORT_SYMBOL(__get_user_1)
51 | ```
52 | 
53 | Do you see that `ASM_STAC` and `ASM_CLAC` that wrapped the `mov` instruction? A quick search tells us that these are related to a feature called SMAP \(Supervisor Mode Access Prevention\). Basically the developers of the processors believe that allowing non-restricted kernel access to the user space is fucking dangerous so we need a fucking feature to prevent that. When SMAP is enabled, the kernel will get a page fault if it trys to access user memory. But we all know that the kernel HAS to access the user memory, so they added two instructions to temporarily disable SMAP and to re-enable it. If you'd prefer disable it by yourself instead of using the kernel provided function, it would be absolutly fine as long as you don't break things. So the basic idea behind it is still some sort of [poka-yoke](https://en.wikipedia.org/wiki/Poka-yoke) thing. I personally dislike the idea of having this kind of thing inside the kernel since in the kernel, if you are breaking it, you will break it. Like a Chinese network slang, “防呆不防傻，大力出奇迹” \(It's poka-yoke not stupid-yoke, if you go brute force you are still gonna break it\). 
54 | 
55 | ## Risks?
56 | 
57 | You must take great care with these functions since they can sleep. If in any case you turned off the interrupt and you called these function, you might poop in you pants. They do not necessarily sleep though. Looking at the code there's no sign that they could sleep at all. But when a page fault happens, the OS might schedule a sleep to it so the OS could load the page from the disk. Now you have a dead system.
58 | 
59 | 


--------------------------------------------------------------------------------
/accessing-the-non-exported-in-modules.md:
--------------------------------------------------------------------------------
  1 | # Accessing the Non-Exported in Modules
  2 | 
  3 | The developers of Linux have "carefully" chosen the useful and non-shitty functions and marked them exported for you to use in your module. That is a bizarre choice since a kernel module is already a piece of kernel code and even if you don't export those functions, one can still access them using the address. Personally I dislike this and luckily enough we somehow have a way to get over it.
  4 | 
  5 | In the good old days, we have a mechanism called the `kallsyms`. Basically it's a subsystem that can tell you the address of a given symbol name. However, since Linux 5.7.7, a shitty change was made to the kernel that even the `kallsyms` is not exported. This change was made by a kernel maintainer called Will Deacon and you can learn more at [https://lore.kernel.org/lkml/20200221114404.14641-1-will@kernel.org/](https://lore.kernel.org/lkml/20200221114404.14641-1-will@kernel.org/).
  6 | 
  7 | So if you are writing a kernel module for versions before 5.7.7, you don't have to read the last subsection. But first of all, let's talk about how to use `kallsyms`.
  8 | 
  9 | ## Not All Kernels Are Equal
 10 | 
 11 | To use kallsyms, you will have to make sure your kernel supports it. These things are supported by turning on `CONFIG_KALLSYMS=y` and `CONFIG_KALLSYMS_ALL=y`. The first config allows you to find any function inside the kernel while the second one allows you to also be able to find variables.
 12 | 
 13 | ## Find Your Address
 14 | 
 15 | To find a symbol, you first define the pointer prototype of it. Here we take an example of a function. Suppose the function inside the kernel is defined as
 16 | 
 17 | ```c
 18 | int a_hidden_fuction(int a, int b);
 19 | ```
 20 | 
 21 | What you need to define in your module is something like
 22 | 
 23 | ```c
 24 | int (*a_pointer_to_the_function)(int, int);
 25 | ```
 26 | 
 27 | So you are calling it with something like `(*a_pointer_to_the_function)(1, 2)`. Now the only thing is to fill that pointer. To get that pointer, you are gonna use `kallsyms_lookup_name`. So you are going to do
 28 | 
 29 | ```c
 30 | a_pointer_to_the_function = (void *)kallsyms_lookup_name("a_hidden_fuction");
 31 | ```
 32 | 
 33 | After that you are able to call the function / use that variable with the pointer.
 34 | 
 35 | ## A Hack You Might Need
 36 | 
 37 | Everything works so fine in C except that you might need to use a non-exported symbol in ASM. This is quite not related to find symbol but more of a question: How to call a function with its pointer in the memory? Ordinarily you are going to load such pointer into a register and do `call` or `jmp`. However in some cases you have absolutly no idea which register does what, and it's not in your control to set back the modified register after `call`.
 38 | 
 39 | To solve it, we are going to do some dirty work. That is to push the address on to the stack, set back all registers and do `ret`. If you work with ASM you should have already know what I want to do. So if the original code is:
 40 | 
 41 | ```c
 42 | call a_hidden_function
 43 | ```
 44 | 
 45 | Then you are going to do
 46 | 
 47 | ```text
 48 | ENTRY(a_hidden_function_wrapper)
 49 | 	subq 	$16, 												%rsp			/* reserve 2 qword space on stack */ 
 50 | 	movq	%rax, 											(%rsp)		/* (%rsp) for %eax */
 51 | 	movq	a_pointer_to_the_function,	%rax			/* move pointer to register */
 52 | 	movq	%rax, 											8(%rsp)		/* 8(%rsp) for the return address */
 53 | 	popq	%rax
 54 | 	ret																					/* goto a_hidden_function */
 55 | END(a_hidden_function_wrapper)
 56 | 
 57 | ...
 58 | 
 59 | call a_hidden_function_wrapper
 60 | ```
 61 | 
 62 | So basically you define a wrapper that save `%rax` and move the pointer to the stack. Then you restore `%rax` and do `ret`.
 63 | 
 64 | ## Paper Works - For Using Before 5.7.7
 65 | 
 66 | kallsyms is a dirty beauty. Since we demand more so we use it, it demands us more. kallsyms is exported as a GPL symbol. So if your module is not GPL-licensed then you are not able to use it. You need
 67 | 
 68 | ```c
 69 | MODULE_LICENSE("GPL");
 70 | ```
 71 | 
 72 | to use the symbol. But in most cases I hardly think anyone cares about this. You just use it and the world is saved.
 73 | 
 74 | ## For Using After 5.7.7
 75 | 
 76 | Like we said before, after 5.7.7, `kallsyms_lookup_name` is not exported anymore. So you don't have to care about if this is exported as GPL or not, but the one thing is: How do you come over this circumstance? This is going to be really hacky and you are about to use `kprobes`.
 77 | 
 78 | `kprobes` is a debugging subsystem in the kernel to help developers probing functions so that they can understand things like performance bottlenecks. And yes you guessed it, we are going to abuse it. You see, if you try to probe a function, it will first gather the information of that function - including its address. Except for some functions that are explicitly black-listed from probing because of security reasons, all other functions can be probed. And of course `kallsyms_lookup_name` is one of those functions that can be probed.
 79 | 
 80 | To use kprobes, remember to have `CONFIG_KPROBES=y` in your configuration and include `<linux/kprobes.h>`. You will want some code like:
 81 | 
 82 | ```c
 83 | unsigned long lookup_kallsyms_lookup_name() {
 84 |     struct kprobe kp;
 85 |     unsigned long addr;
 86 |     
 87 |     memset(&kp, 0, sizeof(struct kprobe));
 88 |     kp.symbol_name = "kallsyms_lookup_name";
 89 |     if (register_kprobe(&kp) < 0) {
 90 |         return 0;
 91 |     }
 92 |     addr = (unsigned long)kp.addr;
 93 |     unregister_kprobe(&kp);
 94 |     return addr;
 95 | }
 96 | ```
 97 | 
 98 | One might ask that if kprobes can probe most functions, why don't we just replace kallsyms with kprobes? The reason is that first of all kprobes has a blacklist while kallsyms doesn't. Also kallsyms can get variable addresses but kprobes can't.
 99 | 
100 | Funny enough though, if you go and read how kprobes get the address of a function, it eventually calls `kallsyms_lookup_name`. What a shame.
101 | 
102 | 


--------------------------------------------------------------------------------
/kvm/amd-v-and-sev.md:
--------------------------------------------------------------------------------
 1 | # AMD-V and SEV
 2 | 
 3 | AMD-V is not something new. In the past it's called AMD SVM \(Secure Virtual Machine\). For any reason it might be AMD renamed it to AMD-V. It is not secure at the first place anyway so the name change is not a big deal. For convention reasons, we are still calling it SVM inside the kernel, just like we are calling x86-64 as AMD64.
 4 | 
 5 | Since the age of cloud, VM security is growing its importance. It is always cool to hide things, and SEV gets you covered. AMD SEV is a technology helping you encrypting your VM. It does not trust the hypervisor but it could poop in your pants too. More on that later.
 6 | 
 7 | Most of the code implementing AMD-V and SEV and be found under `/arch/x86/kvm/svm.c`. In Linux 4.20 it's a 7209-line file and reading it can be a pain in the ass.
 8 | 
 9 | In this sub-chapter we will be discussing how you could mess around a VM and a hypervisor on an AMD platform.
10 | 
11 | ## Guest OS and Host OS Support
12 | 
13 | To use AMD-V, your host OS needs to support that. Luckily unless you are using an ancient kernel, you get the support out of the box. There are no requirement to the guest OS.
14 | 
15 | AMD SEV however, comes with a little bit of more requirements. First of all AMD SEV is quite a new technology so a much more recent kernel is needed. If you are using 4.20 or later then you are probably fine. Second thing is that both guest and host need to support SEV to get things working.
16 | 
17 | ## VMEXIT
18 | 
19 | VMEXIT, as we talked about in the KVM chapter, is probably the only way to handle the control flow from a VM to the hypervisor. If you want to have some fun with KVM then you are most likely going to know what's happening during and after it.
20 | 
21 | VMEXIT is handled by `static int nested_svm_vmexit(struct vcpu_svm *svm)`. Looking at this function you can see that the function copies and saves the state of the CPU into VMCB, then deal with exceptions and interrupts, unmap the memory, then it's done. Pretty simple, right?
22 | 
23 | ## VMCB
24 | 
25 | Now for the saved state of the CPU, you are talking about VMCB or Virtual Machine Control Block. VMCB is an AMD glossary. In team blue it's called VMCS or Virtual Machine Control Structure. They are very similar and are both maintained by the CPU. When an VMEXIT triggered, CPU will save the data onto the VMCB in the predefined format.
26 | 
27 | ![VMCB Layout](../.gitbook/assets/picture1.png)
28 | 
29 | A VMCB looks like this. You can find the definition of VMCB in `/arch/x86/includ/asm/svm.h`. 
30 | 
31 | ```c
32 | struct __attribute__ ((__packed__)) vmcb {
33 | 	struct vmcb_control_area control;
34 | 	struct vmcb_save_area save;
35 | };
36 | ```
37 | 
38 | VMCB is divided into two areas, the control area and the save area. Control area is a place where the information for the hypervisor is saved. For example, the interrupt number will be in it. Save area on the other hand saves the registers and some other stuff that are related with the VM's state.
39 | 
40 | Combining with the code from VMEXIT, you can actually see how VMCB is working. Each CPU has its structure with a VMCB. When VMEXIT triggers, `nested_svm_vmexit` saves the **real** VMCB of the CPU to a place in the memory and unload that VM.
41 | 
42 | ## AMD SEV
43 | 
44 | AMD SEV \(Secure Encrypted Virtualization\) is a technology that will encrypt the memory of the VM so that hypervisor won't be able to access the information in it. It is based on the belief that hypervisor could be malicious and thus should not be trusted. 
45 | 
46 | The way SEV works is based on the nature of paged memory management. You see the VM is also managing its memory using pages, so actually the pages VM is using is mapped into real physical pages in the memory by the hypervisor. If we add an encryption engine to encrypt each page on the CPU side, and drops the key when VMEXIT, then hypervisor is never able to know the content inside the page. It could still manage the page as always, just without the ability to know what's inside.
47 | 
48 | So now the problem is: Who manages the keys of each VM? It may sounds rediculous enough but the CPU does. During the SLAT process, each VM will get its own ASID \(Address Space ID\) which is managed by the hypervisor. So when the hypervisor do VMRUN, the corressponded ASID is loaded into the CPU and the CPU will be able to use the ASID as the index to find the corressponede key. Hypervisor is never able to know what exactly is the key. However on the guest OS side, since the encryption is done by the CPU, nothing much is a concern either.
49 | 
50 | OK, CPU manages the key, hypervisor manages the memory, guest OS works happily, smooth sailing, right? Not really.
51 | 
52 | ### AMD SEV-ES
53 | 
54 | Suddenly a new word here. SEV-ES means Security Encrypted Virtualization - Encrypted State. Still remember the VMCB we talked before? Each VMEXIT will save the registers inside the save area of the VMCB. Indeed in AMD SEV the memory is encrypted but NOT the VMCB. Think of it, if the hypervisor unload all the page of the VM onto the disk and cause a page fault \(which will lead to a VMEXIT\) on every memory access, then every move of the VM could be exposed to the hypervisor due to the leaking registers.
55 | 
56 | SEV-ES was introduced to solve the problem. Now the save area of the VMCB is also encrypted. Done.
57 | 
58 | ... or is it?
59 | 
60 | ### GHCB
61 | 
62 | Think of it. You are now doing `cpuid`. Obviously as a hypervisor, to really implement the `cpuid`, you have to know `eax`, and you have to change other registers. Now they are encrypted, you are screwed. Don't worry, AMD got you covered - Introducing GHCB, Guest Hypervisor Communication Block.
63 | 
64 | GHCB is an area of unencrypted memory after the save area that servers as the somehow trusted area between the guest OS and hypervisor. If you, as a guest OS, intentioned to tell hypervisor something, the put them here. If you, as a hypervisor, changed some register, put them here. Done.
65 | 
66 | ## Is AMD SEV-ES Safe After All
67 | 
68 | The anwser is ... in one way or another yes. In most cases hypervisor is never going to be able to steal a hair from the VM. But if the VM is running special software like an HTTP server, things could be done in a very unexpected way. See [https://arxiv.org/pdf/1805.09604.pdf](https://arxiv.org/pdf/1805.09604.pdf) for a real life case. AMD is implementing a next generation SEV with integrity check, hopefully that would be secure enough.
69 | 
70 | 


--------------------------------------------------------------------------------
/get-a-kernel-and-build-it.md:
--------------------------------------------------------------------------------
 1 | # Get a Kernel and Build It
 2 | 
 3 | Linux is an open sourced software, so everything starts from the source code. In this chapter, we will discuss where you can find the source code of Linux and how to build your own.
 4 | 
 5 | ## Get the source code
 6 | 
 7 | For most of us, cloning the whole tree is quite stupid since it will cover decades of code changes that we don't care about. Indeed you may clone a copy using git and specify the depth but the code you get might contains some Linus-Will-Say-Fuck pieces. The way I recommend you to do is to download the tarball for the specific version you want from [https://mirrors.kernel.org/pub/linux/kernel/](https://mirrors.kernel.org/pub/linux/kernel/). After getting your code, decompress it. A little bit of warning: **the folder is named as linux-x.x.x. If you have a customized version and you want to have another copy of clean version, DO NOT DECOMPRESS USING TAR COMMAND IN THE SAME FOLDER, IT WILL OVERWRITE YOUR OLD VERSION WITHOUT A WARNING**. 
 8 | 
 9 | ## Build the kernel
10 | 
11 | Building the kernel could be as simple as 123 or could be as complex as repairing a car. If you just want to build your kernel without understanding what's happening, the 123 version is here:
12 | 
13 | ```bash
14 | make defconfig O=your/build/dir
15 | make kvmconfig O=your/build/dir
16 | make O=your/build/dir -j8
17 | ```
18 | 
19 | Make sure your `your/build/dir` is empty before you do the above script. Line 2 is used if you want to use your kernel inside a virtual machine like QEMU \(Most people reading this doc should be a kernel developer or researcher so this is quite widely used among these guys\). As for the `-j8` option in line 3, change 8 into the number of thread you want during the building process. In most cases select the number of cores your CPU has.
20 | 
21 | Yeah! You build your kernel! But I think the 123 version is not enough for the people who are reading this book so let's dig deeper. 
22 | 
23 | ### Overview
24 | 
25 | The kernel is built with a system called Kbuild. It may sounds facinating to have a build system but to be honest Kbuild is just a bunch of scripts for `make`, well, a well-written bunch of scripts. Kbuild \(or `make`, if you insist\) will read configurations from the file `.config` and choose the correct architecture, declare configuration flags in all the C files of the kernel as macros, so combining with `#ifdef`, we are able to select the features we want.
26 | 
27 | ### Kbuild
28 | 
29 | Obviously we are not going to explain how Kbuild works, it's too complicated. Luckily we don't have to. All we need to know is some tags \(or targets, to use the `make` jargon\) and how to interpret the output of Kbuild.
30 | 
31 | #### What files will be compiled as the kernel?
32 | 
33 | Adding a C file but not seeing it compiled? You have someting to do. Unlike most C projects, not all files inside the kernel are compiled. The key secret behind it is a target called `obj-y`. When you add your own C file in a folder, say `/kernel/new.c`, you need to edit the Makefile in that directory, which in our example is `/kernel/Makefile`. By opening the Makefile, you will see on the top there's the target `obj-y`. Add your file but change the extension to `.o`, here in our example would be `new.o`,  and you are all set. 
34 | 
35 | #### What files will be compiled as a kernel module?
36 | 
37 | Like `obj-y` means you are adding a file in to the kernel, `obj-m` will make your code into a kernel module. More will be discussed when we talk about how to write a kernel module.
38 | 
39 | #### Knowing the output of Kbuild
40 | 
41 | Generally when building the kernel, logs would fly through your screen. In the output, you will generally see things like
42 | 
43 | ```text
44 |   CC     /some/files
45 |   YACC   /some/gramma
46 |   CC [M] /some/module/files
47 |   LD     /the/final/kernel
48 |   AR     /the/final/kernel
49 | ```
50 | 
51 | CC and LD should be quite familiar to you. Here AR means archive, which would pack the kernel in to a compressed file. YACC will generate a c file from a gramma file into a compiler. An action with `[M]` means it is about a kernel module. It is normal to see modules built during the building process of the kernel using a default configuration even if you did not specify that the kernel should build any module. 
52 | 
53 | ### Build Directory
54 | 
55 | Selecting a build directory is not always necessary. If you just `make` at the root of your source code it will compile all the file locally. But if you are modifying your kernel, this will generate a huge pile of garbage in your working tree. To build your kernel some where else, add `O=the/dir/you/want` to every of your `make` command. As long as the build directory is the same, configs and existing built files will remain and will be reused.
56 | 
57 | ### Configurations
58 | 
59 | Configuration files are a bunch of lines with `CONFIG_XXX_XXX=xxx`, that are used by Kbuild to enable specific features and to customize some values or limits in the kernel. If nothings goes wrong, Kbuild will use the `.config` in the build directory you specify. If there's no `.config`, Kbuild will copy a default configuration file to it and use it. In most cases you might want to have a `.config` file there before you `make` so you can customize your kernel.
60 | 
61 | There are thousands of configurations available in the kernel and we will not be discussing them here. You will know what to change if you run into any of them.
62 | 
63 | #### Initialize a build directory with a clean config file
64 | 
65 | To initialize a clean-state build directory using a default configuration, do `make defconfig O=the/build/dir`. Make sure your build directory is an empty folder. This will not build the kernel but just initialize a build environment.
66 | 
67 | #### Adding other preconfiged config to your config
68 | 
69 | You might demand more features beyound the default configuration. Take an example: For most kernel researchers, they will be testing their kernel inside a virtual machine like QEMU. A generic kernel will boot and will work, but Linux does have an optimizied config for it - the `kvmconfig`. In your case you might have your own but how do you add it to the existing config? Yes, just another `make kvmconfig O=the/build/dir`.
70 | 
71 | #### Editing your config file
72 | 
73 | There are multiple ways of editing your `.config` file. The one that is newbie-friendly is using a handy GUI-like menu that comes with Kbuild. To open that, use `make menuconfig O=the/build/dir`. Opening it you will see all configurations organized in catagories. Edit it, save it, it's done.
74 | 
75 | But as a pro, searching through a huge menu is less than statisfying. If you know the exact `CONFIG_XXX_XXX` you are looking for, a text editor would bring you there fast. Just open the `.config` file, edit it, save it and... you are not done yet. `cd` back to your kernel source and do `make oldconfig O=the/build/dir`. If you did not screw up the `.config` file, you are good to go now.
76 | 
77 | ### Build it!
78 | 
79 | Finally you are ready to build your kernel. Don't hesitate, 
80 | 
81 | ```text
82 | make O=your/build/dir -j128
83 | ```
84 | 
85 | and make the wheels run!... well, change the `128` into something your CPU can handle. Grab a cup of coffee and take a break. 
86 | 
87 | #### How long does it take?
88 | 
89 | If you are using a desktop computer with a recent 4 to 8 threads CPU and you are building the kernel from scratch, it generally takes 40 minutes to 2 hours depends on your specific environment. If you have already built your kernel and changed a `.c` file, it will take less than a minute. If you change a header however, it may affect multiple  `.c` files, so it would take longer.
90 | 
91 | 


--------------------------------------------------------------------------------
/hacking-interrupts-exceptions-and-trap-handlers/hooking-an-idt-handler.md:
--------------------------------------------------------------------------------
  1 | # Hooking an IDT handler
  2 | 
  3 | You might just want to hook an IDT handler for whatever the reason. You searched the web, then found out that things changed rapidly. Even a somehow _modern_ tutourial targeted Linux 4.x is outdated. Now, in the year of 2020, hooking and IDT entry is quite different than what you would think.
  4 | 
  5 | ## A Brief History...
  6 | 
  7 | **Skip if you don't care**
  8 | 
  9 | In most cases hooking an IDT entry is more or less a hacking behaviour. Dated back to 3.x you are going to manually create your own IDT table and replace the default one. In 4.x, there's a function called `set_intr_gate`, which would replace the original interrupt handler with yours. In 2017 however, the guy named Thomas Gleixner [made a change to the kernel](https://lkml.org/lkml/2017/8/25/226). He said that `set_intr_gate` was used during the boot process so it should be an internal function. The only _good_ reason of changing the IDT is for KVM handling page faults, we should create a specific explicitly named function called `update_intr_gate` and make `set_intr_gate` private \(`static`\). Basically what he did was \(diff was changed to a more readable way\)
 10 | 
 11 | ```text
 12 | -void set_intr_gate(unsigned int n, const void *addr)
 13 | +static void set_intr_gate(unsigned int n, const void *addr)
 14 | 
 15 | ...
 16 | 
 17 | +void __init update_intr_gate(unsigned int n, const void *addr)
 18 | +{
 19 | +	if (WARN_ON_ONCE(test_bit(n, used_vectors)))
 20 | +		return;
 21 | +	set_intr_gate(n, addr);
 22 | +}
 23 | ```
 24 | 
 25 | Well, fine. Things are gonna be hard because of this.
 26 | 
 27 | ## How Linux Defines the IDT
 28 | 
 29 | Writing your own handler is shitty. This is the very few cases inside the kernel where you need to write some assembly. To get started, let's look at how things are defined. Most of the IDT related things are in `/arch/x86/kernel/idt.c`. From top to bottom you will fist see that there's an early IDT, a default IDT, and a bunch of other functions. So the thing is: When the kernel is at its early boot stage, only 3 handlers are defined. After that, the `idt_setup_traps` is called to set all default handlers to the IDT. Now everything is up and running. 
 30 | 
 31 | Looking at the `def_idts` you can see
 32 | 
 33 | ```c
 34 | #elif defined(CONFIG_X86_32)
 35 | 	SYSG(IA32_SYSCALL_VECTOR,	entry_INT80_32),
 36 | #endif
 37 | ```
 38 | 
 39 | which shows the process of setting up syscalls in 32-bit x86 machine. Each of those `divide_error`, `invalid_op`, etc. is a somewhat wrapped handler. Take the `invalid_op` handler as an example. This handler deals with, well, an invalid opcode in x86. Using a cross referencer, you can see that it is defined in `/arch/x86/entry/entry_64.S` as
 40 | 
 41 | ```text
 42 | idtentry invalid_op			do_invalid_op			has_error_code=0
 43 | ```
 44 | 
 45 | This is frustrating. It should be an assembly code but it doesn't seem to be one. The thing is that `idtentry` is a macro.
 46 | 
 47 | ```text
 48 | .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 create_gap=0
 49 | ENTRY(\sym)
 50 |    ...
 51 |    tons of code
 52 |    ...
 53 |    
 54 |    call \do_sym
 55 |    
 56 |    ...
 57 |    
 58 |    .if \paranoid
 59 |    jmp	paranoid_exit
 60 |    .else
 61 |    jmp  error_exit
 62 |    .endif
 63 |    
 64 |    ...
 65 | END(\sym)
 66 | ```
 67 | 
 68 | It's a little too complicated. Basically most of the things it does are setting up a kernel context and switch stacks. The core event happens is calling the `do_sym` which is the actual handler. After things are handled, it calls an exit function which would switch back the context and stack then do `iret`. So now you know that if you are going to create a handler, you are gonna manually do all of the things in `entry_64.S`. Shit.
 69 | 
 70 | The simplist way to do it is simply copy the whole `idtentry` macro and some other macros to your own file. I've done it, copy it on your own:
 71 | 
 72 | ```c
 73 | #include <linux/linkage.h>
 74 | 
 75 | #include <asm/segment.h>
 76 | #include <asm/cache.h>
 77 | #include <asm/errno.h>
 78 | #include <asm/asm-offsets.h>
 79 | #include <asm/msr.h>
 80 | #include <asm/unistd.h>
 81 | #include <asm/thread_info.h>
 82 | #include <asm/hw_irq.h>
 83 | #include <asm/page_types.h>
 84 | #include <asm/irqflags.h>
 85 | #include <asm/paravirt.h>
 86 | #include <asm/percpu.h>
 87 | #include <asm/asm.h>
 88 | #include <asm/smap.h>
 89 | #include <asm/pgtable_types.h>
 90 | #include <asm/export.h>
 91 | #include <asm/frame.h>
 92 | #include <asm/nospec-branch.h>
 93 | #include <linux/err.h>
 94 | 
 95 | #include <linux/jump_label.h>
 96 | #include <asm/unwind_hints.h>
 97 | #include <asm/cpufeatures.h>
 98 | #include <asm/page_types.h>
 99 | #include <asm/percpu.h>
100 | #include <asm/asm-offsets.h>
101 | #include <asm/processor-flags.h>
102 | 
103 | .code64
104 | .section .entry.text, "ax"
105 | 
106 | /*
107 |  * 64-bit system call stack frame layout defines and helpers,
108 |  * for assembly code:
109 |  */
110 | 
111 | /* The layout forms the "struct pt_regs" on the stack: */
112 | /*
113 |  * C ABI says these regs are callee-preserved. They aren't saved on kernel entry
114 |  * unless syscall needs a complete, fully filled "struct pt_regs".
115 |  */
116 | #define R15		0*8
117 | #define R14		1*8
118 | #define R13		2*8
119 | #define R12		3*8
120 | #define RBP		4*8
121 | #define RBX		5*8
122 | /* These regs are callee-clobbered. Always saved on kernel entry. */
123 | #define R11		6*8
124 | #define R10		7*8
125 | #define R9		8*8
126 | #define R8		9*8
127 | #define RAX		10*8
128 | #define RCX		11*8
129 | #define RDX		12*8
130 | #define RSI		13*8
131 | #define RDI		14*8
132 | /*
133 |  * On syscall entry, this is syscall#. On CPU exception, this is error code.
134 |  * On hw interrupt, it's IRQ number:
135 |  */
136 | #define ORIG_RAX	15*8
137 | /* Return frame for iretq */
138 | #define RIP		16*8
139 | #define CS		17*8
140 | #define EFLAGS		18*8
141 | #define RSP		19*8
142 | #define SS		20*8
143 | 
144 | #define SIZEOF_PTREGS	21*8
145 | 
146 | /*
147 |  * This does 'call enter_from_user_mode' unless we can avoid it based on
148 |  * kernel config or using the static jump infrastructure.
149 |  */
150 | .macro CALL_enter_from_user_mode
151 | #ifdef CONFIG_CONTEXT_TRACKING
152 | #ifdef CONFIG_JUMP_LABEL
153 | 	STATIC_JUMP_IF_FALSE .Lafter_call_\@, context_tracking_enabled, def=0
154 | #endif
155 | 	call enter_from_user_mode
156 | .Lafter_call_\@:
157 | #endif
158 | .endm
159 | 
160 | #ifdef CONFIG_PARAVIRT_XXL
161 | #define GET_CR2_INTO(reg) GET_CR2_INTO_AX ; _ASM_MOV %_ASM_AX, reg
162 | #else
163 | #define GET_CR2_INTO(reg) _ASM_MOV %cr2, reg
164 | #endif
165 | 
166 | /*
167 |  * When dynamic function tracer is enabled it will add a breakpoint
168 |  * to all locations that it is about to modify, sync CPUs, update
169 |  * all the code, sync CPUs, then remove the breakpoints. In this time
170 |  * if lockdep is enabled, it might jump back into the debug handler
171 |  * outside the updating of the IST protection. (TRACE_IRQS_ON/OFF).
172 |  *
173 |  * We need to change the IDT table before calling TRACE_IRQS_ON/OFF to
174 |  * make sure the stack pointer does not get reset back to the top
175 |  * of the debug stack, and instead just reuses the current stack.
176 |  */
177 | #if defined(CONFIG_DYNAMIC_FTRACE) && defined(CONFIG_TRACE_IRQFLAGS)
178 | 
179 | .macro TRACE_IRQS_OFF_DEBUG
180 | 	call	debug_stack_set_zero
181 | 	TRACE_IRQS_OFF
182 | 	call	debug_stack_reset
183 | .endm
184 | 
185 | .macro TRACE_IRQS_ON_DEBUG
186 | 	call	debug_stack_set_zero
187 | 	TRACE_IRQS_ON
188 | 	call	debug_stack_reset
189 | .endm
190 | 
191 | .macro TRACE_IRQS_IRETQ_DEBUG
192 | 	btl	$9, EFLAGS(%rsp)		/* interrupts off? */
193 | 	jnc	1f
194 | 	TRACE_IRQS_ON_DEBUG
195 | 1:
196 | .endm
197 | 
198 | #else
199 | # define TRACE_IRQS_OFF_DEBUG			TRACE_IRQS_OFF
200 | # define TRACE_IRQS_ON_DEBUG			TRACE_IRQS_ON
201 | # define TRACE_IRQS_IRETQ_DEBUG			TRACE_IRQS_IRETQ
202 | #endif
203 | 
204 | 
205 | /*
206 |  * Exception entry points.
207 |  */
208 | #define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss_rw) + (TSS_ist + (x) * 8)
209 | 
210 | .macro idtentry_part do_sym, has_error_code:req, read_cr2:req, paranoid:req, shift_ist=-1, ist_offset=0
211 | 
212 | 	.if \paranoid
213 | 	call	paranoid_entry
214 | 	/* returned flag: ebx=0: need swapgs on exit, ebx=1: don't need it */
215 | 	.else
216 | 	call	error_entry
217 | 	.endif
218 | 	UNWIND_HINT_REGS
219 | 
220 | 	.if \read_cr2
221 | 	/*
222 | 	 * Store CR2 early so subsequent faults cannot clobber it. Use R12 as
223 | 	 * intermediate storage as RDX can be clobbered in enter_from_user_mode().
224 | 	 * GET_CR2_INTO can clobber RAX.
225 | 	 */
226 | 	GET_CR2_INTO(%r12);
227 | 	.endif
228 | 
229 | 	.if \shift_ist != -1
230 | 	TRACE_IRQS_OFF_DEBUG			/* reload IDT in case of recursion */
231 | 	.else
232 | 	TRACE_IRQS_OFF
233 | 	.endif
234 | 
235 | 	.if \paranoid == 0
236 | 	testb	$3, CS(%rsp)
237 | 	jz	.Lfrom_kernel_no_context_tracking_\@
238 | 	CALL_enter_from_user_mode
239 | .Lfrom_kernel_no_context_tracking_\@:
240 | 	.endif
241 | 
242 | 	movq	%rsp, %rdi			/* pt_regs pointer */
243 | 
244 | 	.if \has_error_code
245 | 	movq	ORIG_RAX(%rsp), %rsi		/* get error code */
246 | 	movq	$-1, ORIG_RAX(%rsp)		/* no syscall to restart */
247 | 	.else
248 | 	xorl	%esi, %esi			/* no error code */
249 | 	.endif
250 | 
251 | 	.if \shift_ist != -1
252 | 	subq	$\ist_offset, CPU_TSS_IST(\shift_ist)
253 | 	.endif
254 | 
255 | 	.if \read_cr2
256 | 	movq	%r12, %rdx			/* Move CR2 into 3rd argument */
257 | 	.endif
258 | 
259 | 	call	\do_sym
260 | 
261 | 	.if \shift_ist != -1
262 | 	addq	$\ist_offset, CPU_TSS_IST(\shift_ist)
263 | 	.endif
264 | 
265 | 	.if \paranoid
266 | 	/* this procedure expect "no swapgs" flag in ebx */
267 | 	jmp	paranoid_exit
268 | 	.else
269 | 	jmp	error_exit
270 | 	.endif
271 | 
272 | .endm
273 | 
274 | /**
275 |  * idtentry - Generate an IDT entry stub
276 |  * @sym:		Name of the generated entry point
277 |  * @do_sym:		C function to be called
278 |  * @has_error_code:	True if this IDT vector has an error code on the stack
279 |  * @paranoid:		non-zero means that this vector may be invoked from
280 |  *			kernel mode with user GSBASE and/or user CR3.
281 |  *			2 is special -- see below.
282 |  * @shift_ist:		Set to an IST index if entries from kernel mode should
283 |  *			decrement the IST stack so that nested entries get a
284 |  *			fresh stack.  (This is for #DB, which has a nasty habit
285 |  *			of recursing.)
286 |  * @create_gap:		create a 6-word stack gap when coming from kernel mode.
287 |  * @read_cr2:		load CR2 into the 3rd argument; done before calling any C code
288 |  *
289 |  * idtentry generates an IDT stub that sets up a usable kernel context,
290 |  * creates struct pt_regs, and calls @do_sym.  The stub has the following
291 |  * special behaviors:
292 |  *
293 |  * On an entry from user mode, the stub switches from the trampoline or
294 |  * IST stack to the normal thread stack.  On an exit to user mode, the
295 |  * normal exit-to-usermode path is invoked.
296 |  *
297 |  * On an exit to kernel mode, if @paranoid == 0, we check for preemption,
298 |  * whereas we omit the preemption check if @paranoid != 0.  This is purely
299 |  * because the implementation is simpler this way.  The kernel only needs
300 |  * to check for asynchronous kernel preemption when IRQ handlers return.
301 |  *
302 |  * If @paranoid == 0, then the stub will handle IRET faults by pretending
303 |  * that the fault came from user mode.  It will handle gs_change faults by
304 |  * pretending that the fault happened with kernel GSBASE.  Since this handling
305 |  * is omitted for @paranoid != 0, the #GP, #SS, and #NP stubs must have
306 |  * @paranoid == 0.  This special handling will do the wrong thing for
307 |  * espfix-induced #DF on IRET, so #DF must not use @paranoid == 0.
308 |  *
309 |  * @paranoid == 2 is special: the stub will never switch stacks.  This is for
310 |  * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
311 |  */
312 | .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 create_gap=0 read_cr2=0
313 | ENTRY(\sym)
314 | 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
315 | 
316 | 	/* Sanity check */
317 | 	.if \shift_ist != -1 && \paranoid != 1
318 | 	.error "using shift_ist requires paranoid=1"
319 | 	.endif
320 | 
321 | 	.if \create_gap && \paranoid
322 | 	.error "using create_gap requires paranoid=0"
323 | 	.endif
324 | 
325 | 	ASM_CLAC
326 | 
327 | 	.if \has_error_code == 0
328 | 	pushq	$-1				/* ORIG_RAX: no syscall to restart */
329 | 	.endif
330 | 
331 | 	.if \paranoid == 1
332 | 	testb	$3, CS-ORIG_RAX(%rsp)		/* If coming from userspace, switch stacks */
333 | 	jnz	.Lfrom_usermode_switch_stack_\@
334 | 	.endif
335 | 
336 | 	.if \create_gap == 1
337 | 	/*
338 | 	 * If coming from kernel space, create a 6-word gap to allow the
339 | 	 * int3 handler to emulate a call instruction.
340 | 	 */
341 | 	testb	$3, CS-ORIG_RAX(%rsp)
342 | 	jnz	.Lfrom_usermode_no_gap_\@
343 | 	.rept	6
344 | 	pushq	5*8(%rsp)
345 | 	.endr
346 | 	UNWIND_HINT_IRET_REGS offset=8
347 | .Lfrom_usermode_no_gap_\@:
348 | 	.endif
349 | 
350 | 	idtentry_part \do_sym, \has_error_code, \read_cr2, \paranoid, \shift_ist, \ist_offset
351 | 
352 | 	.if \paranoid == 1
353 | 	/*
354 | 	 * Entry from userspace.  Switch stacks and treat it
355 | 	 * as a normal entry.  This means that paranoid handlers
356 | 	 * run in real process context if user_mode(regs).
357 | 	 */
358 | .Lfrom_usermode_switch_stack_\@:
359 | 	idtentry_part \do_sym, \has_error_code, \read_cr2, paranoid=0
360 | 	.endif
361 | 
362 | _ASM_NOKPROBE(\sym)
363 | END(\sym)
364 | .endm
365 | 
366 | ```
367 | 
368 | Whops...
369 | 
370 | ## The Handler Wrapper
371 | 
372 | OK, shitty, but we are still gonna write our own, right? If you are modifying the kernel, then just add your own handler in `entry_64.S` using the macro `idtentry` like other handlers and now you may skip to the next section. But if you want to do it outside the kernel, like, in a kernel module, things could be tricky. The first thing you need to know is the how all of those pile of shit works. The handler macro `idtentry` has a lot of helper "functions" like `error_exit`. Will those be accessible outside the  file? We can see that those "functions" are defined using `ENTRY`. A deeper dig into it we find in `/include/linux/linkage.h` of:
373 | 
374 | ```text
375 | #ifndef ENTRY
376 | #define ENTRY(name) \
377 | 	.globl name ASM_NL \
378 | 	ALIGN ASM_NL \
379 | 	name:
380 | #endif
381 | ```
382 | 
383 | Well, fine \(again\). This macro tells that all things defined by `ENTRY` are merely a lable that is accessible everywhere since there's a `.globl`. So all we need is to copy the `idtentry` to our own assembly and define the handler wrapper. The `do_sym` handler, which actually handles the interrupt, is a simple C function.
384 | 
385 | ## The `do_sym` Functions
386 | 
387 | No matter how you have done with the wrapper, you have one now. So the next step is to create your very own C handler. It is always an easier way to copy an existing handler. But you might found out that searching through the code there's just never a `do_divide_error` \(or whatever other handlers\). The reason is that Linux uses macros to define those stuff. In `/arch/x86/kernel/traps.c` we find
388 | 
389 | ```c
390 | #define IP ((void __user *)uprobe_get_trap_addr(regs))
391 | #define DO_ERROR(trapnr, signr, sicode, addr, str, name)		   \
392 | dotraplinkage void do_##name(struct pt_regs *regs, long error_code)	   \
393 | {									   \
394 | 	do_error_trap(regs, error_code, str, trapnr, signr, sicode, addr); \
395 | }
396 | 
397 | DO_ERROR(X86_TRAP_DE,     SIGFPE,  FPE_INTDIV,   IP, "divide error",        divide_error)
398 | DO_ERROR(X86_TRAP_OF,     SIGSEGV,          0, NULL, "overflow",            overflow)
399 | DO_ERROR(X86_TRAP_UD,     SIGILL,  ILL_ILLOPN,   IP, "invalid opcode",      invalid_op)
400 | DO_ERROR(X86_TRAP_OLD_MF, SIGFPE,           0, NULL, "coprocessor segment overrun", coprocessor_segment_overrun)
401 | DO_ERROR(X86_TRAP_TS,     SIGSEGV,          0, NULL, "invalid TSS",         invalid_TSS)
402 | DO_ERROR(X86_TRAP_NP,     SIGBUS,           0, NULL, "segment not present", segment_not_present)
403 | DO_ERROR(X86_TRAP_SS,     SIGBUS,           0, NULL, "stack segment",       stack_segment)
404 | DO_ERROR(X86_TRAP_AC,     SIGBUS,  BUS_ADRALN, NULL, "alignment check",     alignment_check)
405 | #undef IP
406 | ```
407 | 
408 | Well, fine \(for the third time\). So it seems that every `do_xxx` just reports an error and returned. So that means that this C handler could simply return without doing anything useful and the world is saved.
409 | 
410 | So to define your own C handler, you can just define a function with the type of `dotraplinkage void(struct pt_regs*, long)`. But do remember that hooking will override the original handler so if your code is only handling some edge cases, make sure that a default routine \(here is calling `do_error_trap`\) is still included so the kernel won't crash.
411 | 
412 | ## Make It Real
413 | 
414 | OK now with our defined wrapper and actual C handler, we are ready to hook the handler. Like what we talked about in the Brief History, we will be using `update_intr_trap` to achieve such thing. Just
415 | 
416 | ```text
417 | update_intr_gate(YOUR_TARGET_TRAP, your_trap_wrapper);
418 | ```
419 | 
420 | and you are done. 
421 | 
422 | ...or are you? True story, if you call this function you will find
423 | 
424 | ```text
425 | [  124.629625] Initializing hook module...
426 | [  124.629626] Hooking IDT...
427 | [  124.647487] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)                                                                                    
428 | [  124.648651] BUG: unable to handle kernel paging request at ffffffffaaab0dbd
429 | [  124.649666] IP: update_intr_gate+0x0/0x20
430 | ```
431 | 
432 | What? NX-protected? Yes. The reason for that is how `update_intr_gate` is defined.
433 | 
434 | ```text
435 | void __init update_intr_gate(unsigned int n, const void *addr);
436 | ```
437 | 
438 | A huge `__init` is there. What's `__init`? It's a macro telling the kernel to free the CODE after booting. So right after Linux is running, you are not gonna be able to use this. The only true way to do this is copying the whole code of `set_intr_gate` and its related functions to your code. I've done this, so just copy my code: 
439 | 
440 | ```c
441 | #include <linux/kernel.h>
442 | 
443 | #include <asm/desc.h>
444 | #include <asm/traps.h>
445 | #include <asm/ptrace.h>
446 | 
447 | struct idt_data {
448 | 	unsigned int	vector;
449 | 	unsigned int	segment;
450 | 	struct idt_bits	bits;
451 | 	const void	*addr;
452 | };
453 | 
454 | static inline void another_idt_init_desc(gate_desc *gate, const struct idt_data *d)
455 | {
456 | 	unsigned long addr = (unsigned long) d->addr;
457 | 
458 | 	gate->offset_low	= (u16) addr;
459 | 	gate->segment		= (u16) d->segment;
460 | 	gate->bits		= d->bits;
461 | 	gate->offset_middle	= (u16) (addr >> 16);
462 | #ifdef CONFIG_X86_64
463 | 	gate->offset_high	= (u32) (addr >> 32);
464 | 	gate->reserved		= 0;
465 | #endif
466 | }
467 | 
468 | static void
469 | another_idt_setup_from_table(gate_desc *idt, const struct idt_data *t, int size)
470 | {
471 | 	gate_desc desc;
472 | 
473 | 	for (; size > 0; t++, size--) {
474 | 		another_idt_init_desc(&desc, t);
475 | 		write_idt_entry(idt, t->vector, &desc);
476 | 	}
477 | }
478 | 
479 | static void another_set_intr_gate(unsigned int n, const void *addr)
480 | {
481 | 	struct idt_data data;
482 | 
483 | 	BUG_ON(n > 0xFF);
484 | 
485 | 	memset(&data, 0, sizeof(data));
486 | 	data.vector	= n;
487 | 	data.addr	= addr;
488 | 	data.segment	= __KERNEL_CS;
489 | 	data.bits.type	= GATE_INTERRUPT;
490 | 	data.bits.p	= 1;
491 | 
492 | 	another_idt_setup_from_table(idt_table, &data, 1);
493 | }
494 | ```
495 | 
496 | You should know that an exported \(or even just an exposed\) function for setting interrupt gate is extremely dangerous. If you are hooking it, you know the danger. But not everyone. So copy the code to your file and leave them `static`. Use wisely.
497 | 
498 | ## Writing a Module
499 | 
500 | If you are writing a module, you are going to find that some functions mentioned in the previous sections are not exported. Please check how to use non-exported symbols in modules. To know exactly which symbols are not exported, just make and you will get a list from the linker.
501 | 
502 | Another thing. You did it, you clean it. Since you hooked the handler, when you are done, you have to restore the original. Looking at `/arch/x86/include/asm/desc.h`, you find the actual IDT table `idt_table`, and a bunch of inline functions to load and store the IDT. Go read it, you know what to do.
503 | 
504 | 


--------------------------------------------------------------------------------