├── .github └── FUNDING.yml ├── Architecture ├── README.MD ├── architecture-of-linux.png ├── architecture-of-linux2.png └── architecture-of-linux3.png ├── Find a List of Built-in Kernel Modules └── command.md ├── Inodes ├── definition.md └── wall │ └── monster.jpg ├── Kernel and Types of kernels.md ├── Linux └── exit │ └── exit.c ├── Memory ├── README.md └── wall │ └── Memory.png ├── PTE ├── README.MD └── screen │ ├── Capture-24.png │ └── screen ├── Paging ├── README.MD └── screen │ ├── paging-2.jpg │ ├── paging-3.jpg │ ├── paging.jpg │ └── screen ├── Process execution priorities └── README.MD ├── README.md ├── SystemCall ├── README.MD └── SystemCall+LKM+LSM.pdf ├── TLB ├── README.MD └── screen │ ├── screen │ └── tlb1.jpg ├── The-secrets-of-Kconfig ├── README.MD └── wall │ ├── 1.png │ ├── 2.png │ ├── 3.png │ └── 4.png ├── Uninstalling └── kernels.md ├── Userspace ├── README.MD ├── logo │ └── red-hat-logo.png └── what_mean.md ├── file_system └── File System in Linux Kernel.md ├── kernel-types-networknuts.png ├── logo ├── Linux.png └── logo ├── modules └── example-lkm-module │ ├── .clang-format │ ├── .editorconfig │ ├── .gitignore │ ├── .ycm_extra_conf.py │ ├── Kbuild │ ├── Makefile │ ├── README.MD │ └── katil.c ├── scheme └── kernels.png ├── type of programming └── types.md ├── understanding the linux kernel └── Understanding Linux Kernel.pdf └── wall ├── gV8hn.png └── wall /.github/FUNDING.yml: -------------------------------------------------------------------------------- 1 | # Link for sponsoring 2 | custom: ["https://www.nu11secur1ty.com/", "https://github.com/nu11secur1ty"] 3 | custom: ["https://www.paypal.com/donate/?hosted_button_id=ZPQZT5XMC5RFY"] 4 | -------------------------------------------------------------------------------- /Architecture/README.MD: -------------------------------------------------------------------------------- 1 | ## Architecture 2 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Architecture/architecture-of-linux.png) 3 | 4 | 1. Kernel:- The kernel is one of the core section of an operating system. It is responsible for each of the major actions of the Linux OS. This operating system contains distinct types of modules and cooperates with underlying hardware directly. The kernel facilitates required abstraction for hiding details of low-level hardware or application programs to the system. There are some of the important kernel types which are mentioned below: 5 | ----------------------------------------- 6 | - Monolithic Kernel 7 | - Micro kernels 8 | - Exo kernels 9 | - Hybrid kernels 10 | ----------------------------------------- 11 | 12 | 2. System Libraries:- These libraries can be specified as some special functions. These are applied for implementing the operating system's functionality and don't need code access rights of the modules of kernel. 13 | ----------------------------------------- 14 | 3. System Utility Programs:- It is responsible for doing specialized level and individual activities. 15 | ----------------------------------------- 16 | 4. Hardware layer:- Linux operating system contains a hardware layer that consists of several peripheral devices like CPU, HDD, and RAM. 17 | ----------------------------------------- 18 | 5. Shell:- It is an interface among the kernel and user. It can afford the services of kernel. It can take commands through the user and runs the functions of the kernel. The shell is available in distinct types of OSes. These operating systems are categorized into two different types, which are the graphical shells and command-line shells. 19 | ----------------------------------------- 20 | 21 | The graphical line shells facilitate the graphical user interface, while the command line shells facilitate the command line interface. Thus, both of these shells implement operations. However, the graphical user interface shells work slower as compared to the command-line interface shells. 22 | 23 | There are a few types of these shells which are categorized as follows: 24 | 25 | - - - Korn shell 26 | - - - Bourne shell 27 | - - - C shell 28 | - - - POSIX shell 29 | 30 | ## Linux Operating System Features 31 | `Some of the primary features of Linux OS are as follows:` 32 | 33 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Architecture/architecture-of-linux2.png) 34 | 35 | - Portable: Linux OS can perform different types of hardware and the kernel of Linux supports the installation of any type of hardware environment. 36 | - Open source: Linux operating system source code is available freely and for enhancing the capability of the Linux OS, several teams are performing in collaboration. 37 | - Multiprogramming: Linux OS can be defined as a multiprogramming system. It means more than one application can be executed at the same time. 38 | - Multi-user: Linux OS can also be defined as a multi-user system. It means more than one user can use the resources of the system such as application programs, memory, or RAM at the same time. 39 | - Hierarchical file system: Linux OS affords a typical file structure where user files or system files are arranged. 40 | - Security: Linux OS facilitates user security systems with the help of various features of authentication such as controlled access to specific files, password protection, or data encryption. 41 | - Shell: Linux operating system facilitates a unique interpreter program. This type of program can be applied for executing commands of the operating system. It can be applied to perform various types of tasks such as call application programs and others. 42 | 43 | ## Drawbacks of Linux 44 | 45 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Architecture/architecture-of-linux3.png) 46 | 47 | - Hardware drivers: Most of the users of Linux face an issue while using Linux. Various companies of hardware prefer to build drivers for Mac or Windows due to they contain several users than Linux. Linux has small drivers for peripheral hardware than windows. 48 | - Software alternative: Let's take the Photoshop example which is a famous tool for graphic editing. Photoshop exists for Windows; however, it is not available in Linux. Also, there are some other tools for photo editing but the Photoshop tool is more powerful as compare to others. Another example is MS office which is not present for Linux users. 49 | - Learning curve: Linux isn't a very user-friendly operating system. Hence, it might be confusing for many beginners. Getting begun with Windows is efficient and easy for many beginners; however, understanding Linux working is complex. 50 | We have to understand the command line interface and finding for newer software is a little bit complex as well. When we face any issue in the OS, the searching solution is very problematic. Also, there are various experts for Mac and Windows as compare to Linux. 51 | - Games: Several games are developed for Windows but unfortunately not for Linux. Because the platform of Windows is used widely. So, the developers of the games are more interested in windows. 52 | 53 | ## Linux Operating System Applications 54 | 55 | Linux is a billion-dollar corporation nowadays. Thousands of governments and companies are using Linux operating system across the world because of lower money, time, licensing fee, and affordability. Linux can be used within several types of electronic devices. These electronic devices are easily available for users worldwide. A few of the famous Linux-based electronic devices are listed below: 56 | 57 | - Yamaha Motive Keyboard 58 | - Volvo In-Car Navigation System 59 | - TiVo Digital Video Recorder 60 | - Sony Reader 61 | - Sony Bravia Television 62 | - One Laptop Per child XO2 63 | - Motorola MotoRokr EM35 phone 64 | - Lenovo IdeaPad S9 65 | - HP Mini 1000 66 | - Google Android Dev Phone 1 67 | - Garmin Nuvi 860, 880, and 5000 68 | - Dell Inspiron Mini 9 and 12 69 | 70 | ## Linux Distribution 71 | 72 | It is an OS that is composed of a software-based collection on Linux kernel or we can say the distribution includes the Linux Kernel. It is supporting software and libraries. We can obtain Linux-based OS by downloading any Linux distribution. These types of distributions exists for distinct types of devices such as personal computers, embedded devices, etc. Around more than 600 Linux distributions are existed and a few of the famous Linux distributions are listed as follows: 73 | 74 | - - - Deepin 75 | - - - OpenSUSE 76 | - - - Fedora 77 | - - - Solus 78 | - - - Debian 79 | - - - Ubuntu 80 | - - - Elementary 81 | - - - Linux Mint 82 | - - - Manjaro 83 | - - - MX Linux 84 | 85 | ## Are Ubuntu and Linux Differ? 86 | 87 | `YES` 88 | The primary difference between window and Linux is that window is open source and free OS and its Linux distribution based on Debian, Whereas Linux is a large collection of open-source OSes that are working based on Linux kernel. 89 | 90 | Besides, Ubuntu is a distribution of Linux and Linux is a core system. Ubuntu is integrated by Canonical Ltd. and published in 2004 and Linux is integrated by Linus Torvalds and published in 1991. 91 | 92 | 93 | ## User mode vs Kernel mode: 94 | 95 | The code of kernel component runs in a unique privilege mood known as kernel mode along with complete access to every computer resource. This code illustrates an individual process, runs in an individual address space, and don't need the context switch. Hence, it is very fast and efficient. 96 | Kernel executes all the processes and facilitates various services of a system to the processes. Also, it facilitates secured access to processes to hardware. 97 | The support code that is not needed to execute in kernel mode is inside the system library. The user programs and other types of system programs are implemented in the user mode. 98 | It includes no access to kernel mode and system hardware. User utilities/programs use the system libraries for accessing kernel functions to obtain low-level tasks of the system. 99 | 100 | -------------------------------------------------------------------------------- /Architecture/architecture-of-linux.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/Architecture/architecture-of-linux.png -------------------------------------------------------------------------------- /Architecture/architecture-of-linux2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/Architecture/architecture-of-linux2.png -------------------------------------------------------------------------------- /Architecture/architecture-of-linux3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/Architecture/architecture-of-linux3.png -------------------------------------------------------------------------------- /Find a List of Built-in Kernel Modules/command.md: -------------------------------------------------------------------------------- 1 | # Find a List of Built-in Kernel Modules 2 | ```bash 3 | cat /lib/modules/$(uname -r)/modules.builtin 4 | ``` 5 | -------------------------------------------------------------------------------- /Inodes/definition.md: -------------------------------------------------------------------------------- 1 | # Understanding disk inodes 2 | 3 | You try creating a file on a server and see this error message: 4 | ```bash 5 | No space left on device 6 | ``` 7 | ..but you've got plenty of space: 8 | 9 | ```bash 10 | df 11 | Filesystem 1K-blocks Used Available Use% Mounted on 12 | /dev/xvda1 10321208 3159012 6637908 33% / 13 | ``` 14 | 15 | Who is the invisible monster chewing up all of your space? 16 | 17 | Why, the inode monster of course! 18 | 19 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Inodes/wall/monster.jpg) 20 | 21 | # What are inodes? 22 | An index node (or inode) contains metadata information (file size, file type, etc.) for a file system object (like a file or a directory). There is one inode per file system object. 23 | 24 | An inode doesn't store the file contents or the name: it simply points to a specific file or directory. 25 | 26 | The problem with inodes 27 | The total number of inodes and the space reserved for these inodes is set when the filesystem is first created. The inode limit can't be changed dynamically and every file system object must have an inode. 28 | 29 | While it's unusual to run out of inodes before actual disk space, you are more likely to have inode shortages if: 30 | 31 | - You are creating lots of directories, symlinks, and small files. 32 | - You created your ext3 filesystem with smaller block sizes. The ext3 default block size is 4096 bytes. If you are using your filesystem for storing lots of very small files, you might create the filesystem with a block size of 1024 or 2048. This would let you use your disk space more efficiently, but raises the likelihood of running low on inodes. 33 | - Your server is containerized (Docker, LXC, OpenVZ, etc). Containerized servers can often share the same filesystem as the host node. For stability and security purposes, the containers resources such as RAM, CPU and disk space and inodes are limited. In this situation the number of inodes allocated to your container is determined by the administrator of the host node. It is very common to run into into inode issues in containers with filesystems of this type 34 | 35 | 36 | # Viewing inode usage 37 | 38 | - Use the `-i` flag to view inode usage: 39 | ```bash 40 | df -i 41 | Filesystem Inodes IUsed IFree IUse% Mounted on 42 | /dev/xvda1 1310720 1310720 0 100% / 43 | ``` 44 | 45 | You can view more detailed inode information with the following command: 46 | 47 | ```bash 48 | tune2fs -l /dev/xvda1 | grep -i inode 49 | ``` 50 | 51 | # Where are the small files? 52 | If you are using up all of your inode capacity, the next step is figuring out where all of those little files are. This is a bit of a manual process. 53 | 54 | The command below will output the number of files and directory from the top of your file system: 55 | 56 | ```bash 57 | for i in /*; do echo $i; find $i |wc -l; done 58 | ``` 59 | - Example output: 60 | ```bash 61 | /bin 62 | 119 63 | /sys 64 | 9293 65 | /tmp 66 | 1 67 | /usr 68 | 10583 69 | ``` 70 | The counts above include all nested directories. If you have a single directory with many files, the command above may stall - that's a good hint that's the problem directory. Once you've narrowed it down to a specific directory, you can execute the previous command on that directory: 71 | 72 | 73 | ```bash 74 | for i in /usr/*; do echo $i; find $i |wc -l; done 75 | ``` 76 | 77 | Delete all of the small files in that directory. 78 | 79 | I can't delete those files. Can I increase the inode limit? 80 | The bad news: on the ext* file systems, you can't simply increase the inode limit on an existing volume. You have two options: 81 | 82 | 1. - If the disk is an LVM, increase the size of the volume. 83 | 84 | 2. - Backup and create a new a new file system, specifying a higher inode limit: 85 | 86 | ```bash 87 | mke2fs -N 88 | ``` 89 | 90 | 91 | # Are there filesystems that don't have inode limits? 92 | Yes. Modern filesystems like `Btrfs` and `XFS` use dynamic inodes to avoid inode limits. `ZFS` does not use inodes. 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | -------------------------------------------------------------------------------- /Inodes/wall/monster.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/Inodes/wall/monster.jpg -------------------------------------------------------------------------------- /Kernel and Types of kernels.md: -------------------------------------------------------------------------------- 1 | # 1. What Is Kernel? 2 | 3 | A kernel is a central component of an operating system. It acts as an interface between the user applications and the hardware. 4 | The sole aim of the kernel is to manage the communication between the software (user level applications) and the hardware (CPU, disk memory etc). The main tasks of the kernel are : 5 | 6 | Process management 7 | Device management 8 | Memory management 9 | Interrupt handling 10 | I/O communication 11 | File system...etc.. 12 | 13 | # 2. Is LINUX A Kernel Or An Operating System? 14 | 15 | Well, there is a difference between kernel and OS. Kernel as described above is the heart of OS which manages the core features of an OS while if some useful applications and utilities are added over the kernel, then the complete package becomes an OS. 16 | So, it can easily be said that an operating system consists of a kernel space and a user space. 17 | 18 | So, we can say that Linux is a kernel as it does not include applications like file-system utilities, 19 | windowing systems and graphical desktops, system administrator commands, text editors, compilers etc. 20 | So, various companies add these kind of applications over linux kernel and provide their operating system like ubuntu, suse, centOS, redHat etc. 21 | 22 | # 3. Types Of Kernels 23 | 24 | Kernels may be classified mainly in two categories 25 | 26 | Monolithic 27 | Micro Kernel 28 | 29 | # 1 Monolithic Kernels 30 | 31 | Earlier in this type of kernel architecture, all the basic system services like a process and memory management, interrupt handling etc were packaged into a single module in kernel space. This type of architecture led to some serious drawbacks like 1) The Size of the kernel, which was huge. 2) Poor maintainability, which means bug fixing or addition of new features resulted in recompilation of the whole kernel which could consume hours. 32 | 33 | In a modern day approach to monolithic architecture, the kernel consists of different modules which can be dynamically loaded and unloaded. This modular approach allows easy extension of OS's capabilities. With this approach, maintainability of kernel became very easy as only the concerned module needs to be loaded and unloaded every time there is a change or bug fix in a particular module. So, there is no need to bring down and recompile the whole kernel for the smallest bit of change. Also, stripping of a kernel for various platforms (say for embedded devices etc) became very easy as we can easily unload the module that we do not want. 34 | 35 | Linux follows the monolithic modular approach 36 | 37 | # 2 Microkernels 38 | 39 | This architecture majorly caters to the problem of ever growing size of kernel code which we could not control in the monolithic approach. This architecture allows some basic services like device driver management, protocol stack, file system etc to run in user space. This reduces the kernel code size and also increases the security and stability of OS as we have the bare minimum code running in kernel. So, if suppose a basic service like network service crashes due to buffer overflow, then only the networking service's memory would be corrupted, leaving the rest of the system still functional. 40 | 41 | In this architecture, all the basic OS services which are made part of user space are made to run as servers which are used by other programs in the system through inter process communication (IPC). eg: we have servers for device drivers, network protocol stacks, file systems, graphics, etc. Microkernel servers are essentially daemon programs like any others, except that the kernel grants some of them privileges to interact with parts of physical memory that are otherwise off limits to most programs. This allows some servers, particularly device drivers, to interact directly with hardware. These servers are started at the system start-up. 42 | 43 | So, what the bare minimum that microKernel architecture recommends in kernel space? 44 | 45 | Managing memory protection 46 | Process scheduling 47 | Inter Process communication (IPC) 48 | 49 | Apart from the above, all other basic services can be made part of user space and can be run in the form of servers. 50 | 51 | ----------------------------------------------------------------------------------------- 52 | 53 | # 1. Microkernels — In these, only the most elementary functions are implemented directly 54 | in a central kernel — the microkernel. All other functions are delegated to autonomous 55 | processes that communicate with the central kernel via clearly defined communication 56 | interfaces — for example, various filesystems, memory management, and so on. (Of 57 | course, the most elementary level of memory management that controls communication 58 | with the system itself is in the microkernel. However, handling on the system call level is 59 | implemented in external servers.) Theoretically, this is a very elegant approach because 60 | the individual parts are clearly segregated from each other, and this forces programmers 61 | to use ‘‘clean‘‘ programming techniques. Other benefits of this approach are dynamic 62 | extensibility and the ability to swap important components at run time. However, owing 63 | to the additional CPU time needed to support complex communication between the 64 | components, microkernels have not really established themselves in practice although they 65 | have been the subject of active and varied research for some time now. 66 | 67 | # 2. Monolithic Kernels — They are the alternative, traditional concept. Here, the entire code 68 | of the kernel — including all its subsystems such as memory management, filesystems, or 69 | device drivers — is packed into a single file. Each function has access to all other parts of 70 | the kernel; this can result in elaborately nested source code if programming is not done with 71 | great care. 72 | Because, at the moment, the performance of monolithic kernels is still greater than that of microkernels, 73 | Linux was and still is implemented according to this paradigm. However, one major innovation has been 74 | introduced. Modules with kernel code that can be inserted or removed while the system is up-and-running 75 | support the dynamic addition of a whole range of functions to the kernel, thus compensating for some of 76 | the disadvantages of monolithic kernels. This is assisted by elaborate means of communication between 77 | the kernel and userland that allows for implementing hotplugging and dynamic loading of modules. 78 | 79 | 80 | ![image](https://github.com/nu11secur1ty/pictures/blob/master/gV8hn.png) 81 | 82 | 83 | -------------------------------------------------------------------------------------------------- 84 | 85 | 86 | # Kernel Space Definition 87 | 88 | 89 | 90 | System memory in Linux can be divided into two distinct regions: kernel space and user space. Kernel space is where the kernel (i.e., the core of the operating system) executes (i.e., runs) and provides its services. 91 | 92 | Memory consists of RAM (random access memory) cells, whose contents can be accessed (i.e., read and written to) at extremely high speeds but are retained only temporarily (i.e., while in use or, at most, while the power supply remains on). Its purpose is to hold programs and data that are currently in use and thereby serve as a high speed intermediary between the CPU (central processing unit) and the much slower storage, which most commonly consists of one or more hard disk drives (HDDs). 93 | 94 | User space is that set of memory locations in which user processes (i.e., everything other than the kernel) run. A process is an executing instance of a program. One of the roles of the kernel is to manage individual user processes within this space and to prevent them from interfering with each other. 95 | 96 | Kernel space can be accessed by user processes only through the use of system calls. System calls are requests in a Unix-like operating system by an active process for a service performed by the kernel, such as input/output (I/O) or process creation. An active process is a process that is currently progressing in the CPU, as contrasted with a process that is waiting for its next turn in the CPU. I/O is any program, operation or device that transfers data to or from a CPU and to or from a peripheral device (such as disk drives, keyboards, mice and printers). 97 | 98 | 99 | ------------------------------------------------------------------------------------------------------- 100 | 101 | 102 | # User Space Definition 103 | 104 | 105 | 106 | User space is that portion of system memory in which user processes run. This contrasts with kernel space, which is that portion of memory in which the kernel executes and provides its services. 107 | 108 | The contents of memory, which consists of dedicated RAM (random access memory) VLSI (very large scale integrated circuit) semiconductor chips, can be accessed (i.e., read and written to) at extremely high speeds but are retained only temporarily (i.e., while in use or, at most, while the power supply remains on). This contrasts with storage (e.g., disk drives), which has much slower access speeds but whose contents are retained after the power is turned off and which usually has a far greater capacity. 109 | 110 | A process is an executing (i.e., running) instance of a program. User processes are instances of all programs other than the kernel (i.e., utility and application programs). When a program is to be run, it is copied from storage into user space so that it can be accessed at high speed by the CPU (central processing unit). 111 | 112 | The kernel is a program that constitutes the central core of a computer operating system. It is not a process, but rather a controller of processes, and it has complete control over everything that occurs on the system. This includes managing individual user processes within user space and preventing them from interfering with each other. 113 | 114 | The division of system memory in Unix-like operating systems into user space and kernel space plays an important role in maintaining system stability and security. 115 | 116 | ![image](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/kernel-types-networknuts.png) 117 | 118 | # [See also:](https://gist.github.com/nu11secur1ty/e7ad7ca9bd5391727c8e513250839eec) 119 | 120 | 121 | # Example: [See...](http://www.nu11secur1ty.com/2017/12/buildingkernelmoduleexample.html) 122 | -------------------------------------------------------------------------------- /Linux/exit/exit.c: -------------------------------------------------------------------------------- 1 | // SPDX-License-Identifier: GPL-2.0-only 2 | /* 3 | * linux/kernel/exit.c 4 | * 5 | * Copyright (C) 1991, 1992 Linus Torvalds 6 | */ 7 | 8 | #include 9 | #include 10 | #include 11 | #include 12 | #include 13 | #include 14 | #include 15 | #include 16 | #include 17 | #include 18 | #include 19 | #include 20 | #include 21 | #include 22 | #include 23 | #include 24 | #include 25 | #include 26 | #include 27 | #include 28 | #include 29 | #include 30 | #include 31 | #include 32 | #include 33 | #include 34 | #include 35 | #include 36 | #include 37 | #include 38 | #include 39 | #include 40 | #include 41 | #include 42 | #include 43 | #include 44 | #include 45 | #include 46 | #include 47 | #include 48 | #include 49 | #include /* for audit_free() */ 50 | #include 51 | #include 52 | #include 53 | #include 54 | #include 55 | #include 56 | #include 57 | #include 58 | #include 59 | #include 60 | #include 61 | #include 62 | #include 63 | #include 64 | #include 65 | #include 66 | 67 | #include 68 | #include 69 | #include 70 | #include 71 | 72 | static void __unhash_process(struct task_struct *p, bool group_dead) 73 | { 74 | nr_threads--; 75 | detach_pid(p, PIDTYPE_PID); 76 | if (group_dead) { 77 | detach_pid(p, PIDTYPE_TGID); 78 | detach_pid(p, PIDTYPE_PGID); 79 | detach_pid(p, PIDTYPE_SID); 80 | 81 | list_del_rcu(&p->tasks); 82 | list_del_init(&p->sibling); 83 | __this_cpu_dec(process_counts); 84 | } 85 | list_del_rcu(&p->thread_group); 86 | list_del_rcu(&p->thread_node); 87 | } 88 | 89 | /* 90 | * This function expects the tasklist_lock write-locked. 91 | */ 92 | static void __exit_signal(struct task_struct *tsk) 93 | { 94 | struct signal_struct *sig = tsk->signal; 95 | bool group_dead = thread_group_leader(tsk); 96 | struct sighand_struct *sighand; 97 | struct tty_struct *uninitialized_var(tty); 98 | u64 utime, stime; 99 | 100 | sighand = rcu_dereference_check(tsk->sighand, 101 | lockdep_tasklist_lock_is_held()); 102 | spin_lock(&sighand->siglock); 103 | 104 | #ifdef CONFIG_POSIX_TIMERS 105 | posix_cpu_timers_exit(tsk); 106 | if (group_dead) { 107 | posix_cpu_timers_exit_group(tsk); 108 | } else { 109 | /* 110 | * This can only happen if the caller is de_thread(). 111 | * FIXME: this is the temporary hack, we should teach 112 | * posix-cpu-timers to handle this case correctly. 113 | */ 114 | if (unlikely(has_group_leader_pid(tsk))) 115 | posix_cpu_timers_exit_group(tsk); 116 | } 117 | #endif 118 | 119 | if (group_dead) { 120 | tty = sig->tty; 121 | sig->tty = NULL; 122 | } else { 123 | /* 124 | * If there is any task waiting for the group exit 125 | * then notify it: 126 | */ 127 | if (sig->notify_count > 0 && !--sig->notify_count) 128 | wake_up_process(sig->group_exit_task); 129 | 130 | if (tsk == sig->curr_target) 131 | sig->curr_target = next_thread(tsk); 132 | } 133 | 134 | add_device_randomness((const void*) &tsk->se.sum_exec_runtime, 135 | sizeof(unsigned long long)); 136 | 137 | /* 138 | * Accumulate here the counters for all threads as they die. We could 139 | * skip the group leader because it is the last user of signal_struct, 140 | * but we want to avoid the race with thread_group_cputime() which can 141 | * see the empty ->thread_head list. 142 | */ 143 | task_cputime(tsk, &utime, &stime); 144 | write_seqlock(&sig->stats_lock); 145 | sig->utime += utime; 146 | sig->stime += stime; 147 | sig->gtime += task_gtime(tsk); 148 | sig->min_flt += tsk->min_flt; 149 | sig->maj_flt += tsk->maj_flt; 150 | sig->nvcsw += tsk->nvcsw; 151 | sig->nivcsw += tsk->nivcsw; 152 | sig->inblock += task_io_get_inblock(tsk); 153 | sig->oublock += task_io_get_oublock(tsk); 154 | task_io_accounting_add(&sig->ioac, &tsk->ioac); 155 | sig->sum_sched_runtime += tsk->se.sum_exec_runtime; 156 | sig->nr_threads--; 157 | __unhash_process(tsk, group_dead); 158 | write_sequnlock(&sig->stats_lock); 159 | 160 | /* 161 | * Do this under ->siglock, we can race with another thread 162 | * doing sigqueue_free() if we have SIGQUEUE_PREALLOC signals. 163 | */ 164 | flush_sigqueue(&tsk->pending); 165 | tsk->sighand = NULL; 166 | spin_unlock(&sighand->siglock); 167 | 168 | __cleanup_sighand(sighand); 169 | clear_tsk_thread_flag(tsk, TIF_SIGPENDING); 170 | if (group_dead) { 171 | flush_sigqueue(&sig->shared_pending); 172 | tty_kref_put(tty); 173 | } 174 | } 175 | 176 | static void delayed_put_task_struct(struct rcu_head *rhp) 177 | { 178 | struct task_struct *tsk = container_of(rhp, struct task_struct, rcu); 179 | 180 | perf_event_delayed_put(tsk); 181 | trace_sched_process_free(tsk); 182 | put_task_struct(tsk); 183 | } 184 | 185 | 186 | void release_task(struct task_struct *p) 187 | { 188 | struct task_struct *leader; 189 | int zap_leader; 190 | repeat: 191 | /* don't need to get the RCU readlock here - the process is dead and 192 | * can't be modifying its own credentials. But shut RCU-lockdep up */ 193 | rcu_read_lock(); 194 | atomic_dec(&__task_cred(p)->user->processes); 195 | rcu_read_unlock(); 196 | 197 | proc_flush_task(p); 198 | cgroup_release(p); 199 | 200 | write_lock_irq(&tasklist_lock); 201 | ptrace_release_task(p); 202 | __exit_signal(p); 203 | 204 | /* 205 | * If we are the last non-leader member of the thread 206 | * group, and the leader is zombie, then notify the 207 | * group leader's parent process. (if it wants notification.) 208 | */ 209 | zap_leader = 0; 210 | leader = p->group_leader; 211 | if (leader != p && thread_group_empty(leader) 212 | && leader->exit_state == EXIT_ZOMBIE) { 213 | /* 214 | * If we were the last child thread and the leader has 215 | * exited already, and the leader's parent ignores SIGCHLD, 216 | * then we are the one who should release the leader. 217 | */ 218 | zap_leader = do_notify_parent(leader, leader->exit_signal); 219 | if (zap_leader) 220 | leader->exit_state = EXIT_DEAD; 221 | } 222 | 223 | write_unlock_irq(&tasklist_lock); 224 | release_thread(p); 225 | call_rcu(&p->rcu, delayed_put_task_struct); 226 | 227 | p = leader; 228 | if (unlikely(zap_leader)) 229 | goto repeat; 230 | } 231 | 232 | /* 233 | * Note that if this function returns a valid task_struct pointer (!NULL) 234 | * task->usage must remain >0 for the duration of the RCU critical section. 235 | */ 236 | struct task_struct *task_rcu_dereference(struct task_struct **ptask) 237 | { 238 | struct sighand_struct *sighand; 239 | struct task_struct *task; 240 | 241 | /* 242 | * We need to verify that release_task() was not called and thus 243 | * delayed_put_task_struct() can't run and drop the last reference 244 | * before rcu_read_unlock(). We check task->sighand != NULL, 245 | * but we can read the already freed and reused memory. 246 | */ 247 | retry: 248 | task = rcu_dereference(*ptask); 249 | if (!task) 250 | return NULL; 251 | 252 | probe_kernel_address(&task->sighand, sighand); 253 | 254 | /* 255 | * Pairs with atomic_dec_and_test() in put_task_struct(). If this task 256 | * was already freed we can not miss the preceding update of this 257 | * pointer. 258 | */ 259 | smp_rmb(); 260 | if (unlikely(task != READ_ONCE(*ptask))) 261 | goto retry; 262 | 263 | /* 264 | * We've re-checked that "task == *ptask", now we have two different 265 | * cases: 266 | * 267 | * 1. This is actually the same task/task_struct. In this case 268 | * sighand != NULL tells us it is still alive. 269 | * 270 | * 2. This is another task which got the same memory for task_struct. 271 | * We can't know this of course, and we can not trust 272 | * sighand != NULL. 273 | * 274 | * In this case we actually return a random value, but this is 275 | * correct. 276 | * 277 | * If we return NULL - we can pretend that we actually noticed that 278 | * *ptask was updated when the previous task has exited. Or pretend 279 | * that probe_slab_address(&sighand) reads NULL. 280 | * 281 | * If we return the new task (because sighand is not NULL for any 282 | * reason) - this is fine too. This (new) task can't go away before 283 | * another gp pass. 284 | * 285 | * And note: We could even eliminate the false positive if re-read 286 | * task->sighand once again to avoid the falsely NULL. But this case 287 | * is very unlikely so we don't care. 288 | */ 289 | if (!sighand) 290 | return NULL; 291 | 292 | return task; 293 | } 294 | 295 | void rcuwait_wake_up(struct rcuwait *w) 296 | { 297 | struct task_struct *task; 298 | 299 | rcu_read_lock(); 300 | 301 | /* 302 | * Order condition vs @task, such that everything prior to the load 303 | * of @task is visible. This is the condition as to why the user called 304 | * rcuwait_trywake() in the first place. Pairs with set_current_state() 305 | * barrier (A) in rcuwait_wait_event(). 306 | * 307 | * WAIT WAKE 308 | * [S] tsk = current [S] cond = true 309 | * MB (A) MB (B) 310 | * [L] cond [L] tsk 311 | */ 312 | smp_mb(); /* (B) */ 313 | 314 | /* 315 | * Avoid using task_rcu_dereference() magic as long as we are careful, 316 | * see comment in rcuwait_wait_event() regarding ->exit_state. 317 | */ 318 | task = rcu_dereference(w->task); 319 | if (task) 320 | wake_up_process(task); 321 | rcu_read_unlock(); 322 | } 323 | 324 | /* 325 | * Determine if a process group is "orphaned", according to the POSIX 326 | * definition in 2.2.2.52. Orphaned process groups are not to be affected 327 | * by terminal-generated stop signals. Newly orphaned process groups are 328 | * to receive a SIGHUP and a SIGCONT. 329 | * 330 | * "I ask you, have you ever known what it is to be an orphan?" 331 | */ 332 | static int will_become_orphaned_pgrp(struct pid *pgrp, 333 | struct task_struct *ignored_task) 334 | { 335 | struct task_struct *p; 336 | 337 | do_each_pid_task(pgrp, PIDTYPE_PGID, p) { 338 | if ((p == ignored_task) || 339 | (p->exit_state && thread_group_empty(p)) || 340 | is_global_init(p->real_parent)) 341 | continue; 342 | 343 | if (task_pgrp(p->real_parent) != pgrp && 344 | task_session(p->real_parent) == task_session(p)) 345 | return 0; 346 | } while_each_pid_task(pgrp, PIDTYPE_PGID, p); 347 | 348 | return 1; 349 | } 350 | 351 | int is_current_pgrp_orphaned(void) 352 | { 353 | int retval; 354 | 355 | read_lock(&tasklist_lock); 356 | retval = will_become_orphaned_pgrp(task_pgrp(current), NULL); 357 | read_unlock(&tasklist_lock); 358 | 359 | return retval; 360 | } 361 | 362 | static bool has_stopped_jobs(struct pid *pgrp) 363 | { 364 | struct task_struct *p; 365 | 366 | do_each_pid_task(pgrp, PIDTYPE_PGID, p) { 367 | if (p->signal->flags & SIGNAL_STOP_STOPPED) 368 | return true; 369 | } while_each_pid_task(pgrp, PIDTYPE_PGID, p); 370 | 371 | return false; 372 | } 373 | 374 | /* 375 | * Check to see if any process groups have become orphaned as 376 | * a result of our exiting, and if they have any stopped jobs, 377 | * send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2) 378 | */ 379 | static void 380 | kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent) 381 | { 382 | struct pid *pgrp = task_pgrp(tsk); 383 | struct task_struct *ignored_task = tsk; 384 | 385 | if (!parent) 386 | /* exit: our father is in a different pgrp than 387 | * we are and we were the only connection outside. 388 | */ 389 | parent = tsk->real_parent; 390 | else 391 | /* reparent: our child is in a different pgrp than 392 | * we are, and it was the only connection outside. 393 | */ 394 | ignored_task = NULL; 395 | 396 | if (task_pgrp(parent) != pgrp && 397 | task_session(parent) == task_session(tsk) && 398 | will_become_orphaned_pgrp(pgrp, ignored_task) && 399 | has_stopped_jobs(pgrp)) { 400 | __kill_pgrp_info(SIGHUP, SEND_SIG_PRIV, pgrp); 401 | __kill_pgrp_info(SIGCONT, SEND_SIG_PRIV, pgrp); 402 | } 403 | } 404 | 405 | #ifdef CONFIG_MEMCG 406 | /* 407 | * A task is exiting. If it owned this mm, find a new owner for the mm. 408 | */ 409 | void mm_update_next_owner(struct mm_struct *mm) 410 | { 411 | struct task_struct *c, *g, *p = current; 412 | 413 | retry: 414 | /* 415 | * If the exiting or execing task is not the owner, it's 416 | * someone else's problem. 417 | */ 418 | if (mm->owner != p) 419 | return; 420 | /* 421 | * The current owner is exiting/execing and there are no other 422 | * candidates. Do not leave the mm pointing to a possibly 423 | * freed task structure. 424 | */ 425 | if (atomic_read(&mm->mm_users) <= 1) { 426 | WRITE_ONCE(mm->owner, NULL); 427 | return; 428 | } 429 | 430 | read_lock(&tasklist_lock); 431 | /* 432 | * Search in the children 433 | */ 434 | list_for_each_entry(c, &p->children, sibling) { 435 | if (c->mm == mm) 436 | goto assign_new_owner; 437 | } 438 | 439 | /* 440 | * Search in the siblings 441 | */ 442 | list_for_each_entry(c, &p->real_parent->children, sibling) { 443 | if (c->mm == mm) 444 | goto assign_new_owner; 445 | } 446 | 447 | /* 448 | * Search through everything else, we should not get here often. 449 | */ 450 | for_each_process(g) { 451 | if (g->flags & PF_KTHREAD) 452 | continue; 453 | for_each_thread(g, c) { 454 | if (c->mm == mm) 455 | goto assign_new_owner; 456 | if (c->mm) 457 | break; 458 | } 459 | } 460 | read_unlock(&tasklist_lock); 461 | /* 462 | * We found no owner yet mm_users > 1: this implies that we are 463 | * most likely racing with swapoff (try_to_unuse()) or /proc or 464 | * ptrace or page migration (get_task_mm()). Mark owner as NULL. 465 | */ 466 | WRITE_ONCE(mm->owner, NULL); 467 | return; 468 | 469 | assign_new_owner: 470 | BUG_ON(c == p); 471 | get_task_struct(c); 472 | /* 473 | * The task_lock protects c->mm from changing. 474 | * We always want mm->owner->mm == mm 475 | */ 476 | task_lock(c); 477 | /* 478 | * Delay read_unlock() till we have the task_lock() 479 | * to ensure that c does not slip away underneath us 480 | */ 481 | read_unlock(&tasklist_lock); 482 | if (c->mm != mm) { 483 | task_unlock(c); 484 | put_task_struct(c); 485 | goto retry; 486 | } 487 | WRITE_ONCE(mm->owner, c); 488 | task_unlock(c); 489 | put_task_struct(c); 490 | } 491 | #endif /* CONFIG_MEMCG */ 492 | 493 | /* 494 | * Turn us into a lazy TLB process if we 495 | * aren't already.. 496 | */ 497 | static void exit_mm(void) 498 | { 499 | struct mm_struct *mm = current->mm; 500 | struct core_state *core_state; 501 | 502 | mm_release(current, mm); 503 | if (!mm) 504 | return; 505 | sync_mm_rss(mm); 506 | /* 507 | * Serialize with any possible pending coredump. 508 | * We must hold mmap_sem around checking core_state 509 | * and clearing tsk->mm. The core-inducing thread 510 | * will increment ->nr_threads for each thread in the 511 | * group with ->mm != NULL. 512 | */ 513 | down_read(&mm->mmap_sem); 514 | core_state = mm->core_state; 515 | if (core_state) { 516 | struct core_thread self; 517 | 518 | up_read(&mm->mmap_sem); 519 | 520 | self.task = current; 521 | self.next = xchg(&core_state->dumper.next, &self); 522 | /* 523 | * Implies mb(), the result of xchg() must be visible 524 | * to core_state->dumper. 525 | */ 526 | if (atomic_dec_and_test(&core_state->nr_threads)) 527 | complete(&core_state->startup); 528 | 529 | for (;;) { 530 | set_current_state(TASK_UNINTERRUPTIBLE); 531 | if (!self.task) /* see coredump_finish() */ 532 | break; 533 | freezable_schedule(); 534 | } 535 | __set_current_state(TASK_RUNNING); 536 | down_read(&mm->mmap_sem); 537 | } 538 | mmgrab(mm); 539 | BUG_ON(mm != current->active_mm); 540 | /* more a memory barrier than a real lock */ 541 | task_lock(current); 542 | current->mm = NULL; 543 | up_read(&mm->mmap_sem); 544 | enter_lazy_tlb(mm, current); 545 | task_unlock(current); 546 | mm_update_next_owner(mm); 547 | mmput(mm); 548 | if (test_thread_flag(TIF_MEMDIE)) 549 | exit_oom_victim(); 550 | } 551 | 552 | static struct task_struct *find_alive_thread(struct task_struct *p) 553 | { 554 | struct task_struct *t; 555 | 556 | for_each_thread(p, t) { 557 | if (!(t->flags & PF_EXITING)) 558 | return t; 559 | } 560 | return NULL; 561 | } 562 | 563 | static struct task_struct *find_child_reaper(struct task_struct *father, 564 | struct list_head *dead) 565 | __releases(&tasklist_lock) 566 | __acquires(&tasklist_lock) 567 | { 568 | struct pid_namespace *pid_ns = task_active_pid_ns(father); 569 | struct task_struct *reaper = pid_ns->child_reaper; 570 | struct task_struct *p, *n; 571 | 572 | if (likely(reaper != father)) 573 | return reaper; 574 | 575 | reaper = find_alive_thread(father); 576 | if (reaper) { 577 | pid_ns->child_reaper = reaper; 578 | return reaper; 579 | } 580 | 581 | write_unlock_irq(&tasklist_lock); 582 | if (unlikely(pid_ns == &init_pid_ns)) { 583 | panic("Attempted to kill init! exitcode=0x%08x\n", 584 | father->signal->group_exit_code ?: father->exit_code); 585 | } 586 | 587 | list_for_each_entry_safe(p, n, dead, ptrace_entry) { 588 | list_del_init(&p->ptrace_entry); 589 | release_task(p); 590 | } 591 | 592 | zap_pid_ns_processes(pid_ns); 593 | write_lock_irq(&tasklist_lock); 594 | 595 | return father; 596 | } 597 | 598 | /* 599 | * When we die, we re-parent all our children, and try to: 600 | * 1. give them to another thread in our thread group, if such a member exists 601 | * 2. give it to the first ancestor process which prctl'd itself as a 602 | * child_subreaper for its children (like a service manager) 603 | * 3. give it to the init process (PID 1) in our pid namespace 604 | */ 605 | static struct task_struct *find_new_reaper(struct task_struct *father, 606 | struct task_struct *child_reaper) 607 | { 608 | struct task_struct *thread, *reaper; 609 | 610 | thread = find_alive_thread(father); 611 | if (thread) 612 | return thread; 613 | 614 | if (father->signal->has_child_subreaper) { 615 | unsigned int ns_level = task_pid(father)->level; 616 | /* 617 | * Find the first ->is_child_subreaper ancestor in our pid_ns. 618 | * We can't check reaper != child_reaper to ensure we do not 619 | * cross the namespaces, the exiting parent could be injected 620 | * by setns() + fork(). 621 | * We check pid->level, this is slightly more efficient than 622 | * task_active_pid_ns(reaper) != task_active_pid_ns(father). 623 | */ 624 | for (reaper = father->real_parent; 625 | task_pid(reaper)->level == ns_level; 626 | reaper = reaper->real_parent) { 627 | if (reaper == &init_task) 628 | break; 629 | if (!reaper->signal->is_child_subreaper) 630 | continue; 631 | thread = find_alive_thread(reaper); 632 | if (thread) 633 | return thread; 634 | } 635 | } 636 | 637 | return child_reaper; 638 | } 639 | 640 | /* 641 | * Any that need to be release_task'd are put on the @dead list. 642 | */ 643 | static void reparent_leader(struct task_struct *father, struct task_struct *p, 644 | struct list_head *dead) 645 | { 646 | if (unlikely(p->exit_state == EXIT_DEAD)) 647 | return; 648 | 649 | /* We don't want people slaying init. */ 650 | p->exit_signal = SIGCHLD; 651 | 652 | /* If it has exited notify the new parent about this child's death. */ 653 | if (!p->ptrace && 654 | p->exit_state == EXIT_ZOMBIE && thread_group_empty(p)) { 655 | if (do_notify_parent(p, p->exit_signal)) { 656 | p->exit_state = EXIT_DEAD; 657 | list_add(&p->ptrace_entry, dead); 658 | } 659 | } 660 | 661 | kill_orphaned_pgrp(p, father); 662 | } 663 | 664 | /* 665 | * This does two things: 666 | * 667 | * A. Make init inherit all the child processes 668 | * B. Check to see if any process groups have become orphaned 669 | * as a result of our exiting, and if they have any stopped 670 | * jobs, send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2) 671 | */ 672 | static void forget_original_parent(struct task_struct *father, 673 | struct list_head *dead) 674 | { 675 | struct task_struct *p, *t, *reaper; 676 | 677 | if (unlikely(!list_empty(&father->ptraced))) 678 | exit_ptrace(father, dead); 679 | 680 | /* Can drop and reacquire tasklist_lock */ 681 | reaper = find_child_reaper(father, dead); 682 | if (list_empty(&father->children)) 683 | return; 684 | 685 | reaper = find_new_reaper(father, reaper); 686 | list_for_each_entry(p, &father->children, sibling) { 687 | for_each_thread(p, t) { 688 | t->real_parent = reaper; 689 | BUG_ON((!t->ptrace) != (t->parent == father)); 690 | if (likely(!t->ptrace)) 691 | t->parent = t->real_parent; 692 | if (t->pdeath_signal) 693 | group_send_sig_info(t->pdeath_signal, 694 | SEND_SIG_NOINFO, t, 695 | PIDTYPE_TGID); 696 | } 697 | /* 698 | * If this is a threaded reparent there is no need to 699 | * notify anyone anything has happened. 700 | */ 701 | if (!same_thread_group(reaper, father)) 702 | reparent_leader(father, p, dead); 703 | } 704 | list_splice_tail_init(&father->children, &reaper->children); 705 | } 706 | 707 | /* 708 | * Send signals to all our closest relatives so that they know 709 | * to properly mourn us.. 710 | */ 711 | static void exit_notify(struct task_struct *tsk, int group_dead) 712 | { 713 | bool autoreap; 714 | struct task_struct *p, *n; 715 | LIST_HEAD(dead); 716 | 717 | write_lock_irq(&tasklist_lock); 718 | forget_original_parent(tsk, &dead); 719 | 720 | if (group_dead) 721 | kill_orphaned_pgrp(tsk->group_leader, NULL); 722 | 723 | if (unlikely(tsk->ptrace)) { 724 | int sig = thread_group_leader(tsk) && 725 | thread_group_empty(tsk) && 726 | !ptrace_reparented(tsk) ? 727 | tsk->exit_signal : SIGCHLD; 728 | autoreap = do_notify_parent(tsk, sig); 729 | } else if (thread_group_leader(tsk)) { 730 | autoreap = thread_group_empty(tsk) && 731 | do_notify_parent(tsk, tsk->exit_signal); 732 | } else { 733 | autoreap = true; 734 | } 735 | 736 | tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE; 737 | if (tsk->exit_state == EXIT_DEAD) 738 | list_add(&tsk->ptrace_entry, &dead); 739 | 740 | /* mt-exec, de_thread() is waiting for group leader */ 741 | if (unlikely(tsk->signal->notify_count < 0)) 742 | wake_up_process(tsk->signal->group_exit_task); 743 | write_unlock_irq(&tasklist_lock); 744 | 745 | list_for_each_entry_safe(p, n, &dead, ptrace_entry) { 746 | list_del_init(&p->ptrace_entry); 747 | release_task(p); 748 | } 749 | } 750 | 751 | #ifdef CONFIG_DEBUG_STACK_USAGE 752 | static void check_stack_usage(void) 753 | { 754 | static DEFINE_SPINLOCK(low_water_lock); 755 | static int lowest_to_date = THREAD_SIZE; 756 | unsigned long free; 757 | 758 | free = stack_not_used(current); 759 | 760 | if (free >= lowest_to_date) 761 | return; 762 | 763 | spin_lock(&low_water_lock); 764 | if (free < lowest_to_date) { 765 | pr_info("%s (%d) used greatest stack depth: %lu bytes left\n", 766 | current->comm, task_pid_nr(current), free); 767 | lowest_to_date = free; 768 | } 769 | spin_unlock(&low_water_lock); 770 | } 771 | #else 772 | static inline void check_stack_usage(void) {} 773 | #endif 774 | 775 | void __noreturn do_exit(long code) 776 | { 777 | struct task_struct *tsk = current; 778 | int group_dead; 779 | 780 | profile_task_exit(tsk); 781 | kcov_task_exit(tsk); 782 | 783 | WARN_ON(blk_needs_flush_plug(tsk)); 784 | 785 | if (unlikely(in_interrupt())) 786 | panic("Aiee, killing interrupt handler!"); 787 | if (unlikely(!tsk->pid)) 788 | panic("Attempted to kill the idle task!"); 789 | 790 | /* 791 | * If do_exit is called because this processes oopsed, it's possible 792 | * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before 793 | * continuing. Amongst other possible reasons, this is to prevent 794 | * mm_release()->clear_child_tid() from writing to a user-controlled 795 | * kernel address. 796 | */ 797 | set_fs(USER_DS); 798 | 799 | ptrace_event(PTRACE_EVENT_EXIT, code); 800 | 801 | validate_creds_for_do_exit(tsk); 802 | 803 | /* 804 | * We're taking recursive faults here in do_exit. Safest is to just 805 | * leave this task alone and wait for reboot. 806 | */ 807 | if (unlikely(tsk->flags & PF_EXITING)) { 808 | pr_alert("Fixing recursive fault but reboot is needed!\n"); 809 | /* 810 | * We can do this unlocked here. The futex code uses 811 | * this flag just to verify whether the pi state 812 | * cleanup has been done or not. In the worst case it 813 | * loops once more. We pretend that the cleanup was 814 | * done as there is no way to return. Either the 815 | * OWNER_DIED bit is set by now or we push the blocked 816 | * task into the wait for ever nirwana as well. 817 | */ 818 | tsk->flags |= PF_EXITPIDONE; 819 | set_current_state(TASK_UNINTERRUPTIBLE); 820 | schedule(); 821 | } 822 | 823 | exit_signals(tsk); /* sets PF_EXITING */ 824 | /* 825 | * Ensure that all new tsk->pi_lock acquisitions must observe 826 | * PF_EXITING. Serializes against futex.c:attach_to_pi_owner(). 827 | */ 828 | smp_mb(); 829 | /* 830 | * Ensure that we must observe the pi_state in exit_mm() -> 831 | * mm_release() -> exit_pi_state_list(). 832 | */ 833 | raw_spin_lock_irq(&tsk->pi_lock); 834 | raw_spin_unlock_irq(&tsk->pi_lock); 835 | 836 | if (unlikely(in_atomic())) { 837 | pr_info("note: %s[%d] exited with preempt_count %d\n", 838 | current->comm, task_pid_nr(current), 839 | preempt_count()); 840 | preempt_count_set(PREEMPT_ENABLED); 841 | } 842 | 843 | /* sync mm's RSS info before statistics gathering */ 844 | if (tsk->mm) 845 | sync_mm_rss(tsk->mm); 846 | acct_update_integrals(tsk); 847 | group_dead = atomic_dec_and_test(&tsk->signal->live); 848 | if (group_dead) { 849 | #ifdef CONFIG_POSIX_TIMERS 850 | hrtimer_cancel(&tsk->signal->real_timer); 851 | exit_itimers(tsk->signal); 852 | #endif 853 | if (tsk->mm) 854 | setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm); 855 | } 856 | acct_collect(code, group_dead); 857 | if (group_dead) 858 | tty_audit_exit(); 859 | audit_free(tsk); 860 | 861 | tsk->exit_code = code; 862 | taskstats_exit(tsk, group_dead); 863 | 864 | exit_mm(); 865 | 866 | if (group_dead) 867 | acct_process(); 868 | trace_sched_process_exit(tsk); 869 | 870 | exit_sem(tsk); 871 | exit_shm(tsk); 872 | exit_files(tsk); 873 | exit_fs(tsk); 874 | if (group_dead) 875 | disassociate_ctty(1); 876 | exit_task_namespaces(tsk); 877 | exit_task_work(tsk); 878 | exit_thread(tsk); 879 | exit_umh(tsk); 880 | 881 | /* 882 | * Flush inherited counters to the parent - before the parent 883 | * gets woken up by child-exit notifications. 884 | * 885 | * because of cgroup mode, must be called before cgroup_exit() 886 | */ 887 | perf_event_exit_task(tsk); 888 | 889 | sched_autogroup_exit_task(tsk); 890 | cgroup_exit(tsk); 891 | 892 | /* 893 | * FIXME: do that only when needed, using sched_exit tracepoint 894 | */ 895 | flush_ptrace_hw_breakpoint(tsk); 896 | 897 | exit_tasks_rcu_start(); 898 | exit_notify(tsk, group_dead); 899 | proc_exit_connector(tsk); 900 | mpol_put_task_policy(tsk); 901 | #ifdef CONFIG_FUTEX 902 | if (unlikely(current->pi_state_cache)) 903 | kfree(current->pi_state_cache); 904 | #endif 905 | /* 906 | * Make sure we are holding no locks: 907 | */ 908 | debug_check_no_locks_held(); 909 | /* 910 | * We can do this unlocked here. The futex code uses this flag 911 | * just to verify whether the pi state cleanup has been done 912 | * or not. In the worst case it loops once more. 913 | */ 914 | tsk->flags |= PF_EXITPIDONE; 915 | 916 | if (tsk->io_context) 917 | exit_io_context(tsk); 918 | 919 | if (tsk->splice_pipe) 920 | free_pipe_info(tsk->splice_pipe); 921 | 922 | if (tsk->task_frag.page) 923 | put_page(tsk->task_frag.page); 924 | 925 | validate_creds_for_do_exit(tsk); 926 | 927 | check_stack_usage(); 928 | preempt_disable(); 929 | if (tsk->nr_dirtied) 930 | __this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied); 931 | exit_rcu(); 932 | exit_tasks_rcu_finish(); 933 | 934 | lockdep_free_task(tsk); 935 | do_task_dead(); 936 | } 937 | EXPORT_SYMBOL_GPL(do_exit); 938 | 939 | void complete_and_exit(struct completion *comp, long code) 940 | { 941 | if (comp) 942 | complete(comp); 943 | 944 | do_exit(code); 945 | } 946 | EXPORT_SYMBOL(complete_and_exit); 947 | 948 | SYSCALL_DEFINE1(exit, int, error_code) 949 | { 950 | do_exit((error_code&0xff)<<8); 951 | } 952 | 953 | /* 954 | * Take down every thread in the group. This is called by fatal signals 955 | * as well as by sys_exit_group (below). 956 | */ 957 | void 958 | do_group_exit(int exit_code) 959 | { 960 | struct signal_struct *sig = current->signal; 961 | 962 | BUG_ON(exit_code & 0x80); /* core dumps don't get here */ 963 | 964 | if (signal_group_exit(sig)) 965 | exit_code = sig->group_exit_code; 966 | else if (!thread_group_empty(current)) { 967 | struct sighand_struct *const sighand = current->sighand; 968 | 969 | spin_lock_irq(&sighand->siglock); 970 | if (signal_group_exit(sig)) 971 | /* Another thread got here before we took the lock. */ 972 | exit_code = sig->group_exit_code; 973 | else { 974 | sig->group_exit_code = exit_code; 975 | sig->flags = SIGNAL_GROUP_EXIT; 976 | zap_other_threads(current); 977 | } 978 | spin_unlock_irq(&sighand->siglock); 979 | } 980 | 981 | do_exit(exit_code); 982 | /* NOTREACHED */ 983 | } 984 | 985 | /* 986 | * this kills every thread in the thread group. Note that any externally 987 | * wait4()-ing process will get the correct exit code - even if this 988 | * thread is not the thread group leader. 989 | */ 990 | SYSCALL_DEFINE1(exit_group, int, error_code) 991 | { 992 | do_group_exit((error_code & 0xff) << 8); 993 | /* NOTREACHED */ 994 | return 0; 995 | } 996 | 997 | struct waitid_info { 998 | pid_t pid; 999 | uid_t uid; 1000 | int status; 1001 | int cause; 1002 | }; 1003 | 1004 | struct wait_opts { 1005 | enum pid_type wo_type; 1006 | int wo_flags; 1007 | struct pid *wo_pid; 1008 | 1009 | struct waitid_info *wo_info; 1010 | int wo_stat; 1011 | struct rusage *wo_rusage; 1012 | 1013 | wait_queue_entry_t child_wait; 1014 | int notask_error; 1015 | }; 1016 | 1017 | static int eligible_pid(struct wait_opts *wo, struct task_struct *p) 1018 | { 1019 | return wo->wo_type == PIDTYPE_MAX || 1020 | task_pid_type(p, wo->wo_type) == wo->wo_pid; 1021 | } 1022 | 1023 | static int 1024 | eligible_child(struct wait_opts *wo, bool ptrace, struct task_struct *p) 1025 | { 1026 | if (!eligible_pid(wo, p)) 1027 | return 0; 1028 | 1029 | /* 1030 | * Wait for all children (clone and not) if __WALL is set or 1031 | * if it is traced by us. 1032 | */ 1033 | if (ptrace || (wo->wo_flags & __WALL)) 1034 | return 1; 1035 | 1036 | /* 1037 | * Otherwise, wait for clone children *only* if __WCLONE is set; 1038 | * otherwise, wait for non-clone children *only*. 1039 | * 1040 | * Note: a "clone" child here is one that reports to its parent 1041 | * using a signal other than SIGCHLD, or a non-leader thread which 1042 | * we can only see if it is traced by us. 1043 | */ 1044 | if ((p->exit_signal != SIGCHLD) ^ !!(wo->wo_flags & __WCLONE)) 1045 | return 0; 1046 | 1047 | return 1; 1048 | } 1049 | 1050 | /* 1051 | * Handle sys_wait4 work for one task in state EXIT_ZOMBIE. We hold 1052 | * read_lock(&tasklist_lock) on entry. If we return zero, we still hold 1053 | * the lock and this task is uninteresting. If we return nonzero, we have 1054 | * released the lock and the system call should return. 1055 | */ 1056 | static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p) 1057 | { 1058 | int state, status; 1059 | pid_t pid = task_pid_vnr(p); 1060 | uid_t uid = from_kuid_munged(current_user_ns(), task_uid(p)); 1061 | struct waitid_info *infop; 1062 | 1063 | if (!likely(wo->wo_flags & WEXITED)) 1064 | return 0; 1065 | 1066 | if (unlikely(wo->wo_flags & WNOWAIT)) { 1067 | status = p->exit_code; 1068 | get_task_struct(p); 1069 | read_unlock(&tasklist_lock); 1070 | sched_annotate_sleep(); 1071 | if (wo->wo_rusage) 1072 | getrusage(p, RUSAGE_BOTH, wo->wo_rusage); 1073 | put_task_struct(p); 1074 | goto out_info; 1075 | } 1076 | /* 1077 | * Move the task's state to DEAD/TRACE, only one thread can do this. 1078 | */ 1079 | state = (ptrace_reparented(p) && thread_group_leader(p)) ? 1080 | EXIT_TRACE : EXIT_DEAD; 1081 | if (cmpxchg(&p->exit_state, EXIT_ZOMBIE, state) != EXIT_ZOMBIE) 1082 | return 0; 1083 | /* 1084 | * We own this thread, nobody else can reap it. 1085 | */ 1086 | read_unlock(&tasklist_lock); 1087 | sched_annotate_sleep(); 1088 | 1089 | /* 1090 | * Check thread_group_leader() to exclude the traced sub-threads. 1091 | */ 1092 | if (state == EXIT_DEAD && thread_group_leader(p)) { 1093 | struct signal_struct *sig = p->signal; 1094 | struct signal_struct *psig = current->signal; 1095 | unsigned long maxrss; 1096 | u64 tgutime, tgstime; 1097 | 1098 | /* 1099 | * The resource counters for the group leader are in its 1100 | * own task_struct. Those for dead threads in the group 1101 | * are in its signal_struct, as are those for the child 1102 | * processes it has previously reaped. All these 1103 | * accumulate in the parent's signal_struct c* fields. 1104 | * 1105 | * We don't bother to take a lock here to protect these 1106 | * p->signal fields because the whole thread group is dead 1107 | * and nobody can change them. 1108 | * 1109 | * psig->stats_lock also protects us from our sub-theads 1110 | * which can reap other children at the same time. Until 1111 | * we change k_getrusage()-like users to rely on this lock 1112 | * we have to take ->siglock as well. 1113 | * 1114 | * We use thread_group_cputime_adjusted() to get times for 1115 | * the thread group, which consolidates times for all threads 1116 | * in the group including the group leader. 1117 | */ 1118 | thread_group_cputime_adjusted(p, &tgutime, &tgstime); 1119 | spin_lock_irq(¤t->sighand->siglock); 1120 | write_seqlock(&psig->stats_lock); 1121 | psig->cutime += tgutime + sig->cutime; 1122 | psig->cstime += tgstime + sig->cstime; 1123 | psig->cgtime += task_gtime(p) + sig->gtime + sig->cgtime; 1124 | psig->cmin_flt += 1125 | p->min_flt + sig->min_flt + sig->cmin_flt; 1126 | psig->cmaj_flt += 1127 | p->maj_flt + sig->maj_flt + sig->cmaj_flt; 1128 | psig->cnvcsw += 1129 | p->nvcsw + sig->nvcsw + sig->cnvcsw; 1130 | psig->cnivcsw += 1131 | p->nivcsw + sig->nivcsw + sig->cnivcsw; 1132 | psig->cinblock += 1133 | task_io_get_inblock(p) + 1134 | sig->inblock + sig->cinblock; 1135 | psig->coublock += 1136 | task_io_get_oublock(p) + 1137 | sig->oublock + sig->coublock; 1138 | maxrss = max(sig->maxrss, sig->cmaxrss); 1139 | if (psig->cmaxrss < maxrss) 1140 | psig->cmaxrss = maxrss; 1141 | task_io_accounting_add(&psig->ioac, &p->ioac); 1142 | task_io_accounting_add(&psig->ioac, &sig->ioac); 1143 | write_sequnlock(&psig->stats_lock); 1144 | spin_unlock_irq(¤t->sighand->siglock); 1145 | } 1146 | 1147 | if (wo->wo_rusage) 1148 | getrusage(p, RUSAGE_BOTH, wo->wo_rusage); 1149 | status = (p->signal->flags & SIGNAL_GROUP_EXIT) 1150 | ? p->signal->group_exit_code : p->exit_code; 1151 | wo->wo_stat = status; 1152 | 1153 | if (state == EXIT_TRACE) { 1154 | write_lock_irq(&tasklist_lock); 1155 | /* We dropped tasklist, ptracer could die and untrace */ 1156 | ptrace_unlink(p); 1157 | 1158 | /* If parent wants a zombie, don't release it now */ 1159 | state = EXIT_ZOMBIE; 1160 | if (do_notify_parent(p, p->exit_signal)) 1161 | state = EXIT_DEAD; 1162 | p->exit_state = state; 1163 | write_unlock_irq(&tasklist_lock); 1164 | } 1165 | if (state == EXIT_DEAD) 1166 | release_task(p); 1167 | 1168 | out_info: 1169 | infop = wo->wo_info; 1170 | if (infop) { 1171 | if ((status & 0x7f) == 0) { 1172 | infop->cause = CLD_EXITED; 1173 | infop->status = status >> 8; 1174 | } else { 1175 | infop->cause = (status & 0x80) ? CLD_DUMPED : CLD_KILLED; 1176 | infop->status = status & 0x7f; 1177 | } 1178 | infop->pid = pid; 1179 | infop->uid = uid; 1180 | } 1181 | 1182 | return pid; 1183 | } 1184 | 1185 | static int *task_stopped_code(struct task_struct *p, bool ptrace) 1186 | { 1187 | if (ptrace) { 1188 | if (task_is_traced(p) && !(p->jobctl & JOBCTL_LISTENING)) 1189 | return &p->exit_code; 1190 | } else { 1191 | if (p->signal->flags & SIGNAL_STOP_STOPPED) 1192 | return &p->signal->group_exit_code; 1193 | } 1194 | return NULL; 1195 | } 1196 | 1197 | /** 1198 | * wait_task_stopped - Wait for %TASK_STOPPED or %TASK_TRACED 1199 | * @wo: wait options 1200 | * @ptrace: is the wait for ptrace 1201 | * @p: task to wait for 1202 | * 1203 | * Handle sys_wait4() work for %p in state %TASK_STOPPED or %TASK_TRACED. 1204 | * 1205 | * CONTEXT: 1206 | * read_lock(&tasklist_lock), which is released if return value is 1207 | * non-zero. Also, grabs and releases @p->sighand->siglock. 1208 | * 1209 | * RETURNS: 1210 | * 0 if wait condition didn't exist and search for other wait conditions 1211 | * should continue. Non-zero return, -errno on failure and @p's pid on 1212 | * success, implies that tasklist_lock is released and wait condition 1213 | * search should terminate. 1214 | */ 1215 | static int wait_task_stopped(struct wait_opts *wo, 1216 | int ptrace, struct task_struct *p) 1217 | { 1218 | struct waitid_info *infop; 1219 | int exit_code, *p_code, why; 1220 | uid_t uid = 0; /* unneeded, required by compiler */ 1221 | pid_t pid; 1222 | 1223 | /* 1224 | * Traditionally we see ptrace'd stopped tasks regardless of options. 1225 | */ 1226 | if (!ptrace && !(wo->wo_flags & WUNTRACED)) 1227 | return 0; 1228 | 1229 | if (!task_stopped_code(p, ptrace)) 1230 | return 0; 1231 | 1232 | exit_code = 0; 1233 | spin_lock_irq(&p->sighand->siglock); 1234 | 1235 | p_code = task_stopped_code(p, ptrace); 1236 | if (unlikely(!p_code)) 1237 | goto unlock_sig; 1238 | 1239 | exit_code = *p_code; 1240 | if (!exit_code) 1241 | goto unlock_sig; 1242 | 1243 | if (!unlikely(wo->wo_flags & WNOWAIT)) 1244 | *p_code = 0; 1245 | 1246 | uid = from_kuid_munged(current_user_ns(), task_uid(p)); 1247 | unlock_sig: 1248 | spin_unlock_irq(&p->sighand->siglock); 1249 | if (!exit_code) 1250 | return 0; 1251 | 1252 | /* 1253 | * Now we are pretty sure this task is interesting. 1254 | * Make sure it doesn't get reaped out from under us while we 1255 | * give up the lock and then examine it below. We don't want to 1256 | * keep holding onto the tasklist_lock while we call getrusage and 1257 | * possibly take page faults for user memory. 1258 | */ 1259 | get_task_struct(p); 1260 | pid = task_pid_vnr(p); 1261 | why = ptrace ? CLD_TRAPPED : CLD_STOPPED; 1262 | read_unlock(&tasklist_lock); 1263 | sched_annotate_sleep(); 1264 | if (wo->wo_rusage) 1265 | getrusage(p, RUSAGE_BOTH, wo->wo_rusage); 1266 | put_task_struct(p); 1267 | 1268 | if (likely(!(wo->wo_flags & WNOWAIT))) 1269 | wo->wo_stat = (exit_code << 8) | 0x7f; 1270 | 1271 | infop = wo->wo_info; 1272 | if (infop) { 1273 | infop->cause = why; 1274 | infop->status = exit_code; 1275 | infop->pid = pid; 1276 | infop->uid = uid; 1277 | } 1278 | return pid; 1279 | } 1280 | 1281 | /* 1282 | * Handle do_wait work for one task in a live, non-stopped state. 1283 | * read_lock(&tasklist_lock) on entry. If we return zero, we still hold 1284 | * the lock and this task is uninteresting. If we return nonzero, we have 1285 | * released the lock and the system call should return. 1286 | */ 1287 | static int wait_task_continued(struct wait_opts *wo, struct task_struct *p) 1288 | { 1289 | struct waitid_info *infop; 1290 | pid_t pid; 1291 | uid_t uid; 1292 | 1293 | if (!unlikely(wo->wo_flags & WCONTINUED)) 1294 | return 0; 1295 | 1296 | if (!(p->signal->flags & SIGNAL_STOP_CONTINUED)) 1297 | return 0; 1298 | 1299 | spin_lock_irq(&p->sighand->siglock); 1300 | /* Re-check with the lock held. */ 1301 | if (!(p->signal->flags & SIGNAL_STOP_CONTINUED)) { 1302 | spin_unlock_irq(&p->sighand->siglock); 1303 | return 0; 1304 | } 1305 | if (!unlikely(wo->wo_flags & WNOWAIT)) 1306 | p->signal->flags &= ~SIGNAL_STOP_CONTINUED; 1307 | uid = from_kuid_munged(current_user_ns(), task_uid(p)); 1308 | spin_unlock_irq(&p->sighand->siglock); 1309 | 1310 | pid = task_pid_vnr(p); 1311 | get_task_struct(p); 1312 | read_unlock(&tasklist_lock); 1313 | sched_annotate_sleep(); 1314 | if (wo->wo_rusage) 1315 | getrusage(p, RUSAGE_BOTH, wo->wo_rusage); 1316 | put_task_struct(p); 1317 | 1318 | infop = wo->wo_info; 1319 | if (!infop) { 1320 | wo->wo_stat = 0xffff; 1321 | } else { 1322 | infop->cause = CLD_CONTINUED; 1323 | infop->pid = pid; 1324 | infop->uid = uid; 1325 | infop->status = SIGCONT; 1326 | } 1327 | return pid; 1328 | } 1329 | 1330 | /* 1331 | * Consider @p for a wait by @parent. 1332 | * 1333 | * -ECHILD should be in ->notask_error before the first call. 1334 | * Returns nonzero for a final return, when we have unlocked tasklist_lock. 1335 | * Returns zero if the search for a child should continue; 1336 | * then ->notask_error is 0 if @p is an eligible child, 1337 | * or still -ECHILD. 1338 | */ 1339 | static int wait_consider_task(struct wait_opts *wo, int ptrace, 1340 | struct task_struct *p) 1341 | { 1342 | /* 1343 | * We can race with wait_task_zombie() from another thread. 1344 | * Ensure that EXIT_ZOMBIE -> EXIT_DEAD/EXIT_TRACE transition 1345 | * can't confuse the checks below. 1346 | */ 1347 | int exit_state = READ_ONCE(p->exit_state); 1348 | int ret; 1349 | 1350 | if (unlikely(exit_state == EXIT_DEAD)) 1351 | return 0; 1352 | 1353 | ret = eligible_child(wo, ptrace, p); 1354 | if (!ret) 1355 | return ret; 1356 | 1357 | if (unlikely(exit_state == EXIT_TRACE)) { 1358 | /* 1359 | * ptrace == 0 means we are the natural parent. In this case 1360 | * we should clear notask_error, debugger will notify us. 1361 | */ 1362 | if (likely(!ptrace)) 1363 | wo->notask_error = 0; 1364 | return 0; 1365 | } 1366 | 1367 | if (likely(!ptrace) && unlikely(p->ptrace)) { 1368 | /* 1369 | * If it is traced by its real parent's group, just pretend 1370 | * the caller is ptrace_do_wait() and reap this child if it 1371 | * is zombie. 1372 | * 1373 | * This also hides group stop state from real parent; otherwise 1374 | * a single stop can be reported twice as group and ptrace stop. 1375 | * If a ptracer wants to distinguish these two events for its 1376 | * own children it should create a separate process which takes 1377 | * the role of real parent. 1378 | */ 1379 | if (!ptrace_reparented(p)) 1380 | ptrace = 1; 1381 | } 1382 | 1383 | /* slay zombie? */ 1384 | if (exit_state == EXIT_ZOMBIE) { 1385 | /* we don't reap group leaders with subthreads */ 1386 | if (!delay_group_leader(p)) { 1387 | /* 1388 | * A zombie ptracee is only visible to its ptracer. 1389 | * Notification and reaping will be cascaded to the 1390 | * real parent when the ptracer detaches. 1391 | */ 1392 | if (unlikely(ptrace) || likely(!p->ptrace)) 1393 | return wait_task_zombie(wo, p); 1394 | } 1395 | 1396 | /* 1397 | * Allow access to stopped/continued state via zombie by 1398 | * falling through. Clearing of notask_error is complex. 1399 | * 1400 | * When !@ptrace: 1401 | * 1402 | * If WEXITED is set, notask_error should naturally be 1403 | * cleared. If not, subset of WSTOPPED|WCONTINUED is set, 1404 | * so, if there are live subthreads, there are events to 1405 | * wait for. If all subthreads are dead, it's still safe 1406 | * to clear - this function will be called again in finite 1407 | * amount time once all the subthreads are released and 1408 | * will then return without clearing. 1409 | * 1410 | * When @ptrace: 1411 | * 1412 | * Stopped state is per-task and thus can't change once the 1413 | * target task dies. Only continued and exited can happen. 1414 | * Clear notask_error if WCONTINUED | WEXITED. 1415 | */ 1416 | if (likely(!ptrace) || (wo->wo_flags & (WCONTINUED | WEXITED))) 1417 | wo->notask_error = 0; 1418 | } else { 1419 | /* 1420 | * @p is alive and it's gonna stop, continue or exit, so 1421 | * there always is something to wait for. 1422 | */ 1423 | wo->notask_error = 0; 1424 | } 1425 | 1426 | /* 1427 | * Wait for stopped. Depending on @ptrace, different stopped state 1428 | * is used and the two don't interact with each other. 1429 | */ 1430 | ret = wait_task_stopped(wo, ptrace, p); 1431 | if (ret) 1432 | return ret; 1433 | 1434 | /* 1435 | * Wait for continued. There's only one continued state and the 1436 | * ptracer can consume it which can confuse the real parent. Don't 1437 | * use WCONTINUED from ptracer. You don't need or want it. 1438 | */ 1439 | return wait_task_continued(wo, p); 1440 | } 1441 | 1442 | /* 1443 | * Do the work of do_wait() for one thread in the group, @tsk. 1444 | * 1445 | * -ECHILD should be in ->notask_error before the first call. 1446 | * Returns nonzero for a final return, when we have unlocked tasklist_lock. 1447 | * Returns zero if the search for a child should continue; then 1448 | * ->notask_error is 0 if there were any eligible children, 1449 | * or still -ECHILD. 1450 | */ 1451 | static int do_wait_thread(struct wait_opts *wo, struct task_struct *tsk) 1452 | { 1453 | struct task_struct *p; 1454 | 1455 | list_for_each_entry(p, &tsk->children, sibling) { 1456 | int ret = wait_consider_task(wo, 0, p); 1457 | 1458 | if (ret) 1459 | return ret; 1460 | } 1461 | 1462 | return 0; 1463 | } 1464 | 1465 | static int ptrace_do_wait(struct wait_opts *wo, struct task_struct *tsk) 1466 | { 1467 | struct task_struct *p; 1468 | 1469 | list_for_each_entry(p, &tsk->ptraced, ptrace_entry) { 1470 | int ret = wait_consider_task(wo, 1, p); 1471 | 1472 | if (ret) 1473 | return ret; 1474 | } 1475 | 1476 | return 0; 1477 | } 1478 | 1479 | static int child_wait_callback(wait_queue_entry_t *wait, unsigned mode, 1480 | int sync, void *key) 1481 | { 1482 | struct wait_opts *wo = container_of(wait, struct wait_opts, 1483 | child_wait); 1484 | struct task_struct *p = key; 1485 | 1486 | if (!eligible_pid(wo, p)) 1487 | return 0; 1488 | 1489 | if ((wo->wo_flags & __WNOTHREAD) && wait->private != p->parent) 1490 | return 0; 1491 | 1492 | return default_wake_function(wait, mode, sync, key); 1493 | } 1494 | 1495 | void __wake_up_parent(struct task_struct *p, struct task_struct *parent) 1496 | { 1497 | __wake_up_sync_key(&parent->signal->wait_chldexit, 1498 | TASK_INTERRUPTIBLE, 1, p); 1499 | } 1500 | 1501 | static long do_wait(struct wait_opts *wo) 1502 | { 1503 | struct task_struct *tsk; 1504 | int retval; 1505 | 1506 | trace_sched_process_wait(wo->wo_pid); 1507 | 1508 | init_waitqueue_func_entry(&wo->child_wait, child_wait_callback); 1509 | wo->child_wait.private = current; 1510 | add_wait_queue(¤t->signal->wait_chldexit, &wo->child_wait); 1511 | repeat: 1512 | /* 1513 | * If there is nothing that can match our criteria, just get out. 1514 | * We will clear ->notask_error to zero if we see any child that 1515 | * might later match our criteria, even if we are not able to reap 1516 | * it yet. 1517 | */ 1518 | wo->notask_error = -ECHILD; 1519 | if ((wo->wo_type < PIDTYPE_MAX) && 1520 | (!wo->wo_pid || hlist_empty(&wo->wo_pid->tasks[wo->wo_type]))) 1521 | goto notask; 1522 | 1523 | set_current_state(TASK_INTERRUPTIBLE); 1524 | read_lock(&tasklist_lock); 1525 | tsk = current; 1526 | do { 1527 | retval = do_wait_thread(wo, tsk); 1528 | if (retval) 1529 | goto end; 1530 | 1531 | retval = ptrace_do_wait(wo, tsk); 1532 | if (retval) 1533 | goto end; 1534 | 1535 | if (wo->wo_flags & __WNOTHREAD) 1536 | break; 1537 | } while_each_thread(current, tsk); 1538 | read_unlock(&tasklist_lock); 1539 | 1540 | notask: 1541 | retval = wo->notask_error; 1542 | if (!retval && !(wo->wo_flags & WNOHANG)) { 1543 | retval = -ERESTARTSYS; 1544 | if (!signal_pending(current)) { 1545 | schedule(); 1546 | goto repeat; 1547 | } 1548 | } 1549 | end: 1550 | __set_current_state(TASK_RUNNING); 1551 | remove_wait_queue(¤t->signal->wait_chldexit, &wo->child_wait); 1552 | return retval; 1553 | } 1554 | 1555 | static long kernel_waitid(int which, pid_t upid, struct waitid_info *infop, 1556 | int options, struct rusage *ru) 1557 | { 1558 | struct wait_opts wo; 1559 | struct pid *pid = NULL; 1560 | enum pid_type type; 1561 | long ret; 1562 | 1563 | if (options & ~(WNOHANG|WNOWAIT|WEXITED|WSTOPPED|WCONTINUED| 1564 | __WNOTHREAD|__WCLONE|__WALL)) 1565 | return -EINVAL; 1566 | if (!(options & (WEXITED|WSTOPPED|WCONTINUED))) 1567 | return -EINVAL; 1568 | 1569 | switch (which) { 1570 | case P_ALL: 1571 | type = PIDTYPE_MAX; 1572 | break; 1573 | case P_PID: 1574 | type = PIDTYPE_PID; 1575 | if (upid <= 0) 1576 | return -EINVAL; 1577 | break; 1578 | case P_PGID: 1579 | type = PIDTYPE_PGID; 1580 | if (upid <= 0) 1581 | return -EINVAL; 1582 | break; 1583 | default: 1584 | return -EINVAL; 1585 | } 1586 | 1587 | if (type < PIDTYPE_MAX) 1588 | pid = find_get_pid(upid); 1589 | 1590 | wo.wo_type = type; 1591 | wo.wo_pid = pid; 1592 | wo.wo_flags = options; 1593 | wo.wo_info = infop; 1594 | wo.wo_rusage = ru; 1595 | ret = do_wait(&wo); 1596 | 1597 | put_pid(pid); 1598 | return ret; 1599 | } 1600 | 1601 | SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *, 1602 | infop, int, options, struct rusage __user *, ru) 1603 | { 1604 | struct rusage r; 1605 | struct waitid_info info = {.status = 0}; 1606 | long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL); 1607 | int signo = 0; 1608 | 1609 | if (err > 0) { 1610 | signo = SIGCHLD; 1611 | err = 0; 1612 | if (ru && copy_to_user(ru, &r, sizeof(struct rusage))) 1613 | return -EFAULT; 1614 | } 1615 | if (!infop) 1616 | return err; 1617 | 1618 | if (!user_access_begin(infop, sizeof(*infop))) 1619 | return -EFAULT; 1620 | 1621 | unsafe_put_user(signo, &infop->si_signo, Efault); 1622 | unsafe_put_user(0, &infop->si_errno, Efault); 1623 | unsafe_put_user(info.cause, &infop->si_code, Efault); 1624 | unsafe_put_user(info.pid, &infop->si_pid, Efault); 1625 | unsafe_put_user(info.uid, &infop->si_uid, Efault); 1626 | unsafe_put_user(info.status, &infop->si_status, Efault); 1627 | user_access_end(); 1628 | return err; 1629 | Efault: 1630 | user_access_end(); 1631 | return -EFAULT; 1632 | } 1633 | 1634 | long kernel_wait4(pid_t upid, int __user *stat_addr, int options, 1635 | struct rusage *ru) 1636 | { 1637 | struct wait_opts wo; 1638 | struct pid *pid = NULL; 1639 | enum pid_type type; 1640 | long ret; 1641 | 1642 | if (options & ~(WNOHANG|WUNTRACED|WCONTINUED| 1643 | __WNOTHREAD|__WCLONE|__WALL)) 1644 | return -EINVAL; 1645 | 1646 | /* -INT_MIN is not defined */ 1647 | if (upid == INT_MIN) 1648 | return -ESRCH; 1649 | 1650 | if (upid == -1) 1651 | type = PIDTYPE_MAX; 1652 | else if (upid < 0) { 1653 | type = PIDTYPE_PGID; 1654 | pid = find_get_pid(-upid); 1655 | } else if (upid == 0) { 1656 | type = PIDTYPE_PGID; 1657 | pid = get_task_pid(current, PIDTYPE_PGID); 1658 | } else /* upid > 0 */ { 1659 | type = PIDTYPE_PID; 1660 | pid = find_get_pid(upid); 1661 | } 1662 | 1663 | wo.wo_type = type; 1664 | wo.wo_pid = pid; 1665 | wo.wo_flags = options | WEXITED; 1666 | wo.wo_info = NULL; 1667 | wo.wo_stat = 0; 1668 | wo.wo_rusage = ru; 1669 | ret = do_wait(&wo); 1670 | put_pid(pid); 1671 | if (ret > 0 && stat_addr && put_user(wo.wo_stat, stat_addr)) 1672 | ret = -EFAULT; 1673 | 1674 | return ret; 1675 | } 1676 | 1677 | SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr, 1678 | int, options, struct rusage __user *, ru) 1679 | { 1680 | struct rusage r; 1681 | long err = kernel_wait4(upid, stat_addr, options, ru ? &r : NULL); 1682 | 1683 | if (err > 0) { 1684 | if (ru && copy_to_user(ru, &r, sizeof(struct rusage))) 1685 | return -EFAULT; 1686 | } 1687 | return err; 1688 | } 1689 | 1690 | #ifdef __ARCH_WANT_SYS_WAITPID 1691 | 1692 | /* 1693 | * sys_waitpid() remains for compatibility. waitpid() should be 1694 | * implemented by calling sys_wait4() from libc.a. 1695 | */ 1696 | SYSCALL_DEFINE3(waitpid, pid_t, pid, int __user *, stat_addr, int, options) 1697 | { 1698 | return kernel_wait4(pid, stat_addr, options, NULL); 1699 | } 1700 | 1701 | #endif 1702 | 1703 | #ifdef CONFIG_COMPAT 1704 | COMPAT_SYSCALL_DEFINE4(wait4, 1705 | compat_pid_t, pid, 1706 | compat_uint_t __user *, stat_addr, 1707 | int, options, 1708 | struct compat_rusage __user *, ru) 1709 | { 1710 | struct rusage r; 1711 | long err = kernel_wait4(pid, stat_addr, options, ru ? &r : NULL); 1712 | if (err > 0) { 1713 | if (ru && put_compat_rusage(&r, ru)) 1714 | return -EFAULT; 1715 | } 1716 | return err; 1717 | } 1718 | 1719 | COMPAT_SYSCALL_DEFINE5(waitid, 1720 | int, which, compat_pid_t, pid, 1721 | struct compat_siginfo __user *, infop, int, options, 1722 | struct compat_rusage __user *, uru) 1723 | { 1724 | struct rusage ru; 1725 | struct waitid_info info = {.status = 0}; 1726 | long err = kernel_waitid(which, pid, &info, options, uru ? &ru : NULL); 1727 | int signo = 0; 1728 | if (err > 0) { 1729 | signo = SIGCHLD; 1730 | err = 0; 1731 | if (uru) { 1732 | /* kernel_waitid() overwrites everything in ru */ 1733 | if (COMPAT_USE_64BIT_TIME) 1734 | err = copy_to_user(uru, &ru, sizeof(ru)); 1735 | else 1736 | err = put_compat_rusage(&ru, uru); 1737 | if (err) 1738 | return -EFAULT; 1739 | } 1740 | } 1741 | 1742 | if (!infop) 1743 | return err; 1744 | 1745 | if (!user_access_begin(infop, sizeof(*infop))) 1746 | return -EFAULT; 1747 | 1748 | unsafe_put_user(signo, &infop->si_signo, Efault); 1749 | unsafe_put_user(0, &infop->si_errno, Efault); 1750 | unsafe_put_user(info.cause, &infop->si_code, Efault); 1751 | unsafe_put_user(info.pid, &infop->si_pid, Efault); 1752 | unsafe_put_user(info.uid, &infop->si_uid, Efault); 1753 | unsafe_put_user(info.status, &infop->si_status, Efault); 1754 | user_access_end(); 1755 | return err; 1756 | Efault: 1757 | user_access_end(); 1758 | return -EFAULT; 1759 | } 1760 | #endif 1761 | 1762 | __weak void abort(void) 1763 | { 1764 | BUG(); 1765 | 1766 | /* if that doesn't kill us, halt */ 1767 | panic("Oops failed to kill thread"); 1768 | } 1769 | EXPORT_SYMBOL(abort); 1770 | -------------------------------------------------------------------------------- /Memory/README.md: -------------------------------------------------------------------------------- 1 | # Video materials: 2 | [![alt text](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Memory/wall/Memory.png)](https://www.youtube.com/playlist?list=PLVEEYpRafmwEwdHqvk-81VkrFq8ZLkXQ7 "Linux-Memory") 3 | -------------------------------------------------------------------------------- /Memory/wall/Memory.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/Memory/wall/Memory.png -------------------------------------------------------------------------------- /PTE/README.MD: -------------------------------------------------------------------------------- 1 | ## Page Table Entries in Page Table 2 | 3 | Prerequisite – Paging 4 | 5 | Page table has page table entries where each page table entry stores a frame number and optional status (like protection) bits. Many of status bits used in the virtual memory system. The most important thing in PTE is frame Number. 6 | 7 | Page table entry has the following information – 8 | 9 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/PTE/screen/Capture-24.png) 10 | 11 | ## Frame Number – 12 | It gives the frame number in which the current page you are looking for is present. The number of bits required depends on the number of frames.Frame bit is also known as address translation bit. 13 | ``` 14 | Number of bits for frame = Size of physical memory/frame size 15 | ``` 16 | ## Present/Absent bit – 17 | Present or absent bit says whether a particular page you are looking for is present or absent. In case if it is not present, that is called Page Fault. It is set to 0 if the corresponding page is not in memory. Used to control page fault by the operating system to support virtual memory. Sometimes this bit is also known as valid/invalid bits. 18 | 19 | ## Protection bit – 20 | Protection bit says that what kind of protection you want on that page. So, these bit for the protection of the page frame (read, write etc). 21 | 22 | ## Referenced bit – 23 | Referenced bit will say whether this page has been referred in the last clock cycle or not. It is set to 1 by hardware when the page is accessed. 24 | 25 | ## Caching enabled/disabled – 26 | Some times we need the fresh data. Let us say the user is typing some information from the keyboard and your program should run according to the input given by the user. In that case, the information will come into the main memory. Therefore main memory contains the latest information which is typed by the user. Now if you try to put that page in the cache, that cache will show the old information. So whenever freshness is required, we don’t want to go for caching or many levels of the memory.The information present in the closest level to the CPU and the information present in the closest level to the user might be different. So we want the information has to be consistency, which means whatever information user has given, CPU should be able to see it as first as possible. That is the reason we want to disable caching. So, this bit enables or disable caching of the page. 27 | 28 | ## Modified bit – 29 | Modified bit says whether the page has been modified or not. Modified means sometimes you might try to write something on to the page. If a page is modified, then whenever you should replace that page with some other page, then the modified information should be kept on the hard disk or it has to be written back or it has to be saved back. It is set to 1 by hardware on write-access to page which is used to avoid writing when swapped out. Sometimes this modified bit is also called as the 30 | ## Dirty bit. 31 | 32 | ## GATE CS Corner Questions 33 | 34 | Practicing the following questions will help you test your knowledge. All questions have been asked in GATE in previous years or in GATE Mock Tests. It is highly recommended that you practice them. 35 | 36 | 37 | 1 . GATE CS 2001, Question 46 38 | 39 | 2 . GATE CS 2004, Question 66 40 | 41 | 3 . GATE CS 2015 (Set 1), Question 65 42 | 43 | -------------------------------------------------------------------------------- /PTE/screen/Capture-24.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/PTE/screen/Capture-24.png -------------------------------------------------------------------------------- /PTE/screen/screen: -------------------------------------------------------------------------------- 1 | screen 2 | -------------------------------------------------------------------------------- /Paging/README.MD: -------------------------------------------------------------------------------- 1 | ## Paging in Operating System 2 | 3 | Paging is a memory management scheme that eliminates the need for contiguous allocation of physical memory. This scheme permits the physical address space of a process to be non – contiguous. 4 | 5 | - Logical Address or Virtual Address (represented in bits): An address generated by the CPU 6 | 7 | - Logical Address Space or Virtual Address Space( represented in words or bytes): The set of all logical addresses generated by a program 8 | 9 | - Physical Address (represented in bits): An address actually available on memory unit 10 | 11 | - Physical Address Space (represented in words or bytes): The set of all physical addresses corresponding to the logical addresses 12 | 13 | ## Example: 14 | 15 | - If Logical Address = 31 bit, then Logical Address Space = 231 words = 2 G words (1 G = 230) 16 | 17 | - If Logical Address Space = 128 M words = 27 * 220 words, then Logical Address = log2 227 = 27 bits 18 | 19 | - If Physical Address = 22 bit, then Physical Address Space = 222 words = 4 M words (1 M = 220) 20 | 21 | - If Physical Address Space = 16 M words = 24 * 220 words, then Physical Address = log2 224 = 24 bits 22 | 23 | The mapping from virtual to physical address is done by the memory management unit (MMU) which is a hardware device and this mapping is known as paging technique. 24 | 25 | The Physical Address Space is conceptually divided into a number of fixed-size blocks, called frames. 26 | The Logical address Space is also splitted into fixed-size blocks, called pages. 27 | Page Size = Frame Size 28 | 29 | ## Let us consider an example: 30 | 31 | - Physical Address = 12 bits, then Physical Address Space = 4 K words 32 | - Logical Address = 13 bits, then Logical Address Space = 8 K words 33 | - Page size = frame size = 1 K words (assumption) 34 | 35 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Paging/screen/paging.jpg) 36 | 37 | ## Address generated by CPU is divided into 38 | 39 | - Page number(p): Number of bits required to represent the pages in Logical Address Space or Page number 40 | - Page offset(d): Number of bits required to represent particular word in a page or page size of Logical Address Space or word number of a page or page offset. 41 | 42 | ## Physical Address is divided into 43 | 44 | - Frame number(f): Number of bits required to represent the frame of Physical Address Space or Frame number. 45 | - Frame offset(d): Number of bits required to represent particular word in a frame or frame size of Physical Address Space or word number of a frame or frame offset. 46 | 47 | The hardware implementation of page table can be done by using dedicated registers. But the usage of register for the page table is satisfactory only if page table is small. If page table contain large number of entries then we can use TLB(translation Look-aside buffer), a special, small, fast look up hardware cache. 48 | 49 | - The TLB is associative, high speed memory. 50 | - Each entry in TLB consists of two parts: a tag and a value. 51 | - When this memory is used, then an item is compared with all tags simultaneously.If the item is found, then corresponding value is returned. 52 | 53 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Paging/screen/paging-2.jpg) 54 | 55 | Main memory access time = m 56 | If page table are kept in main memory, 57 | Effective access time = m(for page table) + m(for particular page in page table) 58 | 59 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Paging/screen/paging-3.jpg) 60 | 61 | -------------------------------------------------------------------------------- /Paging/screen/paging-2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/Paging/screen/paging-2.jpg -------------------------------------------------------------------------------- /Paging/screen/paging-3.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/Paging/screen/paging-3.jpg -------------------------------------------------------------------------------- /Paging/screen/paging.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/Paging/screen/paging.jpg -------------------------------------------------------------------------------- /Paging/screen/screen: -------------------------------------------------------------------------------- 1 | screen 2 | -------------------------------------------------------------------------------- /Process execution priorities/README.MD: -------------------------------------------------------------------------------- 1 | # Learn Linux,Process execution priorities 2 | 3 | - Keeping your eye on what's going on!!! 4 | 5 | Overview 6 | 7 | This tutorial grounds you in the basic Linux techniques for managing execution process priorities. Learn to: 8 | 9 | - Understand process priorities 10 | - Set process priorities 11 | - Change process priorities 12 | 13 | This tutorial helps you prepare for Objective 103.6 in Topic 103 of the Linux Server Professional (LPIC-1) exam 101. The objective has a weight of 2. 14 | Linux task priorities 15 | 16 | Linux, like most modern operating systems, can run multiple processes. It does this by sharing the CPU and other resources among the processes. If one process can use 100 percent of the CPU, then other processes may become unresponsive. We’ll introduce you to the way Linux assigns priorities for tasks. 17 | 18 | 19 | - Prerequisites 20 | 21 | To get the most from the tutorials in this series, you should have a basic knowledge of Linux and a working Linux system on which you can practice the commands covered in this tutorial. Sometimes different versions of a program will format output differently, so your results may not always look exactly like the listings and figures shown here. The results in the examples shown here were obtained on a Ubuntu 15.04 distribution. 22 | 23 | - Knowing your priorities 24 | 25 | If you run the `top` command, its default is to display processes in decreasing order according to their CPU usage, as shown in Listing. If we ran that process, it probably wouldn’t make it onto the output list from top because the process spends most of its time not using the CPU. 26 | 27 | 28 | # Listing 1. Typical output from top on a Linux workstation 29 | 30 | ```bash 31 | 32 | top ‑ 22:47:44 up 1 day, 12:44, 3 users, load average: 0.00, 0.01, 0.05 33 | Tasks: 188 total, 1 running, 187 sleeping, 0 stopped, 0 zombie 34 | %Cpu(s): 0.2 us, 0.0 sy, 0.0 ni, 99.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st 35 | KiB Mem: 8090144 total, 2145616 used, 5944528 free, 81880 buffers 36 | KiB Swap: 4095996 total, 100660 used, 3995336 free. 1464920 cached Mem 37 | 38 | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 39 | 2215 ian 20 0 1549252 162644 79992 S 0.7 2.0 16:11.61 compiz 40 | 9 root 20 0 0 0 0 S 0.3 0.0 0:08.04 rcuos/0 41 | 4918 ian 20 0 29184 3120 2612 R 0.3 0.0 0:00.47 top 42 | 1 root 20 0 182732 5392 3648 S 0.0 0.1 0:03.74 systemd 43 | 2 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd 44 | 3 root 20 0 0 0 0 S 0.0 0.0 0:00.10 ksoftirqd/0 45 | 5 root 0 ‑20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H 46 | 7 root 20 0 0 0 0 S 0.0 0.0 0:08.42 rcu_sched 47 | 8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh 48 | 10 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/0 49 | 11 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/0 50 | 12 root rt 0 0 0 0 S 0.0 0.0 0:00.77 watchdog/0 51 | 13 root rt 0 0 0 0 S 0.0 0.0 0:00.78 watchdog/1 52 | 14 root rt 0 0 0 0 S 0.0 0.0 0:00.18 migration/1 53 | 15 root 20 0 0 0 0 S 0.0 0.0 0:00.11 ksoftirqd/1 54 | 17 root 0 ‑20 0 0 0 S 0.0 0.0 0:00.00 kworker/1:0H 55 | 18 root 20 0 0 0 0 S 0.0 0.0 0:00.46 rcuos/1 56 | ``` 57 | 58 | Your system may have many commands that are capable of using lots of CPU. Examples include movie editing tools, and programs to convert between different image types or between different sound encoding, such as mp3 to ogg. 59 | 60 | When you only have one or a limited number of CPUs, you need to decide how to share those limited CPU resources among several competing processes. This is generally done by selecting one process for execution and letting it run for a short period (called a timeslice), or until it needs to wait for some event, such as IO to complete. To ensure that important processes don’t get starved out by CPU hogs, the selection is done based on a scheduling priority. The NI column in Listing 1 above, shows the scheduling priority or niceness of each process. Niceness generally ranges from -20 to 19, with -20 being the most favorable or highest priority for scheduling and 19 being the least favorable or lowest priority. 61 | 62 | 63 | # Using ps to find niceness 64 | 65 | In addition to the top command, you can also display niceness values using the ps command. You can either customize the output as you saw in the tutorial the output of ps -lps -l is shown in Listing 2. As with top, look for the niceness value in the NI column. 66 | 67 | - Listing 2. Using ps to find niceness 68 | 69 | ```bash 70 | 71 | :~$ ps ‑l 72 | F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 73 | 0 S 1000 3850 3849 0 80 0 ‑ 6726 wait pts/5 00:00:00 bash 74 | 0 R 1000 4924 3850 0 80 0 ‑ 3561 ‑ pts/5 00:00:00 ps 75 | ``` 76 | 77 | # Default niceness 78 | 79 | You may have guessed from Listing 1 or Listing 2 that the default niceness, at least for processes started by regular users, is 0. This is usually the case on current Linux systems. You can verify the value for your shell and system by running the nice command with no parameters as shown in Listing 3. 80 | 81 | 82 | - Listing 3. Checking default niceness 83 | 84 | ```bash 85 | :~$ nice 86 | 0 87 | ``` 88 | 89 | # Setting priorities 90 | 91 | Before we look at how to set or change niceness values, let’s build a little CPU-intensive script that will show how niceness really works. 92 | 93 | - A CPU-intensive script 94 | 95 | We’ll create a small script that just uses CPU and does little else. The script takes two inputs, a count and a label. It prints the label and the current date and time, then spins, decrementing the count till it reaches 0, and finally prints the label and the date again. This script shown in Listing 4 has no error checking and is not very robust, but it illustrates our point. 96 | 97 | 98 | - Listing 4. CPU-intensive script 99 | ```bash 100 | 101 | :~$ echo 'x="$1"'>count1.sh 102 | :~$ echo 'echo "$2" $(date)'>>count1.sh 103 | :~$ echo 'while [ $x ‑gt 0 ]; do x=$(( x‑1 ));done'>>count1.sh 104 | :~$ echo 'echo "$2" $(date)'>>count1.sh 105 | :~$ cat count1.sh 106 | x="$1" 107 | echo "$2" $(date) 108 | while [ $x ‑gt 0 ]; do x=$(( x‑1 ));done 109 | echo "$2" $(date) 110 | ``` 111 | 112 | If you run this on your own system, you might see output similar to Listing 5. Depending on the speed of your system, you may have to increase the count value to even see a difference in the times. This script uses lots of CPU, as we’ll see in a moment. If your default shell is not Bash, and if the script does not work for you, then use the second form of calling shown below. If you are not using your own workstation, make sure that it is okay to use lots of CPU before you run the script. 113 | 114 | - Listing 5. Running count1.sh 115 | 116 | ```bash 117 | 118 | :~$ sh count1.sh 10000 A 119 | A Thu Jul 16 23:13:07 EDT 2015 120 | A Thu Jul 16 23:13:07 EDT 2015 121 | :~$ bash count1.sh 99000 A 122 | A Thu Jul 16 23:13:53 EDT 2015 123 | A Thu Jul 16 23:13:54 EDT 2015 124 | ``` 125 | 126 | 127 | 128 | So far, so good. Now let’s create a command list to run the script in background and launch the top command to see how much CPU the script is using. The command list is shown in Listing 6 and the output from top in Listing 7. 129 | 130 | - Listing 6. Running count1.sh and top 131 | 132 | ```bash 133 | :~$ (sh count1.sh 5000000 A&);top 134 | ``` 135 | 136 | - Listing 7. Using lots of CPU 137 | 138 | ```bash 139 | 140 | top ‑ 23:19:30 up 1 day, 13:16, 3 users, load average: 0.15, 0.06, 0.05 141 | Tasks: 190 total, 2 running, 188 sleeping, 0 stopped, 0 zombie 142 | %Cpu(s): 25.1 us, 0.0 sy, 0.0 ni, 74.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st 143 | KiB Mem: 8090144 total, 2145600 used, 5944544 free, 82024 buffers 144 | KiB Swap: 4095996 total, 100644 used, 3995352 free. 1464940 cached Mem 145 | 146 | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 147 | 4952 ian 20 0 4472 736 652 R 100.0 0.0 0:06.18 sh 148 | 2043 ian 20 0 552900 34544 24464 S 0.3 0.4 0:15.14 unity‑panel‑+ 149 | 2215 ian 20 0 1549252 162644 79992 S 0.3 2.0 16:28.20 compiz 150 | 1 root 20 0 182732 5392 3648 S 0.0 0.1 0:03.76 systemd 151 | 2 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd 152 | 3 root 20 0 0 0 0 S 0.0 0.0 0:00.10 ksoftirqd/0 153 | ``` 154 | Not bad. We are using 100 percent of one of the CPUs on this system with just a simple script. If you want to stress multiple CPUs, you can add an extra invocation of count1.sh to the command list. If we had a long running job such as this, we might find that it interfered with our ability (or the ability of other users) to do other work on our system. 155 | 156 | # Using nice to set priorities 157 | 158 | Now that we can keep a CPU busy for a while, we’ll see how to set a priority for a process. To summarize what we’ve learned so far: 159 | 160 | - Linux and UNIX® systems use a priority system with 40 priorities, ranging from -20 (highest priority) to 19 (lowest priority. 161 | - Processes started by regular users usually have priority 0. 162 | - The ps command can display the priority (nice, or NI, level, for example) using the -l option. 163 | - The nice command displays our default priority. 164 | 165 | The nice command can also be used to start a process with a different priority. You use the -n or (--adjustment) option with a positive value to increase the priority value and a negative value to decrease it. Remember that processes with the lowest priority value run at highest scheduling priority, so think of increasing the priority value as being nice to other processes. Note that you usually need to be the superuser (root) to specify negative priority adjustments. In other words, regular users can usually only make their processes nicer. 166 | 167 | To demonstrate the use of nice to set priorities, let’s start two copies of the count1.sh script in different subshells at the same time, but give one the maximum niceness of 19. After a second we’ll use ps -lps -l to display the process status, including niceness. Finally, we’ll add an arbitrary 30-second sleep to ensure the command sequence finishes after the two subshells do. That way, we won’t get a new prompt while we’re still waiting for output. The result is shown in Listing 8. 168 | 169 | - Listing 8. Using nice to set priorities for a pair of processes 170 | 171 | ```bash 172 | 173 | :~$ (sh count1.sh 2000000 A&);(nice ‑n 19 sh count1.sh 2000000 B&);> sleep 1;ps ‑l;sleep 10 174 | A Fri Jul 17 17:10:33 EDT 2015 175 | B Fri Jul 17 17:10:33 EDT 2015 176 | F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 177 | 0 S 1000 3850 3849 0 80 0 ‑ 6726 wait pts/5 00:00:00 bash 178 | 0 R 1000 5614 1 99 80 0 ‑ 1118 ‑ pts/5 00:00:01 sh 179 | 0 R 1000 5617 1 99 99 19 ‑ 1118 ‑ pts/5 00:00:01 sh 180 | 0 R 1000 5620 3850 0 80 0 ‑ 3561 ‑ pts/5 00:00:00 ps 181 | A Fri Jul 17 17:10:36 EDT 2015 182 | B Fri Jul 17 17:10:36 EDT 2015 183 | ``` 184 | 185 | Are you surprised that the two jobs finished at the same time? What happened to our priority setting? Remember that the script occupied one of our CPUs. This particular system runs on an Intel(R) Core(TM) i7 processor, which is very lightly loaded, so each core ran one process, and there wasn’t any need to prioritize them. 186 | 187 | So let’s try starting four processes at four different niceness levels (0, 6, 12, and18) and see what happens. We’ll increase the busy count parameter for each so they run a little longer. Before you look at Listing 9, think about what you might expect, given what you’ve already seen. 188 | 189 | - Listing 9. Using nice to set priorities for four of processes 190 | 191 | ```bash 192 | 193 | :~$ (sh count1.sh 5000000 A&);(nice ‑n 6 sh count1.sh 5000000 B&); 194 | > (nice ‑n 12 sh count1.sh 5000000 C&);(nice ‑n 18 sh count1.sh 5000000 D&); 195 | > sleep 1;ps ‑l;sleep 30 196 | A Fri Jul 17 17:13:05 EDT 2015 197 | C Fri Jul 17 17:13:05 EDT 2015 198 | B Fri Jul 17 17:13:05 EDT 2015 199 | D Fri Jul 17 17:13:05 EDT 2015 200 | F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 201 | 0 S 1000 3850 3849 0 80 0 ‑ 6726 wait pts/5 00:00:00 bash 202 | 0 R 1000 5626 1 99 80 0 ‑ 1118 ‑ pts/5 00:00:01 sh 203 | 0 R 1000 5629 1 99 86 6 ‑ 1118 ‑ pts/5 00:00:01 sh 204 | 0 R 1000 5631 1 90 92 12 ‑ 1118 ‑ pts/5 00:00:00 sh 205 | 0 R 1000 5633 1 58 98 18 ‑ 1118 ‑ pts/5 00:00:00 sh 206 | 0 R 1000 5638 3850 0 80 0 ‑ 3561 ‑ pts/5 00:00:00 ps 207 | A Fri Jul 17 17:13:15 EDT 2015 208 | B Fri Jul 17 17:13:21 EDT 2015 209 | C Fri Jul 17 17:13:26 EDT 2015 210 | D Fri Jul 17 17:13:27 EDT 2015 211 | ``` 212 | With four different priorities, we see the effect of the different niceness values as each job finishes in priority order. Try experimenting with different nice values to demonstrate the different possibilities for yourself. 213 | 214 | A final note on starting processes with nice; as with the nohup command, you cannot use a command list or a pipeline as the argument of nice. 215 | 216 | # Changing priorities 217 | - `renice` 218 | 219 | If you happen to start a process and realize that it should run at a different priority, there is a way to change it after it has started, using the renice command. You specify an absolute priority (and not an adjustment) for the process or processes to be changed as shown in Listing 10. 220 | 221 | - Listing 10. Using renice to change priorities 222 | 223 | ```bash 224 | 225 | :~$ sh count1.sh 100000000 A& 226 | [1] 5724 227 | :~$ A Fri Jul 17 17:30:20 EDT 2015 228 | renice 1 5724;ps ‑l 5724 229 | 5724 (process ID) old priority 0, new priority 1 230 | F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 231 | 0 R 1000 5724 3850 99 81 1 ‑ 1118 ‑ pts/5 0:35 sh count1.sh 10000 232 | :~$ renice +3 5724;ps ‑l 5724 233 | 5724 (process ID) old priority 1, new priority 3 234 | F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 235 | 0 R 1000 5724 3850 99 83 3 ‑ 1118 ‑ pts/5 0:50 sh count1.sh 10000 236 | :~$ sudo renice ‑8 5724;ps ‑l 5724 237 | 5724 (process ID) old priority 3, new priority ‑8 238 | F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 239 | 0 R 1000 5724 3850 99 72 ‑8 ‑ 1118 ‑ pts/5 1:01 sh count1.sh 10000 240 | ``` 241 | 242 | Remember that you have to be the superuser to give your processes higher scheduling priority and make them less nice. 243 | 244 | You can find more information on nice and renice in the main pages. 245 | 246 | This concludes your introduction to process execution priorities. 247 | 248 | 249 | 250 | 251 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Kernel-and-Types-of-kernels 2 | ![image](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/wall/gV8hn.png) 3 | 4 | - Last update from nu11secur1ty: https://github.com/nu11secur1ty/kernel-4.19.0V.Varbanovski_lp150.12.22_default 5 | 6 | # Importance_of_Linux_partitions 7 | [Read...](https://github.com/nu11secur1ty/Linux_Deployment_Administration_Hacks/tree/master/Importance_of_Linux_partitions) 8 | 9 | # Kernel Space, User Space Interfaces 10 | [Read...](https://www.nu11secur1ty.com/2019/01/kernel-space-user-space-interfaces.html) 11 | 12 | ----------------------------------------------------------------------------------------------------------------- 13 | # Reference 14 | 15 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/scheme/kernels.png) 16 | 17 | The kernel is the core of an operating system. It is the software responsible for running programs and providing secure access to the machine's hardware. Since there are many programs, and resources are limited, the kernel also decides when and how long a program should run. This is called scheduling. Accessing the hardware directly can be very complex, since there are many different hardware designs for the same type of component. Kernels usually implement some level of hardware abstraction (a set of instructions universal to all devices of a certain type) to hide the underlying complexity from applications and provide a clean and uniform interface. This helps application programmers to develop programs without having to know how to program for specific devices. The kernel relies upon software drivers that translate the generic command into instructions specific to that device. 18 | 19 | An operating system kernel is not strictly needed to run a computer. Programs can be directly loaded and executed on the "bare metal" machine, provided that the authors of those programs are willing to do without any hardware abstraction or operating system support. This was the normal operating method of many early computers, which were reset and reloaded between the running of different programs. Eventually, small ancillary programs such as program loaders and debuggers were typically left in-core between runs, or loaded from read-only memory. As these were developed, they formed the basis of what became early operating system kernels. The "bare metal" approach is still used today on many video game consoles and embedded systems, but in general, newer systems use kernels and operating systems. 20 | 21 | Four broad categories of kernels: 22 | 23 | - Monolithic kernels provide rich and powerful abstractions of the underlying hardware. 24 | - Microkernels provide a small set of simple hardware abstractions and use applications called servers to provide more functionality. 25 | - Exokernels provide minimal abstractions, allowing low-level hardware access. In exokernel systems, library operating systems provide the abstractions typically present in monolithic kernels. 26 | - Hybrid (modified microkernels) are much like pure microkernels, except that they include some additional code in kernelspace to increase performance. 27 | 28 | 29 | 30 | ------------------------------------------------------------------------------------------------------------- 31 | # Monolitic 32 | 33 | The monolithic approach is to define a high-level virtual interface over the hardware, with a set of primitives or system calls to implement operating system services such as process management, concurrency, and memory management in several modules that run in supervisor mode. 34 | 35 | Even if every module servicing these operations is separate from the whole, the code integration is very tight and difficult to do correctly, and, since all the modules run in the same address space, a bug in one module can bring down the whole system. However, when the implementation is complete and trustworthy, the tight internal integration of components allows the low-level features of the underlying system to be effectively utilized, making a good monolithic kernel highly efficient. Proponents of the monolithic kernel approach make the case that if code is incorrect, it does not belong in a kernel, and if it is, there is little advantage in the microkernel approach. More modern monolithic kernels such as Linux, FreeBSD and Solaris can load executable modules at runtime, allowing easy extension of the kernel's capabilities as required, while helping to keep the amount of code running in kernel-space to a minimum. It is alone in supervise mode. 36 | 37 | ----------------------------------------------------------------------------------------------------------------- 38 | 39 | # Microkernels 40 | The microkernel approach is to define a very simple abstraction over the hardware, with a set of primitives or system calls to implement minimal OS services such as thread management, address spaces and interprocess communication. All other services, those normally provided by the kernel such as networking, are implemented in user-space programs referred to as servers. Servers are programs like any others, allowing the operating system to be modified simply by starting and stopping programs. For a small machine without networking support, for instance, the networking server simply isn't started. Under a traditional system this would require the kernel to be recompiled, something well beyond the capabilities of the average end-user. In theory the system is also more stable, because a failing server simply stops a single program, rather than causing the kernel itself to crash. 41 | 42 | However, part of the system state is lost with the failing server, and it is generally difficult to continue execution of applications, or even of other servers with a fresh copy. For example, if a (theoretic) server responsible for TCP/IP connections is restarted, applications could be told the connection was "lost" and reconnect, going through the new instance of the server. However, other system objects, like files, do not have these convenient semantics, are supposed to be reliable, not become unavailable randomly and keep all the information written to them previously. So, database techniques like transactions, replication and checkpointing need to be used between servers in order to preserve essential state across single server restarts. 43 | 44 | Microkernels generally underperform traditional designs, sometimes dramatically. This is due in large part to the overhead of moving in and out of the kernel, a context switch, in order to move data between the various applications and servers. It was originally believed that careful tuning could reduce this overhead dramatically, but by the mid-90s most researchers had given up. In more recent times newer microkernels, designed for performance first, have addressed these problems to a very large degree. Nevertheless the market for existing operating systems is so entrenched that little work continues on microkernel design. 45 | 46 | ------------------------------------------------------------------------------------------------------------------ 47 | 48 | # Exokernel 49 | 50 | - General 51 | An exokernel is a type of operating system where the kernel is limited to extending resources to sub operating systems called LibOS's. Resulting in a very small, fast kernel environment. The theory behind this method is that by providing as few abstractions as possible programs are able to do exactly what they want in a controlled environment. Such as MS-DOS achieved through real mode, except with paging and other modern programming techniques. 52 | - LibOS 53 | LibOS's provide a way for the programmer of an exokernel type system to easily program cross-platform programs using familiar interfaces, instead of having to write his\her own. Moreover, they provide an additional advantage over monolithic kernels in that by having multiple LibOS's running at the same time, one can theoretically run programs from Linux, Windows, and Mac (Provided that there is a LibOS for that system) all at the same time, on the same OS, and with no performance issues. 54 | 55 | -------------------------------------------------------------------------------------------------------------------- 56 | 57 | # Hybrid 58 | 59 | A hybrid kernel is one that combines aspects of both micro and monolithic kernels, but there is no exact definition. Often, "hybrid kernel" means that the kernel is highly modular, but all runs in the same address space. This allows the kernel avoid the overhead of a complicated message passing system within the kernel, while still retaining some microkernel-like features. 60 | 61 | -------------------------------------------------------------------------------------------------------------------- 62 | 63 | # INIT and zombie 64 | [see the rule](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Linux/exit/exit.c#L652) 65 | 66 | - Or all code: 67 | ```c 68 | // SPDX-License-Identifier: GPL-2.0-only 69 | /* 70 | * linux/kernel/exit.c 71 | * 72 | * Copyright (C) 1991, 1992 Linus Torvalds 73 | */ 74 | 75 | #include 76 | #include 77 | #include 78 | #include 79 | #include 80 | #include 81 | #include 82 | #include 83 | #include 84 | #include 85 | #include 86 | #include 87 | #include 88 | #include 89 | #include 90 | #include 91 | #include 92 | #include 93 | #include 94 | #include 95 | #include 96 | #include 97 | #include 98 | #include 99 | #include 100 | #include 101 | #include 102 | #include 103 | #include 104 | #include 105 | #include 106 | #include 107 | #include 108 | #include 109 | #include 110 | #include 111 | #include 112 | #include 113 | #include 114 | #include 115 | #include 116 | #include /* for audit_free() */ 117 | #include 118 | #include 119 | #include 120 | #include 121 | #include 122 | #include 123 | #include 124 | #include 125 | #include 126 | #include 127 | #include 128 | #include 129 | #include 130 | #include 131 | #include 132 | #include 133 | 134 | #include 135 | #include 136 | #include 137 | #include 138 | 139 | static void __unhash_process(struct task_struct *p, bool group_dead) 140 | { 141 | nr_threads--; 142 | detach_pid(p, PIDTYPE_PID); 143 | if (group_dead) { 144 | detach_pid(p, PIDTYPE_TGID); 145 | detach_pid(p, PIDTYPE_PGID); 146 | detach_pid(p, PIDTYPE_SID); 147 | 148 | list_del_rcu(&p->tasks); 149 | list_del_init(&p->sibling); 150 | __this_cpu_dec(process_counts); 151 | } 152 | list_del_rcu(&p->thread_group); 153 | list_del_rcu(&p->thread_node); 154 | } 155 | 156 | /* 157 | * This function expects the tasklist_lock write-locked. 158 | */ 159 | static void __exit_signal(struct task_struct *tsk) 160 | { 161 | struct signal_struct *sig = tsk->signal; 162 | bool group_dead = thread_group_leader(tsk); 163 | struct sighand_struct *sighand; 164 | struct tty_struct *uninitialized_var(tty); 165 | u64 utime, stime; 166 | 167 | sighand = rcu_dereference_check(tsk->sighand, 168 | lockdep_tasklist_lock_is_held()); 169 | spin_lock(&sighand->siglock); 170 | 171 | #ifdef CONFIG_POSIX_TIMERS 172 | posix_cpu_timers_exit(tsk); 173 | if (group_dead) { 174 | posix_cpu_timers_exit_group(tsk); 175 | } else { 176 | /* 177 | * This can only happen if the caller is de_thread(). 178 | * FIXME: this is the temporary hack, we should teach 179 | * posix-cpu-timers to handle this case correctly. 180 | */ 181 | if (unlikely(has_group_leader_pid(tsk))) 182 | posix_cpu_timers_exit_group(tsk); 183 | } 184 | #endif 185 | 186 | if (group_dead) { 187 | tty = sig->tty; 188 | sig->tty = NULL; 189 | } else { 190 | /* 191 | * If there is any task waiting for the group exit 192 | * then notify it: 193 | */ 194 | if (sig->notify_count > 0 && !--sig->notify_count) 195 | wake_up_process(sig->group_exit_task); 196 | 197 | if (tsk == sig->curr_target) 198 | sig->curr_target = next_thread(tsk); 199 | } 200 | 201 | add_device_randomness((const void*) &tsk->se.sum_exec_runtime, 202 | sizeof(unsigned long long)); 203 | 204 | /* 205 | * Accumulate here the counters for all threads as they die. We could 206 | * skip the group leader because it is the last user of signal_struct, 207 | * but we want to avoid the race with thread_group_cputime() which can 208 | * see the empty ->thread_head list. 209 | */ 210 | task_cputime(tsk, &utime, &stime); 211 | write_seqlock(&sig->stats_lock); 212 | sig->utime += utime; 213 | sig->stime += stime; 214 | sig->gtime += task_gtime(tsk); 215 | sig->min_flt += tsk->min_flt; 216 | sig->maj_flt += tsk->maj_flt; 217 | sig->nvcsw += tsk->nvcsw; 218 | sig->nivcsw += tsk->nivcsw; 219 | sig->inblock += task_io_get_inblock(tsk); 220 | sig->oublock += task_io_get_oublock(tsk); 221 | task_io_accounting_add(&sig->ioac, &tsk->ioac); 222 | sig->sum_sched_runtime += tsk->se.sum_exec_runtime; 223 | sig->nr_threads--; 224 | __unhash_process(tsk, group_dead); 225 | write_sequnlock(&sig->stats_lock); 226 | 227 | /* 228 | * Do this under ->siglock, we can race with another thread 229 | * doing sigqueue_free() if we have SIGQUEUE_PREALLOC signals. 230 | */ 231 | flush_sigqueue(&tsk->pending); 232 | tsk->sighand = NULL; 233 | spin_unlock(&sighand->siglock); 234 | 235 | __cleanup_sighand(sighand); 236 | clear_tsk_thread_flag(tsk, TIF_SIGPENDING); 237 | if (group_dead) { 238 | flush_sigqueue(&sig->shared_pending); 239 | tty_kref_put(tty); 240 | } 241 | } 242 | 243 | static void delayed_put_task_struct(struct rcu_head *rhp) 244 | { 245 | struct task_struct *tsk = container_of(rhp, struct task_struct, rcu); 246 | 247 | perf_event_delayed_put(tsk); 248 | trace_sched_process_free(tsk); 249 | put_task_struct(tsk); 250 | } 251 | 252 | 253 | void release_task(struct task_struct *p) 254 | { 255 | struct task_struct *leader; 256 | int zap_leader; 257 | repeat: 258 | /* don't need to get the RCU readlock here - the process is dead and 259 | * can't be modifying its own credentials. But shut RCU-lockdep up */ 260 | rcu_read_lock(); 261 | atomic_dec(&__task_cred(p)->user->processes); 262 | rcu_read_unlock(); 263 | 264 | proc_flush_task(p); 265 | cgroup_release(p); 266 | 267 | write_lock_irq(&tasklist_lock); 268 | ptrace_release_task(p); 269 | __exit_signal(p); 270 | 271 | /* 272 | * If we are the last non-leader member of the thread 273 | * group, and the leader is zombie, then notify the 274 | * group leader's parent process. (if it wants notification.) 275 | */ 276 | zap_leader = 0; 277 | leader = p->group_leader; 278 | if (leader != p && thread_group_empty(leader) 279 | && leader->exit_state == EXIT_ZOMBIE) { 280 | /* 281 | * If we were the last child thread and the leader has 282 | * exited already, and the leader's parent ignores SIGCHLD, 283 | * then we are the one who should release the leader. 284 | */ 285 | zap_leader = do_notify_parent(leader, leader->exit_signal); 286 | if (zap_leader) 287 | leader->exit_state = EXIT_DEAD; 288 | } 289 | 290 | write_unlock_irq(&tasklist_lock); 291 | release_thread(p); 292 | call_rcu(&p->rcu, delayed_put_task_struct); 293 | 294 | p = leader; 295 | if (unlikely(zap_leader)) 296 | goto repeat; 297 | } 298 | 299 | /* 300 | * Note that if this function returns a valid task_struct pointer (!NULL) 301 | * task->usage must remain >0 for the duration of the RCU critical section. 302 | */ 303 | struct task_struct *task_rcu_dereference(struct task_struct **ptask) 304 | { 305 | struct sighand_struct *sighand; 306 | struct task_struct *task; 307 | 308 | /* 309 | * We need to verify that release_task() was not called and thus 310 | * delayed_put_task_struct() can't run and drop the last reference 311 | * before rcu_read_unlock(). We check task->sighand != NULL, 312 | * but we can read the already freed and reused memory. 313 | */ 314 | retry: 315 | task = rcu_dereference(*ptask); 316 | if (!task) 317 | return NULL; 318 | 319 | probe_kernel_address(&task->sighand, sighand); 320 | 321 | /* 322 | * Pairs with atomic_dec_and_test() in put_task_struct(). If this task 323 | * was already freed we can not miss the preceding update of this 324 | * pointer. 325 | */ 326 | smp_rmb(); 327 | if (unlikely(task != READ_ONCE(*ptask))) 328 | goto retry; 329 | 330 | /* 331 | * We've re-checked that "task == *ptask", now we have two different 332 | * cases: 333 | * 334 | * 1. This is actually the same task/task_struct. In this case 335 | * sighand != NULL tells us it is still alive. 336 | * 337 | * 2. This is another task which got the same memory for task_struct. 338 | * We can't know this of course, and we can not trust 339 | * sighand != NULL. 340 | * 341 | * In this case we actually return a random value, but this is 342 | * correct. 343 | * 344 | * If we return NULL - we can pretend that we actually noticed that 345 | * *ptask was updated when the previous task has exited. Or pretend 346 | * that probe_slab_address(&sighand) reads NULL. 347 | * 348 | * If we return the new task (because sighand is not NULL for any 349 | * reason) - this is fine too. This (new) task can't go away before 350 | * another gp pass. 351 | * 352 | * And note: We could even eliminate the false positive if re-read 353 | * task->sighand once again to avoid the falsely NULL. But this case 354 | * is very unlikely so we don't care. 355 | */ 356 | if (!sighand) 357 | return NULL; 358 | 359 | return task; 360 | } 361 | 362 | void rcuwait_wake_up(struct rcuwait *w) 363 | { 364 | struct task_struct *task; 365 | 366 | rcu_read_lock(); 367 | 368 | /* 369 | * Order condition vs @task, such that everything prior to the load 370 | * of @task is visible. This is the condition as to why the user called 371 | * rcuwait_trywake() in the first place. Pairs with set_current_state() 372 | * barrier (A) in rcuwait_wait_event(). 373 | * 374 | * WAIT WAKE 375 | * [S] tsk = current [S] cond = true 376 | * MB (A) MB (B) 377 | * [L] cond [L] tsk 378 | */ 379 | smp_mb(); /* (B) */ 380 | 381 | /* 382 | * Avoid using task_rcu_dereference() magic as long as we are careful, 383 | * see comment in rcuwait_wait_event() regarding ->exit_state. 384 | */ 385 | task = rcu_dereference(w->task); 386 | if (task) 387 | wake_up_process(task); 388 | rcu_read_unlock(); 389 | } 390 | 391 | /* 392 | * Determine if a process group is "orphaned", according to the POSIX 393 | * definition in 2.2.2.52. Orphaned process groups are not to be affected 394 | * by terminal-generated stop signals. Newly orphaned process groups are 395 | * to receive a SIGHUP and a SIGCONT. 396 | * 397 | * "I ask you, have you ever known what it is to be an orphan?" 398 | */ 399 | static int will_become_orphaned_pgrp(struct pid *pgrp, 400 | struct task_struct *ignored_task) 401 | { 402 | struct task_struct *p; 403 | 404 | do_each_pid_task(pgrp, PIDTYPE_PGID, p) { 405 | if ((p == ignored_task) || 406 | (p->exit_state && thread_group_empty(p)) || 407 | is_global_init(p->real_parent)) 408 | continue; 409 | 410 | if (task_pgrp(p->real_parent) != pgrp && 411 | task_session(p->real_parent) == task_session(p)) 412 | return 0; 413 | } while_each_pid_task(pgrp, PIDTYPE_PGID, p); 414 | 415 | return 1; 416 | } 417 | 418 | int is_current_pgrp_orphaned(void) 419 | { 420 | int retval; 421 | 422 | read_lock(&tasklist_lock); 423 | retval = will_become_orphaned_pgrp(task_pgrp(current), NULL); 424 | read_unlock(&tasklist_lock); 425 | 426 | return retval; 427 | } 428 | 429 | static bool has_stopped_jobs(struct pid *pgrp) 430 | { 431 | struct task_struct *p; 432 | 433 | do_each_pid_task(pgrp, PIDTYPE_PGID, p) { 434 | if (p->signal->flags & SIGNAL_STOP_STOPPED) 435 | return true; 436 | } while_each_pid_task(pgrp, PIDTYPE_PGID, p); 437 | 438 | return false; 439 | } 440 | 441 | /* 442 | * Check to see if any process groups have become orphaned as 443 | * a result of our exiting, and if they have any stopped jobs, 444 | * send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2) 445 | */ 446 | static void 447 | kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent) 448 | { 449 | struct pid *pgrp = task_pgrp(tsk); 450 | struct task_struct *ignored_task = tsk; 451 | 452 | if (!parent) 453 | /* exit: our father is in a different pgrp than 454 | * we are and we were the only connection outside. 455 | */ 456 | parent = tsk->real_parent; 457 | else 458 | /* reparent: our child is in a different pgrp than 459 | * we are, and it was the only connection outside. 460 | */ 461 | ignored_task = NULL; 462 | 463 | if (task_pgrp(parent) != pgrp && 464 | task_session(parent) == task_session(tsk) && 465 | will_become_orphaned_pgrp(pgrp, ignored_task) && 466 | has_stopped_jobs(pgrp)) { 467 | __kill_pgrp_info(SIGHUP, SEND_SIG_PRIV, pgrp); 468 | __kill_pgrp_info(SIGCONT, SEND_SIG_PRIV, pgrp); 469 | } 470 | } 471 | 472 | #ifdef CONFIG_MEMCG 473 | /* 474 | * A task is exiting. If it owned this mm, find a new owner for the mm. 475 | */ 476 | void mm_update_next_owner(struct mm_struct *mm) 477 | { 478 | struct task_struct *c, *g, *p = current; 479 | 480 | retry: 481 | /* 482 | * If the exiting or execing task is not the owner, it's 483 | * someone else's problem. 484 | */ 485 | if (mm->owner != p) 486 | return; 487 | /* 488 | * The current owner is exiting/execing and there are no other 489 | * candidates. Do not leave the mm pointing to a possibly 490 | * freed task structure. 491 | */ 492 | if (atomic_read(&mm->mm_users) <= 1) { 493 | WRITE_ONCE(mm->owner, NULL); 494 | return; 495 | } 496 | 497 | read_lock(&tasklist_lock); 498 | /* 499 | * Search in the children 500 | */ 501 | list_for_each_entry(c, &p->children, sibling) { 502 | if (c->mm == mm) 503 | goto assign_new_owner; 504 | } 505 | 506 | /* 507 | * Search in the siblings 508 | */ 509 | list_for_each_entry(c, &p->real_parent->children, sibling) { 510 | if (c->mm == mm) 511 | goto assign_new_owner; 512 | } 513 | 514 | /* 515 | * Search through everything else, we should not get here often. 516 | */ 517 | for_each_process(g) { 518 | if (g->flags & PF_KTHREAD) 519 | continue; 520 | for_each_thread(g, c) { 521 | if (c->mm == mm) 522 | goto assign_new_owner; 523 | if (c->mm) 524 | break; 525 | } 526 | } 527 | read_unlock(&tasklist_lock); 528 | /* 529 | * We found no owner yet mm_users > 1: this implies that we are 530 | * most likely racing with swapoff (try_to_unuse()) or /proc or 531 | * ptrace or page migration (get_task_mm()). Mark owner as NULL. 532 | */ 533 | WRITE_ONCE(mm->owner, NULL); 534 | return; 535 | 536 | assign_new_owner: 537 | BUG_ON(c == p); 538 | get_task_struct(c); 539 | /* 540 | * The task_lock protects c->mm from changing. 541 | * We always want mm->owner->mm == mm 542 | */ 543 | task_lock(c); 544 | /* 545 | * Delay read_unlock() till we have the task_lock() 546 | * to ensure that c does not slip away underneath us 547 | */ 548 | read_unlock(&tasklist_lock); 549 | if (c->mm != mm) { 550 | task_unlock(c); 551 | put_task_struct(c); 552 | goto retry; 553 | } 554 | WRITE_ONCE(mm->owner, c); 555 | task_unlock(c); 556 | put_task_struct(c); 557 | } 558 | #endif /* CONFIG_MEMCG */ 559 | 560 | /* 561 | * Turn us into a lazy TLB process if we 562 | * aren't already.. 563 | */ 564 | static void exit_mm(void) 565 | { 566 | struct mm_struct *mm = current->mm; 567 | struct core_state *core_state; 568 | 569 | mm_release(current, mm); 570 | if (!mm) 571 | return; 572 | sync_mm_rss(mm); 573 | /* 574 | * Serialize with any possible pending coredump. 575 | * We must hold mmap_sem around checking core_state 576 | * and clearing tsk->mm. The core-inducing thread 577 | * will increment ->nr_threads for each thread in the 578 | * group with ->mm != NULL. 579 | */ 580 | down_read(&mm->mmap_sem); 581 | core_state = mm->core_state; 582 | if (core_state) { 583 | struct core_thread self; 584 | 585 | up_read(&mm->mmap_sem); 586 | 587 | self.task = current; 588 | self.next = xchg(&core_state->dumper.next, &self); 589 | /* 590 | * Implies mb(), the result of xchg() must be visible 591 | * to core_state->dumper. 592 | */ 593 | if (atomic_dec_and_test(&core_state->nr_threads)) 594 | complete(&core_state->startup); 595 | 596 | for (;;) { 597 | set_current_state(TASK_UNINTERRUPTIBLE); 598 | if (!self.task) /* see coredump_finish() */ 599 | break; 600 | freezable_schedule(); 601 | } 602 | __set_current_state(TASK_RUNNING); 603 | down_read(&mm->mmap_sem); 604 | } 605 | mmgrab(mm); 606 | BUG_ON(mm != current->active_mm); 607 | /* more a memory barrier than a real lock */ 608 | task_lock(current); 609 | current->mm = NULL; 610 | up_read(&mm->mmap_sem); 611 | enter_lazy_tlb(mm, current); 612 | task_unlock(current); 613 | mm_update_next_owner(mm); 614 | mmput(mm); 615 | if (test_thread_flag(TIF_MEMDIE)) 616 | exit_oom_victim(); 617 | } 618 | 619 | static struct task_struct *find_alive_thread(struct task_struct *p) 620 | { 621 | struct task_struct *t; 622 | 623 | for_each_thread(p, t) { 624 | if (!(t->flags & PF_EXITING)) 625 | return t; 626 | } 627 | return NULL; 628 | } 629 | 630 | static struct task_struct *find_child_reaper(struct task_struct *father, 631 | struct list_head *dead) 632 | __releases(&tasklist_lock) 633 | __acquires(&tasklist_lock) 634 | { 635 | struct pid_namespace *pid_ns = task_active_pid_ns(father); 636 | struct task_struct *reaper = pid_ns->child_reaper; 637 | struct task_struct *p, *n; 638 | 639 | if (likely(reaper != father)) 640 | return reaper; 641 | 642 | reaper = find_alive_thread(father); 643 | if (reaper) { 644 | pid_ns->child_reaper = reaper; 645 | return reaper; 646 | } 647 | 648 | write_unlock_irq(&tasklist_lock); 649 | if (unlikely(pid_ns == &init_pid_ns)) { 650 | panic("Attempted to kill init! exitcode=0x%08x\n", 651 | father->signal->group_exit_code ?: father->exit_code); 652 | } 653 | 654 | list_for_each_entry_safe(p, n, dead, ptrace_entry) { 655 | list_del_init(&p->ptrace_entry); 656 | release_task(p); 657 | } 658 | 659 | zap_pid_ns_processes(pid_ns); 660 | write_lock_irq(&tasklist_lock); 661 | 662 | return father; 663 | } 664 | 665 | /* 666 | * When we die, we re-parent all our children, and try to: 667 | * 1. give them to another thread in our thread group, if such a member exists 668 | * 2. give it to the first ancestor process which prctl'd itself as a 669 | * child_subreaper for its children (like a service manager) 670 | * 3. give it to the init process (PID 1) in our pid namespace 671 | */ 672 | static struct task_struct *find_new_reaper(struct task_struct *father, 673 | struct task_struct *child_reaper) 674 | { 675 | struct task_struct *thread, *reaper; 676 | 677 | thread = find_alive_thread(father); 678 | if (thread) 679 | return thread; 680 | 681 | if (father->signal->has_child_subreaper) { 682 | unsigned int ns_level = task_pid(father)->level; 683 | /* 684 | * Find the first ->is_child_subreaper ancestor in our pid_ns. 685 | * We can't check reaper != child_reaper to ensure we do not 686 | * cross the namespaces, the exiting parent could be injected 687 | * by setns() + fork(). 688 | * We check pid->level, this is slightly more efficient than 689 | * task_active_pid_ns(reaper) != task_active_pid_ns(father). 690 | */ 691 | for (reaper = father->real_parent; 692 | task_pid(reaper)->level == ns_level; 693 | reaper = reaper->real_parent) { 694 | if (reaper == &init_task) 695 | break; 696 | if (!reaper->signal->is_child_subreaper) 697 | continue; 698 | thread = find_alive_thread(reaper); 699 | if (thread) 700 | return thread; 701 | } 702 | } 703 | 704 | return child_reaper; 705 | } 706 | 707 | /* 708 | * Any that need to be release_task'd are put on the @dead list. 709 | */ 710 | static void reparent_leader(struct task_struct *father, struct task_struct *p, 711 | struct list_head *dead) 712 | { 713 | if (unlikely(p->exit_state == EXIT_DEAD)) 714 | return; 715 | 716 | /* We don't want people slaying init. */ 717 | p->exit_signal = SIGCHLD; 718 | 719 | /* If it has exited notify the new parent about this child's death. */ 720 | if (!p->ptrace && 721 | p->exit_state == EXIT_ZOMBIE && thread_group_empty(p)) { 722 | if (do_notify_parent(p, p->exit_signal)) { 723 | p->exit_state = EXIT_DEAD; 724 | list_add(&p->ptrace_entry, dead); 725 | } 726 | } 727 | 728 | kill_orphaned_pgrp(p, father); 729 | } 730 | 731 | /* 732 | * This does two things: 733 | * 734 | * A. Make init inherit all the child processes 735 | * B. Check to see if any process groups have become orphaned 736 | * as a result of our exiting, and if they have any stopped 737 | * jobs, send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2) 738 | */ 739 | static void forget_original_parent(struct task_struct *father, 740 | struct list_head *dead) 741 | { 742 | struct task_struct *p, *t, *reaper; 743 | 744 | if (unlikely(!list_empty(&father->ptraced))) 745 | exit_ptrace(father, dead); 746 | 747 | /* Can drop and reacquire tasklist_lock */ 748 | reaper = find_child_reaper(father, dead); 749 | if (list_empty(&father->children)) 750 | return; 751 | 752 | reaper = find_new_reaper(father, reaper); 753 | list_for_each_entry(p, &father->children, sibling) { 754 | for_each_thread(p, t) { 755 | t->real_parent = reaper; 756 | BUG_ON((!t->ptrace) != (t->parent == father)); 757 | if (likely(!t->ptrace)) 758 | t->parent = t->real_parent; 759 | if (t->pdeath_signal) 760 | group_send_sig_info(t->pdeath_signal, 761 | SEND_SIG_NOINFO, t, 762 | PIDTYPE_TGID); 763 | } 764 | /* 765 | * If this is a threaded reparent there is no need to 766 | * notify anyone anything has happened. 767 | */ 768 | if (!same_thread_group(reaper, father)) 769 | reparent_leader(father, p, dead); 770 | } 771 | list_splice_tail_init(&father->children, &reaper->children); 772 | } 773 | 774 | /* 775 | * Send signals to all our closest relatives so that they know 776 | * to properly mourn us.. 777 | */ 778 | static void exit_notify(struct task_struct *tsk, int group_dead) 779 | { 780 | bool autoreap; 781 | struct task_struct *p, *n; 782 | LIST_HEAD(dead); 783 | 784 | write_lock_irq(&tasklist_lock); 785 | forget_original_parent(tsk, &dead); 786 | 787 | if (group_dead) 788 | kill_orphaned_pgrp(tsk->group_leader, NULL); 789 | 790 | if (unlikely(tsk->ptrace)) { 791 | int sig = thread_group_leader(tsk) && 792 | thread_group_empty(tsk) && 793 | !ptrace_reparented(tsk) ? 794 | tsk->exit_signal : SIGCHLD; 795 | autoreap = do_notify_parent(tsk, sig); 796 | } else if (thread_group_leader(tsk)) { 797 | autoreap = thread_group_empty(tsk) && 798 | do_notify_parent(tsk, tsk->exit_signal); 799 | } else { 800 | autoreap = true; 801 | } 802 | 803 | tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE; 804 | if (tsk->exit_state == EXIT_DEAD) 805 | list_add(&tsk->ptrace_entry, &dead); 806 | 807 | /* mt-exec, de_thread() is waiting for group leader */ 808 | if (unlikely(tsk->signal->notify_count < 0)) 809 | wake_up_process(tsk->signal->group_exit_task); 810 | write_unlock_irq(&tasklist_lock); 811 | 812 | list_for_each_entry_safe(p, n, &dead, ptrace_entry) { 813 | list_del_init(&p->ptrace_entry); 814 | release_task(p); 815 | } 816 | } 817 | 818 | #ifdef CONFIG_DEBUG_STACK_USAGE 819 | static void check_stack_usage(void) 820 | { 821 | static DEFINE_SPINLOCK(low_water_lock); 822 | static int lowest_to_date = THREAD_SIZE; 823 | unsigned long free; 824 | 825 | free = stack_not_used(current); 826 | 827 | if (free >= lowest_to_date) 828 | return; 829 | 830 | spin_lock(&low_water_lock); 831 | if (free < lowest_to_date) { 832 | pr_info("%s (%d) used greatest stack depth: %lu bytes left\n", 833 | current->comm, task_pid_nr(current), free); 834 | lowest_to_date = free; 835 | } 836 | spin_unlock(&low_water_lock); 837 | } 838 | #else 839 | static inline void check_stack_usage(void) {} 840 | #endif 841 | 842 | void __noreturn do_exit(long code) 843 | { 844 | struct task_struct *tsk = current; 845 | int group_dead; 846 | 847 | profile_task_exit(tsk); 848 | kcov_task_exit(tsk); 849 | 850 | WARN_ON(blk_needs_flush_plug(tsk)); 851 | 852 | if (unlikely(in_interrupt())) 853 | panic("Aiee, killing interrupt handler!"); 854 | if (unlikely(!tsk->pid)) 855 | panic("Attempted to kill the idle task!"); 856 | 857 | /* 858 | * If do_exit is called because this processes oopsed, it's possible 859 | * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before 860 | * continuing. Amongst other possible reasons, this is to prevent 861 | * mm_release()->clear_child_tid() from writing to a user-controlled 862 | * kernel address. 863 | */ 864 | set_fs(USER_DS); 865 | 866 | ptrace_event(PTRACE_EVENT_EXIT, code); 867 | 868 | validate_creds_for_do_exit(tsk); 869 | 870 | /* 871 | * We're taking recursive faults here in do_exit. Safest is to just 872 | * leave this task alone and wait for reboot. 873 | */ 874 | if (unlikely(tsk->flags & PF_EXITING)) { 875 | pr_alert("Fixing recursive fault but reboot is needed!\n"); 876 | /* 877 | * We can do this unlocked here. The futex code uses 878 | * this flag just to verify whether the pi state 879 | * cleanup has been done or not. In the worst case it 880 | * loops once more. We pretend that the cleanup was 881 | * done as there is no way to return. Either the 882 | * OWNER_DIED bit is set by now or we push the blocked 883 | * task into the wait for ever nirwana as well. 884 | */ 885 | tsk->flags |= PF_EXITPIDONE; 886 | set_current_state(TASK_UNINTERRUPTIBLE); 887 | schedule(); 888 | } 889 | 890 | exit_signals(tsk); /* sets PF_EXITING */ 891 | /* 892 | * Ensure that all new tsk->pi_lock acquisitions must observe 893 | * PF_EXITING. Serializes against futex.c:attach_to_pi_owner(). 894 | */ 895 | smp_mb(); 896 | /* 897 | * Ensure that we must observe the pi_state in exit_mm() -> 898 | * mm_release() -> exit_pi_state_list(). 899 | */ 900 | raw_spin_lock_irq(&tsk->pi_lock); 901 | raw_spin_unlock_irq(&tsk->pi_lock); 902 | 903 | if (unlikely(in_atomic())) { 904 | pr_info("note: %s[%d] exited with preempt_count %d\n", 905 | current->comm, task_pid_nr(current), 906 | preempt_count()); 907 | preempt_count_set(PREEMPT_ENABLED); 908 | } 909 | 910 | /* sync mm's RSS info before statistics gathering */ 911 | if (tsk->mm) 912 | sync_mm_rss(tsk->mm); 913 | acct_update_integrals(tsk); 914 | group_dead = atomic_dec_and_test(&tsk->signal->live); 915 | if (group_dead) { 916 | #ifdef CONFIG_POSIX_TIMERS 917 | hrtimer_cancel(&tsk->signal->real_timer); 918 | exit_itimers(tsk->signal); 919 | #endif 920 | if (tsk->mm) 921 | setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm); 922 | } 923 | acct_collect(code, group_dead); 924 | if (group_dead) 925 | tty_audit_exit(); 926 | audit_free(tsk); 927 | 928 | tsk->exit_code = code; 929 | taskstats_exit(tsk, group_dead); 930 | 931 | exit_mm(); 932 | 933 | if (group_dead) 934 | acct_process(); 935 | trace_sched_process_exit(tsk); 936 | 937 | exit_sem(tsk); 938 | exit_shm(tsk); 939 | exit_files(tsk); 940 | exit_fs(tsk); 941 | if (group_dead) 942 | disassociate_ctty(1); 943 | exit_task_namespaces(tsk); 944 | exit_task_work(tsk); 945 | exit_thread(tsk); 946 | exit_umh(tsk); 947 | 948 | /* 949 | * Flush inherited counters to the parent - before the parent 950 | * gets woken up by child-exit notifications. 951 | * 952 | * because of cgroup mode, must be called before cgroup_exit() 953 | */ 954 | perf_event_exit_task(tsk); 955 | 956 | sched_autogroup_exit_task(tsk); 957 | cgroup_exit(tsk); 958 | 959 | /* 960 | * FIXME: do that only when needed, using sched_exit tracepoint 961 | */ 962 | flush_ptrace_hw_breakpoint(tsk); 963 | 964 | exit_tasks_rcu_start(); 965 | exit_notify(tsk, group_dead); 966 | proc_exit_connector(tsk); 967 | mpol_put_task_policy(tsk); 968 | #ifdef CONFIG_FUTEX 969 | if (unlikely(current->pi_state_cache)) 970 | kfree(current->pi_state_cache); 971 | #endif 972 | /* 973 | * Make sure we are holding no locks: 974 | */ 975 | debug_check_no_locks_held(); 976 | /* 977 | * We can do this unlocked here. The futex code uses this flag 978 | * just to verify whether the pi state cleanup has been done 979 | * or not. In the worst case it loops once more. 980 | */ 981 | tsk->flags |= PF_EXITPIDONE; 982 | 983 | if (tsk->io_context) 984 | exit_io_context(tsk); 985 | 986 | if (tsk->splice_pipe) 987 | free_pipe_info(tsk->splice_pipe); 988 | 989 | if (tsk->task_frag.page) 990 | put_page(tsk->task_frag.page); 991 | 992 | validate_creds_for_do_exit(tsk); 993 | 994 | check_stack_usage(); 995 | preempt_disable(); 996 | if (tsk->nr_dirtied) 997 | __this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied); 998 | exit_rcu(); 999 | exit_tasks_rcu_finish(); 1000 | 1001 | lockdep_free_task(tsk); 1002 | do_task_dead(); 1003 | } 1004 | EXPORT_SYMBOL_GPL(do_exit); 1005 | 1006 | void complete_and_exit(struct completion *comp, long code) 1007 | { 1008 | if (comp) 1009 | complete(comp); 1010 | 1011 | do_exit(code); 1012 | } 1013 | EXPORT_SYMBOL(complete_and_exit); 1014 | 1015 | SYSCALL_DEFINE1(exit, int, error_code) 1016 | { 1017 | do_exit((error_code&0xff)<<8); 1018 | } 1019 | 1020 | /* 1021 | * Take down every thread in the group. This is called by fatal signals 1022 | * as well as by sys_exit_group (below). 1023 | */ 1024 | void 1025 | do_group_exit(int exit_code) 1026 | { 1027 | struct signal_struct *sig = current->signal; 1028 | 1029 | BUG_ON(exit_code & 0x80); /* core dumps don't get here */ 1030 | 1031 | if (signal_group_exit(sig)) 1032 | exit_code = sig->group_exit_code; 1033 | else if (!thread_group_empty(current)) { 1034 | struct sighand_struct *const sighand = current->sighand; 1035 | 1036 | spin_lock_irq(&sighand->siglock); 1037 | if (signal_group_exit(sig)) 1038 | /* Another thread got here before we took the lock. */ 1039 | exit_code = sig->group_exit_code; 1040 | else { 1041 | sig->group_exit_code = exit_code; 1042 | sig->flags = SIGNAL_GROUP_EXIT; 1043 | zap_other_threads(current); 1044 | } 1045 | spin_unlock_irq(&sighand->siglock); 1046 | } 1047 | 1048 | do_exit(exit_code); 1049 | /* NOTREACHED */ 1050 | } 1051 | 1052 | /* 1053 | * this kills every thread in the thread group. Note that any externally 1054 | * wait4()-ing process will get the correct exit code - even if this 1055 | * thread is not the thread group leader. 1056 | */ 1057 | SYSCALL_DEFINE1(exit_group, int, error_code) 1058 | { 1059 | do_group_exit((error_code & 0xff) << 8); 1060 | /* NOTREACHED */ 1061 | return 0; 1062 | } 1063 | 1064 | struct waitid_info { 1065 | pid_t pid; 1066 | uid_t uid; 1067 | int status; 1068 | int cause; 1069 | }; 1070 | 1071 | struct wait_opts { 1072 | enum pid_type wo_type; 1073 | int wo_flags; 1074 | struct pid *wo_pid; 1075 | 1076 | struct waitid_info *wo_info; 1077 | int wo_stat; 1078 | struct rusage *wo_rusage; 1079 | 1080 | wait_queue_entry_t child_wait; 1081 | int notask_error; 1082 | }; 1083 | 1084 | static int eligible_pid(struct wait_opts *wo, struct task_struct *p) 1085 | { 1086 | return wo->wo_type == PIDTYPE_MAX || 1087 | task_pid_type(p, wo->wo_type) == wo->wo_pid; 1088 | } 1089 | 1090 | static int 1091 | eligible_child(struct wait_opts *wo, bool ptrace, struct task_struct *p) 1092 | { 1093 | if (!eligible_pid(wo, p)) 1094 | return 0; 1095 | 1096 | /* 1097 | * Wait for all children (clone and not) if __WALL is set or 1098 | * if it is traced by us. 1099 | */ 1100 | if (ptrace || (wo->wo_flags & __WALL)) 1101 | return 1; 1102 | 1103 | /* 1104 | * Otherwise, wait for clone children *only* if __WCLONE is set; 1105 | * otherwise, wait for non-clone children *only*. 1106 | * 1107 | * Note: a "clone" child here is one that reports to its parent 1108 | * using a signal other than SIGCHLD, or a non-leader thread which 1109 | * we can only see if it is traced by us. 1110 | */ 1111 | if ((p->exit_signal != SIGCHLD) ^ !!(wo->wo_flags & __WCLONE)) 1112 | return 0; 1113 | 1114 | return 1; 1115 | } 1116 | 1117 | /* 1118 | * Handle sys_wait4 work for one task in state EXIT_ZOMBIE. We hold 1119 | * read_lock(&tasklist_lock) on entry. If we return zero, we still hold 1120 | * the lock and this task is uninteresting. If we return nonzero, we have 1121 | * released the lock and the system call should return. 1122 | */ 1123 | static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p) 1124 | { 1125 | int state, status; 1126 | pid_t pid = task_pid_vnr(p); 1127 | uid_t uid = from_kuid_munged(current_user_ns(), task_uid(p)); 1128 | struct waitid_info *infop; 1129 | 1130 | if (!likely(wo->wo_flags & WEXITED)) 1131 | return 0; 1132 | 1133 | if (unlikely(wo->wo_flags & WNOWAIT)) { 1134 | status = p->exit_code; 1135 | get_task_struct(p); 1136 | read_unlock(&tasklist_lock); 1137 | sched_annotate_sleep(); 1138 | if (wo->wo_rusage) 1139 | getrusage(p, RUSAGE_BOTH, wo->wo_rusage); 1140 | put_task_struct(p); 1141 | goto out_info; 1142 | } 1143 | /* 1144 | * Move the task's state to DEAD/TRACE, only one thread can do this. 1145 | */ 1146 | state = (ptrace_reparented(p) && thread_group_leader(p)) ? 1147 | EXIT_TRACE : EXIT_DEAD; 1148 | if (cmpxchg(&p->exit_state, EXIT_ZOMBIE, state) != EXIT_ZOMBIE) 1149 | return 0; 1150 | /* 1151 | * We own this thread, nobody else can reap it. 1152 | */ 1153 | read_unlock(&tasklist_lock); 1154 | sched_annotate_sleep(); 1155 | 1156 | /* 1157 | * Check thread_group_leader() to exclude the traced sub-threads. 1158 | */ 1159 | if (state == EXIT_DEAD && thread_group_leader(p)) { 1160 | struct signal_struct *sig = p->signal; 1161 | struct signal_struct *psig = current->signal; 1162 | unsigned long maxrss; 1163 | u64 tgutime, tgstime; 1164 | 1165 | /* 1166 | * The resource counters for the group leader are in its 1167 | * own task_struct. Those for dead threads in the group 1168 | * are in its signal_struct, as are those for the child 1169 | * processes it has previously reaped. All these 1170 | * accumulate in the parent's signal_struct c* fields. 1171 | * 1172 | * We don't bother to take a lock here to protect these 1173 | * p->signal fields because the whole thread group is dead 1174 | * and nobody can change them. 1175 | * 1176 | * psig->stats_lock also protects us from our sub-theads 1177 | * which can reap other children at the same time. Until 1178 | * we change k_getrusage()-like users to rely on this lock 1179 | * we have to take ->siglock as well. 1180 | * 1181 | * We use thread_group_cputime_adjusted() to get times for 1182 | * the thread group, which consolidates times for all threads 1183 | * in the group including the group leader. 1184 | */ 1185 | thread_group_cputime_adjusted(p, &tgutime, &tgstime); 1186 | spin_lock_irq(¤t->sighand->siglock); 1187 | write_seqlock(&psig->stats_lock); 1188 | psig->cutime += tgutime + sig->cutime; 1189 | psig->cstime += tgstime + sig->cstime; 1190 | psig->cgtime += task_gtime(p) + sig->gtime + sig->cgtime; 1191 | psig->cmin_flt += 1192 | p->min_flt + sig->min_flt + sig->cmin_flt; 1193 | psig->cmaj_flt += 1194 | p->maj_flt + sig->maj_flt + sig->cmaj_flt; 1195 | psig->cnvcsw += 1196 | p->nvcsw + sig->nvcsw + sig->cnvcsw; 1197 | psig->cnivcsw += 1198 | p->nivcsw + sig->nivcsw + sig->cnivcsw; 1199 | psig->cinblock += 1200 | task_io_get_inblock(p) + 1201 | sig->inblock + sig->cinblock; 1202 | psig->coublock += 1203 | task_io_get_oublock(p) + 1204 | sig->oublock + sig->coublock; 1205 | maxrss = max(sig->maxrss, sig->cmaxrss); 1206 | if (psig->cmaxrss < maxrss) 1207 | psig->cmaxrss = maxrss; 1208 | task_io_accounting_add(&psig->ioac, &p->ioac); 1209 | task_io_accounting_add(&psig->ioac, &sig->ioac); 1210 | write_sequnlock(&psig->stats_lock); 1211 | spin_unlock_irq(¤t->sighand->siglock); 1212 | } 1213 | 1214 | if (wo->wo_rusage) 1215 | getrusage(p, RUSAGE_BOTH, wo->wo_rusage); 1216 | status = (p->signal->flags & SIGNAL_GROUP_EXIT) 1217 | ? p->signal->group_exit_code : p->exit_code; 1218 | wo->wo_stat = status; 1219 | 1220 | if (state == EXIT_TRACE) { 1221 | write_lock_irq(&tasklist_lock); 1222 | /* We dropped tasklist, ptracer could die and untrace */ 1223 | ptrace_unlink(p); 1224 | 1225 | /* If parent wants a zombie, don't release it now */ 1226 | state = EXIT_ZOMBIE; 1227 | if (do_notify_parent(p, p->exit_signal)) 1228 | state = EXIT_DEAD; 1229 | p->exit_state = state; 1230 | write_unlock_irq(&tasklist_lock); 1231 | } 1232 | if (state == EXIT_DEAD) 1233 | release_task(p); 1234 | 1235 | out_info: 1236 | infop = wo->wo_info; 1237 | if (infop) { 1238 | if ((status & 0x7f) == 0) { 1239 | infop->cause = CLD_EXITED; 1240 | infop->status = status >> 8; 1241 | } else { 1242 | infop->cause = (status & 0x80) ? CLD_DUMPED : CLD_KILLED; 1243 | infop->status = status & 0x7f; 1244 | } 1245 | infop->pid = pid; 1246 | infop->uid = uid; 1247 | } 1248 | 1249 | return pid; 1250 | } 1251 | 1252 | static int *task_stopped_code(struct task_struct *p, bool ptrace) 1253 | { 1254 | if (ptrace) { 1255 | if (task_is_traced(p) && !(p->jobctl & JOBCTL_LISTENING)) 1256 | return &p->exit_code; 1257 | } else { 1258 | if (p->signal->flags & SIGNAL_STOP_STOPPED) 1259 | return &p->signal->group_exit_code; 1260 | } 1261 | return NULL; 1262 | } 1263 | 1264 | /** 1265 | * wait_task_stopped - Wait for %TASK_STOPPED or %TASK_TRACED 1266 | * @wo: wait options 1267 | * @ptrace: is the wait for ptrace 1268 | * @p: task to wait for 1269 | * 1270 | * Handle sys_wait4() work for %p in state %TASK_STOPPED or %TASK_TRACED. 1271 | * 1272 | * CONTEXT: 1273 | * read_lock(&tasklist_lock), which is released if return value is 1274 | * non-zero. Also, grabs and releases @p->sighand->siglock. 1275 | * 1276 | * RETURNS: 1277 | * 0 if wait condition didn't exist and search for other wait conditions 1278 | * should continue. Non-zero return, -errno on failure and @p's pid on 1279 | * success, implies that tasklist_lock is released and wait condition 1280 | * search should terminate. 1281 | */ 1282 | static int wait_task_stopped(struct wait_opts *wo, 1283 | int ptrace, struct task_struct *p) 1284 | { 1285 | struct waitid_info *infop; 1286 | int exit_code, *p_code, why; 1287 | uid_t uid = 0; /* unneeded, required by compiler */ 1288 | pid_t pid; 1289 | 1290 | /* 1291 | * Traditionally we see ptrace'd stopped tasks regardless of options. 1292 | */ 1293 | if (!ptrace && !(wo->wo_flags & WUNTRACED)) 1294 | return 0; 1295 | 1296 | if (!task_stopped_code(p, ptrace)) 1297 | return 0; 1298 | 1299 | exit_code = 0; 1300 | spin_lock_irq(&p->sighand->siglock); 1301 | 1302 | p_code = task_stopped_code(p, ptrace); 1303 | if (unlikely(!p_code)) 1304 | goto unlock_sig; 1305 | 1306 | exit_code = *p_code; 1307 | if (!exit_code) 1308 | goto unlock_sig; 1309 | 1310 | if (!unlikely(wo->wo_flags & WNOWAIT)) 1311 | *p_code = 0; 1312 | 1313 | uid = from_kuid_munged(current_user_ns(), task_uid(p)); 1314 | unlock_sig: 1315 | spin_unlock_irq(&p->sighand->siglock); 1316 | if (!exit_code) 1317 | return 0; 1318 | 1319 | /* 1320 | * Now we are pretty sure this task is interesting. 1321 | * Make sure it doesn't get reaped out from under us while we 1322 | * give up the lock and then examine it below. We don't want to 1323 | * keep holding onto the tasklist_lock while we call getrusage and 1324 | * possibly take page faults for user memory. 1325 | */ 1326 | get_task_struct(p); 1327 | pid = task_pid_vnr(p); 1328 | why = ptrace ? CLD_TRAPPED : CLD_STOPPED; 1329 | read_unlock(&tasklist_lock); 1330 | sched_annotate_sleep(); 1331 | if (wo->wo_rusage) 1332 | getrusage(p, RUSAGE_BOTH, wo->wo_rusage); 1333 | put_task_struct(p); 1334 | 1335 | if (likely(!(wo->wo_flags & WNOWAIT))) 1336 | wo->wo_stat = (exit_code << 8) | 0x7f; 1337 | 1338 | infop = wo->wo_info; 1339 | if (infop) { 1340 | infop->cause = why; 1341 | infop->status = exit_code; 1342 | infop->pid = pid; 1343 | infop->uid = uid; 1344 | } 1345 | return pid; 1346 | } 1347 | 1348 | /* 1349 | * Handle do_wait work for one task in a live, non-stopped state. 1350 | * read_lock(&tasklist_lock) on entry. If we return zero, we still hold 1351 | * the lock and this task is uninteresting. If we return nonzero, we have 1352 | * released the lock and the system call should return. 1353 | */ 1354 | static int wait_task_continued(struct wait_opts *wo, struct task_struct *p) 1355 | { 1356 | struct waitid_info *infop; 1357 | pid_t pid; 1358 | uid_t uid; 1359 | 1360 | if (!unlikely(wo->wo_flags & WCONTINUED)) 1361 | return 0; 1362 | 1363 | if (!(p->signal->flags & SIGNAL_STOP_CONTINUED)) 1364 | return 0; 1365 | 1366 | spin_lock_irq(&p->sighand->siglock); 1367 | /* Re-check with the lock held. */ 1368 | if (!(p->signal->flags & SIGNAL_STOP_CONTINUED)) { 1369 | spin_unlock_irq(&p->sighand->siglock); 1370 | return 0; 1371 | } 1372 | if (!unlikely(wo->wo_flags & WNOWAIT)) 1373 | p->signal->flags &= ~SIGNAL_STOP_CONTINUED; 1374 | uid = from_kuid_munged(current_user_ns(), task_uid(p)); 1375 | spin_unlock_irq(&p->sighand->siglock); 1376 | 1377 | pid = task_pid_vnr(p); 1378 | get_task_struct(p); 1379 | read_unlock(&tasklist_lock); 1380 | sched_annotate_sleep(); 1381 | if (wo->wo_rusage) 1382 | getrusage(p, RUSAGE_BOTH, wo->wo_rusage); 1383 | put_task_struct(p); 1384 | 1385 | infop = wo->wo_info; 1386 | if (!infop) { 1387 | wo->wo_stat = 0xffff; 1388 | } else { 1389 | infop->cause = CLD_CONTINUED; 1390 | infop->pid = pid; 1391 | infop->uid = uid; 1392 | infop->status = SIGCONT; 1393 | } 1394 | return pid; 1395 | } 1396 | 1397 | /* 1398 | * Consider @p for a wait by @parent. 1399 | * 1400 | * -ECHILD should be in ->notask_error before the first call. 1401 | * Returns nonzero for a final return, when we have unlocked tasklist_lock. 1402 | * Returns zero if the search for a child should continue; 1403 | * then ->notask_error is 0 if @p is an eligible child, 1404 | * or still -ECHILD. 1405 | */ 1406 | static int wait_consider_task(struct wait_opts *wo, int ptrace, 1407 | struct task_struct *p) 1408 | { 1409 | /* 1410 | * We can race with wait_task_zombie() from another thread. 1411 | * Ensure that EXIT_ZOMBIE -> EXIT_DEAD/EXIT_TRACE transition 1412 | * can't confuse the checks below. 1413 | */ 1414 | int exit_state = READ_ONCE(p->exit_state); 1415 | int ret; 1416 | 1417 | if (unlikely(exit_state == EXIT_DEAD)) 1418 | return 0; 1419 | 1420 | ret = eligible_child(wo, ptrace, p); 1421 | if (!ret) 1422 | return ret; 1423 | 1424 | if (unlikely(exit_state == EXIT_TRACE)) { 1425 | /* 1426 | * ptrace == 0 means we are the natural parent. In this case 1427 | * we should clear notask_error, debugger will notify us. 1428 | */ 1429 | if (likely(!ptrace)) 1430 | wo->notask_error = 0; 1431 | return 0; 1432 | } 1433 | 1434 | if (likely(!ptrace) && unlikely(p->ptrace)) { 1435 | /* 1436 | * If it is traced by its real parent's group, just pretend 1437 | * the caller is ptrace_do_wait() and reap this child if it 1438 | * is zombie. 1439 | * 1440 | * This also hides group stop state from real parent; otherwise 1441 | * a single stop can be reported twice as group and ptrace stop. 1442 | * If a ptracer wants to distinguish these two events for its 1443 | * own children it should create a separate process which takes 1444 | * the role of real parent. 1445 | */ 1446 | if (!ptrace_reparented(p)) 1447 | ptrace = 1; 1448 | } 1449 | 1450 | /* slay zombie? */ 1451 | if (exit_state == EXIT_ZOMBIE) { 1452 | /* we don't reap group leaders with subthreads */ 1453 | if (!delay_group_leader(p)) { 1454 | /* 1455 | * A zombie ptracee is only visible to its ptracer. 1456 | * Notification and reaping will be cascaded to the 1457 | * real parent when the ptracer detaches. 1458 | */ 1459 | if (unlikely(ptrace) || likely(!p->ptrace)) 1460 | return wait_task_zombie(wo, p); 1461 | } 1462 | 1463 | /* 1464 | * Allow access to stopped/continued state via zombie by 1465 | * falling through. Clearing of notask_error is complex. 1466 | * 1467 | * When !@ptrace: 1468 | * 1469 | * If WEXITED is set, notask_error should naturally be 1470 | * cleared. If not, subset of WSTOPPED|WCONTINUED is set, 1471 | * so, if there are live subthreads, there are events to 1472 | * wait for. If all subthreads are dead, it's still safe 1473 | * to clear - this function will be called again in finite 1474 | * amount time once all the subthreads are released and 1475 | * will then return without clearing. 1476 | * 1477 | * When @ptrace: 1478 | * 1479 | * Stopped state is per-task and thus can't change once the 1480 | * target task dies. Only continued and exited can happen. 1481 | * Clear notask_error if WCONTINUED | WEXITED. 1482 | */ 1483 | if (likely(!ptrace) || (wo->wo_flags & (WCONTINUED | WEXITED))) 1484 | wo->notask_error = 0; 1485 | } else { 1486 | /* 1487 | * @p is alive and it's gonna stop, continue or exit, so 1488 | * there always is something to wait for. 1489 | */ 1490 | wo->notask_error = 0; 1491 | } 1492 | 1493 | /* 1494 | * Wait for stopped. Depending on @ptrace, different stopped state 1495 | * is used and the two don't interact with each other. 1496 | */ 1497 | ret = wait_task_stopped(wo, ptrace, p); 1498 | if (ret) 1499 | return ret; 1500 | 1501 | /* 1502 | * Wait for continued. There's only one continued state and the 1503 | * ptracer can consume it which can confuse the real parent. Don't 1504 | * use WCONTINUED from ptracer. You don't need or want it. 1505 | */ 1506 | return wait_task_continued(wo, p); 1507 | } 1508 | 1509 | /* 1510 | * Do the work of do_wait() for one thread in the group, @tsk. 1511 | * 1512 | * -ECHILD should be in ->notask_error before the first call. 1513 | * Returns nonzero for a final return, when we have unlocked tasklist_lock. 1514 | * Returns zero if the search for a child should continue; then 1515 | * ->notask_error is 0 if there were any eligible children, 1516 | * or still -ECHILD. 1517 | */ 1518 | static int do_wait_thread(struct wait_opts *wo, struct task_struct *tsk) 1519 | { 1520 | struct task_struct *p; 1521 | 1522 | list_for_each_entry(p, &tsk->children, sibling) { 1523 | int ret = wait_consider_task(wo, 0, p); 1524 | 1525 | if (ret) 1526 | return ret; 1527 | } 1528 | 1529 | return 0; 1530 | } 1531 | 1532 | static int ptrace_do_wait(struct wait_opts *wo, struct task_struct *tsk) 1533 | { 1534 | struct task_struct *p; 1535 | 1536 | list_for_each_entry(p, &tsk->ptraced, ptrace_entry) { 1537 | int ret = wait_consider_task(wo, 1, p); 1538 | 1539 | if (ret) 1540 | return ret; 1541 | } 1542 | 1543 | return 0; 1544 | } 1545 | 1546 | static int child_wait_callback(wait_queue_entry_t *wait, unsigned mode, 1547 | int sync, void *key) 1548 | { 1549 | struct wait_opts *wo = container_of(wait, struct wait_opts, 1550 | child_wait); 1551 | struct task_struct *p = key; 1552 | 1553 | if (!eligible_pid(wo, p)) 1554 | return 0; 1555 | 1556 | if ((wo->wo_flags & __WNOTHREAD) && wait->private != p->parent) 1557 | return 0; 1558 | 1559 | return default_wake_function(wait, mode, sync, key); 1560 | } 1561 | 1562 | void __wake_up_parent(struct task_struct *p, struct task_struct *parent) 1563 | { 1564 | __wake_up_sync_key(&parent->signal->wait_chldexit, 1565 | TASK_INTERRUPTIBLE, 1, p); 1566 | } 1567 | 1568 | static long do_wait(struct wait_opts *wo) 1569 | { 1570 | struct task_struct *tsk; 1571 | int retval; 1572 | 1573 | trace_sched_process_wait(wo->wo_pid); 1574 | 1575 | init_waitqueue_func_entry(&wo->child_wait, child_wait_callback); 1576 | wo->child_wait.private = current; 1577 | add_wait_queue(¤t->signal->wait_chldexit, &wo->child_wait); 1578 | repeat: 1579 | /* 1580 | * If there is nothing that can match our criteria, just get out. 1581 | * We will clear ->notask_error to zero if we see any child that 1582 | * might later match our criteria, even if we are not able to reap 1583 | * it yet. 1584 | */ 1585 | wo->notask_error = -ECHILD; 1586 | if ((wo->wo_type < PIDTYPE_MAX) && 1587 | (!wo->wo_pid || hlist_empty(&wo->wo_pid->tasks[wo->wo_type]))) 1588 | goto notask; 1589 | 1590 | set_current_state(TASK_INTERRUPTIBLE); 1591 | read_lock(&tasklist_lock); 1592 | tsk = current; 1593 | do { 1594 | retval = do_wait_thread(wo, tsk); 1595 | if (retval) 1596 | goto end; 1597 | 1598 | retval = ptrace_do_wait(wo, tsk); 1599 | if (retval) 1600 | goto end; 1601 | 1602 | if (wo->wo_flags & __WNOTHREAD) 1603 | break; 1604 | } while_each_thread(current, tsk); 1605 | read_unlock(&tasklist_lock); 1606 | 1607 | notask: 1608 | retval = wo->notask_error; 1609 | if (!retval && !(wo->wo_flags & WNOHANG)) { 1610 | retval = -ERESTARTSYS; 1611 | if (!signal_pending(current)) { 1612 | schedule(); 1613 | goto repeat; 1614 | } 1615 | } 1616 | end: 1617 | __set_current_state(TASK_RUNNING); 1618 | remove_wait_queue(¤t->signal->wait_chldexit, &wo->child_wait); 1619 | return retval; 1620 | } 1621 | 1622 | static long kernel_waitid(int which, pid_t upid, struct waitid_info *infop, 1623 | int options, struct rusage *ru) 1624 | { 1625 | struct wait_opts wo; 1626 | struct pid *pid = NULL; 1627 | enum pid_type type; 1628 | long ret; 1629 | 1630 | if (options & ~(WNOHANG|WNOWAIT|WEXITED|WSTOPPED|WCONTINUED| 1631 | __WNOTHREAD|__WCLONE|__WALL)) 1632 | return -EINVAL; 1633 | if (!(options & (WEXITED|WSTOPPED|WCONTINUED))) 1634 | return -EINVAL; 1635 | 1636 | switch (which) { 1637 | case P_ALL: 1638 | type = PIDTYPE_MAX; 1639 | break; 1640 | case P_PID: 1641 | type = PIDTYPE_PID; 1642 | if (upid <= 0) 1643 | return -EINVAL; 1644 | break; 1645 | case P_PGID: 1646 | type = PIDTYPE_PGID; 1647 | if (upid <= 0) 1648 | return -EINVAL; 1649 | break; 1650 | default: 1651 | return -EINVAL; 1652 | } 1653 | 1654 | if (type < PIDTYPE_MAX) 1655 | pid = find_get_pid(upid); 1656 | 1657 | wo.wo_type = type; 1658 | wo.wo_pid = pid; 1659 | wo.wo_flags = options; 1660 | wo.wo_info = infop; 1661 | wo.wo_rusage = ru; 1662 | ret = do_wait(&wo); 1663 | 1664 | put_pid(pid); 1665 | return ret; 1666 | } 1667 | 1668 | SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *, 1669 | infop, int, options, struct rusage __user *, ru) 1670 | { 1671 | struct rusage r; 1672 | struct waitid_info info = {.status = 0}; 1673 | long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL); 1674 | int signo = 0; 1675 | 1676 | if (err > 0) { 1677 | signo = SIGCHLD; 1678 | err = 0; 1679 | if (ru && copy_to_user(ru, &r, sizeof(struct rusage))) 1680 | return -EFAULT; 1681 | } 1682 | if (!infop) 1683 | return err; 1684 | 1685 | if (!user_access_begin(infop, sizeof(*infop))) 1686 | return -EFAULT; 1687 | 1688 | unsafe_put_user(signo, &infop->si_signo, Efault); 1689 | unsafe_put_user(0, &infop->si_errno, Efault); 1690 | unsafe_put_user(info.cause, &infop->si_code, Efault); 1691 | unsafe_put_user(info.pid, &infop->si_pid, Efault); 1692 | unsafe_put_user(info.uid, &infop->si_uid, Efault); 1693 | unsafe_put_user(info.status, &infop->si_status, Efault); 1694 | user_access_end(); 1695 | return err; 1696 | Efault: 1697 | user_access_end(); 1698 | return -EFAULT; 1699 | } 1700 | 1701 | long kernel_wait4(pid_t upid, int __user *stat_addr, int options, 1702 | struct rusage *ru) 1703 | { 1704 | struct wait_opts wo; 1705 | struct pid *pid = NULL; 1706 | enum pid_type type; 1707 | long ret; 1708 | 1709 | if (options & ~(WNOHANG|WUNTRACED|WCONTINUED| 1710 | __WNOTHREAD|__WCLONE|__WALL)) 1711 | return -EINVAL; 1712 | 1713 | /* -INT_MIN is not defined */ 1714 | if (upid == INT_MIN) 1715 | return -ESRCH; 1716 | 1717 | if (upid == -1) 1718 | type = PIDTYPE_MAX; 1719 | else if (upid < 0) { 1720 | type = PIDTYPE_PGID; 1721 | pid = find_get_pid(-upid); 1722 | } else if (upid == 0) { 1723 | type = PIDTYPE_PGID; 1724 | pid = get_task_pid(current, PIDTYPE_PGID); 1725 | } else /* upid > 0 */ { 1726 | type = PIDTYPE_PID; 1727 | pid = find_get_pid(upid); 1728 | } 1729 | 1730 | wo.wo_type = type; 1731 | wo.wo_pid = pid; 1732 | wo.wo_flags = options | WEXITED; 1733 | wo.wo_info = NULL; 1734 | wo.wo_stat = 0; 1735 | wo.wo_rusage = ru; 1736 | ret = do_wait(&wo); 1737 | put_pid(pid); 1738 | if (ret > 0 && stat_addr && put_user(wo.wo_stat, stat_addr)) 1739 | ret = -EFAULT; 1740 | 1741 | return ret; 1742 | } 1743 | 1744 | SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr, 1745 | int, options, struct rusage __user *, ru) 1746 | { 1747 | struct rusage r; 1748 | long err = kernel_wait4(upid, stat_addr, options, ru ? &r : NULL); 1749 | 1750 | if (err > 0) { 1751 | if (ru && copy_to_user(ru, &r, sizeof(struct rusage))) 1752 | return -EFAULT; 1753 | } 1754 | return err; 1755 | } 1756 | 1757 | #ifdef __ARCH_WANT_SYS_WAITPID 1758 | 1759 | /* 1760 | * sys_waitpid() remains for compatibility. waitpid() should be 1761 | * implemented by calling sys_wait4() from libc.a. 1762 | */ 1763 | SYSCALL_DEFINE3(waitpid, pid_t, pid, int __user *, stat_addr, int, options) 1764 | { 1765 | return kernel_wait4(pid, stat_addr, options, NULL); 1766 | } 1767 | 1768 | #endif 1769 | 1770 | #ifdef CONFIG_COMPAT 1771 | COMPAT_SYSCALL_DEFINE4(wait4, 1772 | compat_pid_t, pid, 1773 | compat_uint_t __user *, stat_addr, 1774 | int, options, 1775 | struct compat_rusage __user *, ru) 1776 | { 1777 | struct rusage r; 1778 | long err = kernel_wait4(pid, stat_addr, options, ru ? &r : NULL); 1779 | if (err > 0) { 1780 | if (ru && put_compat_rusage(&r, ru)) 1781 | return -EFAULT; 1782 | } 1783 | return err; 1784 | } 1785 | 1786 | COMPAT_SYSCALL_DEFINE5(waitid, 1787 | int, which, compat_pid_t, pid, 1788 | struct compat_siginfo __user *, infop, int, options, 1789 | struct compat_rusage __user *, uru) 1790 | { 1791 | struct rusage ru; 1792 | struct waitid_info info = {.status = 0}; 1793 | long err = kernel_waitid(which, pid, &info, options, uru ? &ru : NULL); 1794 | int signo = 0; 1795 | if (err > 0) { 1796 | signo = SIGCHLD; 1797 | err = 0; 1798 | if (uru) { 1799 | /* kernel_waitid() overwrites everything in ru */ 1800 | if (COMPAT_USE_64BIT_TIME) 1801 | err = copy_to_user(uru, &ru, sizeof(ru)); 1802 | else 1803 | err = put_compat_rusage(&ru, uru); 1804 | if (err) 1805 | return -EFAULT; 1806 | } 1807 | } 1808 | 1809 | if (!infop) 1810 | return err; 1811 | 1812 | if (!user_access_begin(infop, sizeof(*infop))) 1813 | return -EFAULT; 1814 | 1815 | unsafe_put_user(signo, &infop->si_signo, Efault); 1816 | unsafe_put_user(0, &infop->si_errno, Efault); 1817 | unsafe_put_user(info.cause, &infop->si_code, Efault); 1818 | unsafe_put_user(info.pid, &infop->si_pid, Efault); 1819 | unsafe_put_user(info.uid, &infop->si_uid, Efault); 1820 | unsafe_put_user(info.status, &infop->si_status, Efault); 1821 | user_access_end(); 1822 | return err; 1823 | Efault: 1824 | user_access_end(); 1825 | return -EFAULT; 1826 | } 1827 | #endif 1828 | 1829 | __weak void abort(void) 1830 | { 1831 | BUG(); 1832 | 1833 | /* if that doesn't kill us, halt */ 1834 | panic("Oops failed to kill thread"); 1835 | } 1836 | EXPORT_SYMBOL(abort); 1837 | ``` 1838 | ----------------------------------------------------------------------------------------- 1839 | - Creator 1840 | https://github.com/torvalds/linux.git 1841 | 1842 | - Kernel doc 1843 | https://www.kernel.org/doc/html/v4.15/index.html# 1844 | 1845 | - How to patching your Kernel 1846 | https://www.kernel.org/doc/html/v4.10/process/applying-patches.html 1847 | 1848 | - Source: 1849 | https://www.kernel.org/ 1850 | 1851 | - - - Online: 1852 | 1853 | [![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/logo/Linux.png)](https://elixir.bootlin.com/linux/latest/source) 1854 | 1855 | # VV 1856 | -------------------------------------------------------------------------------- /SystemCall/README.MD: -------------------------------------------------------------------------------- 1 | # What is a system call? 2 | 3 | In the most literal sense, a system call (also called a "syscall") is an instruction, similar to the "add" instruction or the "jump" instruction. At a higher level, a system call is the way a user level program asks the operating system to do something for it. If you're writing a program, and you need to read from a file, you use a system call to ask the operating system to read the file for you. 4 | 5 | ----------------------------------------------------------------------------------------------------------------------------- 6 | - System calls in detail 7 | 8 | Here's how a system call works. First, the user program sets up the arguments for the system call. One of the arguments is the system call number (more on that later). Note that all this is done automatically by library functions unless you are writing in assembly. After the arguments are all set up, the program executes the "system call" instruction. This instruction causes an exception: an event that causes the processor to jump to a new address and start executing the code there. 9 | 10 | The instructions at the new address save your user program's state, figure out what system call you want, call the function in the kernel that implements that system call, restores your user program state, and returns control back to the user program. A system call is one way that the functions defined in a device driver end up being called. 11 | 12 | That was the whirlwind tour of how a system call works. Next, we'll go into minute detail for those who are curious about exactly how the kernel does all this. Don't worry if you don't quite understand all of the details - just remember that this is one way that a function in the kernel can end up being called, and that no magic is involved. You can trace the control flow all the way through the kernel - with difficulty sometimes, but you can do it. 13 | 14 | ----------------------------------------------------------------------------------------------------------------------------- 15 | - A system call example 16 | 17 | This is a good place to start showing code to go along with the theory. We'll follow the progress of a read() system call, starting from the moment the system call instruction is executed. The PowerPC architecture will be used as an example for the architecture specific part of the code. On the PowerPC, when you execute a system call, the processor jumps to the address 0xc00. The code at that location is defined in the file: 18 | 19 | `arch/ppc/kernel/head.S` 20 | 21 | It looks something like this: 22 | 23 | ```bash 24 | /* System call */ 25 | . = 0xc00 26 | SystemCall: 27 | EXCEPTION_PROLOG 28 | EXC_XFER_EE_LITE(0xc00, DoSyscall) 29 | 30 | /* Single step - not used on 601 */ 31 | EXCEPTION(0xd00, SingleStep, SingleStepException, EXC_XFER_STD) 32 | EXCEPTION(0xe00, Trap_0e, UnknownException, EXC_XFER_EE) 33 | ``` 34 | 35 | What this code does is save some state and call another function called `DoSyscall`. Here's a more detailed explanation (feel free to skip this part): 36 | 37 | `EXCEPTION_PROLOG` is a macro that handles the switch from user to kernel space, which requires things like saving the register state of the user process. `EXC_XFER_EE_LITE` is called with the address of this routine, and the address of the function `DoSyscall`. Eventually, some state will be saved and `DoSyscall` will be called. The next two lines save two exception vectors on the addresses`0xd00` and `0xe00`. 38 | 39 | `EXC_XFER_EE_LITE` looks like this: 40 | 41 | ```bash 42 | #define EXC_XFER_EE_LITE(n, hdlr) \ 43 | EXC_XFER_TEMPLATE(n, hdlr, n+1, COPY_EE, transfer_to_handler, \ 44 | ret_from_except) 45 | ``` 46 | 47 | `EXC_XFER_TEMPLATE` is another macro, and the code looks like this: 48 | 49 | ```bash 50 | #define EXC_XFER_TEMPLATE(n, hdlr, trap, copyee, tfer, ret) \ 51 | li r10,trap; \ 52 | stw r10,TRAP(r11); \ 53 | li r10,MSR_KERNEL; \ 54 | copyee(r10, r9); \ 55 | bl tfer; \ 56 | ##n: \ 57 | .long hdlr; \ 58 | .long ret 59 | ``` 60 | 61 | `li` stands for "load immediate", which means that a constant value known at compile time is stored in a register. First, `trap` is loaded into the register `r10`. On the next line, that value is stored on the address given by `TRAP(r11)`. `TRAP(r11)` and the next two lines do some hardware specific bit manipulation. After that we call the `tfer` function (i.e. the `transfer_to_handler` function), which does yet more housekeeping, and then transfers control to `hdlr` (i.e. `DoSyscall`). Note that `transfer_to_handler` loads the address of the handler from the link register, which is why you see `.long DoSyscall` instead of `bl DoSyscall`. 62 | 63 | Now, let's look at `DoSyscall`. It's in the file: 64 | 65 | `arch/ppc/kernel/entry.S` 66 | 67 | Eventually, this function loads up the address of the syscall table and indexes into it using the system call number. The syscall table is what the OS uses to translate from a system call number to a particular system call. The system call table is named `sys_call_table` and defined in: 68 | 69 | `arch/ppc/kernel/misc.S` 70 | 71 | The syscall table contains the addresses of the functions that implement each system call. For example, the read() system call function is named `sys_read`. The `read()` system call number is 3, so the address of sys_read() is in the 4th entry of the system call table (since we start numbering the system calls with 0). We read the data from the address sys_call_table + (3 * word_size) and we get the address of `sys_read()`. 72 | 73 | After `DoSyscall` has looked up the correct system call address, it transfers control to that system call. Let's look at where `sys_read()` is defined, in the file: 74 | `fs/read_write.c` 75 | 76 | This function finds the file struct associated with the fd number you passed to the `read()` function. That structure contains a pointer to the function that should be used to read data from that particular kind of file. After doing some checks, it calls that file-specific read function in order to actually read the data from the file, and then returns. This file-specific function is defined somewhere else - the socket code, filesystem code, or device driver code, for example. This is one of the points at which a specific kernel subsystem finally interfaces with the rest of the kernel. After our read function finishes, we return from the `sys_read()`, back to `DoSyscall()`, which switches control to `ret_from_except`, which is in defined in: 77 | 78 | `arch/ppc/kernel/entry.S` 79 | 80 | This checks for tasks that might need to be done before switching back to user mode. If nothing else needs to be done, we fall through to the `restore` function, which restores the user process's state and returns control back to the user program. There! Your `read()` call is done! If you're lucky, you even got your data back. 81 | 82 | You can explore syscalls further by putting printks at strategic places. Be sure to limit the amount of output from these printks. For example, if you add a `printk` to `sys_read()` syscall, you should do something like this: 83 | 84 | ```bash 85 | static int mycount = 0; 86 | 87 | if (mycount < 10) { 88 | printk ("sys_read called\n"); 89 | mycount++; 90 | } 91 | ``` 92 | # BR nu11secur1ty! 93 | -------------------------------------------------------------------------------- /SystemCall/SystemCall+LKM+LSM.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/SystemCall/SystemCall+LKM+LSM.pdf -------------------------------------------------------------------------------- /TLB/README.MD: -------------------------------------------------------------------------------- 1 | ## Translation Lookaside Buffer (TLB) in Paging 2 | 3 | In Operating System (Memory Management Technique : Paging), for each process page table will be created, which will contain Page Table Entry (PTE). This PTE will contain information like frame number (The address of main memory where we want to refer), and some other useful bits (e.g., valid/invalid bit, dirty bit, protection bit etc). This page table entry (PTE) will tell where in the main memory the actual page is residing. 4 | 5 | Now the question is where to place the page table, such that overall access time (or reference time) will be less. 6 | 7 | The problem initially was to fast access the main memory content based on address generated by CPU (i.e logical/virtual address). Initially, some people thought of using registers to store page table, as they are high-speed memory so access time will be less. 8 | 9 | The idea used here is, place the page table entries in registers, for each request generated from CPU (virtual address), it will be matched to the appropriate page number of the page table, which will now tell where in the main memory that corresponding page resides. Everything seems right here, but the problem is register size is small (in practical, it can accommodate maximum of 0.5k to 1k page table entries) and process size may be big hence the required page table will also be big (lets say this page table contains 1M entries), so registers may not hold all the PTE’s of Page table. So this is not a practical approach. 10 | To overcome this size issue, the entire page table was kept in main memory. but the problem here is two main memory references are required: 11 | 12 | - To find the frame number 13 | 14 | - To go to the address specified by frame number 15 | 16 | To overcome this problem a high-speed cache is set up for page table entries called a Translation Lookaside Buffer (TLB). Translation Lookaside Buffer (TLB) is nothing but a special cache used to keep track of recently used transactions. TLB contains page table entries that have been most recently used. Given a virtual address, the processor examines the TLB if a page table entry is present (TLB hit), the frame number is retrieved and the real address is formed. If a page table entry is not found in the TLB (TLB miss), the page number is used as index while processing page table. TLB first checks if the page is already in main memory, if not in main memory a page fault is issued then the TLB is updated to include the new page entry. 17 | 18 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/TLB/screen/tlb1.jpg) 19 | 20 | ## Steps in TLB hit: 21 | 22 | 23 | 1 . CPU generates virtual (logical) address. 24 | 25 | 2 . It is checked in TLB (present). 26 | 27 | 3 . Corresponding frame number is retrieved, which now tells where in the main memory page lies. 28 | 29 | ## Steps in TLB miss: 30 | 31 | 32 | 1 . CPU generates virtual (logical) address. 33 | 34 | 2 . It is checked in TLB (not present). 35 | 36 | 3 . Now the page number is matched to page table residing in main memory (assuming page table contains all PTE). 37 | 38 | 4 . Corresponding frame number is retrieved, which now tells where in the main memory page lies. 39 | 40 | 5 . The TLB is updated with new PTE (if space is not there, one of the replacement technique comes into picture i.e either FIFO, LRU or MFU etc). 41 | 42 | ``` 43 | Effective memory access time(EMAT) : TLB is used to reduce effective memory access time as it is a high speed associative cache. 44 | EMAT = h*(c+m) + (1-h)*(c+2m) 45 | where, h = hit ratio of TLB 46 | m = Memory access time 47 | c = TLB access time 48 | ``` 49 | -------------------------------------------------------------------------------- /TLB/screen/screen: -------------------------------------------------------------------------------- 1 | screen 2 | -------------------------------------------------------------------------------- /TLB/screen/tlb1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/TLB/screen/tlb1.jpg -------------------------------------------------------------------------------- /The-secrets-of-Kconfig/README.MD: -------------------------------------------------------------------------------- 1 | # Kconfig 2 | The Linux kernel config/build system, also known as Kconfig/kbuild, has been around for a long time, ever since the Linux kernel code migrated to Git. As supporting infrastructure, however, it is seldom in the spotlight; even kernel developers who use it in their daily work never really think about it. 3 | 4 | To explore how the Linux kernel is compiled, this article will dive into the Kconfig/kbuild internal process, explain how the .config file and the vmlinux/bzImage files are produced, and introduce a smart trick for dependency tracking. 5 | 6 | The first step in building a kernel is always configuration. Kconfig helps make the Linux kernel highly modular and customizable. Kconfig offers the user many config targets: 7 | 8 | - Kconfig 9 | ``` 10 | config - Update current config utilizing a line-oriented program 11 | nconfig - Update current config utilizing a ncurses menu-based program 12 | menuconfig - Update current config utilizing a menu-based program 13 | xconfig - Update current config utilizing a Qt-based frontend 14 | gconfig - Update current config utilizing a GTK+ based frontend 15 | oldconfig - Update current config utilizing a provided .config as base 16 | localmodconfig - Update current config disabling modules not loaded 17 | localyesconfig - Update current config converting local mods to core 18 | defconfig - New config with default from Arch-supplied defconfig 19 | savedefconfig - Save current config as ./defconfig (minimal config) 20 | allnoconfig - New config where all options are answered with 'no' 21 | allyesconfig - New config where all options are accepted with 'yes' 22 | allmodconfig - New config selecting modules when possible 23 | alldefconfig - New config with all symbols set to default 24 | randconfig - New config with a random answer to all options 25 | listnewconfig - List new options 26 | olddefconfig - Same as oldconfig but sets new symbols to their default value without prompting 27 | kvmconfig - Enable additional options for KVM guest kernel support 28 | xenconfig - Enable additional options for xen dom0 and guest kernel support 29 | tinyconfig - Configure the tiniest possible kernel 30 | ``` 31 | The menuconfig is the most popular of these targets. The targets are processed by different host programs, which are provided by the kernel and built during kernel building. Some targets have a GUI (for the user's convenience) while most don't. Kconfig-related tools and source code reside mainly under scripts/kconfig/ in the kernel source. As we can see from scripts/kconfig/Makefile, there are several host programs, including conf, mconf, and nconf. Except for conf, each of them is responsible for one of the GUI-based config targets, so, conf deals with most of them. 32 | 33 | Logically, Kconfig's infrastructure has two parts: one implements a new language to define the configuration items (see the Kconfig files under the kernel source), and the other parses the Kconfig language and deals with configuration actions. 34 | 35 | Most of the config targets have roughly the same internal process (shown below): 36 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/The-secrets-of-Kconfig/wall/1.png) 37 | 38 | 39 | - Note that all configuration items have a default value. 40 | 41 | The first step reads the Kconfig file under source root to construct an initial configuration database; then it updates the initial database by reading an existing configuration file according to this priority: 42 | ``` 43 | .config 44 | /lib/modules/$(shell,uname -r)/.config 45 | /etc/kernel-config 46 | /boot/config-$(shell,uname -r) 47 | ARCH_DEFCONFIG 48 | arch/$(ARCH)/defconfig 49 | ``` 50 | If you are doing GUI-based configuration via menuconfig or command-line-based configuration via oldconfig, the database is updated according to your customization. Finally, the configuration database is dumped into the .config file. 51 | 52 | But the .config file is not the final fodder for kernel building; this is why the syncconfig target exists. syncconfig used to be a config target called silentoldconfig, but it doesn't do what the old name says, so it was renamed. Also, because it is for internal use (not for users), it was dropped from the list. 53 | 54 | Here is an illustration of what syncconfig does: 55 | 56 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/The-secrets-of-Kconfig/wall/2.png) 57 | 58 | syncconfig takes .config as input and outputs many other files, which fall into three categories: 59 | 60 | - auto.conf & tristate.conf are used for makefile text processing. For example, you may see statements like this in a component's makefile: 61 | 62 | ```bash 63 | obj-$(CONFIG_GENERIC_CALIBRATE_DELAY) += calibrate.o 64 | ``` 65 | - autoconf.h is used in C-language source files. 66 | - Empty header files under include/config/ are used for configuration-dependency tracking during kbuild, which is explained below. 67 | 68 | After configuration, we will know which files and code pieces are not compiled. 69 | 70 | -------------------------------------------------------------------------------- 71 | 72 | # kbuild 73 | 74 | Component-wise building, called recursive make, is a common way for GNU make to manage a large project. Kbuild is a good example of recursive make. By dividing source files into different modules/components, each component is managed by its own makefile. When you start building, a top makefile invokes each component's makefile in the proper order, builds the components, and collects them into the final executive. 75 | 76 | Kbuild refers to different kinds of makefiles: 77 | 78 | - `Makefile` is the top makefile located in source root. 79 | - `.config` is the kernel configuration file. 80 | - `arch/$(ARCH)/Makefile` is the arch makefile, which is the supplement to the top makefile. 81 | - `scripts/Makefile.*` describes common rules for all kbuild makefiles. 82 | - Finally, there are about 500 `kbuild makefiles`. 83 | 84 | The top makefile includes the arch makefile, reads the .config file, descends into subdirectories, invokes `make` on each component's makefile with the help of routines defined in `scripts/Makefile.*`, builds up each intermediate object, and links all the intermediate objects into vmlinux. Kernel document Documentation/kbuild/makefiles.txt describes all aspects of these makefiles. 85 | 86 | As an example, let's look at how vmlinux is produced on x86-64: 87 | 88 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/The-secrets-of-Kconfig/wall/3.png) 89 | 90 | 91 | All the `.o` files that go into vmlinux first go into their own built-in.a, which is indicated via variables `KBUILD_VMLINUX_INIT`, `KBUILD_VMLINUX_MAIN`, `KBUILD_VMLINUX_LIBS`, then are collected into the vmlinux file. 92 | 93 | Take a look at how recursive make is implemented in the Linux kernel, with the help of simplified makefile code: 94 | 95 | 96 | ``` 97 | # In top Makefile vmlinux: scripts/link-vmlinux.sh $(vmlinux-deps) +$(call if_changed,link-vmlinux) # Variable assignments vmlinux-deps := $(KBUILD_LDS) $(KBUILD_VMLINUX_INIT) $(KBUILD_VMLINUX_MAIN) $(KBUILD_VMLINUX_LIBS) export KBUILD_VMLINUX_INIT := $(head-y) $(init-y) export KBUILD_VMLINUX_MAIN := $(core-y) $(libs-y2) $(drivers-y) $(net-y) $(virt-y) export KBUILD_VMLINUX_LIBS := $(libs-y1) export KBUILD_LDS := arch/$(SRCARCH)/kernel/vmlinux.lds init-y := init/ drivers-y := drivers/ sound/ firmware/ net-y := net/ libs-y := lib/ core-y := usr/ virt-y := virt/ # Transform to corresponding built-in.a init-y := $(patsubst %/, %/built-in.a, $(init-y)) core-y := $(patsubst %/, %/built-in.a, $(core-y)) drivers-y := $(patsubst %/, %/built-in.a, $(drivers-y)) net-y := $(patsubst %/, %/built-in.a, $(net-y)) libs-y1 := $(patsubst %/, %/lib.a, $(libs-y)) libs-y2 := $(patsubst %/, %/built-in.a, $(filter-out %.a, $(libs-y))) virt-y := $(patsubst %/, %/built-in.a, $(virt-y)) # Setup the dependency. vmlinux-deps are all intermediate objects, vmlinux-dirs # are phony targets, so every time comes to this rule, the recipe of vmlinux-dirs # will be executed. Refer "4.6 Phony Targets" of `info make` $(sort $(vmlinux-deps)): $(vmlinux-dirs) ; # Variable vmlinux-dirs is the directory part of each built-in.a vmlinux-dirs := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \ $(core-y) $(core-m) $(drivers-y) $(drivers-m) \ $(net-y) $(net-m) $(libs-y) $(libs-m) $(virt-y))) # The entry of recursive make $(vmlinux-dirs): $(Q)$(MAKE) $(build)=$@ need-builtin=1 98 | ``` 99 | 100 | The recursive make recipe is expanded, for example: 101 | 102 | ```bash 103 | make -f scripts/Makefile.build obj=init need-builtin=1 104 | ``` 105 | 106 | 107 | This means make will go into scripts/Makefile.build to continue the work of building each built-in.a. With the help of scripts/link-vmlinux.sh, the vmlinux file is finally under source root. 108 | Understanding vmlinux vs. bzImage 109 | 110 | Many Linux kernel developers may not be clear about the relationship between vmlinux and bzImage. For example, here is their relationship in x86-64: 111 | 112 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/The-secrets-of-Kconfig/wall/4.png) 113 | 114 | 115 | The source root vmlinux is stripped, compressed, put into `piggy.S`, then linked with other peer objects into `arch/x86/boot/compressed/vmlinux`. Meanwhile, a file called setup.bin is produced under `arch/x86/boot`. There may be an optional third file that has relocation info, depending on the configuration of `CONFIG_X86_NEED_RELOCS`. 116 | 117 | A host program called build, provided by the kernel, builds these two (or three) parts into the final bzImage file. 118 | 119 | - Dependency tracking 120 | 121 | Kbuild tracks three kinds of dependencies: 122 | 123 | 1. All prerequisite files (both *.c and *.h) 124 | 2. CONFIG_ options used in all prerequisite files 125 | 3. Command-line dependencies used to compile the target 126 | 127 | The first one is easy to understand, but what about the second and third? Kernel developers often see code pieces like this: 128 | 129 | ```bash 130 | #ifdef CONFIG_SMP __boot_cpu_id = cpu; #endif 131 | ``` 132 | 133 | When `CONFIG_SMP` changes, this piece of code should be recompiled. The command line for compiling a source file also matters, because different command lines may result in different object files. 134 | 135 | When a `.c` file uses a header file via a `#include` directive, you need write a rule like this: 136 | 137 | ```c 138 | main.o: defs.h recipe... 139 | ``` 140 | When managing a large project, you need a lot of these kinds of rules; writing them all would be tedious and boring. Fortunately, most modern C compilers can write these rules for you by looking at the #include lines in the source file. For the GNU Compiler Collection (GCC), it is just a matter of adding a command-line parameter: -MD depfile 141 | 142 | 143 | ``` 144 | # In scripts/Makefile.lib c_flags = -Wp,-MD,$(depfile) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) \ -include $(srctree)/include/linux/compiler_types.h \ $(__c_flags) $(modkern_cflags) \ $(basename_flags) $(modname_flags) 145 | ``` 146 | This would generate a `.d` file with content like: 147 | 148 | ``` 149 | init_task.o: init/init_task.c include/linux/kconfig.h \ include/generated/autoconf.h include/linux/init_task.h \ include/linux/rcupdate.h include/linux/types.h \ ... 150 | ``` 151 | 152 | Then the host program `fixdep` takes care of the other two dependencies by taking the `depfile` and command line as input, then outputting a `..cmd` file in makefile syntax, which records the command line and all the prerequisites (including the configuration) for a target. It looks like this: 153 | 154 | 155 | ``` 156 | # The command line used to compile the target cmd_init/init_task.o := gcc -Wp,-MD,init/.init_task.o.d -nostdinc ... ... # The dependency files deps_init/init_task.o := \ $(wildcard include/config/posix/timers.h) \ $(wildcard include/config/arch/task/struct/on/stack.h) \ $(wildcard include/config/thread/info/in/task.h) \ ... include/uapi/linux/types.h \ arch/x86/include/uapi/asm/types.h \ include/uapi/asm-generic/types.h \ ... 157 | ``` 158 | 159 | 160 | A `..cmd` file will be included during recursive make, providing all the dependency info and helping to decide whether to rebuild a target or not. 161 | 162 | The secret behind this is that `fixdep` will parse the `depfil`e (`.d`file), then parse all the dependency files inside, search the text for all the `CONFIG_` strings, convert them to the corresponding empty header file, and add them to the target's prerequisites. Every time the configuration changes, the corresponding empty header file will be updated, too, so kbuild can detect that change and rebuild the target that depends on it. Because the command line is also recorded, it is easy to compare the last and current compiling parameters. 163 | -------------------------------------------------------------------------------- /The-secrets-of-Kconfig/wall/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/The-secrets-of-Kconfig/wall/1.png -------------------------------------------------------------------------------- /The-secrets-of-Kconfig/wall/2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/The-secrets-of-Kconfig/wall/2.png -------------------------------------------------------------------------------- /The-secrets-of-Kconfig/wall/3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/The-secrets-of-Kconfig/wall/3.png -------------------------------------------------------------------------------- /The-secrets-of-Kconfig/wall/4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/The-secrets-of-Kconfig/wall/4.png -------------------------------------------------------------------------------- /Uninstalling/kernels.md: -------------------------------------------------------------------------------- 1 | # Rpm based distro – Red Hat/CentOS/Fedora Core/Suse Linux 2 | 3 | First find out all installed kernel version with following command: 4 | ```bash 5 | # rpm -qa | grep kernel-smp 6 | ``` 7 | - or 8 | ```bash 9 | # rpm -qa | grep kernel 10 | ``` 11 | - Output: 12 | ```bash 13 | kernel-smp-x.x.x-x.x.x.EL 14 | kernel-smp-x.x.x-x.x.x.EL 15 | kernel-smp-x.x.x-x.x.x.EL 16 | ``` 17 | I’ve total 3 different kernel installed. To remove kernel-smp-2.6.9-42.EL type command: 18 | ```bash 19 | # rpm -e kernel-smp-x.x.x-x.x.x.EL 20 | ``` 21 | - OR 22 | ```bash 23 | # rpm -vv -e kernel-smp-x.x.x-x.x.x.EL 24 | ``` 25 | -------------------------------------------------------------------------------- /Userspace/README.MD: -------------------------------------------------------------------------------- 1 | # Read 2 | - Part1 3 | 4 | [![alt text](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Userspace/logo/red-hat-logo.png)](https://www.redhat.com/en/blog/architecting-containers-part-1-why-understanding-user-space-vs-kernel-space-matters "Linux-Usercpace") 5 | 6 | - Part2 7 | 8 | [![alt text](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Userspace/logo/red-hat-logo.png)](https://www.redhat.com/en/blog/architecting-containers-part-2-why-user-space-matters "Linux-Usercpace") 9 | 10 | - Part3 11 | 12 | [![alt text](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Userspace/logo/red-hat-logo.png)](https://www.redhat.com/en/blog/architecting-containers-part-3-how-user-space-affects-your-application "Linux-Usercpace") 13 | 14 | 15 | -------------------------------------------------------------------------------- /Userspace/logo/red-hat-logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/Userspace/logo/red-hat-logo.png -------------------------------------------------------------------------------- /Userspace/what_mean.md: -------------------------------------------------------------------------------- 1 | This is a Linus Torvald's first rule of kernel development. This note explains it: https://lkml.org/lkml/2012/12/23/75. I.e., when maintaining the kernel, do not do something which breaks user programs/applications. In other words, when making kernel changes, it is very bad to cause problems in the user's application "space". That doesn't literally mean memory. That means anything that impacts the user applications in a way that negatively affects its behavior (causes the program to malfunction). The note I cite also indicates at least one example. 2 | 3 | 4 | ----------------------------------------------------- 5 | 6 | ```bash 7 | From Linus Torvalds <> 8 | Date Sun, 23 Dec 2012 09:36:15 -0800 9 | Subject Re: [Regression w/ patch] Media commit causes user space to misbahave (was: Re: Linux 3.8-rc1) 10 | On Sun, Dec 23, 2012 at 6:08 AM, Mauro Carvalho Chehab 11 | wrote: 12 | > 13 | > Are you saying that pulseaudio is entering on some weird loop if the 14 | > returned value is not -EINVAL? That seems a bug at pulseaudio. 15 | 16 | Mauro, SHUT THE FUCK UP! 17 | 18 | It's a bug alright - in the kernel. How long have you been a 19 | maintainer? And you *still* haven't learnt the first rule of kernel 20 | maintenance? 21 | 22 | If a change results in user programs breaking, it's a bug in the 23 | kernel. We never EVER blame the user programs. How hard can this be to 24 | understand? 25 | 26 | To make matters worse, commit f0ed2ce840b3 is clearly total and utter 27 | CRAP even if it didn't break applications. ENOENT is not a valid error 28 | return from an ioctl. Never has been, never will be. ENOENT means "No 29 | such file and directory", and is for path operations. ioctl's are done 30 | on files that have already been opened, there's no way in hell that 31 | ENOENT would ever be valid. 32 | 33 | > So, on a first glance, this doesn't sound like a regression, 34 | > but, instead, it looks tha pulseaudio/tumbleweed has some serious 35 | > bugs and/or regressions. 36 | 37 | Shut up, Mauro. And I don't _ever_ want to hear that kind of obvious 38 | garbage and idiocy from a kernel maintainer again. Seriously. 39 | 40 | I'd wait for Rafael's patch to go through you, but I have another 41 | error report in my mailbox of all KDE media applications being broken 42 | by v3.8-rc1, and I bet it's the same kernel bug. And you've shown 43 | yourself to not be competent in this issue, so I'll apply it directly 44 | and immediately myself. 45 | 46 | WE DO NOT BREAK USERSPACE! 47 | 48 | Seriously. How hard is this rule to understand? We particularly don't 49 | break user space with TOTAL CRAP. I'm angry, because your whole email 50 | was so _horribly_ wrong, and the patch that broke things was so 51 | obviously crap. The whole patch is incredibly broken shit. It adds an 52 | insane error code (ENOENT), and then because it's so insane, it adds a 53 | few places to fix it up ("ret == -ENOENT ? -EINVAL : ret"). 54 | 55 | The fact that you then try to make *excuses* for breaking user space, 56 | and blaming some external program that *used* to work, is just 57 | shameful. It's not how we work. 58 | 59 | Fix your f*cking "compliance tool", because it is obviously broken. 60 | And fix your approach to kernel programming. 61 | 62 | Linus 63 | ``` 64 | -------------------------------------------------------------------------------- /file_system/File System in Linux Kernel.md: -------------------------------------------------------------------------------- 1 | # Introduction: 2 | A file system is one of the central OS subsystems. File systems continue to develop as operating systems evolve. Currently we have an entire heterogenetic zoo of file systems, from the old “classic” UFS, to the new exotic NILFS (though this idea isn’t new at all, look at LFS) and BTRFS. We aren’t going to try to throw down monsters like ext3/4 and BTRFS. Our file system will be of educational nature, and we will familiarize ourselves with the Linux kernel using its help. 3 | 4 | # Environment Setup: 5 | Before we get into the kernel, let’s prepare all the necessary steps for building our OS module. I use Ubuntu, so I’m going to setup within this environment. Fortunately, it’s not difficult at all. To begin with, we’re going to need a compiler and building facilities: 6 | 7 | ``` 8 | sudo apt-get install gcc build-essential 9 | ``` 10 | Then we may need the kernel source code. We’ll go the easy route and won’t bother rebuilding the kernel from the source. We’ll just determine kernel headers; this should be enough to write a loadable module. The headers can be determined the following way: 11 | 12 | Check for available linux headers: 13 | ``` 14 | apt-cache search linux-headers 15 | ``` 16 | Install them: 17 | ``` 18 | sudo apt-get install linux-headers-xx.x.x.x.x 19 | apt-get install linux-headers-$(uname -r) 20 | ``` 21 | 22 | And now I’m going to jump onto my soap box. Rummaging in the kernel on a working machine isn’t the smartest idea, so I strongly recommend you perform all these actions withiin a virtual machine. We won’t do anything dangerous so the stored data is safe. But if anything goes wrong, we’ll probably have to restart the system. Besides, it’s more comfortable to debug the kernel modules in a virtual machine (such as QEMU), though this question won’t be considered in the article. 23 | 24 | # Environment Check Up: 25 | 26 | In order to check the environment we’ll write and start the kernel module, which won’t do anything useful (Hello, World!). Let’s consider the module code. I named it super.c (super is derived from superblock): 27 | 28 | ``` 29 | #include 30 | #include 31 | 32 | static int __init aufs_init(void) 33 | { 34 | pr_debug("aufs module loaded\n"); 35 | return 0; 36 | } 37 | 38 | static void __exit aufs_fini(void) 39 | { 40 | pr_debug("aufs module unloaded\n"); 41 | } 42 | 43 | module_init(aufs_init); 44 | module_exit(aufs_fini); 45 | 46 | MODULE_LICENSE("GPL"); 47 | MODULE_AUTHOR("kmu"); 48 | ``` 49 | 50 | At the very beginning you can see two headers. They are an important part of any loadable module. Then two functions: aufs_init and aufs_fini follow. They will be called before and after the module roll-out. Some of you may be confused by __init label. __init is a hint to the kernel that the function is used during module initialization only. It means that after the module initialization it can be unloaded from memory. There is an analogous marker for the data; the kernel can ignore these hints. The reference to __init functions and data from the main module code is a potential error. That’s why during the module setup it should be checked that there are no such references. If such message is found, the kernel setup system will give out an alert. The similar check is carried out for __exit functions and data. If you want to know details about __init and __exit, you can refer to the source code. 51 | 52 | Please note that aufs_init returns int. Thus the kernel finds out that during the module initialization something went wrong. If the module hasn’t returned a zero value, it means that an error occurred during initialization. In order to find out which functions should be called at module loading and unloading, two macros module_init and module_exit are used. In order to learn details, refer to lxr, it’s really useful if you want to study the kernel. pr_debug – is a function (it’s a macro actually, but it doesn’t matter for now) of kernel output to the log, it’s very similar to the family of printf functions with some extensions (for example, for IP and MAC addresses printing. You will find a complete list of modifiers in the documentation of the kernel. Together with pr_debug, there is an entire family of macros: pr_info, pr_warn, pr_err and others. If you are familiar with Linux module development you know about printk function. pr_* open into printk calls, so you can use printk instead of them. 53 | 54 | Then there are macros with information about descendants – a license and an author. There are also other macros that allow to store manifold information about the module. For example MODULE_VERSION, MODULE_INFO, MODULE_SUPPORTED_DEVICE and others. By the way, if you’re using different form GPL license, you won’t be able to use some functions available for GPL modules. 55 | 56 | Now let’s build and start our module. We’ll write Makefile for this. It will build our module: 57 | 58 | ``` 59 | obj-m := aufs.o 60 | aufs-objs := super.o 61 | 62 | CFLAGS_super.o := -DDEBUG 63 | 64 | all: 65 | make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules 66 | 67 | clean: 68 | make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean 69 | ``` 70 | 71 | Makefile calls Makefile for building, it should be located in /lib/modules/$(shell uname -r)/build catalogue (uname –r is a command that returns the version of the started kernel). If the headers (or source codes) of the kernel are located in another catalogue, you should fix it. 72 | 73 | obj-m allows to state the name of the future module. In our case it will be named aufs.ko (ko exactly – from kernel object). aufs-objs allows to indicate what source code aufs module should be built from. In our case super.c file will be used. You can also indicate different compiler flags, which will be used (in addition to those used by Makefile kernel) for object files building. In our case I pass –DDEBUG flag during super.c build. If we don’t pass the flag, we won’t see pr_debug output in the system log. 74 | 75 | In order to build the module make command should be executed. If everything’s fine, aufs.ko file will appear in the catalogue. It’s quite easy to load the module: 76 | 77 | ``` 78 | sudo insmod ./aufs.ko 79 | ``` 80 | In order to make sure that the module is loaded you can look at lsmod command output. 81 | 82 | ``` 83 | lsmod | grep aufs 84 | ``` 85 | 86 | 87 | In order to see the system log, dmesg command should be called. We’ll see messages from our module in it. It’s not difficult to unload the module: 88 | 89 | ``` 90 | sudo rmmod aufs 91 | ``` 92 | 93 | # Returning to the File System: 94 | 95 | Thus, the environment is adjusted and operates. We know how to build the simplest module, load and unload it. Now we should consider the file system. The file system design should begin “on paper”, from a thorough consideration of the data structures being usedr. But we’ll follow a simple path and defer the details of files and folders storage at the disk for the next time. Now we’ll write a framework of our future file system. 96 | 97 | The life of a file system begins with check-in. You can check-in the file system by calling register_filesystem. We’ll register the file system in the function of module initialization. There’s unregister_filesystem for the file system unregistration and we’ll call it in aufs_fini function of our module. 98 | 99 | Both functions accept a pointer to file_system_type structure as a parameter. It’ll “describe” the file system. Consider it as a class of the file system. There’re enough fields in the structure, but we’re interested in some of them only: 100 | 101 | 102 | ``` 103 | static struct file_system_type aufs_type = { 104 | .owner = THIS_MODULE, 105 | .name = "aufs", 106 | .mount = aufs_mount, 107 | .kill_sb = kill_block_super, 108 | .fs_flags = FS_REQUIRES_DEV, 109 | }; 110 | ``` 111 | 112 | First of all, we are interested in name field. It stores the file system name. This name will be used during the assembling. 113 | 114 | mount & kill_sb — are two fields containing pointers to functions. The first function will be called during the file system assembling, the second one during disassembly. It’s enough to implement one of them, and we’ll use kill_block_super instead of the second one, which is provided by the kernel. 115 | 116 | fs_flags — stores different flags. In our case it stores FS_REQUIRES_DEV flag, which says that our file system needs a disk for operation (though it’s not the case for the moment). You don’t have to indicate this flag if you don’t want to, everything will operate without it. Finally, owner field is necessary in order to setup a counter of links to the module. The link counter is necessary so that the module wouldn’t be unloaded too soon. For example, if the file system has been assembled, the module loading can lead to the crash. The link counter won’t allow to unload the module till its being used, i.e. until we disassembly the file system. 117 | 118 | Now let’s consider aufs_mount function. It should assemble the device and return the structure describing a root catalogue of the file system. It sounds quite complicated, but, fortunately, the kernel will do it for us as well: 119 | 120 | ``` 121 | static struct dentry *aufs_mount(struct file_system_type *type, int flags, 122 | char const *dev, void *data) 123 | { 124 | struct dentry *const entry = mount_bdev(type, flags, dev, 125 | data, aufs_fill_sb); 126 | if (IS_ERR(entry)) 127 | pr_err("aufs mounting failed\n"); 128 | else 129 | pr_debug("aufs mounted\n"); 130 | return entry; 131 | } 132 | ``` 133 | 134 | The biggest part of work happens inside moun_bdev function. We are interested in its aufs_fill_sb parameter. It’s a pointer to the function (again), which will be called from mount_bdev in order to initialize the superblock. But before we move on to it, an important file subsystem of the kernel — dentry structure – should be considered. This structure represents a section of the path to the file name. For example, if we refer to /usr/bin/vim file, we’ll have different structure exemplars representing sections of / (root catalogue), bin/ and vim path. The kernel supports these structures cache. It allows to quickly find inode (another center structure) by the file (path) name. aufs_mount function should return dentry, which represents the root catalogue of our file system. aufs_fill_sb function will create it. 135 | 136 | Thus, for the moment aufs_fill_sb is the most important function in our module and it looks like the following: 137 | 138 | ``` 139 | static int aufs_fill_sb(struct super_block *sb, void *data, int silent) 140 | { 141 | struct inode *root = NULL; 142 | 143 | sb->s_magic = AUFS_MAGIC_NUMBER; 144 | sb->s_op = &aufs_super_ops; 145 | 146 | root = new_inode(sb); 147 | if (!root) 148 | { 149 | pr_err("inode allocation failed\n"); 150 | return -ENOMEM; 151 | } 152 | 153 | root->i_ino = 0; 154 | root->i_sb = sb; 155 | root->i_atime = root->i_mtime = root->i_ctime = CURRENT_TIME; 156 | inode_init_owner(root, NULL, S_IFDIR); 157 | 158 | sb->s_root = d_make_root(root); 159 | if (!sb->s_root) 160 | { 161 | pr_err("root creation failed\n"); 162 | return -ENOMEM; 163 | } 164 | 165 | return 0; 166 | } 167 | ``` 168 | 169 | First of all, we fill super_block structure. What kind of structure is it? Usually file systems store in a special place of a disk partition (this place is chosen by file system) the set of file system parameters, such as the block size, the number of occupied/free blocks, file system version, “pointer” to the root catalogue, magic number, by which a driver can check that the necessary file system is stored on the disk. This structure is named superblock (look at the picture below). super_block structure in Linux kernel is mainly intended for similar goals, we save a magic number in it and dentry for the root catalogue (the one returned by mount_bdev). 170 | 171 | Besides, in s_op field of super_block structure we store a pointer to super_operations structure — these are super_block “class methods”, it’s another structure storing a lot of pointers to the functions. I’ll make another note here. Linux kernel is written on C, without support of different OOP features from the language side. But we can structure a program following OOP ideas without support from the language. So the structures storing a lot of pointers to the functions can be met quite often in the kernel. It’s a method of virtual functions implementation by current facilities. 172 | 173 | Let’s get back to super_block structure and its “methods”. We’re interested in its put_super field. We’ll save a “destructor” of our superblock in it” 174 | 175 | ``` 176 | static void aufs_put_super(struct super_block *sb) 177 | { 178 | pr_debug("aufs super block destroyed\n"); 179 | } 180 | 181 | static struct super_operations const aufs_super_ops = { 182 | .put_super = aufs_put_super, 183 | }; 184 | ``` 185 | 186 | While aufs_put_super function does nothing useful, we use it just in order to type in the system log one more line. aufs_put_super function will be called within kill_block_super (see above) before super_block structure deleting. i.e. when disassembling the file system. 187 | 188 | Now let’s return to the most important aufs_fill_sb function. Before we create dentry for the root catalogue we should create an index node (inode) of the root catalogue. Inode structure is probably the most important one in the file system. Each file system object (a file, a folder, a special file, a magazine, etc.) identifies with inode. As well as super_block, inode structure reflects the way file systems are stored on a disk. Inode name comes from index node, meaning that it Indexes files and folders on a disk. Usually inside inode on a disk a directive to the place file data are stored on a disk (in which blocks he file content is stored), different access flags (available for read/write/execute), information about the file owner, time of creation/modification/writing/execution and other similar things are stored. 189 | 190 | We can’t read from the disk yet, so we’ll fill inode with fictive data. As time of creation/modification/writing/execution we use the current time and the kernel will assign the owner and access permissions (call inode_init_owner function). Finally, create dentry bound with the root inode. 191 | 192 | Skeleton Check Up 193 | 194 | The skeleton of our file system is ready. It’s time to check it. Building and loading of the file system driver doesn’t differ from building and loading of a general module. We’ll use loop device instead of a real disk for experiments. It’s a “disk” driver, which writes data not on the physical device, but to the file (disk image). Let’s create a disk image. It doesn’t store any data yet, so it’s simple: 195 | 196 | ``` 197 | touch image 198 | ``` 199 | 200 | We should also create a catalogue, which will be an assembling point (root) of our file system: 201 | 202 | ``` 203 | mkdir dir 204 | ``` 205 | Now using this image we’ll assemble our file system: 206 | ``` 207 | sudo mount -o loop -t aufs ./image ./dir 208 | ``` 209 | 210 | 211 | If the operation ended successfully, we’ll see messages from our module in the system log. In order to disassemble the file system we should: 212 | 213 | ``` 214 | sudo umount ./dir 215 | ``` 216 | 217 | Check the system log again. 218 | 219 | # Summary: 220 | 221 | We familiarized ourselves with creation of loadable kernel modules and main structures of the file subsystem. We also wrote a real file system, which can assemble and disassemble only (it’s quite silly for the time being, but we’re going to fix it in future). 222 | 223 | Then we’re going to consider data reading from the disk. To begin with we’ll define the way data will be stored on disks. We’ll also learn how to read superblock and inodes from the disk. 224 | 225 | # see more [link](https://kukuruku.co/) 226 | 227 | ------------------------------------------------ 228 | 229 | # Source for development: 230 | [link](https://github.com/torvalds/linux/tree/master/fs/ext4) 231 | 232 | 233 | -------------------------------------------------------------------------------- /kernel-types-networknuts.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/kernel-types-networknuts.png -------------------------------------------------------------------------------- /logo/Linux.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/logo/Linux.png -------------------------------------------------------------------------------- /logo/logo: -------------------------------------------------------------------------------- 1 | logo 2 | -------------------------------------------------------------------------------- /modules/example-lkm-module/.clang-format: -------------------------------------------------------------------------------- 1 | --- 2 | BasedOnStyle: Mozilla 3 | AllowShortFunctionsOnASingleLine: Empty 4 | BreakBeforeBraces: Linux 5 | UseTab: ForIndentation 6 | IndentWidth: 8 7 | AlwaysBreakTemplateDeclarations: true 8 | AlignConsecutiveDeclarations: true 9 | AlignConsecutiveAssignments: true 10 | -------------------------------------------------------------------------------- /modules/example-lkm-module/.editorconfig: -------------------------------------------------------------------------------- 1 | root = true 2 | 3 | [*] 4 | charset = utf-8 5 | end_of_line = LF 6 | insert_final_newline = true 7 | trim_trailing_whitespace = true 8 | indent_style = space 9 | indent_size = 8 10 | 11 | [Makefile] 12 | indent_size = 4 13 | indent_style = tab 14 | 15 | [*.md] 16 | trim_trailing_whitespace = false 17 | 18 | [*.go] 19 | indent_style = tab 20 | 21 | [*.yml] 22 | indent_style = space 23 | indent_size = 2 24 | 25 | [*.sh] 26 | indent_style = space 27 | indent_size = 2 28 | 29 | -------------------------------------------------------------------------------- /modules/example-lkm-module/.gitignore: -------------------------------------------------------------------------------- 1 | .tmp_versions 2 | .cache.mk 3 | .hello* 4 | Module.symvers 5 | *.ko 6 | *.o 7 | *.mod.o 8 | *.order 9 | *.mod.c 10 | -------------------------------------------------------------------------------- /modules/example-lkm-module/.ycm_extra_conf.py: -------------------------------------------------------------------------------- 1 | import subprocess 2 | 3 | kernel_release = subprocess.check_output(['uname', '-r']).rstrip() 4 | 5 | def FlagsForFile( filename, **kwargs ): 6 | return { 'flags': [ 7 | '-I/usr/include', 8 | '-I/usr/local/include', 9 | '-I/usr/src/linux-headers-' + kernel_release + '/arch/x86/include', 10 | '-I/usr/src/linux-headers-' + kernel_release + '/arch/x86/include/uapi', 11 | '-I/usr/src/linux-headers-' + kernel_release + '/include', 12 | '-D__KERNEL__', 13 | '-std=gnu99', 14 | '-xc' 15 | ]} 16 | -------------------------------------------------------------------------------- /modules/example-lkm-module/Kbuild: -------------------------------------------------------------------------------- 1 | EXTRA_FLAGS = -Wall -g 2 | 3 | obj-m = katil.o 4 | -------------------------------------------------------------------------------- /modules/example-lkm-module/Makefile: -------------------------------------------------------------------------------- 1 | KDIR = /lib/modules/`uname -r`/build 2 | 3 | kbuild: 4 | make -C $(KDIR) M=`pwd` 5 | 6 | # Formats any C-related file using the clang-format 7 | # definition at the root of the project. 8 | # 9 | # Make sure you have clang-format installed before 10 | # executing. 11 | fmt: 12 | find . -name "*.c" -o -name "*.h" | \ 13 | xargs clang-format -style=file -i 14 | 15 | 16 | # Removes any binary generated. 17 | clean: 18 | find . -name "*.out" -type f -delete 19 | make -c $(KDIR) M=`pwd` 20 | -------------------------------------------------------------------------------- /modules/example-lkm-module/README.MD: -------------------------------------------------------------------------------- 1 | # Katil-lkm is a loadable kernel module 2 | 3 | - Usage: 4 | 5 | - 1. Install the necessary development headers 6 | ```bash 7 | apt update -y 8 | apt install -y libelf-dev linux-headers-`uname -r` 9 | ``` 10 | - 2. build the kernel module 11 | ```bash 12 | make 13 | ``` 14 | - 3. load the module 15 | ```bash 16 | insmod katil.ko 17 | ``` 18 | - 4. check that the module has been inserted 19 | ```bash 20 | lsmod | grep katil 21 | ``` 22 | - 5. check that the module has been really loaded 23 | ```bash 24 | dmesg 25 | ``` 26 | - 6. remove the module 27 | ```bash 28 | rmmod katil 29 | ``` 30 | -------------------------------------------------------------------------------- /modules/example-lkm-module/katil.c: -------------------------------------------------------------------------------- 1 | /** 2 | * Author V.Varbanovski nu11secur1ty 3 | * katil - a module that does nothing more than 4 | * printing `katil` using `printk`. 5 | */ 6 | 7 | /** 8 | * `module.h` contains the most basic functionality needed for 9 | * us to create a loadable kernel module, including the `MODULE_*` 10 | * macros, `module_*` functions and including a bunch of other 11 | * relevant headers that provide useful functionality for us 12 | * (for instance, `printk`, which comes from `linux/printk.h`, 13 | * a header included by `linux/module.h`). 14 | */ 15 | #include 16 | 17 | /** 18 | * Following, we make use of several macros to properly provide 19 | * information about the kernel module that we're creating. 20 | * 21 | * The informations supplied here are visible through tools like 22 | * `modinfo`. 23 | * 24 | * Note.: the license you choose here **DOES AFFECT** other things - 25 | * by using a properietary license your kernel will be "tainted". 26 | */ 27 | MODULE_LICENSE("GPL"); 28 | MODULE_AUTHOR("V.Varbanovsi. nu11secur1ty"); 29 | MODULE_DESCRIPTION("Katil LKM"); 30 | MODULE_VERSION("0.1"); 31 | 32 | /** hello_init - initializes the module 33 | * 34 | * The `hello_init` method defines the procedures that performs the set up 35 | * of our module. 36 | */ 37 | static int 38 | hello_init(void) 39 | { 40 | // By making use of `printk` here (in the initialization), 41 | // we can look at `dmesg` and verify that what we log here 42 | // appears there at the moment that we load the module with 43 | // `insmod`. 44 | printk(KERN_INFO "Ko staaa katil?\n"); 45 | return 0; 46 | } 47 | 48 | static void 49 | hello_exit(void) 50 | { 51 | // similar to `init`, but for the removal time. 52 | printk(KERN_INFO "Ae beee si igraesh!\n"); 53 | } 54 | 55 | // registers the `hello_init` method as the method to run at module 56 | // insertion time. 57 | module_init(hello_init); 58 | 59 | // similar, but for `removal` 60 | module_exit(hello_exit); 61 | -------------------------------------------------------------------------------- /scheme/kernels.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/scheme/kernels.png -------------------------------------------------------------------------------- /type of programming/types.md: -------------------------------------------------------------------------------- 1 | 1. - Imperative programming 2 | 2. - Functional Programming 3 | 3. - Object oriented programming 4 | -------------------------------------------------------------------------------- /understanding the linux kernel/Understanding Linux Kernel.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/understanding the linux kernel/Understanding Linux Kernel.pdf -------------------------------------------------------------------------------- /wall/gV8hn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/wall/gV8hn.png -------------------------------------------------------------------------------- /wall/wall: -------------------------------------------------------------------------------- 1 | ..... 2 | --------------------------------------------------------------------------------