├── .github
    └── FUNDING.yml
├── Architecture
    ├── README.MD
    ├── architecture-of-linux.png
    ├── architecture-of-linux2.png
    └── architecture-of-linux3.png
├── Find a List of Built-in Kernel Modules
    └── command.md
├── Inodes
    ├── definition.md
    └── wall
    │   └── monster.jpg
├── Kernel and Types of kernels.md
├── Linux
    └── exit
    │   └── exit.c
├── Memory
    ├── README.md
    └── wall
    │   └── Memory.png
├── PTE
    ├── README.MD
    └── screen
    │   ├── Capture-24.png
    │   └── screen
├── Paging
    ├── README.MD
    └── screen
    │   ├── paging-2.jpg
    │   ├── paging-3.jpg
    │   ├── paging.jpg
    │   └── screen
├── Process execution priorities
    └── README.MD
├── README.md
├── SystemCall
    ├── README.MD
    └── SystemCall+LKM+LSM.pdf
├── TLB
    ├── README.MD
    └── screen
    │   ├── screen
    │   └── tlb1.jpg
├── The-secrets-of-Kconfig
    ├── README.MD
    └── wall
    │   ├── 1.png
    │   ├── 2.png
    │   ├── 3.png
    │   └── 4.png
├── Uninstalling
    └── kernels.md
├── Userspace
    ├── README.MD
    ├── logo
    │   └── red-hat-logo.png
    └── what_mean.md
├── file_system
    └── File System in Linux Kernel.md
├── kernel-types-networknuts.png
├── logo
    ├── Linux.png
    └── logo
├── modules
    └── example-lkm-module
    │   ├── .clang-format
    │   ├── .editorconfig
    │   ├── .gitignore
    │   ├── .ycm_extra_conf.py
    │   ├── Kbuild
    │   ├── Makefile
    │   ├── README.MD
    │   └── katil.c
├── scheme
    └── kernels.png
├── type of programming
    └── types.md
├── understanding the linux kernel
    └── Understanding Linux Kernel.pdf
└── wall
    ├── gV8hn.png
    └── wall


/.github/FUNDING.yml:
--------------------------------------------------------------------------------
1 | # Link for sponsoring
2 | custom: ["https://www.nu11secur1ty.com/", "https://github.com/nu11secur1ty"]
3 | custom: ["https://www.paypal.com/donate/?hosted_button_id=ZPQZT5XMC5RFY"]
4 | 


--------------------------------------------------------------------------------
/Architecture/README.MD:
--------------------------------------------------------------------------------
  1 | ## Architecture
  2 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Architecture/architecture-of-linux.png)
  3 | 
  4 | 1. Kernel:- The kernel is one of the core section of an operating system. It is responsible for each of the major actions of the Linux OS. This operating system contains distinct types of modules and cooperates with underlying hardware directly. The kernel facilitates required abstraction for hiding details of low-level hardware or application programs to the system. There are some of the important kernel types which are mentioned below:
  5 | -----------------------------------------
  6 | - Monolithic Kernel
  7 | - Micro kernels
  8 | - Exo kernels
  9 | - Hybrid kernels
 10 | -----------------------------------------
 11 | 
 12 | 2. System Libraries:- These libraries can be specified as some special functions. These are applied for implementing the operating system's functionality and don't need code access rights of the modules of kernel.
 13 | -----------------------------------------
 14 | 3. System Utility Programs:- It is responsible for doing specialized level and individual activities.
 15 | -----------------------------------------
 16 | 4. Hardware layer:- Linux operating system contains a hardware layer that consists of several peripheral devices like CPU, HDD, and RAM.
 17 | -----------------------------------------
 18 | 5. Shell:- It is an interface among the kernel and user. It can afford the services of kernel. It can take commands through the user and runs the functions of the kernel. The shell is available in distinct types of OSes. These operating systems are categorized into two different types, which are the graphical shells and command-line shells.
 19 | -----------------------------------------
 20 | 
 21 | The graphical line shells facilitate the graphical user interface, while the command line shells facilitate the command line interface. Thus, both of these shells implement operations. However, the graphical user interface shells work slower as compared to the command-line interface shells.
 22 | 
 23 | There are a few types of these shells which are categorized as follows:
 24 | 
 25 | - - - Korn shell
 26 | - - - Bourne shell
 27 | - - - C shell
 28 | - - - POSIX shell
 29 | 
 30 | ## Linux Operating System Features
 31 | `Some of the primary features of Linux OS are as follows:`
 32 | 
 33 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Architecture/architecture-of-linux2.png)
 34 | 
 35 | - Portable: Linux OS can perform different types of hardware and the kernel of Linux supports the installation of any type of hardware environment.
 36 | - Open source: Linux operating system source code is available freely and for enhancing the capability of the Linux OS, several teams are performing in collaboration.
 37 | - Multiprogramming: Linux OS can be defined as a multiprogramming system. It means more than one application can be executed at the same time.
 38 | - Multi-user: Linux OS can also be defined as a multi-user system. It means more than one user can use the resources of the system such as application programs, memory, or RAM at the same time.
 39 | - Hierarchical file system: Linux OS affords a typical file structure where user files or system files are arranged.
 40 | - Security: Linux OS facilitates user security systems with the help of various features of authentication such as controlled access to specific files, password protection, or data encryption.
 41 | - Shell: Linux operating system facilitates a unique interpreter program. This type of program can be applied for executing commands of the operating system. It can be applied to perform various types of tasks such as call application programs and others.
 42 | 
 43 | ## Drawbacks of Linux
 44 | 
 45 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Architecture/architecture-of-linux3.png)
 46 | 
 47 | - Hardware drivers: Most of the users of Linux face an issue while using Linux. Various companies of hardware prefer to build drivers for Mac or Windows due to they contain several users than Linux. Linux has small drivers for peripheral hardware than windows.
 48 | - Software alternative: Let's take the Photoshop example which is a famous tool for graphic editing. Photoshop exists for Windows; however, it is not available in Linux. Also, there are some other tools for photo editing but the Photoshop tool is more powerful as compare to others. Another example is MS office which is not present for Linux users.
 49 | - Learning curve: Linux isn't a very user-friendly operating system. Hence, it might be confusing for many beginners. Getting begun with Windows is efficient and easy for many beginners; however, understanding Linux working is complex.
 50 | We have to understand the command line interface and finding for newer software is a little bit complex as well. When we face any issue in the OS, the searching solution is very problematic. Also, there are various experts for Mac and Windows as compare to Linux.
 51 | - Games: Several games are developed for Windows but unfortunately not for Linux. Because the platform of Windows is used widely. So, the developers of the games are more interested in windows.
 52 | 
 53 | ## Linux Operating System Applications
 54 | 
 55 | Linux is a billion-dollar corporation nowadays. Thousands of governments and companies are using Linux operating system across the world because of lower money, time, licensing fee, and affordability. Linux can be used within several types of electronic devices. These electronic devices are easily available for users worldwide. A few of the famous Linux-based electronic devices are listed below:
 56 | 
 57 | - Yamaha Motive Keyboard
 58 | - Volvo In-Car Navigation System
 59 | - TiVo Digital Video Recorder
 60 | - Sony Reader
 61 | - Sony Bravia Television
 62 | - One Laptop Per child XO2
 63 | - Motorola MotoRokr EM35 phone
 64 | - Lenovo IdeaPad S9
 65 | - HP Mini 1000
 66 | - Google Android Dev Phone 1
 67 | - Garmin Nuvi 860, 880, and 5000
 68 | - Dell Inspiron Mini 9 and 12
 69 | 
 70 | ## Linux Distribution
 71 | 
 72 | It is an OS that is composed of a software-based collection on Linux kernel or we can say the distribution includes the Linux Kernel. It is supporting software and libraries. We can obtain Linux-based OS by downloading any Linux distribution. These types of distributions exists for distinct types of devices such as personal computers, embedded devices, etc. Around more than 600 Linux distributions are existed and a few of the famous Linux distributions are listed as follows:
 73 | 
 74 | - - - Deepin
 75 | - - - OpenSUSE
 76 | - - - Fedora
 77 | - - - Solus
 78 | - - - Debian
 79 | - - - Ubuntu
 80 | - - - Elementary
 81 | - - - Linux Mint
 82 | - - - Manjaro
 83 | - - - MX Linux
 84 | 
 85 | ## Are Ubuntu and Linux Differ?
 86 | 
 87 | `YES`
 88 | The primary difference between window and Linux is that window is open source and free OS and its Linux distribution based on Debian, Whereas Linux is a large collection of open-source OSes that are working based on Linux kernel.
 89 | 
 90 | Besides, Ubuntu is a distribution of Linux and Linux is a core system. Ubuntu is integrated by Canonical Ltd. and published in 2004 and Linux is integrated by Linus Torvalds and published in 1991.
 91 | 
 92 | 
 93 | ## User mode vs Kernel mode:
 94 | 
 95 | The code of kernel component runs in a unique privilege mood known as kernel mode along with complete access to every computer resource. This code illustrates an individual process, runs in an individual address space, and don't need the context switch. Hence, it is very fast and efficient.
 96 | Kernel executes all the processes and facilitates various services of a system to the processes. Also, it facilitates secured access to processes to hardware.
 97 | The support code that is not needed to execute in kernel mode is inside the system library. The user programs and other types of system programs are implemented in the user mode.
 98 | It includes no access to kernel mode and system hardware. User utilities/programs use the system libraries for accessing kernel functions to obtain low-level tasks of the system.
 99 | 
100 | 


--------------------------------------------------------------------------------
/Architecture/architecture-of-linux.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/Architecture/architecture-of-linux.png


--------------------------------------------------------------------------------
/Architecture/architecture-of-linux2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/Architecture/architecture-of-linux2.png


--------------------------------------------------------------------------------
/Architecture/architecture-of-linux3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/Architecture/architecture-of-linux3.png


--------------------------------------------------------------------------------
/Find a List of Built-in Kernel Modules/command.md:
--------------------------------------------------------------------------------
1 | # Find a List of Built-in Kernel Modules
2 | ```bash
3 | cat /lib/modules/$(uname -r)/modules.builtin
4 | ```
5 | 


--------------------------------------------------------------------------------
/Inodes/definition.md:
--------------------------------------------------------------------------------
  1 | # Understanding disk inodes
  2 | 
  3 | You try creating a file on a server and see this error message:
  4 | ```bash
  5 | No space left on device
  6 | ```
  7 | ..but you've got plenty of space:
  8 | 
  9 | ```bash
 10 | df
 11 | Filesystem           1K-blocks      Used Available Use% Mounted on
 12 | /dev/xvda1            10321208   3159012   6637908  33% /
 13 | ```
 14 | 
 15 | Who is the invisible monster chewing up all of your space?
 16 | 
 17 | Why, the inode monster of course!
 18 | 
 19 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Inodes/wall/monster.jpg)
 20 | 
 21 | # What are inodes?
 22 | An index node (or inode) contains metadata information (file size, file type, etc.) for a file system object (like a file or a directory). There is one inode per file system object.
 23 | 
 24 | An inode doesn't store the file contents or the name: it simply points to a specific file or directory.
 25 | 
 26 | The problem with inodes
 27 | The total number of inodes and the space reserved for these inodes is set when the filesystem is first created. The inode limit can't be changed dynamically and every file system object must have an inode.
 28 | 
 29 | While it's unusual to run out of inodes before actual disk space, you are more likely to have inode shortages if:
 30 | 
 31 | - You are creating lots of directories, symlinks, and small files.
 32 | - You created your ext3 filesystem with smaller block sizes. The ext3 default block size is 4096 bytes. If you are using your filesystem for storing lots of very small files, you might create the filesystem with a block size of 1024 or 2048. This would let you use your disk space more efficiently, but raises the likelihood of running low on inodes.
 33 | - Your server is containerized (Docker, LXC, OpenVZ, etc). Containerized servers can often share the same filesystem as the host node. For stability and security purposes, the containers resources such as RAM, CPU and disk space and inodes are limited. In this situation the number of inodes allocated to your container is determined by the administrator of the host node. It is very common to run into into inode issues in containers with filesystems of this type
 34 | 
 35 | 
 36 | # Viewing inode usage
 37 | 
 38 | - Use the `-i` flag to view inode usage:
 39 | ```bash
 40 | df -i
 41 | Filesystem            Inodes   IUsed   IFree IUse% Mounted on
 42 | /dev/xvda1           1310720  1310720        0  100% /
 43 | ```
 44 | 
 45 | You can view more detailed inode information with the following command:
 46 | 
 47 | ```bash
 48 | tune2fs -l /dev/xvda1 | grep -i inode
 49 | ```
 50 | 
 51 | # Where are the small files?
 52 | If you are using up all of your inode capacity, the next step is figuring out where all of those little files are. This is a bit of a manual process.
 53 | 
 54 | The command below will output the number of files and directory from the top of your file system:
 55 | 
 56 | ```bash
 57 | for i in /*; do echo $i; find $i |wc -l; done
 58 | ```
 59 | - Example output:
 60 | ```bash
 61 | /bin
 62 | 119
 63 | /sys
 64 | 9293
 65 | /tmp
 66 | 1
 67 | /usr
 68 | 10583
 69 | ```
 70 | The counts above include all nested directories. If you have a single directory with many files, the command above may stall - that's a good hint that's the problem directory. Once you've narrowed it down to a specific directory, you can execute the previous command on that directory:
 71 | 
 72 | 
 73 | ```bash
 74 | for i in /usr/*; do echo $i; find $i |wc -l; done
 75 | ```
 76 | 
 77 | Delete all of the small files in that directory.
 78 | 
 79 | I can't delete those files. Can I increase the inode limit?
 80 | The bad news: on the ext* file systems, you can't simply increase the inode limit on an existing volume. You have two options:
 81 | 
 82 | 1. - If the disk is an LVM, increase the size of the volume.
 83 | 
 84 | 2. - Backup and create a new a new file system, specifying a higher inode limit:
 85 | 
 86 | ```bash
 87 | mke2fs -N
 88 | ```
 89 | 
 90 | 
 91 | # Are there filesystems that don't have inode limits?
 92 | Yes. Modern filesystems like `Btrfs` and `XFS` use dynamic inodes to avoid inode limits. `ZFS` does not use inodes.
 93 | 
 94 | 
 95 | 
 96 | 
 97 | 
 98 | 
 99 | 
100 | 
101 | 
102 | 
103 | 


--------------------------------------------------------------------------------
/Inodes/wall/monster.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/Inodes/wall/monster.jpg


--------------------------------------------------------------------------------
/Kernel and Types of kernels.md:
--------------------------------------------------------------------------------
  1 | # 1. What Is Kernel?
  2 | 
  3 | A kernel is a central component of an operating system. It acts as an interface between the user applications and the hardware. 
  4 | The sole aim of the kernel is to manage the communication between the software (user level applications) and the hardware (CPU, disk memory etc). The main tasks of the kernel are :
  5 | 
  6 |     Process management
  7 |     Device management
  8 |     Memory management
  9 |     Interrupt handling
 10 |     I/O communication
 11 |     File system...etc..
 12 | 
 13 | # 2. Is LINUX A Kernel Or An Operating System?
 14 | 
 15 | Well, there is a difference between kernel and OS. Kernel as described above is the heart of OS which manages the core features of an OS while if some useful applications and utilities are added over the kernel, then the complete package becomes an OS. 
 16 | So, it can easily be said that an operating system consists of a kernel space and a user space.
 17 | 
 18 | So, we can say that Linux is a kernel as it does not include applications like file-system utilities, 
 19 | windowing systems and graphical desktops, system administrator commands, text editors, compilers etc. 
 20 | So, various companies add these kind of applications over linux kernel and provide their operating system like ubuntu, suse, centOS, redHat etc.
 21 | 
 22 | # 3. Types Of Kernels
 23 | 
 24 | Kernels may be classified mainly in two categories
 25 | 
 26 |     Monolithic
 27 |     Micro Kernel
 28 | 
 29 | # 1 Monolithic Kernels
 30 | 
 31 | Earlier in this type of kernel architecture, all the basic system services like a process and memory management, interrupt handling etc were packaged into a single module in kernel space. This type of architecture led to some serious drawbacks like 1) The Size of the kernel, which was huge. 2) Poor maintainability, which means bug fixing or addition of new features resulted in recompilation of the whole kernel which could consume hours.
 32 | 
 33 | In a modern day approach to monolithic architecture, the kernel consists of different modules which can be dynamically loaded and unloaded. This modular approach allows easy extension of OS's capabilities. With this approach, maintainability of kernel became very easy as only the concerned module needs to be loaded and unloaded every time there is a change or bug fix in a particular module. So, there is no need to bring down and recompile the whole kernel for the smallest bit of change. Also, stripping of a kernel for various platforms (say for embedded devices etc) became very easy as we can easily unload the module that we do not want.
 34 | 
 35 | Linux follows the monolithic modular approach
 36 | 
 37 | # 2 Microkernels
 38 | 
 39 | This architecture majorly caters to the problem of ever growing size of kernel code which we could not control in the monolithic approach. This architecture allows some basic services like device driver management, protocol stack, file system etc to run in user space. This reduces the kernel code size and also increases the security and stability of OS as we have the bare minimum code running in kernel. So, if suppose a basic service like network service crashes due to buffer overflow, then only the networking service's memory would be corrupted, leaving the rest of the system still functional.
 40 | 
 41 | In this architecture, all the basic OS services which are made part of user space are made to run as servers which are used by other programs in the system through inter process communication (IPC). eg: we have servers for device drivers, network protocol stacks, file systems, graphics, etc. Microkernel servers are essentially daemon programs like any others, except that the kernel grants some of them privileges to interact with parts of physical memory that are otherwise off limits to most programs. This allows some servers, particularly device drivers, to interact directly with hardware. These servers are started at the system start-up.
 42 | 
 43 | So, what the bare minimum that microKernel architecture recommends in kernel space?
 44 | 
 45 |     Managing memory protection
 46 |     Process scheduling
 47 |     Inter Process communication (IPC)
 48 | 
 49 | Apart from the above, all other basic services can be made part of user space and can be run in the form of servers. 
 50 | 
 51 | -----------------------------------------------------------------------------------------
 52 | 
 53 | # 1. Microkernels — In these, only the most elementary functions are implemented directly
 54 | in a central kernel — the microkernel. All other functions are delegated to autonomous
 55 | processes that communicate with the central kernel via clearly defined communication
 56 | interfaces — for example, various filesystems, memory management, and so on. (Of
 57 | course, the most elementary level of memory management that controls communication
 58 | with the system itself is in the microkernel. However, handling on the system call level is
 59 | implemented in external servers.) Theoretically, this is a very elegant approach because
 60 | the individual parts are clearly segregated from each other, and this forces programmers
 61 | to use ‘‘clean‘‘ programming techniques. Other benefits of this approach are dynamic
 62 | extensibility and the ability to swap important components at run time. However, owing
 63 | to the additional CPU time needed to support complex communication between the
 64 | components, microkernels have not really established themselves in practice although they
 65 | have been the subject of active and varied research for some time now.
 66 | 
 67 | # 2. Monolithic Kernels — They are the alternative, traditional concept. Here, the entire code
 68 | of the kernel — including all its subsystems such as memory management, filesystems, or
 69 | device drivers — is packed into a single file. Each function has access to all other parts of
 70 | the kernel; this can result in elaborately nested source code if programming is not done with
 71 | great care.
 72 | Because, at the moment, the performance of monolithic kernels is still greater than that of microkernels,
 73 | Linux was and still is implemented according to this paradigm. However, one major innovation has been
 74 | introduced. Modules with kernel code that can be inserted or removed while the system is up-and-running
 75 | support the dynamic addition of a whole range of functions to the kernel, thus compensating for some of
 76 | the disadvantages of monolithic kernels. This is assisted by elaborate means of communication between
 77 | the kernel and userland that allows for implementing hotplugging and dynamic loading of modules.
 78 | 
 79 | 
 80 | ![image](https://github.com/nu11secur1ty/pictures/blob/master/gV8hn.png)
 81 | 
 82 | 
 83 | --------------------------------------------------------------------------------------------------
 84 | 
 85 | 
 86 | # Kernel Space Definition
 87 | 
 88 | 	
 89 | 
 90 | System memory in Linux can be divided into two distinct regions: kernel space and user space. Kernel space is where the kernel (i.e., the core of the operating system) executes (i.e., runs) and provides its services.
 91 | 
 92 | Memory consists of RAM (random access memory) cells, whose contents can be accessed (i.e., read and written to) at extremely high speeds but are retained only temporarily (i.e., while in use or, at most, while the power supply remains on). Its purpose is to hold programs and data that are currently in use and thereby serve as a high speed intermediary between the CPU (central processing unit) and the much slower storage, which most commonly consists of one or more hard disk drives (HDDs).
 93 | 
 94 | User space is that set of memory locations in which user processes (i.e., everything other than the kernel) run. A process is an executing instance of a program. One of the roles of the kernel is to manage individual user processes within this space and to prevent them from interfering with each other.
 95 | 
 96 | Kernel space can be accessed by user processes only through the use of system calls. System calls are requests in a Unix-like operating system by an active process for a service performed by the kernel, such as input/output (I/O) or process creation. An active process is a process that is currently progressing in the CPU, as contrasted with a process that is waiting for its next turn in the CPU. I/O is any program, operation or device that transfers data to or from a CPU and to or from a peripheral device (such as disk drives, keyboards, mice and printers). 
 97 | 
 98 | 
 99 | -------------------------------------------------------------------------------------------------------
100 | 
101 | 
102 | # User Space Definition
103 | 
104 | 	
105 | 
106 | User space is that portion of system memory in which user processes run. This contrasts with kernel space, which is that portion of memory in which the kernel executes and provides its services.
107 | 
108 | The contents of memory, which consists of dedicated RAM (random access memory) VLSI (very large scale integrated circuit) semiconductor chips, can be accessed (i.e., read and written to) at extremely high speeds but are retained only temporarily (i.e., while in use or, at most, while the power supply remains on). This contrasts with storage (e.g., disk drives), which has much slower access speeds but whose contents are retained after the power is turned off and which usually has a far greater capacity.
109 | 
110 | A process is an executing (i.e., running) instance of a program. User processes are instances of all programs other than the kernel (i.e., utility and application programs). When a program is to be run, it is copied from storage into user space so that it can be accessed at high speed by the CPU (central processing unit).
111 | 
112 | The kernel is a program that constitutes the central core of a computer operating system. It is not a process, but rather a controller of processes, and it has complete control over everything that occurs on the system. This includes managing individual user processes within user space and preventing them from interfering with each other.
113 | 
114 | The division of system memory in Unix-like operating systems into user space and kernel space plays an important role in maintaining system stability and security. 
115 | 
116 | ![image](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/kernel-types-networknuts.png)
117 | 
118 | # [See also:](https://gist.github.com/nu11secur1ty/e7ad7ca9bd5391727c8e513250839eec)
119 | 
120 | 
121 | # Example: [See...](http://www.nu11secur1ty.com/2017/12/buildingkernelmoduleexample.html)
122 | 


--------------------------------------------------------------------------------
/Linux/exit/exit.c:
--------------------------------------------------------------------------------
   1 | // SPDX-License-Identifier: GPL-2.0-only
   2 | /*
   3 |  *  linux/kernel/exit.c
   4 |  *
   5 |  *  Copyright (C) 1991, 1992  Linus Torvalds
   6 |  */
   7 | 
   8 | #include <linux/mm.h>
   9 | #include <linux/slab.h>
  10 | #include <linux/sched/autogroup.h>
  11 | #include <linux/sched/mm.h>
  12 | #include <linux/sched/stat.h>
  13 | #include <linux/sched/task.h>
  14 | #include <linux/sched/task_stack.h>
  15 | #include <linux/sched/cputime.h>
  16 | #include <linux/interrupt.h>
  17 | #include <linux/module.h>
  18 | #include <linux/capability.h>
  19 | #include <linux/completion.h>
  20 | #include <linux/personality.h>
  21 | #include <linux/tty.h>
  22 | #include <linux/iocontext.h>
  23 | #include <linux/key.h>
  24 | #include <linux/cpu.h>
  25 | #include <linux/acct.h>
  26 | #include <linux/tsacct_kern.h>
  27 | #include <linux/file.h>
  28 | #include <linux/fdtable.h>
  29 | #include <linux/freezer.h>
  30 | #include <linux/binfmts.h>
  31 | #include <linux/nsproxy.h>
  32 | #include <linux/pid_namespace.h>
  33 | #include <linux/ptrace.h>
  34 | #include <linux/profile.h>
  35 | #include <linux/mount.h>
  36 | #include <linux/proc_fs.h>
  37 | #include <linux/kthread.h>
  38 | #include <linux/mempolicy.h>
  39 | #include <linux/taskstats_kern.h>
  40 | #include <linux/delayacct.h>
  41 | #include <linux/cgroup.h>
  42 | #include <linux/syscalls.h>
  43 | #include <linux/signal.h>
  44 | #include <linux/posix-timers.h>
  45 | #include <linux/cn_proc.h>
  46 | #include <linux/mutex.h>
  47 | #include <linux/futex.h>
  48 | #include <linux/pipe_fs_i.h>
  49 | #include <linux/audit.h> /* for audit_free() */
  50 | #include <linux/resource.h>
  51 | #include <linux/blkdev.h>
  52 | #include <linux/task_io_accounting_ops.h>
  53 | #include <linux/tracehook.h>
  54 | #include <linux/fs_struct.h>
  55 | #include <linux/init_task.h>
  56 | #include <linux/perf_event.h>
  57 | #include <trace/events/sched.h>
  58 | #include <linux/hw_breakpoint.h>
  59 | #include <linux/oom.h>
  60 | #include <linux/writeback.h>
  61 | #include <linux/shm.h>
  62 | #include <linux/kcov.h>
  63 | #include <linux/random.h>
  64 | #include <linux/rcuwait.h>
  65 | #include <linux/compat.h>
  66 | 
  67 | #include <linux/uaccess.h>
  68 | #include <asm/unistd.h>
  69 | #include <asm/pgtable.h>
  70 | #include <asm/mmu_context.h>
  71 | 
  72 | static void __unhash_process(struct task_struct *p, bool group_dead)
  73 | {
  74 | 	nr_threads--;
  75 | 	detach_pid(p, PIDTYPE_PID);
  76 | 	if (group_dead) {
  77 | 		detach_pid(p, PIDTYPE_TGID);
  78 | 		detach_pid(p, PIDTYPE_PGID);
  79 | 		detach_pid(p, PIDTYPE_SID);
  80 | 
  81 | 		list_del_rcu(&p->tasks);
  82 | 		list_del_init(&p->sibling);
  83 | 		__this_cpu_dec(process_counts);
  84 | 	}
  85 | 	list_del_rcu(&p->thread_group);
  86 | 	list_del_rcu(&p->thread_node);
  87 | }
  88 | 
  89 | /*
  90 |  * This function expects the tasklist_lock write-locked.
  91 |  */
  92 | static void __exit_signal(struct task_struct *tsk)
  93 | {
  94 | 	struct signal_struct *sig = tsk->signal;
  95 | 	bool group_dead = thread_group_leader(tsk);
  96 | 	struct sighand_struct *sighand;
  97 | 	struct tty_struct *uninitialized_var(tty);
  98 | 	u64 utime, stime;
  99 | 
 100 | 	sighand = rcu_dereference_check(tsk->sighand,
 101 | 					lockdep_tasklist_lock_is_held());
 102 | 	spin_lock(&sighand->siglock);
 103 | 
 104 | #ifdef CONFIG_POSIX_TIMERS
 105 | 	posix_cpu_timers_exit(tsk);
 106 | 	if (group_dead) {
 107 | 		posix_cpu_timers_exit_group(tsk);
 108 | 	} else {
 109 | 		/*
 110 | 		 * This can only happen if the caller is de_thread().
 111 | 		 * FIXME: this is the temporary hack, we should teach
 112 | 		 * posix-cpu-timers to handle this case correctly.
 113 | 		 */
 114 | 		if (unlikely(has_group_leader_pid(tsk)))
 115 | 			posix_cpu_timers_exit_group(tsk);
 116 | 	}
 117 | #endif
 118 | 
 119 | 	if (group_dead) {
 120 | 		tty = sig->tty;
 121 | 		sig->tty = NULL;
 122 | 	} else {
 123 | 		/*
 124 | 		 * If there is any task waiting for the group exit
 125 | 		 * then notify it:
 126 | 		 */
 127 | 		if (sig->notify_count > 0 && !--sig->notify_count)
 128 | 			wake_up_process(sig->group_exit_task);
 129 | 
 130 | 		if (tsk == sig->curr_target)
 131 | 			sig->curr_target = next_thread(tsk);
 132 | 	}
 133 | 
 134 | 	add_device_randomness((const void*) &tsk->se.sum_exec_runtime,
 135 | 			      sizeof(unsigned long long));
 136 | 
 137 | 	/*
 138 | 	 * Accumulate here the counters for all threads as they die. We could
 139 | 	 * skip the group leader because it is the last user of signal_struct,
 140 | 	 * but we want to avoid the race with thread_group_cputime() which can
 141 | 	 * see the empty ->thread_head list.
 142 | 	 */
 143 | 	task_cputime(tsk, &utime, &stime);
 144 | 	write_seqlock(&sig->stats_lock);
 145 | 	sig->utime += utime;
 146 | 	sig->stime += stime;
 147 | 	sig->gtime += task_gtime(tsk);
 148 | 	sig->min_flt += tsk->min_flt;
 149 | 	sig->maj_flt += tsk->maj_flt;
 150 | 	sig->nvcsw += tsk->nvcsw;
 151 | 	sig->nivcsw += tsk->nivcsw;
 152 | 	sig->inblock += task_io_get_inblock(tsk);
 153 | 	sig->oublock += task_io_get_oublock(tsk);
 154 | 	task_io_accounting_add(&sig->ioac, &tsk->ioac);
 155 | 	sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
 156 | 	sig->nr_threads--;
 157 | 	__unhash_process(tsk, group_dead);
 158 | 	write_sequnlock(&sig->stats_lock);
 159 | 
 160 | 	/*
 161 | 	 * Do this under ->siglock, we can race with another thread
 162 | 	 * doing sigqueue_free() if we have SIGQUEUE_PREALLOC signals.
 163 | 	 */
 164 | 	flush_sigqueue(&tsk->pending);
 165 | 	tsk->sighand = NULL;
 166 | 	spin_unlock(&sighand->siglock);
 167 | 
 168 | 	__cleanup_sighand(sighand);
 169 | 	clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
 170 | 	if (group_dead) {
 171 | 		flush_sigqueue(&sig->shared_pending);
 172 | 		tty_kref_put(tty);
 173 | 	}
 174 | }
 175 | 
 176 | static void delayed_put_task_struct(struct rcu_head *rhp)
 177 | {
 178 | 	struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
 179 | 
 180 | 	perf_event_delayed_put(tsk);
 181 | 	trace_sched_process_free(tsk);
 182 | 	put_task_struct(tsk);
 183 | }
 184 | 
 185 | 
 186 | void release_task(struct task_struct *p)
 187 | {
 188 | 	struct task_struct *leader;
 189 | 	int zap_leader;
 190 | repeat:
 191 | 	/* don't need to get the RCU readlock here - the process is dead and
 192 | 	 * can't be modifying its own credentials. But shut RCU-lockdep up */
 193 | 	rcu_read_lock();
 194 | 	atomic_dec(&__task_cred(p)->user->processes);
 195 | 	rcu_read_unlock();
 196 | 
 197 | 	proc_flush_task(p);
 198 | 	cgroup_release(p);
 199 | 
 200 | 	write_lock_irq(&tasklist_lock);
 201 | 	ptrace_release_task(p);
 202 | 	__exit_signal(p);
 203 | 
 204 | 	/*
 205 | 	 * If we are the last non-leader member of the thread
 206 | 	 * group, and the leader is zombie, then notify the
 207 | 	 * group leader's parent process. (if it wants notification.)
 208 | 	 */
 209 | 	zap_leader = 0;
 210 | 	leader = p->group_leader;
 211 | 	if (leader != p && thread_group_empty(leader)
 212 | 			&& leader->exit_state == EXIT_ZOMBIE) {
 213 | 		/*
 214 | 		 * If we were the last child thread and the leader has
 215 | 		 * exited already, and the leader's parent ignores SIGCHLD,
 216 | 		 * then we are the one who should release the leader.
 217 | 		 */
 218 | 		zap_leader = do_notify_parent(leader, leader->exit_signal);
 219 | 		if (zap_leader)
 220 | 			leader->exit_state = EXIT_DEAD;
 221 | 	}
 222 | 
 223 | 	write_unlock_irq(&tasklist_lock);
 224 | 	release_thread(p);
 225 | 	call_rcu(&p->rcu, delayed_put_task_struct);
 226 | 
 227 | 	p = leader;
 228 | 	if (unlikely(zap_leader))
 229 | 		goto repeat;
 230 | }
 231 | 
 232 | /*
 233 |  * Note that if this function returns a valid task_struct pointer (!NULL)
 234 |  * task->usage must remain >0 for the duration of the RCU critical section.
 235 |  */
 236 | struct task_struct *task_rcu_dereference(struct task_struct **ptask)
 237 | {
 238 | 	struct sighand_struct *sighand;
 239 | 	struct task_struct *task;
 240 | 
 241 | 	/*
 242 | 	 * We need to verify that release_task() was not called and thus
 243 | 	 * delayed_put_task_struct() can't run and drop the last reference
 244 | 	 * before rcu_read_unlock(). We check task->sighand != NULL,
 245 | 	 * but we can read the already freed and reused memory.
 246 | 	 */
 247 | retry:
 248 | 	task = rcu_dereference(*ptask);
 249 | 	if (!task)
 250 | 		return NULL;
 251 | 
 252 | 	probe_kernel_address(&task->sighand, sighand);
 253 | 
 254 | 	/*
 255 | 	 * Pairs with atomic_dec_and_test() in put_task_struct(). If this task
 256 | 	 * was already freed we can not miss the preceding update of this
 257 | 	 * pointer.
 258 | 	 */
 259 | 	smp_rmb();
 260 | 	if (unlikely(task != READ_ONCE(*ptask)))
 261 | 		goto retry;
 262 | 
 263 | 	/*
 264 | 	 * We've re-checked that "task == *ptask", now we have two different
 265 | 	 * cases:
 266 | 	 *
 267 | 	 * 1. This is actually the same task/task_struct. In this case
 268 | 	 *    sighand != NULL tells us it is still alive.
 269 | 	 *
 270 | 	 * 2. This is another task which got the same memory for task_struct.
 271 | 	 *    We can't know this of course, and we can not trust
 272 | 	 *    sighand != NULL.
 273 | 	 *
 274 | 	 *    In this case we actually return a random value, but this is
 275 | 	 *    correct.
 276 | 	 *
 277 | 	 *    If we return NULL - we can pretend that we actually noticed that
 278 | 	 *    *ptask was updated when the previous task has exited. Or pretend
 279 | 	 *    that probe_slab_address(&sighand) reads NULL.
 280 | 	 *
 281 | 	 *    If we return the new task (because sighand is not NULL for any
 282 | 	 *    reason) - this is fine too. This (new) task can't go away before
 283 | 	 *    another gp pass.
 284 | 	 *
 285 | 	 *    And note: We could even eliminate the false positive if re-read
 286 | 	 *    task->sighand once again to avoid the falsely NULL. But this case
 287 | 	 *    is very unlikely so we don't care.
 288 | 	 */
 289 | 	if (!sighand)
 290 | 		return NULL;
 291 | 
 292 | 	return task;
 293 | }
 294 | 
 295 | void rcuwait_wake_up(struct rcuwait *w)
 296 | {
 297 | 	struct task_struct *task;
 298 | 
 299 | 	rcu_read_lock();
 300 | 
 301 | 	/*
 302 | 	 * Order condition vs @task, such that everything prior to the load
 303 | 	 * of @task is visible. This is the condition as to why the user called
 304 | 	 * rcuwait_trywake() in the first place. Pairs with set_current_state()
 305 | 	 * barrier (A) in rcuwait_wait_event().
 306 | 	 *
 307 | 	 *    WAIT                WAKE
 308 | 	 *    [S] tsk = current	  [S] cond = true
 309 | 	 *        MB (A)	      MB (B)
 310 | 	 *    [L] cond		  [L] tsk
 311 | 	 */
 312 | 	smp_mb(); /* (B) */
 313 | 
 314 | 	/*
 315 | 	 * Avoid using task_rcu_dereference() magic as long as we are careful,
 316 | 	 * see comment in rcuwait_wait_event() regarding ->exit_state.
 317 | 	 */
 318 | 	task = rcu_dereference(w->task);
 319 | 	if (task)
 320 | 		wake_up_process(task);
 321 | 	rcu_read_unlock();
 322 | }
 323 | 
 324 | /*
 325 |  * Determine if a process group is "orphaned", according to the POSIX
 326 |  * definition in 2.2.2.52.  Orphaned process groups are not to be affected
 327 |  * by terminal-generated stop signals.  Newly orphaned process groups are
 328 |  * to receive a SIGHUP and a SIGCONT.
 329 |  *
 330 |  * "I ask you, have you ever known what it is to be an orphan?"
 331 |  */
 332 | static int will_become_orphaned_pgrp(struct pid *pgrp,
 333 | 					struct task_struct *ignored_task)
 334 | {
 335 | 	struct task_struct *p;
 336 | 
 337 | 	do_each_pid_task(pgrp, PIDTYPE_PGID, p) {
 338 | 		if ((p == ignored_task) ||
 339 | 		    (p->exit_state && thread_group_empty(p)) ||
 340 | 		    is_global_init(p->real_parent))
 341 | 			continue;
 342 | 
 343 | 		if (task_pgrp(p->real_parent) != pgrp &&
 344 | 		    task_session(p->real_parent) == task_session(p))
 345 | 			return 0;
 346 | 	} while_each_pid_task(pgrp, PIDTYPE_PGID, p);
 347 | 
 348 | 	return 1;
 349 | }
 350 | 
 351 | int is_current_pgrp_orphaned(void)
 352 | {
 353 | 	int retval;
 354 | 
 355 | 	read_lock(&tasklist_lock);
 356 | 	retval = will_become_orphaned_pgrp(task_pgrp(current), NULL);
 357 | 	read_unlock(&tasklist_lock);
 358 | 
 359 | 	return retval;
 360 | }
 361 | 
 362 | static bool has_stopped_jobs(struct pid *pgrp)
 363 | {
 364 | 	struct task_struct *p;
 365 | 
 366 | 	do_each_pid_task(pgrp, PIDTYPE_PGID, p) {
 367 | 		if (p->signal->flags & SIGNAL_STOP_STOPPED)
 368 | 			return true;
 369 | 	} while_each_pid_task(pgrp, PIDTYPE_PGID, p);
 370 | 
 371 | 	return false;
 372 | }
 373 | 
 374 | /*
 375 |  * Check to see if any process groups have become orphaned as
 376 |  * a result of our exiting, and if they have any stopped jobs,
 377 |  * send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2)
 378 |  */
 379 | static void
 380 | kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
 381 | {
 382 | 	struct pid *pgrp = task_pgrp(tsk);
 383 | 	struct task_struct *ignored_task = tsk;
 384 | 
 385 | 	if (!parent)
 386 | 		/* exit: our father is in a different pgrp than
 387 | 		 * we are and we were the only connection outside.
 388 | 		 */
 389 | 		parent = tsk->real_parent;
 390 | 	else
 391 | 		/* reparent: our child is in a different pgrp than
 392 | 		 * we are, and it was the only connection outside.
 393 | 		 */
 394 | 		ignored_task = NULL;
 395 | 
 396 | 	if (task_pgrp(parent) != pgrp &&
 397 | 	    task_session(parent) == task_session(tsk) &&
 398 | 	    will_become_orphaned_pgrp(pgrp, ignored_task) &&
 399 | 	    has_stopped_jobs(pgrp)) {
 400 | 		__kill_pgrp_info(SIGHUP, SEND_SIG_PRIV, pgrp);
 401 | 		__kill_pgrp_info(SIGCONT, SEND_SIG_PRIV, pgrp);
 402 | 	}
 403 | }
 404 | 
 405 | #ifdef CONFIG_MEMCG
 406 | /*
 407 |  * A task is exiting.   If it owned this mm, find a new owner for the mm.
 408 |  */
 409 | void mm_update_next_owner(struct mm_struct *mm)
 410 | {
 411 | 	struct task_struct *c, *g, *p = current;
 412 | 
 413 | retry:
 414 | 	/*
 415 | 	 * If the exiting or execing task is not the owner, it's
 416 | 	 * someone else's problem.
 417 | 	 */
 418 | 	if (mm->owner != p)
 419 | 		return;
 420 | 	/*
 421 | 	 * The current owner is exiting/execing and there are no other
 422 | 	 * candidates.  Do not leave the mm pointing to a possibly
 423 | 	 * freed task structure.
 424 | 	 */
 425 | 	if (atomic_read(&mm->mm_users) <= 1) {
 426 | 		WRITE_ONCE(mm->owner, NULL);
 427 | 		return;
 428 | 	}
 429 | 
 430 | 	read_lock(&tasklist_lock);
 431 | 	/*
 432 | 	 * Search in the children
 433 | 	 */
 434 | 	list_for_each_entry(c, &p->children, sibling) {
 435 | 		if (c->mm == mm)
 436 | 			goto assign_new_owner;
 437 | 	}
 438 | 
 439 | 	/*
 440 | 	 * Search in the siblings
 441 | 	 */
 442 | 	list_for_each_entry(c, &p->real_parent->children, sibling) {
 443 | 		if (c->mm == mm)
 444 | 			goto assign_new_owner;
 445 | 	}
 446 | 
 447 | 	/*
 448 | 	 * Search through everything else, we should not get here often.
 449 | 	 */
 450 | 	for_each_process(g) {
 451 | 		if (g->flags & PF_KTHREAD)
 452 | 			continue;
 453 | 		for_each_thread(g, c) {
 454 | 			if (c->mm == mm)
 455 | 				goto assign_new_owner;
 456 | 			if (c->mm)
 457 | 				break;
 458 | 		}
 459 | 	}
 460 | 	read_unlock(&tasklist_lock);
 461 | 	/*
 462 | 	 * We found no owner yet mm_users > 1: this implies that we are
 463 | 	 * most likely racing with swapoff (try_to_unuse()) or /proc or
 464 | 	 * ptrace or page migration (get_task_mm()).  Mark owner as NULL.
 465 | 	 */
 466 | 	WRITE_ONCE(mm->owner, NULL);
 467 | 	return;
 468 | 
 469 | assign_new_owner:
 470 | 	BUG_ON(c == p);
 471 | 	get_task_struct(c);
 472 | 	/*
 473 | 	 * The task_lock protects c->mm from changing.
 474 | 	 * We always want mm->owner->mm == mm
 475 | 	 */
 476 | 	task_lock(c);
 477 | 	/*
 478 | 	 * Delay read_unlock() till we have the task_lock()
 479 | 	 * to ensure that c does not slip away underneath us
 480 | 	 */
 481 | 	read_unlock(&tasklist_lock);
 482 | 	if (c->mm != mm) {
 483 | 		task_unlock(c);
 484 | 		put_task_struct(c);
 485 | 		goto retry;
 486 | 	}
 487 | 	WRITE_ONCE(mm->owner, c);
 488 | 	task_unlock(c);
 489 | 	put_task_struct(c);
 490 | }
 491 | #endif /* CONFIG_MEMCG */
 492 | 
 493 | /*
 494 |  * Turn us into a lazy TLB process if we
 495 |  * aren't already..
 496 |  */
 497 | static void exit_mm(void)
 498 | {
 499 | 	struct mm_struct *mm = current->mm;
 500 | 	struct core_state *core_state;
 501 | 
 502 | 	mm_release(current, mm);
 503 | 	if (!mm)
 504 | 		return;
 505 | 	sync_mm_rss(mm);
 506 | 	/*
 507 | 	 * Serialize with any possible pending coredump.
 508 | 	 * We must hold mmap_sem around checking core_state
 509 | 	 * and clearing tsk->mm.  The core-inducing thread
 510 | 	 * will increment ->nr_threads for each thread in the
 511 | 	 * group with ->mm != NULL.
 512 | 	 */
 513 | 	down_read(&mm->mmap_sem);
 514 | 	core_state = mm->core_state;
 515 | 	if (core_state) {
 516 | 		struct core_thread self;
 517 | 
 518 | 		up_read(&mm->mmap_sem);
 519 | 
 520 | 		self.task = current;
 521 | 		self.next = xchg(&core_state->dumper.next, &self);
 522 | 		/*
 523 | 		 * Implies mb(), the result of xchg() must be visible
 524 | 		 * to core_state->dumper.
 525 | 		 */
 526 | 		if (atomic_dec_and_test(&core_state->nr_threads))
 527 | 			complete(&core_state->startup);
 528 | 
 529 | 		for (;;) {
 530 | 			set_current_state(TASK_UNINTERRUPTIBLE);
 531 | 			if (!self.task) /* see coredump_finish() */
 532 | 				break;
 533 | 			freezable_schedule();
 534 | 		}
 535 | 		__set_current_state(TASK_RUNNING);
 536 | 		down_read(&mm->mmap_sem);
 537 | 	}
 538 | 	mmgrab(mm);
 539 | 	BUG_ON(mm != current->active_mm);
 540 | 	/* more a memory barrier than a real lock */
 541 | 	task_lock(current);
 542 | 	current->mm = NULL;
 543 | 	up_read(&mm->mmap_sem);
 544 | 	enter_lazy_tlb(mm, current);
 545 | 	task_unlock(current);
 546 | 	mm_update_next_owner(mm);
 547 | 	mmput(mm);
 548 | 	if (test_thread_flag(TIF_MEMDIE))
 549 | 		exit_oom_victim();
 550 | }
 551 | 
 552 | static struct task_struct *find_alive_thread(struct task_struct *p)
 553 | {
 554 | 	struct task_struct *t;
 555 | 
 556 | 	for_each_thread(p, t) {
 557 | 		if (!(t->flags & PF_EXITING))
 558 | 			return t;
 559 | 	}
 560 | 	return NULL;
 561 | }
 562 | 
 563 | static struct task_struct *find_child_reaper(struct task_struct *father,
 564 | 						struct list_head *dead)
 565 | 	__releases(&tasklist_lock)
 566 | 	__acquires(&tasklist_lock)
 567 | {
 568 | 	struct pid_namespace *pid_ns = task_active_pid_ns(father);
 569 | 	struct task_struct *reaper = pid_ns->child_reaper;
 570 | 	struct task_struct *p, *n;
 571 | 
 572 | 	if (likely(reaper != father))
 573 | 		return reaper;
 574 | 
 575 | 	reaper = find_alive_thread(father);
 576 | 	if (reaper) {
 577 | 		pid_ns->child_reaper = reaper;
 578 | 		return reaper;
 579 | 	}
 580 | 
 581 | 	write_unlock_irq(&tasklist_lock);
 582 | 	if (unlikely(pid_ns == &init_pid_ns)) {
 583 | 		panic("Attempted to kill init! exitcode=0x%08x\n",
 584 | 			father->signal->group_exit_code ?: father->exit_code);
 585 | 	}
 586 | 
 587 | 	list_for_each_entry_safe(p, n, dead, ptrace_entry) {
 588 | 		list_del_init(&p->ptrace_entry);
 589 | 		release_task(p);
 590 | 	}
 591 | 
 592 | 	zap_pid_ns_processes(pid_ns);
 593 | 	write_lock_irq(&tasklist_lock);
 594 | 
 595 | 	return father;
 596 | }
 597 | 
 598 | /*
 599 |  * When we die, we re-parent all our children, and try to:
 600 |  * 1. give them to another thread in our thread group, if such a member exists
 601 |  * 2. give it to the first ancestor process which prctl'd itself as a
 602 |  *    child_subreaper for its children (like a service manager)
 603 |  * 3. give it to the init process (PID 1) in our pid namespace
 604 |  */
 605 | static struct task_struct *find_new_reaper(struct task_struct *father,
 606 | 					   struct task_struct *child_reaper)
 607 | {
 608 | 	struct task_struct *thread, *reaper;
 609 | 
 610 | 	thread = find_alive_thread(father);
 611 | 	if (thread)
 612 | 		return thread;
 613 | 
 614 | 	if (father->signal->has_child_subreaper) {
 615 | 		unsigned int ns_level = task_pid(father)->level;
 616 | 		/*
 617 | 		 * Find the first ->is_child_subreaper ancestor in our pid_ns.
 618 | 		 * We can't check reaper != child_reaper to ensure we do not
 619 | 		 * cross the namespaces, the exiting parent could be injected
 620 | 		 * by setns() + fork().
 621 | 		 * We check pid->level, this is slightly more efficient than
 622 | 		 * task_active_pid_ns(reaper) != task_active_pid_ns(father).
 623 | 		 */
 624 | 		for (reaper = father->real_parent;
 625 | 		     task_pid(reaper)->level == ns_level;
 626 | 		     reaper = reaper->real_parent) {
 627 | 			if (reaper == &init_task)
 628 | 				break;
 629 | 			if (!reaper->signal->is_child_subreaper)
 630 | 				continue;
 631 | 			thread = find_alive_thread(reaper);
 632 | 			if (thread)
 633 | 				return thread;
 634 | 		}
 635 | 	}
 636 | 
 637 | 	return child_reaper;
 638 | }
 639 | 
 640 | /*
 641 | * Any that need to be release_task'd are put on the @dead list.
 642 |  */
 643 | static void reparent_leader(struct task_struct *father, struct task_struct *p,
 644 | 				struct list_head *dead)
 645 | {
 646 | 	if (unlikely(p->exit_state == EXIT_DEAD))
 647 | 		return;
 648 | 
 649 | 	/* We don't want people slaying init. */
 650 | 	p->exit_signal = SIGCHLD;
 651 | 
 652 | 	/* If it has exited notify the new parent about this child's death. */
 653 | 	if (!p->ptrace &&
 654 | 	    p->exit_state == EXIT_ZOMBIE && thread_group_empty(p)) {
 655 | 		if (do_notify_parent(p, p->exit_signal)) {
 656 | 			p->exit_state = EXIT_DEAD;
 657 | 			list_add(&p->ptrace_entry, dead);
 658 | 		}
 659 | 	}
 660 | 
 661 | 	kill_orphaned_pgrp(p, father);
 662 | }
 663 | 
 664 | /*
 665 |  * This does two things:
 666 |  *
 667 |  * A.  Make init inherit all the child processes
 668 |  * B.  Check to see if any process groups have become orphaned
 669 |  *	as a result of our exiting, and if they have any stopped
 670 |  *	jobs, send them a SIGHUP and then a SIGCONT.  (POSIX 3.2.2.2)
 671 |  */
 672 | static void forget_original_parent(struct task_struct *father,
 673 | 					struct list_head *dead)
 674 | {
 675 | 	struct task_struct *p, *t, *reaper;
 676 | 
 677 | 	if (unlikely(!list_empty(&father->ptraced)))
 678 | 		exit_ptrace(father, dead);
 679 | 
 680 | 	/* Can drop and reacquire tasklist_lock */
 681 | 	reaper = find_child_reaper(father, dead);
 682 | 	if (list_empty(&father->children))
 683 | 		return;
 684 | 
 685 | 	reaper = find_new_reaper(father, reaper);
 686 | 	list_for_each_entry(p, &father->children, sibling) {
 687 | 		for_each_thread(p, t) {
 688 | 			t->real_parent = reaper;
 689 | 			BUG_ON((!t->ptrace) != (t->parent == father));
 690 | 			if (likely(!t->ptrace))
 691 | 				t->parent = t->real_parent;
 692 | 			if (t->pdeath_signal)
 693 | 				group_send_sig_info(t->pdeath_signal,
 694 | 						    SEND_SIG_NOINFO, t,
 695 | 						    PIDTYPE_TGID);
 696 | 		}
 697 | 		/*
 698 | 		 * If this is a threaded reparent there is no need to
 699 | 		 * notify anyone anything has happened.
 700 | 		 */
 701 | 		if (!same_thread_group(reaper, father))
 702 | 			reparent_leader(father, p, dead);
 703 | 	}
 704 | 	list_splice_tail_init(&father->children, &reaper->children);
 705 | }
 706 | 
 707 | /*
 708 |  * Send signals to all our closest relatives so that they know
 709 |  * to properly mourn us..
 710 |  */
 711 | static void exit_notify(struct task_struct *tsk, int group_dead)
 712 | {
 713 | 	bool autoreap;
 714 | 	struct task_struct *p, *n;
 715 | 	LIST_HEAD(dead);
 716 | 
 717 | 	write_lock_irq(&tasklist_lock);
 718 | 	forget_original_parent(tsk, &dead);
 719 | 
 720 | 	if (group_dead)
 721 | 		kill_orphaned_pgrp(tsk->group_leader, NULL);
 722 | 
 723 | 	if (unlikely(tsk->ptrace)) {
 724 | 		int sig = thread_group_leader(tsk) &&
 725 | 				thread_group_empty(tsk) &&
 726 | 				!ptrace_reparented(tsk) ?
 727 | 			tsk->exit_signal : SIGCHLD;
 728 | 		autoreap = do_notify_parent(tsk, sig);
 729 | 	} else if (thread_group_leader(tsk)) {
 730 | 		autoreap = thread_group_empty(tsk) &&
 731 | 			do_notify_parent(tsk, tsk->exit_signal);
 732 | 	} else {
 733 | 		autoreap = true;
 734 | 	}
 735 | 
 736 | 	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
 737 | 	if (tsk->exit_state == EXIT_DEAD)
 738 | 		list_add(&tsk->ptrace_entry, &dead);
 739 | 
 740 | 	/* mt-exec, de_thread() is waiting for group leader */
 741 | 	if (unlikely(tsk->signal->notify_count < 0))
 742 | 		wake_up_process(tsk->signal->group_exit_task);
 743 | 	write_unlock_irq(&tasklist_lock);
 744 | 
 745 | 	list_for_each_entry_safe(p, n, &dead, ptrace_entry) {
 746 | 		list_del_init(&p->ptrace_entry);
 747 | 		release_task(p);
 748 | 	}
 749 | }
 750 | 
 751 | #ifdef CONFIG_DEBUG_STACK_USAGE
 752 | static void check_stack_usage(void)
 753 | {
 754 | 	static DEFINE_SPINLOCK(low_water_lock);
 755 | 	static int lowest_to_date = THREAD_SIZE;
 756 | 	unsigned long free;
 757 | 
 758 | 	free = stack_not_used(current);
 759 | 
 760 | 	if (free >= lowest_to_date)
 761 | 		return;
 762 | 
 763 | 	spin_lock(&low_water_lock);
 764 | 	if (free < lowest_to_date) {
 765 | 		pr_info("%s (%d) used greatest stack depth: %lu bytes left\n",
 766 | 			current->comm, task_pid_nr(current), free);
 767 | 		lowest_to_date = free;
 768 | 	}
 769 | 	spin_unlock(&low_water_lock);
 770 | }
 771 | #else
 772 | static inline void check_stack_usage(void) {}
 773 | #endif
 774 | 
 775 | void __noreturn do_exit(long code)
 776 | {
 777 | 	struct task_struct *tsk = current;
 778 | 	int group_dead;
 779 | 
 780 | 	profile_task_exit(tsk);
 781 | 	kcov_task_exit(tsk);
 782 | 
 783 | 	WARN_ON(blk_needs_flush_plug(tsk));
 784 | 
 785 | 	if (unlikely(in_interrupt()))
 786 | 		panic("Aiee, killing interrupt handler!");
 787 | 	if (unlikely(!tsk->pid))
 788 | 		panic("Attempted to kill the idle task!");
 789 | 
 790 | 	/*
 791 | 	 * If do_exit is called because this processes oopsed, it's possible
 792 | 	 * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
 793 | 	 * continuing. Amongst other possible reasons, this is to prevent
 794 | 	 * mm_release()->clear_child_tid() from writing to a user-controlled
 795 | 	 * kernel address.
 796 | 	 */
 797 | 	set_fs(USER_DS);
 798 | 
 799 | 	ptrace_event(PTRACE_EVENT_EXIT, code);
 800 | 
 801 | 	validate_creds_for_do_exit(tsk);
 802 | 
 803 | 	/*
 804 | 	 * We're taking recursive faults here in do_exit. Safest is to just
 805 | 	 * leave this task alone and wait for reboot.
 806 | 	 */
 807 | 	if (unlikely(tsk->flags & PF_EXITING)) {
 808 | 		pr_alert("Fixing recursive fault but reboot is needed!\n");
 809 | 		/*
 810 | 		 * We can do this unlocked here. The futex code uses
 811 | 		 * this flag just to verify whether the pi state
 812 | 		 * cleanup has been done or not. In the worst case it
 813 | 		 * loops once more. We pretend that the cleanup was
 814 | 		 * done as there is no way to return. Either the
 815 | 		 * OWNER_DIED bit is set by now or we push the blocked
 816 | 		 * task into the wait for ever nirwana as well.
 817 | 		 */
 818 | 		tsk->flags |= PF_EXITPIDONE;
 819 | 		set_current_state(TASK_UNINTERRUPTIBLE);
 820 | 		schedule();
 821 | 	}
 822 | 
 823 | 	exit_signals(tsk);  /* sets PF_EXITING */
 824 | 	/*
 825 | 	 * Ensure that all new tsk->pi_lock acquisitions must observe
 826 | 	 * PF_EXITING. Serializes against futex.c:attach_to_pi_owner().
 827 | 	 */
 828 | 	smp_mb();
 829 | 	/*
 830 | 	 * Ensure that we must observe the pi_state in exit_mm() ->
 831 | 	 * mm_release() -> exit_pi_state_list().
 832 | 	 */
 833 | 	raw_spin_lock_irq(&tsk->pi_lock);
 834 | 	raw_spin_unlock_irq(&tsk->pi_lock);
 835 | 
 836 | 	if (unlikely(in_atomic())) {
 837 | 		pr_info("note: %s[%d] exited with preempt_count %d\n",
 838 | 			current->comm, task_pid_nr(current),
 839 | 			preempt_count());
 840 | 		preempt_count_set(PREEMPT_ENABLED);
 841 | 	}
 842 | 
 843 | 	/* sync mm's RSS info before statistics gathering */
 844 | 	if (tsk->mm)
 845 | 		sync_mm_rss(tsk->mm);
 846 | 	acct_update_integrals(tsk);
 847 | 	group_dead = atomic_dec_and_test(&tsk->signal->live);
 848 | 	if (group_dead) {
 849 | #ifdef CONFIG_POSIX_TIMERS
 850 | 		hrtimer_cancel(&tsk->signal->real_timer);
 851 | 		exit_itimers(tsk->signal);
 852 | #endif
 853 | 		if (tsk->mm)
 854 | 			setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm);
 855 | 	}
 856 | 	acct_collect(code, group_dead);
 857 | 	if (group_dead)
 858 | 		tty_audit_exit();
 859 | 	audit_free(tsk);
 860 | 
 861 | 	tsk->exit_code = code;
 862 | 	taskstats_exit(tsk, group_dead);
 863 | 
 864 | 	exit_mm();
 865 | 
 866 | 	if (group_dead)
 867 | 		acct_process();
 868 | 	trace_sched_process_exit(tsk);
 869 | 
 870 | 	exit_sem(tsk);
 871 | 	exit_shm(tsk);
 872 | 	exit_files(tsk);
 873 | 	exit_fs(tsk);
 874 | 	if (group_dead)
 875 | 		disassociate_ctty(1);
 876 | 	exit_task_namespaces(tsk);
 877 | 	exit_task_work(tsk);
 878 | 	exit_thread(tsk);
 879 | 	exit_umh(tsk);
 880 | 
 881 | 	/*
 882 | 	 * Flush inherited counters to the parent - before the parent
 883 | 	 * gets woken up by child-exit notifications.
 884 | 	 *
 885 | 	 * because of cgroup mode, must be called before cgroup_exit()
 886 | 	 */
 887 | 	perf_event_exit_task(tsk);
 888 | 
 889 | 	sched_autogroup_exit_task(tsk);
 890 | 	cgroup_exit(tsk);
 891 | 
 892 | 	/*
 893 | 	 * FIXME: do that only when needed, using sched_exit tracepoint
 894 | 	 */
 895 | 	flush_ptrace_hw_breakpoint(tsk);
 896 | 
 897 | 	exit_tasks_rcu_start();
 898 | 	exit_notify(tsk, group_dead);
 899 | 	proc_exit_connector(tsk);
 900 | 	mpol_put_task_policy(tsk);
 901 | #ifdef CONFIG_FUTEX
 902 | 	if (unlikely(current->pi_state_cache))
 903 | 		kfree(current->pi_state_cache);
 904 | #endif
 905 | 	/*
 906 | 	 * Make sure we are holding no locks:
 907 | 	 */
 908 | 	debug_check_no_locks_held();
 909 | 	/*
 910 | 	 * We can do this unlocked here. The futex code uses this flag
 911 | 	 * just to verify whether the pi state cleanup has been done
 912 | 	 * or not. In the worst case it loops once more.
 913 | 	 */
 914 | 	tsk->flags |= PF_EXITPIDONE;
 915 | 
 916 | 	if (tsk->io_context)
 917 | 		exit_io_context(tsk);
 918 | 
 919 | 	if (tsk->splice_pipe)
 920 | 		free_pipe_info(tsk->splice_pipe);
 921 | 
 922 | 	if (tsk->task_frag.page)
 923 | 		put_page(tsk->task_frag.page);
 924 | 
 925 | 	validate_creds_for_do_exit(tsk);
 926 | 
 927 | 	check_stack_usage();
 928 | 	preempt_disable();
 929 | 	if (tsk->nr_dirtied)
 930 | 		__this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied);
 931 | 	exit_rcu();
 932 | 	exit_tasks_rcu_finish();
 933 | 
 934 | 	lockdep_free_task(tsk);
 935 | 	do_task_dead();
 936 | }
 937 | EXPORT_SYMBOL_GPL(do_exit);
 938 | 
 939 | void complete_and_exit(struct completion *comp, long code)
 940 | {
 941 | 	if (comp)
 942 | 		complete(comp);
 943 | 
 944 | 	do_exit(code);
 945 | }
 946 | EXPORT_SYMBOL(complete_and_exit);
 947 | 
 948 | SYSCALL_DEFINE1(exit, int, error_code)
 949 | {
 950 | 	do_exit((error_code&0xff)<<8);
 951 | }
 952 | 
 953 | /*
 954 |  * Take down every thread in the group.  This is called by fatal signals
 955 |  * as well as by sys_exit_group (below).
 956 |  */
 957 | void
 958 | do_group_exit(int exit_code)
 959 | {
 960 | 	struct signal_struct *sig = current->signal;
 961 | 
 962 | 	BUG_ON(exit_code & 0x80); /* core dumps don't get here */
 963 | 
 964 | 	if (signal_group_exit(sig))
 965 | 		exit_code = sig->group_exit_code;
 966 | 	else if (!thread_group_empty(current)) {
 967 | 		struct sighand_struct *const sighand = current->sighand;
 968 | 
 969 | 		spin_lock_irq(&sighand->siglock);
 970 | 		if (signal_group_exit(sig))
 971 | 			/* Another thread got here before we took the lock.  */
 972 | 			exit_code = sig->group_exit_code;
 973 | 		else {
 974 | 			sig->group_exit_code = exit_code;
 975 | 			sig->flags = SIGNAL_GROUP_EXIT;
 976 | 			zap_other_threads(current);
 977 | 		}
 978 | 		spin_unlock_irq(&sighand->siglock);
 979 | 	}
 980 | 
 981 | 	do_exit(exit_code);
 982 | 	/* NOTREACHED */
 983 | }
 984 | 
 985 | /*
 986 |  * this kills every thread in the thread group. Note that any externally
 987 |  * wait4()-ing process will get the correct exit code - even if this
 988 |  * thread is not the thread group leader.
 989 |  */
 990 | SYSCALL_DEFINE1(exit_group, int, error_code)
 991 | {
 992 | 	do_group_exit((error_code & 0xff) << 8);
 993 | 	/* NOTREACHED */
 994 | 	return 0;
 995 | }
 996 | 
 997 | struct waitid_info {
 998 | 	pid_t pid;
 999 | 	uid_t uid;
1000 | 	int status;
1001 | 	int cause;
1002 | };
1003 | 
1004 | struct wait_opts {
1005 | 	enum pid_type		wo_type;
1006 | 	int			wo_flags;
1007 | 	struct pid		*wo_pid;
1008 | 
1009 | 	struct waitid_info	*wo_info;
1010 | 	int			wo_stat;
1011 | 	struct rusage		*wo_rusage;
1012 | 
1013 | 	wait_queue_entry_t		child_wait;
1014 | 	int			notask_error;
1015 | };
1016 | 
1017 | static int eligible_pid(struct wait_opts *wo, struct task_struct *p)
1018 | {
1019 | 	return	wo->wo_type == PIDTYPE_MAX ||
1020 | 		task_pid_type(p, wo->wo_type) == wo->wo_pid;
1021 | }
1022 | 
1023 | static int
1024 | eligible_child(struct wait_opts *wo, bool ptrace, struct task_struct *p)
1025 | {
1026 | 	if (!eligible_pid(wo, p))
1027 | 		return 0;
1028 | 
1029 | 	/*
1030 | 	 * Wait for all children (clone and not) if __WALL is set or
1031 | 	 * if it is traced by us.
1032 | 	 */
1033 | 	if (ptrace || (wo->wo_flags & __WALL))
1034 | 		return 1;
1035 | 
1036 | 	/*
1037 | 	 * Otherwise, wait for clone children *only* if __WCLONE is set;
1038 | 	 * otherwise, wait for non-clone children *only*.
1039 | 	 *
1040 | 	 * Note: a "clone" child here is one that reports to its parent
1041 | 	 * using a signal other than SIGCHLD, or a non-leader thread which
1042 | 	 * we can only see if it is traced by us.
1043 | 	 */
1044 | 	if ((p->exit_signal != SIGCHLD) ^ !!(wo->wo_flags & __WCLONE))
1045 | 		return 0;
1046 | 
1047 | 	return 1;
1048 | }
1049 | 
1050 | /*
1051 |  * Handle sys_wait4 work for one task in state EXIT_ZOMBIE.  We hold
1052 |  * read_lock(&tasklist_lock) on entry.  If we return zero, we still hold
1053 |  * the lock and this task is uninteresting.  If we return nonzero, we have
1054 |  * released the lock and the system call should return.
1055 |  */
1056 | static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
1057 | {
1058 | 	int state, status;
1059 | 	pid_t pid = task_pid_vnr(p);
1060 | 	uid_t uid = from_kuid_munged(current_user_ns(), task_uid(p));
1061 | 	struct waitid_info *infop;
1062 | 
1063 | 	if (!likely(wo->wo_flags & WEXITED))
1064 | 		return 0;
1065 | 
1066 | 	if (unlikely(wo->wo_flags & WNOWAIT)) {
1067 | 		status = p->exit_code;
1068 | 		get_task_struct(p);
1069 | 		read_unlock(&tasklist_lock);
1070 | 		sched_annotate_sleep();
1071 | 		if (wo->wo_rusage)
1072 | 			getrusage(p, RUSAGE_BOTH, wo->wo_rusage);
1073 | 		put_task_struct(p);
1074 | 		goto out_info;
1075 | 	}
1076 | 	/*
1077 | 	 * Move the task's state to DEAD/TRACE, only one thread can do this.
1078 | 	 */
1079 | 	state = (ptrace_reparented(p) && thread_group_leader(p)) ?
1080 | 		EXIT_TRACE : EXIT_DEAD;
1081 | 	if (cmpxchg(&p->exit_state, EXIT_ZOMBIE, state) != EXIT_ZOMBIE)
1082 | 		return 0;
1083 | 	/*
1084 | 	 * We own this thread, nobody else can reap it.
1085 | 	 */
1086 | 	read_unlock(&tasklist_lock);
1087 | 	sched_annotate_sleep();
1088 | 
1089 | 	/*
1090 | 	 * Check thread_group_leader() to exclude the traced sub-threads.
1091 | 	 */
1092 | 	if (state == EXIT_DEAD && thread_group_leader(p)) {
1093 | 		struct signal_struct *sig = p->signal;
1094 | 		struct signal_struct *psig = current->signal;
1095 | 		unsigned long maxrss;
1096 | 		u64 tgutime, tgstime;
1097 | 
1098 | 		/*
1099 | 		 * The resource counters for the group leader are in its
1100 | 		 * own task_struct.  Those for dead threads in the group
1101 | 		 * are in its signal_struct, as are those for the child
1102 | 		 * processes it has previously reaped.  All these
1103 | 		 * accumulate in the parent's signal_struct c* fields.
1104 | 		 *
1105 | 		 * We don't bother to take a lock here to protect these
1106 | 		 * p->signal fields because the whole thread group is dead
1107 | 		 * and nobody can change them.
1108 | 		 *
1109 | 		 * psig->stats_lock also protects us from our sub-theads
1110 | 		 * which can reap other children at the same time. Until
1111 | 		 * we change k_getrusage()-like users to rely on this lock
1112 | 		 * we have to take ->siglock as well.
1113 | 		 *
1114 | 		 * We use thread_group_cputime_adjusted() to get times for
1115 | 		 * the thread group, which consolidates times for all threads
1116 | 		 * in the group including the group leader.
1117 | 		 */
1118 | 		thread_group_cputime_adjusted(p, &tgutime, &tgstime);
1119 | 		spin_lock_irq(&current->sighand->siglock);
1120 | 		write_seqlock(&psig->stats_lock);
1121 | 		psig->cutime += tgutime + sig->cutime;
1122 | 		psig->cstime += tgstime + sig->cstime;
1123 | 		psig->cgtime += task_gtime(p) + sig->gtime + sig->cgtime;
1124 | 		psig->cmin_flt +=
1125 | 			p->min_flt + sig->min_flt + sig->cmin_flt;
1126 | 		psig->cmaj_flt +=
1127 | 			p->maj_flt + sig->maj_flt + sig->cmaj_flt;
1128 | 		psig->cnvcsw +=
1129 | 			p->nvcsw + sig->nvcsw + sig->cnvcsw;
1130 | 		psig->cnivcsw +=
1131 | 			p->nivcsw + sig->nivcsw + sig->cnivcsw;
1132 | 		psig->cinblock +=
1133 | 			task_io_get_inblock(p) +
1134 | 			sig->inblock + sig->cinblock;
1135 | 		psig->coublock +=
1136 | 			task_io_get_oublock(p) +
1137 | 			sig->oublock + sig->coublock;
1138 | 		maxrss = max(sig->maxrss, sig->cmaxrss);
1139 | 		if (psig->cmaxrss < maxrss)
1140 | 			psig->cmaxrss = maxrss;
1141 | 		task_io_accounting_add(&psig->ioac, &p->ioac);
1142 | 		task_io_accounting_add(&psig->ioac, &sig->ioac);
1143 | 		write_sequnlock(&psig->stats_lock);
1144 | 		spin_unlock_irq(&current->sighand->siglock);
1145 | 	}
1146 | 
1147 | 	if (wo->wo_rusage)
1148 | 		getrusage(p, RUSAGE_BOTH, wo->wo_rusage);
1149 | 	status = (p->signal->flags & SIGNAL_GROUP_EXIT)
1150 | 		? p->signal->group_exit_code : p->exit_code;
1151 | 	wo->wo_stat = status;
1152 | 
1153 | 	if (state == EXIT_TRACE) {
1154 | 		write_lock_irq(&tasklist_lock);
1155 | 		/* We dropped tasklist, ptracer could die and untrace */
1156 | 		ptrace_unlink(p);
1157 | 
1158 | 		/* If parent wants a zombie, don't release it now */
1159 | 		state = EXIT_ZOMBIE;
1160 | 		if (do_notify_parent(p, p->exit_signal))
1161 | 			state = EXIT_DEAD;
1162 | 		p->exit_state = state;
1163 | 		write_unlock_irq(&tasklist_lock);
1164 | 	}
1165 | 	if (state == EXIT_DEAD)
1166 | 		release_task(p);
1167 | 
1168 | out_info:
1169 | 	infop = wo->wo_info;
1170 | 	if (infop) {
1171 | 		if ((status & 0x7f) == 0) {
1172 | 			infop->cause = CLD_EXITED;
1173 | 			infop->status = status >> 8;
1174 | 		} else {
1175 | 			infop->cause = (status & 0x80) ? CLD_DUMPED : CLD_KILLED;
1176 | 			infop->status = status & 0x7f;
1177 | 		}
1178 | 		infop->pid = pid;
1179 | 		infop->uid = uid;
1180 | 	}
1181 | 
1182 | 	return pid;
1183 | }
1184 | 
1185 | static int *task_stopped_code(struct task_struct *p, bool ptrace)
1186 | {
1187 | 	if (ptrace) {
1188 | 		if (task_is_traced(p) && !(p->jobctl & JOBCTL_LISTENING))
1189 | 			return &p->exit_code;
1190 | 	} else {
1191 | 		if (p->signal->flags & SIGNAL_STOP_STOPPED)
1192 | 			return &p->signal->group_exit_code;
1193 | 	}
1194 | 	return NULL;
1195 | }
1196 | 
1197 | /**
1198 |  * wait_task_stopped - Wait for %TASK_STOPPED or %TASK_TRACED
1199 |  * @wo: wait options
1200 |  * @ptrace: is the wait for ptrace
1201 |  * @p: task to wait for
1202 |  *
1203 |  * Handle sys_wait4() work for %p in state %TASK_STOPPED or %TASK_TRACED.
1204 |  *
1205 |  * CONTEXT:
1206 |  * read_lock(&tasklist_lock), which is released if return value is
1207 |  * non-zero.  Also, grabs and releases @p->sighand->siglock.
1208 |  *
1209 |  * RETURNS:
1210 |  * 0 if wait condition didn't exist and search for other wait conditions
1211 |  * should continue.  Non-zero return, -errno on failure and @p's pid on
1212 |  * success, implies that tasklist_lock is released and wait condition
1213 |  * search should terminate.
1214 |  */
1215 | static int wait_task_stopped(struct wait_opts *wo,
1216 | 				int ptrace, struct task_struct *p)
1217 | {
1218 | 	struct waitid_info *infop;
1219 | 	int exit_code, *p_code, why;
1220 | 	uid_t uid = 0; /* unneeded, required by compiler */
1221 | 	pid_t pid;
1222 | 
1223 | 	/*
1224 | 	 * Traditionally we see ptrace'd stopped tasks regardless of options.
1225 | 	 */
1226 | 	if (!ptrace && !(wo->wo_flags & WUNTRACED))
1227 | 		return 0;
1228 | 
1229 | 	if (!task_stopped_code(p, ptrace))
1230 | 		return 0;
1231 | 
1232 | 	exit_code = 0;
1233 | 	spin_lock_irq(&p->sighand->siglock);
1234 | 
1235 | 	p_code = task_stopped_code(p, ptrace);
1236 | 	if (unlikely(!p_code))
1237 | 		goto unlock_sig;
1238 | 
1239 | 	exit_code = *p_code;
1240 | 	if (!exit_code)
1241 | 		goto unlock_sig;
1242 | 
1243 | 	if (!unlikely(wo->wo_flags & WNOWAIT))
1244 | 		*p_code = 0;
1245 | 
1246 | 	uid = from_kuid_munged(current_user_ns(), task_uid(p));
1247 | unlock_sig:
1248 | 	spin_unlock_irq(&p->sighand->siglock);
1249 | 	if (!exit_code)
1250 | 		return 0;
1251 | 
1252 | 	/*
1253 | 	 * Now we are pretty sure this task is interesting.
1254 | 	 * Make sure it doesn't get reaped out from under us while we
1255 | 	 * give up the lock and then examine it below.  We don't want to
1256 | 	 * keep holding onto the tasklist_lock while we call getrusage and
1257 | 	 * possibly take page faults for user memory.
1258 | 	 */
1259 | 	get_task_struct(p);
1260 | 	pid = task_pid_vnr(p);
1261 | 	why = ptrace ? CLD_TRAPPED : CLD_STOPPED;
1262 | 	read_unlock(&tasklist_lock);
1263 | 	sched_annotate_sleep();
1264 | 	if (wo->wo_rusage)
1265 | 		getrusage(p, RUSAGE_BOTH, wo->wo_rusage);
1266 | 	put_task_struct(p);
1267 | 
1268 | 	if (likely(!(wo->wo_flags & WNOWAIT)))
1269 | 		wo->wo_stat = (exit_code << 8) | 0x7f;
1270 | 
1271 | 	infop = wo->wo_info;
1272 | 	if (infop) {
1273 | 		infop->cause = why;
1274 | 		infop->status = exit_code;
1275 | 		infop->pid = pid;
1276 | 		infop->uid = uid;
1277 | 	}
1278 | 	return pid;
1279 | }
1280 | 
1281 | /*
1282 |  * Handle do_wait work for one task in a live, non-stopped state.
1283 |  * read_lock(&tasklist_lock) on entry.  If we return zero, we still hold
1284 |  * the lock and this task is uninteresting.  If we return nonzero, we have
1285 |  * released the lock and the system call should return.
1286 |  */
1287 | static int wait_task_continued(struct wait_opts *wo, struct task_struct *p)
1288 | {
1289 | 	struct waitid_info *infop;
1290 | 	pid_t pid;
1291 | 	uid_t uid;
1292 | 
1293 | 	if (!unlikely(wo->wo_flags & WCONTINUED))
1294 | 		return 0;
1295 | 
1296 | 	if (!(p->signal->flags & SIGNAL_STOP_CONTINUED))
1297 | 		return 0;
1298 | 
1299 | 	spin_lock_irq(&p->sighand->siglock);
1300 | 	/* Re-check with the lock held.  */
1301 | 	if (!(p->signal->flags & SIGNAL_STOP_CONTINUED)) {
1302 | 		spin_unlock_irq(&p->sighand->siglock);
1303 | 		return 0;
1304 | 	}
1305 | 	if (!unlikely(wo->wo_flags & WNOWAIT))
1306 | 		p->signal->flags &= ~SIGNAL_STOP_CONTINUED;
1307 | 	uid = from_kuid_munged(current_user_ns(), task_uid(p));
1308 | 	spin_unlock_irq(&p->sighand->siglock);
1309 | 
1310 | 	pid = task_pid_vnr(p);
1311 | 	get_task_struct(p);
1312 | 	read_unlock(&tasklist_lock);
1313 | 	sched_annotate_sleep();
1314 | 	if (wo->wo_rusage)
1315 | 		getrusage(p, RUSAGE_BOTH, wo->wo_rusage);
1316 | 	put_task_struct(p);
1317 | 
1318 | 	infop = wo->wo_info;
1319 | 	if (!infop) {
1320 | 		wo->wo_stat = 0xffff;
1321 | 	} else {
1322 | 		infop->cause = CLD_CONTINUED;
1323 | 		infop->pid = pid;
1324 | 		infop->uid = uid;
1325 | 		infop->status = SIGCONT;
1326 | 	}
1327 | 	return pid;
1328 | }
1329 | 
1330 | /*
1331 |  * Consider @p for a wait by @parent.
1332 |  *
1333 |  * -ECHILD should be in ->notask_error before the first call.
1334 |  * Returns nonzero for a final return, when we have unlocked tasklist_lock.
1335 |  * Returns zero if the search for a child should continue;
1336 |  * then ->notask_error is 0 if @p is an eligible child,
1337 |  * or still -ECHILD.
1338 |  */
1339 | static int wait_consider_task(struct wait_opts *wo, int ptrace,
1340 | 				struct task_struct *p)
1341 | {
1342 | 	/*
1343 | 	 * We can race with wait_task_zombie() from another thread.
1344 | 	 * Ensure that EXIT_ZOMBIE -> EXIT_DEAD/EXIT_TRACE transition
1345 | 	 * can't confuse the checks below.
1346 | 	 */
1347 | 	int exit_state = READ_ONCE(p->exit_state);
1348 | 	int ret;
1349 | 
1350 | 	if (unlikely(exit_state == EXIT_DEAD))
1351 | 		return 0;
1352 | 
1353 | 	ret = eligible_child(wo, ptrace, p);
1354 | 	if (!ret)
1355 | 		return ret;
1356 | 
1357 | 	if (unlikely(exit_state == EXIT_TRACE)) {
1358 | 		/*
1359 | 		 * ptrace == 0 means we are the natural parent. In this case
1360 | 		 * we should clear notask_error, debugger will notify us.
1361 | 		 */
1362 | 		if (likely(!ptrace))
1363 | 			wo->notask_error = 0;
1364 | 		return 0;
1365 | 	}
1366 | 
1367 | 	if (likely(!ptrace) && unlikely(p->ptrace)) {
1368 | 		/*
1369 | 		 * If it is traced by its real parent's group, just pretend
1370 | 		 * the caller is ptrace_do_wait() and reap this child if it
1371 | 		 * is zombie.
1372 | 		 *
1373 | 		 * This also hides group stop state from real parent; otherwise
1374 | 		 * a single stop can be reported twice as group and ptrace stop.
1375 | 		 * If a ptracer wants to distinguish these two events for its
1376 | 		 * own children it should create a separate process which takes
1377 | 		 * the role of real parent.
1378 | 		 */
1379 | 		if (!ptrace_reparented(p))
1380 | 			ptrace = 1;
1381 | 	}
1382 | 
1383 | 	/* slay zombie? */
1384 | 	if (exit_state == EXIT_ZOMBIE) {
1385 | 		/* we don't reap group leaders with subthreads */
1386 | 		if (!delay_group_leader(p)) {
1387 | 			/*
1388 | 			 * A zombie ptracee is only visible to its ptracer.
1389 | 			 * Notification and reaping will be cascaded to the
1390 | 			 * real parent when the ptracer detaches.
1391 | 			 */
1392 | 			if (unlikely(ptrace) || likely(!p->ptrace))
1393 | 				return wait_task_zombie(wo, p);
1394 | 		}
1395 | 
1396 | 		/*
1397 | 		 * Allow access to stopped/continued state via zombie by
1398 | 		 * falling through.  Clearing of notask_error is complex.
1399 | 		 *
1400 | 		 * When !@ptrace:
1401 | 		 *
1402 | 		 * If WEXITED is set, notask_error should naturally be
1403 | 		 * cleared.  If not, subset of WSTOPPED|WCONTINUED is set,
1404 | 		 * so, if there are live subthreads, there are events to
1405 | 		 * wait for.  If all subthreads are dead, it's still safe
1406 | 		 * to clear - this function will be called again in finite
1407 | 		 * amount time once all the subthreads are released and
1408 | 		 * will then return without clearing.
1409 | 		 *
1410 | 		 * When @ptrace:
1411 | 		 *
1412 | 		 * Stopped state is per-task and thus can't change once the
1413 | 		 * target task dies.  Only continued and exited can happen.
1414 | 		 * Clear notask_error if WCONTINUED | WEXITED.
1415 | 		 */
1416 | 		if (likely(!ptrace) || (wo->wo_flags & (WCONTINUED | WEXITED)))
1417 | 			wo->notask_error = 0;
1418 | 	} else {
1419 | 		/*
1420 | 		 * @p is alive and it's gonna stop, continue or exit, so
1421 | 		 * there always is something to wait for.
1422 | 		 */
1423 | 		wo->notask_error = 0;
1424 | 	}
1425 | 
1426 | 	/*
1427 | 	 * Wait for stopped.  Depending on @ptrace, different stopped state
1428 | 	 * is used and the two don't interact with each other.
1429 | 	 */
1430 | 	ret = wait_task_stopped(wo, ptrace, p);
1431 | 	if (ret)
1432 | 		return ret;
1433 | 
1434 | 	/*
1435 | 	 * Wait for continued.  There's only one continued state and the
1436 | 	 * ptracer can consume it which can confuse the real parent.  Don't
1437 | 	 * use WCONTINUED from ptracer.  You don't need or want it.
1438 | 	 */
1439 | 	return wait_task_continued(wo, p);
1440 | }
1441 | 
1442 | /*
1443 |  * Do the work of do_wait() for one thread in the group, @tsk.
1444 |  *
1445 |  * -ECHILD should be in ->notask_error before the first call.
1446 |  * Returns nonzero for a final return, when we have unlocked tasklist_lock.
1447 |  * Returns zero if the search for a child should continue; then
1448 |  * ->notask_error is 0 if there were any eligible children,
1449 |  * or still -ECHILD.
1450 |  */
1451 | static int do_wait_thread(struct wait_opts *wo, struct task_struct *tsk)
1452 | {
1453 | 	struct task_struct *p;
1454 | 
1455 | 	list_for_each_entry(p, &tsk->children, sibling) {
1456 | 		int ret = wait_consider_task(wo, 0, p);
1457 | 
1458 | 		if (ret)
1459 | 			return ret;
1460 | 	}
1461 | 
1462 | 	return 0;
1463 | }
1464 | 
1465 | static int ptrace_do_wait(struct wait_opts *wo, struct task_struct *tsk)
1466 | {
1467 | 	struct task_struct *p;
1468 | 
1469 | 	list_for_each_entry(p, &tsk->ptraced, ptrace_entry) {
1470 | 		int ret = wait_consider_task(wo, 1, p);
1471 | 
1472 | 		if (ret)
1473 | 			return ret;
1474 | 	}
1475 | 
1476 | 	return 0;
1477 | }
1478 | 
1479 | static int child_wait_callback(wait_queue_entry_t *wait, unsigned mode,
1480 | 				int sync, void *key)
1481 | {
1482 | 	struct wait_opts *wo = container_of(wait, struct wait_opts,
1483 | 						child_wait);
1484 | 	struct task_struct *p = key;
1485 | 
1486 | 	if (!eligible_pid(wo, p))
1487 | 		return 0;
1488 | 
1489 | 	if ((wo->wo_flags & __WNOTHREAD) && wait->private != p->parent)
1490 | 		return 0;
1491 | 
1492 | 	return default_wake_function(wait, mode, sync, key);
1493 | }
1494 | 
1495 | void __wake_up_parent(struct task_struct *p, struct task_struct *parent)
1496 | {
1497 | 	__wake_up_sync_key(&parent->signal->wait_chldexit,
1498 | 				TASK_INTERRUPTIBLE, 1, p);
1499 | }
1500 | 
1501 | static long do_wait(struct wait_opts *wo)
1502 | {
1503 | 	struct task_struct *tsk;
1504 | 	int retval;
1505 | 
1506 | 	trace_sched_process_wait(wo->wo_pid);
1507 | 
1508 | 	init_waitqueue_func_entry(&wo->child_wait, child_wait_callback);
1509 | 	wo->child_wait.private = current;
1510 | 	add_wait_queue(&current->signal->wait_chldexit, &wo->child_wait);
1511 | repeat:
1512 | 	/*
1513 | 	 * If there is nothing that can match our criteria, just get out.
1514 | 	 * We will clear ->notask_error to zero if we see any child that
1515 | 	 * might later match our criteria, even if we are not able to reap
1516 | 	 * it yet.
1517 | 	 */
1518 | 	wo->notask_error = -ECHILD;
1519 | 	if ((wo->wo_type < PIDTYPE_MAX) &&
1520 | 	   (!wo->wo_pid || hlist_empty(&wo->wo_pid->tasks[wo->wo_type])))
1521 | 		goto notask;
1522 | 
1523 | 	set_current_state(TASK_INTERRUPTIBLE);
1524 | 	read_lock(&tasklist_lock);
1525 | 	tsk = current;
1526 | 	do {
1527 | 		retval = do_wait_thread(wo, tsk);
1528 | 		if (retval)
1529 | 			goto end;
1530 | 
1531 | 		retval = ptrace_do_wait(wo, tsk);
1532 | 		if (retval)
1533 | 			goto end;
1534 | 
1535 | 		if (wo->wo_flags & __WNOTHREAD)
1536 | 			break;
1537 | 	} while_each_thread(current, tsk);
1538 | 	read_unlock(&tasklist_lock);
1539 | 
1540 | notask:
1541 | 	retval = wo->notask_error;
1542 | 	if (!retval && !(wo->wo_flags & WNOHANG)) {
1543 | 		retval = -ERESTARTSYS;
1544 | 		if (!signal_pending(current)) {
1545 | 			schedule();
1546 | 			goto repeat;
1547 | 		}
1548 | 	}
1549 | end:
1550 | 	__set_current_state(TASK_RUNNING);
1551 | 	remove_wait_queue(&current->signal->wait_chldexit, &wo->child_wait);
1552 | 	return retval;
1553 | }
1554 | 
1555 | static long kernel_waitid(int which, pid_t upid, struct waitid_info *infop,
1556 | 			  int options, struct rusage *ru)
1557 | {
1558 | 	struct wait_opts wo;
1559 | 	struct pid *pid = NULL;
1560 | 	enum pid_type type;
1561 | 	long ret;
1562 | 
1563 | 	if (options & ~(WNOHANG|WNOWAIT|WEXITED|WSTOPPED|WCONTINUED|
1564 | 			__WNOTHREAD|__WCLONE|__WALL))
1565 | 		return -EINVAL;
1566 | 	if (!(options & (WEXITED|WSTOPPED|WCONTINUED)))
1567 | 		return -EINVAL;
1568 | 
1569 | 	switch (which) {
1570 | 	case P_ALL:
1571 | 		type = PIDTYPE_MAX;
1572 | 		break;
1573 | 	case P_PID:
1574 | 		type = PIDTYPE_PID;
1575 | 		if (upid <= 0)
1576 | 			return -EINVAL;
1577 | 		break;
1578 | 	case P_PGID:
1579 | 		type = PIDTYPE_PGID;
1580 | 		if (upid <= 0)
1581 | 			return -EINVAL;
1582 | 		break;
1583 | 	default:
1584 | 		return -EINVAL;
1585 | 	}
1586 | 
1587 | 	if (type < PIDTYPE_MAX)
1588 | 		pid = find_get_pid(upid);
1589 | 
1590 | 	wo.wo_type	= type;
1591 | 	wo.wo_pid	= pid;
1592 | 	wo.wo_flags	= options;
1593 | 	wo.wo_info	= infop;
1594 | 	wo.wo_rusage	= ru;
1595 | 	ret = do_wait(&wo);
1596 | 
1597 | 	put_pid(pid);
1598 | 	return ret;
1599 | }
1600 | 
1601 | SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
1602 | 		infop, int, options, struct rusage __user *, ru)
1603 | {
1604 | 	struct rusage r;
1605 | 	struct waitid_info info = {.status = 0};
1606 | 	long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL);
1607 | 	int signo = 0;
1608 | 
1609 | 	if (err > 0) {
1610 | 		signo = SIGCHLD;
1611 | 		err = 0;
1612 | 		if (ru && copy_to_user(ru, &r, sizeof(struct rusage)))
1613 | 			return -EFAULT;
1614 | 	}
1615 | 	if (!infop)
1616 | 		return err;
1617 | 
1618 | 	if (!user_access_begin(infop, sizeof(*infop)))
1619 | 		return -EFAULT;
1620 | 
1621 | 	unsafe_put_user(signo, &infop->si_signo, Efault);
1622 | 	unsafe_put_user(0, &infop->si_errno, Efault);
1623 | 	unsafe_put_user(info.cause, &infop->si_code, Efault);
1624 | 	unsafe_put_user(info.pid, &infop->si_pid, Efault);
1625 | 	unsafe_put_user(info.uid, &infop->si_uid, Efault);
1626 | 	unsafe_put_user(info.status, &infop->si_status, Efault);
1627 | 	user_access_end();
1628 | 	return err;
1629 | Efault:
1630 | 	user_access_end();
1631 | 	return -EFAULT;
1632 | }
1633 | 
1634 | long kernel_wait4(pid_t upid, int __user *stat_addr, int options,
1635 | 		  struct rusage *ru)
1636 | {
1637 | 	struct wait_opts wo;
1638 | 	struct pid *pid = NULL;
1639 | 	enum pid_type type;
1640 | 	long ret;
1641 | 
1642 | 	if (options & ~(WNOHANG|WUNTRACED|WCONTINUED|
1643 | 			__WNOTHREAD|__WCLONE|__WALL))
1644 | 		return -EINVAL;
1645 | 
1646 | 	/* -INT_MIN is not defined */
1647 | 	if (upid == INT_MIN)
1648 | 		return -ESRCH;
1649 | 
1650 | 	if (upid == -1)
1651 | 		type = PIDTYPE_MAX;
1652 | 	else if (upid < 0) {
1653 | 		type = PIDTYPE_PGID;
1654 | 		pid = find_get_pid(-upid);
1655 | 	} else if (upid == 0) {
1656 | 		type = PIDTYPE_PGID;
1657 | 		pid = get_task_pid(current, PIDTYPE_PGID);
1658 | 	} else /* upid > 0 */ {
1659 | 		type = PIDTYPE_PID;
1660 | 		pid = find_get_pid(upid);
1661 | 	}
1662 | 
1663 | 	wo.wo_type	= type;
1664 | 	wo.wo_pid	= pid;
1665 | 	wo.wo_flags	= options | WEXITED;
1666 | 	wo.wo_info	= NULL;
1667 | 	wo.wo_stat	= 0;
1668 | 	wo.wo_rusage	= ru;
1669 | 	ret = do_wait(&wo);
1670 | 	put_pid(pid);
1671 | 	if (ret > 0 && stat_addr && put_user(wo.wo_stat, stat_addr))
1672 | 		ret = -EFAULT;
1673 | 
1674 | 	return ret;
1675 | }
1676 | 
1677 | SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr,
1678 | 		int, options, struct rusage __user *, ru)
1679 | {
1680 | 	struct rusage r;
1681 | 	long err = kernel_wait4(upid, stat_addr, options, ru ? &r : NULL);
1682 | 
1683 | 	if (err > 0) {
1684 | 		if (ru && copy_to_user(ru, &r, sizeof(struct rusage)))
1685 | 			return -EFAULT;
1686 | 	}
1687 | 	return err;
1688 | }
1689 | 
1690 | #ifdef __ARCH_WANT_SYS_WAITPID
1691 | 
1692 | /*
1693 |  * sys_waitpid() remains for compatibility. waitpid() should be
1694 |  * implemented by calling sys_wait4() from libc.a.
1695 |  */
1696 | SYSCALL_DEFINE3(waitpid, pid_t, pid, int __user *, stat_addr, int, options)
1697 | {
1698 | 	return kernel_wait4(pid, stat_addr, options, NULL);
1699 | }
1700 | 
1701 | #endif
1702 | 
1703 | #ifdef CONFIG_COMPAT
1704 | COMPAT_SYSCALL_DEFINE4(wait4,
1705 | 	compat_pid_t, pid,
1706 | 	compat_uint_t __user *, stat_addr,
1707 | 	int, options,
1708 | 	struct compat_rusage __user *, ru)
1709 | {
1710 | 	struct rusage r;
1711 | 	long err = kernel_wait4(pid, stat_addr, options, ru ? &r : NULL);
1712 | 	if (err > 0) {
1713 | 		if (ru && put_compat_rusage(&r, ru))
1714 | 			return -EFAULT;
1715 | 	}
1716 | 	return err;
1717 | }
1718 | 
1719 | COMPAT_SYSCALL_DEFINE5(waitid,
1720 | 		int, which, compat_pid_t, pid,
1721 | 		struct compat_siginfo __user *, infop, int, options,
1722 | 		struct compat_rusage __user *, uru)
1723 | {
1724 | 	struct rusage ru;
1725 | 	struct waitid_info info = {.status = 0};
1726 | 	long err = kernel_waitid(which, pid, &info, options, uru ? &ru : NULL);
1727 | 	int signo = 0;
1728 | 	if (err > 0) {
1729 | 		signo = SIGCHLD;
1730 | 		err = 0;
1731 | 		if (uru) {
1732 | 			/* kernel_waitid() overwrites everything in ru */
1733 | 			if (COMPAT_USE_64BIT_TIME)
1734 | 				err = copy_to_user(uru, &ru, sizeof(ru));
1735 | 			else
1736 | 				err = put_compat_rusage(&ru, uru);
1737 | 			if (err)
1738 | 				return -EFAULT;
1739 | 		}
1740 | 	}
1741 | 
1742 | 	if (!infop)
1743 | 		return err;
1744 | 
1745 | 	if (!user_access_begin(infop, sizeof(*infop)))
1746 | 		return -EFAULT;
1747 | 
1748 | 	unsafe_put_user(signo, &infop->si_signo, Efault);
1749 | 	unsafe_put_user(0, &infop->si_errno, Efault);
1750 | 	unsafe_put_user(info.cause, &infop->si_code, Efault);
1751 | 	unsafe_put_user(info.pid, &infop->si_pid, Efault);
1752 | 	unsafe_put_user(info.uid, &infop->si_uid, Efault);
1753 | 	unsafe_put_user(info.status, &infop->si_status, Efault);
1754 | 	user_access_end();
1755 | 	return err;
1756 | Efault:
1757 | 	user_access_end();
1758 | 	return -EFAULT;
1759 | }
1760 | #endif
1761 | 
1762 | __weak void abort(void)
1763 | {
1764 | 	BUG();
1765 | 
1766 | 	/* if that doesn't kill us, halt */
1767 | 	panic("Oops failed to kill thread");
1768 | }
1769 | EXPORT_SYMBOL(abort);
1770 | 


--------------------------------------------------------------------------------
/Memory/README.md:
--------------------------------------------------------------------------------
1 | # Video materials:
2 | [![alt text](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Memory/wall/Memory.png)](https://www.youtube.com/playlist?list=PLVEEYpRafmwEwdHqvk-81VkrFq8ZLkXQ7 "Linux-Memory")
3 | 


--------------------------------------------------------------------------------
/Memory/wall/Memory.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/Memory/wall/Memory.png


--------------------------------------------------------------------------------
/PTE/README.MD:
--------------------------------------------------------------------------------
 1 | ## Page Table Entries in Page Table
 2 | 
 3 | Prerequisite – Paging
 4 | 
 5 | Page table has page table entries where each page table entry stores a frame number and optional status (like protection) bits. Many of status bits used in the virtual memory system. The most important thing in PTE is frame Number.
 6 | 
 7 | Page table entry has the following information –
 8 | 
 9 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/PTE/screen/Capture-24.png)
10 | 
11 | ## Frame Number – 
12 | It gives the frame number in which the current page you are looking for is present. The number of bits required depends on the number of frames.Frame bit is also known as address translation bit.
13 | ```
14 | Number of bits for frame = Size of physical memory/frame size
15 | ```
16 | ## Present/Absent bit – 
17 | Present or absent bit says whether a particular page you are looking for is present or absent. In case if it is not present, that is called Page Fault. It is set to 0 if the corresponding page is not in memory. Used to control page fault by the operating system to support virtual memory. Sometimes this bit is also known as valid/invalid bits.
18 | 
19 | ## Protection bit –
20 | Protection bit says that what kind of protection you want on that page. So, these bit for the protection of the page frame (read, write etc).
21 | 
22 | ## Referenced bit – 
23 | Referenced bit will say whether this page has been referred in the last clock cycle or not. It is set to 1 by hardware when the page is accessed.
24 | 
25 | ## Caching enabled/disabled – 
26 | Some times we need the fresh data. Let us say the user is typing some information from the keyboard and your program should run according to the input given by the user. In that case, the information will come into the main memory. Therefore main memory contains the latest information which is typed by the user. Now if you try to put that page in the cache, that cache will show the old information. So whenever freshness is required, we don’t want to go for caching or many levels of the memory.The information present in the closest level to the CPU and the information present in the closest level to the user might be different. So we want the information has to be consistency, which means whatever information user has given, CPU should be able to see it as first as possible. That is the reason we want to disable caching. So, this bit enables or disable caching of the page.
27 | 
28 | ## Modified bit – 
29 | Modified bit says whether the page has been modified or not. Modified means sometimes you might try to write something on to the page. If a page is modified, then whenever you should replace that page with some other page, then the modified information should be kept on the hard disk or it has to be written back or it has to be saved back. It is set to 1 by hardware on write-access to page which is used to avoid writing when swapped out. Sometimes this modified bit is also called as the 
30 | ## Dirty bit.
31 | 
32 | ## GATE CS Corner Questions
33 | 
34 | Practicing the following questions will help you test your knowledge. All questions have been asked in GATE in previous years or in GATE Mock Tests. It is highly recommended that you practice them.
35 | 
36 | 
37 | 1 . GATE CS 2001, Question 46
38 | 
39 | 2 . GATE CS 2004, Question 66
40 | 
41 | 3 . GATE CS 2015 (Set 1), Question 65
42 | 
43 | 


--------------------------------------------------------------------------------
/PTE/screen/Capture-24.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/PTE/screen/Capture-24.png


--------------------------------------------------------------------------------
/PTE/screen/screen:
--------------------------------------------------------------------------------
1 | screen
2 | 


--------------------------------------------------------------------------------
/Paging/README.MD:
--------------------------------------------------------------------------------
 1 | ## Paging in Operating System
 2 | 
 3 | Paging is a memory management scheme that eliminates the need for contiguous allocation of physical memory. This scheme permits the physical address space of a process to be non – contiguous.
 4 | 
 5 | - Logical Address or Virtual Address (represented in bits): An address generated by the CPU
 6 | 
 7 | - Logical Address Space or Virtual Address Space( represented in words or bytes): The set of all logical addresses generated by a program
 8 | 
 9 | - Physical Address (represented in bits): An address actually available on memory unit
10 | 
11 | - Physical Address Space (represented in words or bytes): The set of all physical addresses corresponding to the logical addresses
12 | 
13 | ## Example:
14 | 
15 | - If Logical Address = 31 bit, then Logical Address Space = 231 words = 2 G words (1 G = 230)
16 | 
17 | - If Logical Address Space = 128 M words = 27 * 220 words, then Logical Address = log2 227 = 27 bits
18 | 
19 | - If Physical Address = 22 bit, then Physical Address Space = 222 words = 4 M words (1 M = 220)
20 | 
21 | - If Physical Address Space = 16 M words = 24 * 220 words, then Physical Address = log2 224 = 24 bits
22 | 
23 | The mapping from virtual to physical address is done by the memory management unit (MMU) which is a hardware device and this mapping is known as paging technique.
24 | 
25 | The Physical Address Space is conceptually divided into a number of fixed-size blocks, called frames.
26 | The Logical address Space is also splitted into fixed-size blocks, called pages.
27 | Page Size = Frame Size
28 | 
29 | ## Let us consider an example:
30 | 
31 | - Physical Address = 12 bits, then Physical Address Space = 4 K words
32 | - Logical Address = 13 bits, then Logical Address Space = 8 K words
33 | - Page size = frame size = 1 K words (assumption)
34 | 
35 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Paging/screen/paging.jpg)
36 | 
37 | ## Address generated by CPU is divided into
38 | 
39 | - Page number(p): Number of bits required to represent the pages in Logical Address Space or Page number
40 | - Page offset(d): Number of bits required to represent particular word in a page or page size of Logical Address Space or word number of a page or page offset.
41 | 
42 | ## Physical Address is divided into
43 | 
44 | - Frame number(f): Number of bits required to represent the frame of Physical Address Space or Frame number.
45 | - Frame offset(d): Number of bits required to represent particular word in a frame or frame size of Physical Address Space or word number of a frame or frame offset.
46 |  
47 | The hardware implementation of page table can be done by using dedicated registers. But the usage of register for the page table is satisfactory only if page table is small. If page table contain large number of entries then we can use TLB(translation Look-aside buffer), a special, small, fast look up hardware cache.
48 | 
49 | - The TLB is associative, high speed memory.
50 | - Each entry in TLB consists of two parts: a tag and a value.
51 | - When this memory is used, then an item is compared with all tags simultaneously.If the item is found, then corresponding value is returned.
52 | 
53 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Paging/screen/paging-2.jpg)
54 | 
55 | Main memory access time = m
56 | If page table are kept in main memory,
57 | Effective access time = m(for page table) + m(for particular page in page table)
58 | 
59 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Paging/screen/paging-3.jpg)
60 | 
61 | 


--------------------------------------------------------------------------------
/Paging/screen/paging-2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/Paging/screen/paging-2.jpg


--------------------------------------------------------------------------------
/Paging/screen/paging-3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/Paging/screen/paging-3.jpg


--------------------------------------------------------------------------------
/Paging/screen/paging.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/Paging/screen/paging.jpg


--------------------------------------------------------------------------------
/Paging/screen/screen:
--------------------------------------------------------------------------------
1 | screen
2 | 


--------------------------------------------------------------------------------
/Process execution priorities/README.MD:
--------------------------------------------------------------------------------
  1 | # Learn Linux,Process execution priorities
  2 | 
  3 | - Keeping your eye on what's going on!!!
  4 | 
  5 | Overview
  6 | 
  7 | This tutorial grounds you in the basic Linux techniques for managing execution process priorities. Learn to:
  8 | 
  9 |    - Understand process priorities
 10 |    - Set process priorities
 11 |    - Change process priorities
 12 | 
 13 | This tutorial helps you prepare for Objective 103.6 in Topic 103 of the Linux Server Professional (LPIC-1) exam 101. The objective has a weight of 2.
 14 | Linux task priorities
 15 | 
 16 | Linux, like most modern operating systems, can run multiple processes. It does this by sharing the CPU and other resources among the processes. If one process can use 100 percent of the CPU, then other processes may become unresponsive. We’ll introduce you to the way Linux assigns priorities for tasks.
 17 | 
 18 | 
 19 | - Prerequisites
 20 | 
 21 | To get the most from the tutorials in this series, you should have a basic knowledge of Linux and a working Linux system on which you can practice the commands covered in this tutorial. Sometimes different versions of a program will format output differently, so your results may not always look exactly like the listings and figures shown here. The results in the examples shown here were obtained on a Ubuntu 15.04 distribution.
 22 | 
 23 | - Knowing your priorities
 24 | 
 25 | If you run the `top` command, its default is to display processes in decreasing order according to their CPU usage, as shown in Listing. If we ran that process, it probably wouldn’t make it onto the output list from top because the process spends most of its time not using the CPU.
 26 | 
 27 | 
 28 | # Listing 1. Typical output from top on a Linux workstation
 29 | 
 30 | ```bash
 31 | 
 32 | top ‑ 22:47:44 up 1 day, 12:44,  3 users,  load average: 0.00, 0.01, 0.05
 33 | Tasks: 188 total,   1 running, 187 sleeping,   0 stopped,   0 zombie
 34 | %Cpu(s):  0.2 us,  0.0 sy,  0.0 ni, 99.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
 35 | KiB Mem:   8090144 total,  2145616 used,  5944528 free,    81880 buffers
 36 | KiB Swap:  4095996 total,   100660 used,  3995336 free.  1464920 cached Mem
 37 | 
 38 |   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND       
 39 |  2215 ian       20   0 1549252 162644  79992 S   0.7  2.0  16:11.61 compiz        
 40 |     9 root      20   0       0      0      0 S   0.3  0.0   0:08.04 rcuos/0       
 41 |  4918 ian       20   0   29184   3120   2612 R   0.3  0.0   0:00.47 top           
 42 |     1 root      20   0  182732   5392   3648 S   0.0  0.1   0:03.74 systemd       
 43 |     2 root      20   0       0      0      0 S   0.0  0.0   0:00.02 kthreadd      
 44 |     3 root      20   0       0      0      0 S   0.0  0.0   0:00.10 ksoftirqd/0   
 45 |     5 root       0 ‑20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H  
 46 |     7 root      20   0       0      0      0 S   0.0  0.0   0:08.42 rcu_sched     
 47 |     8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh        
 48 |    10 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/0       
 49 |    11 root      rt   0       0      0      0 S   0.0  0.0   0:00.00 migration/0   
 50 |    12 root      rt   0       0      0      0 S   0.0  0.0   0:00.77 watchdog/0    
 51 |    13 root      rt   0       0      0      0 S   0.0  0.0   0:00.78 watchdog/1    
 52 |    14 root      rt   0       0      0      0 S   0.0  0.0   0:00.18 migration/1   
 53 |    15 root      20   0       0      0      0 S   0.0  0.0   0:00.11 ksoftirqd/1   
 54 |    17 root       0 ‑20       0      0      0 S   0.0  0.0   0:00.00 kworker/1:0H  
 55 |    18 root      20   0       0      0      0 S   0.0  0.0   0:00.46 rcuos/1       
 56 | ```
 57 | 
 58 | Your system may have many commands that are capable of using lots of CPU. Examples include movie editing tools, and programs to convert between different image types or between different sound encoding, such as mp3 to ogg.
 59 | 
 60 | When you only have one or a limited number of CPUs, you need to decide how to share those limited CPU resources among several competing processes. This is generally done by selecting one process for execution and letting it run for a short period (called a timeslice), or until it needs to wait for some event, such as IO to complete. To ensure that important processes don’t get starved out by CPU hogs, the selection is done based on a scheduling priority. The NI column in Listing 1 above, shows the scheduling priority or niceness of each process. Niceness generally ranges from -20 to 19, with -20 being the most favorable or highest priority for scheduling and 19 being the least favorable or lowest priority.
 61 | 
 62 | 
 63 | # Using ps to find niceness
 64 | 
 65 | In addition to the top command, you can also display niceness values using the ps command. You can either customize the output as you saw in the tutorial the output of ps -lps -l is shown in Listing 2. As with top, look for the niceness value in the NI column.
 66 | 
 67 | - Listing 2. Using ps to find niceness
 68 | 
 69 | ```bash
 70 |                     
 71 | :~$  ps ‑l
 72 | F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
 73 | 0 S  1000  3850  3849  0  80   0 ‑  6726 wait   pts/5    00:00:00 bash
 74 | 0 R  1000  4924  3850  0  80   0 ‑  3561 ‑      pts/5    00:00:00 ps
 75 | ```
 76 | 
 77 | # Default niceness
 78 | 
 79 | You may have guessed from Listing 1 or Listing 2 that the default niceness, at least for processes started by regular users, is 0. This is usually the case on current Linux systems. You can verify the value for your shell and system by running the nice command with no parameters as shown in Listing 3.
 80 | 
 81 | 
 82 | - Listing 3. Checking default niceness
 83 | 
 84 | ```bash
 85 | :~$ nice
 86 | 0
 87 | ```
 88 | 
 89 | # Setting priorities
 90 | 
 91 | Before we look at how to set or change niceness values, let’s build a little CPU-intensive script that will show how niceness really works.
 92 | 
 93 | - A CPU-intensive script
 94 | 
 95 | We’ll create a small script that just uses CPU and does little else. The script takes two inputs, a count and a label. It prints the label and the current date and time, then spins, decrementing the count till it reaches 0, and finally prints the label and the date again. This script shown in Listing 4 has no error checking and is not very robust, but it illustrates our point.
 96 | 
 97 | 
 98 | - Listing 4. CPU-intensive script
 99 | ```bash
100 |    
101 | :~$ echo 'x="$1"'>count1.sh
102 | :~$ echo 'echo "$2" $(date)'>>count1.sh
103 | :~$ echo 'while [ $x ‑gt 0 ]; do x=$(( x‑1 ));done'>>count1.sh
104 | :~$ echo 'echo "$2" $(date)'>>count1.sh
105 | :~$ cat count1.sh
106 | x="$1"
107 | echo "$2" $(date)
108 | while [ $x ‑gt 0 ]; do x=$(( x‑1 ));done
109 | echo "$2" $(date)
110 | ```
111 | 
112 | If you run this on your own system, you might see output similar to Listing 5. Depending on the speed of your system, you may have to increase the count value to even see a difference in the times. This script uses lots of CPU, as we’ll see in a moment. If your default shell is not Bash, and if the script does not work for you, then use the second form of calling shown below. If you are not using your own workstation, make sure that it is okay to use lots of CPU before you run the script.
113 | 
114 | - Listing 5. Running count1.sh
115 | 
116 | ```bash
117 |                     
118 | :~$ sh count1.sh 10000 A
119 | A Thu Jul 16 23:13:07 EDT 2015
120 | A Thu Jul 16 23:13:07 EDT 2015
121 | :~$ bash count1.sh 99000 A
122 | A Thu Jul 16 23:13:53 EDT 2015
123 | A Thu Jul 16 23:13:54 EDT 2015
124 | ```
125 | 
126 | 
127 | 
128 | So far, so good. Now let’s create a command list to run the script in background and launch the top command to see how much CPU the script is using. The command list is shown in Listing 6 and the output from top in Listing 7.
129 | 
130 | - Listing 6. Running count1.sh and top
131 | 
132 | ```bash
133 | :~$ (sh count1.sh 5000000 A&);top
134 | ```
135 | 
136 | - Listing 7. Using lots of CPU
137 | 
138 | ```bash
139 |                     
140 | top ‑ 23:19:30 up 1 day, 13:16,  3 users,  load average: 0.15, 0.06, 0.05
141 | Tasks: 190 total,   2 running, 188 sleeping,   0 stopped,   0 zombie
142 | %Cpu(s): 25.1 us,  0.0 sy,  0.0 ni, 74.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
143 | KiB Mem:   8090144 total,  2145600 used,  5944544 free,    82024 buffers
144 | KiB Swap:  4095996 total,   100644 used,  3995352 free.  1464940 cached Mem
145 | 
146 |   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND       
147 |  4952 ian       20   0    4472    736    652 R 100.0  0.0   0:06.18 sh            
148 |  2043 ian       20   0  552900  34544  24464 S   0.3  0.4   0:15.14 unity‑panel‑+ 
149 |  2215 ian       20   0 1549252 162644  79992 S   0.3  2.0  16:28.20 compiz        
150 |     1 root      20   0  182732   5392   3648 S   0.0  0.1   0:03.76 systemd       
151 |     2 root      20   0       0      0      0 S   0.0  0.0   0:00.02 kthreadd      
152 |     3 root      20   0       0      0      0 S   0.0  0.0   0:00.10 ksoftirqd/0   
153 | ```
154 | Not bad. We are using 100 percent of one of the CPUs on this system with just a simple script. If you want to stress multiple CPUs, you can add an extra invocation of count1.sh to the command list. If we had a long running job such as this, we might find that it interfered with our ability (or the ability of other users) to do other work on our system.
155 | 
156 | # Using nice to set priorities
157 | 
158 | Now that we can keep a CPU busy for a while, we’ll see how to set a priority for a process. To summarize what we’ve learned so far:
159 | 
160 |    - Linux and UNIX® systems use a priority system with 40 priorities, ranging from -20 (highest priority) to 19 (lowest priority.
161 |    - Processes started by regular users usually have priority 0.
162 |    - The ps command can display the priority (nice, or NI, level, for example) using the -l option.
163 |    - The nice command displays our default priority.
164 | 
165 | The nice command can also be used to start a process with a different priority. You use the -n or (--adjustment) option with a positive value to increase the priority value and a negative value to decrease it. Remember that processes with the lowest priority value run at highest scheduling priority, so think of increasing the priority value as being nice to other processes. Note that you usually need to be the superuser (root) to specify negative priority adjustments. In other words, regular users can usually only make their processes nicer.
166 | 
167 | To demonstrate the use of nice to set priorities, let’s start two copies of the count1.sh script in different subshells at the same time, but give one the maximum niceness of 19. After a second we’ll use ps -lps -l to display the process status, including niceness. Finally, we’ll add an arbitrary 30-second sleep to ensure the command sequence finishes after the two subshells do. That way, we won’t get a new prompt while we’re still waiting for output. The result is shown in Listing 8.
168 | 
169 | - Listing 8. Using nice to set priorities for a pair of processes
170 | 
171 | ```bash 
172 |                    
173 | :~$  (sh count1.sh 2000000 A&);(nice ‑n 19 sh count1.sh 2000000 B&);> sleep 1;ps ‑l;sleep 10
174 | A Fri Jul 17 17:10:33 EDT 2015
175 | B Fri Jul 17 17:10:33 EDT 2015
176 | F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
177 | 0 S  1000  3850  3849  0  80   0 ‑  6726 wait   pts/5    00:00:00 bash
178 | 0 R  1000  5614     1 99  80   0 ‑  1118 ‑      pts/5    00:00:01 sh
179 | 0 R  1000  5617     1 99  99  19 ‑  1118 ‑      pts/5    00:00:01 sh
180 | 0 R  1000  5620  3850  0  80   0 ‑  3561 ‑      pts/5    00:00:00 ps
181 | A Fri Jul 17 17:10:36 EDT 2015
182 | B Fri Jul 17 17:10:36 EDT 2015
183 | ```
184 | 
185 | Are you surprised that the two jobs finished at the same time? What happened to our priority setting? Remember that the script occupied one of our CPUs. This particular system runs on an Intel(R) Core(TM) i7 processor, which is very lightly loaded, so each core ran one process, and there wasn’t any need to prioritize them.
186 | 
187 | So let’s try starting four processes at four different niceness levels (0, 6, 12, and18) and see what happens. We’ll increase the busy count parameter for each so they run a little longer. Before you look at Listing 9, think about what you might expect, given what you’ve already seen.
188 | 
189 | - Listing 9. Using nice to set priorities for four of processes
190 | 
191 | ```bash
192 |                    
193 | :~$ (sh count1.sh 5000000 A&);(nice ‑n 6 sh count1.sh 5000000 B&);
194 | > (nice ‑n 12 sh count1.sh 5000000 C&);(nice ‑n 18 sh count1.sh 5000000 D&);
195 | > sleep 1;ps ‑l;sleep 30
196 | A Fri Jul 17 17:13:05 EDT 2015
197 | C Fri Jul 17 17:13:05 EDT 2015
198 | B Fri Jul 17 17:13:05 EDT 2015
199 | D Fri Jul 17 17:13:05 EDT 2015
200 | F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
201 | 0 S  1000  3850  3849  0  80   0 ‑  6726 wait   pts/5    00:00:00 bash
202 | 0 R  1000  5626     1 99  80   0 ‑  1118 ‑      pts/5    00:00:01 sh
203 | 0 R  1000  5629     1 99  86   6 ‑  1118 ‑      pts/5    00:00:01 sh
204 | 0 R  1000  5631     1 90  92  12 ‑  1118 ‑      pts/5    00:00:00 sh
205 | 0 R  1000  5633     1 58  98  18 ‑  1118 ‑      pts/5    00:00:00 sh
206 | 0 R  1000  5638  3850  0  80   0 ‑  3561 ‑      pts/5    00:00:00 ps
207 | A Fri Jul 17 17:13:15 EDT 2015
208 | B Fri Jul 17 17:13:21 EDT 2015
209 | C Fri Jul 17 17:13:26 EDT 2015
210 | D Fri Jul 17 17:13:27 EDT 2015
211 | ```
212 | With four different priorities, we see the effect of the different niceness values as each job finishes in priority order. Try experimenting with different nice values to demonstrate the different possibilities for yourself.
213 | 
214 | A final note on starting processes with nice; as with the nohup command, you cannot use a command list or a pipeline as the argument of nice.
215 | 
216 | # Changing priorities
217 | - `renice`
218 | 
219 | If you happen to start a process and realize that it should run at a different priority, there is a way to change it after it has started, using the renice command. You specify an absolute priority (and not an adjustment) for the process or processes to be changed as shown in Listing 10.
220 | 
221 | - Listing 10. Using renice to change priorities
222 | 
223 | ```bash
224 |                     
225 | :~$ sh count1.sh 100000000 A&
226 | [1] 5724
227 | :~$ A Fri Jul 17 17:30:20 EDT 2015
228 | renice 1 5724;ps ‑l 5724
229 | 5724 (process ID) old priority 0, new priority 1
230 | F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY        TIME CMD
231 | 0 R  1000  5724  3850 99  81   1 ‑  1118 ‑      pts/5      0:35 sh count1.sh 10000
232 | :~$ renice +3 5724;ps ‑l 5724
233 | 5724 (process ID) old priority 1, new priority 3
234 | F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY        TIME CMD
235 | 0 R  1000  5724  3850 99  83   3 ‑  1118 ‑      pts/5      0:50 sh count1.sh 10000
236 | :~$ sudo renice ‑8 5724;ps ‑l 5724
237 | 5724 (process ID) old priority 3, new priority ‑8
238 | F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY        TIME CMD
239 | 0 R  1000  5724  3850 99  72  ‑8 ‑  1118 ‑      pts/5      1:01 sh count1.sh 10000
240 | ```
241 | 
242 | Remember that you have to be the superuser to give your processes higher scheduling priority and make them less nice.
243 | 
244 | You can find more information on nice and renice in the main pages.
245 | 
246 | This concludes your introduction to process execution priorities.
247 | 
248 | 
249 | 
250 | 
251 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
   1 | # Kernel-and-Types-of-kernels
   2 | ![image](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/wall/gV8hn.png)
   3 | 
   4 | - Last update from nu11secur1ty: https://github.com/nu11secur1ty/kernel-4.19.0V.Varbanovski_lp150.12.22_default
   5 | 
   6 | # Importance_of_Linux_partitions
   7 | [Read...](https://github.com/nu11secur1ty/Linux_Deployment_Administration_Hacks/tree/master/Importance_of_Linux_partitions)
   8 | 
   9 | # Kernel Space, User Space Interfaces
  10 | [Read...](https://www.nu11secur1ty.com/2019/01/kernel-space-user-space-interfaces.html)
  11 | 
  12 | -----------------------------------------------------------------------------------------------------------------
  13 | # Reference
  14 | 
  15 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/scheme/kernels.png)
  16 | 
  17 | The kernel is the core of an operating system. It is the software responsible for running programs and providing secure access to the machine's hardware. Since there are many programs, and resources are limited, the kernel also decides when and how long a program should run. This is called scheduling. Accessing the hardware directly can be very complex, since there are many different hardware designs for the same type of component. Kernels usually implement some level of hardware abstraction (a set of instructions universal to all devices of a certain type) to hide the underlying complexity from applications and provide a clean and uniform interface. This helps application programmers to develop programs without having to know how to program for specific devices. The kernel relies upon software drivers that translate the generic command into instructions specific to that device.
  18 | 
  19 | An operating system kernel is not strictly needed to run a computer. Programs can be directly loaded and executed on the "bare metal" machine, provided that the authors of those programs are willing to do without any hardware abstraction or operating system support. This was the normal operating method of many early computers, which were reset and reloaded between the running of different programs. Eventually, small ancillary programs such as program loaders and debuggers were typically left in-core between runs, or loaded from read-only memory. As these were developed, they formed the basis of what became early operating system kernels. The "bare metal" approach is still used today on many video game consoles and embedded systems, but in general, newer systems use kernels and operating systems.
  20 | 
  21 | Four broad categories of kernels:
  22 | 
  23 |    - Monolithic kernels provide rich and powerful abstractions of the underlying hardware.
  24 |    - Microkernels provide a small set of simple hardware abstractions and use applications called servers to provide more functionality.
  25 |    - Exokernels provide minimal abstractions, allowing low-level hardware access. In exokernel systems, library operating systems provide the abstractions typically present in monolithic kernels.
  26 |    - Hybrid (modified microkernels) are much like pure microkernels, except that they include some additional code in kernelspace to increase performance.
  27 |    
  28 |    
  29 |    
  30 |   -------------------------------------------------------------------------------------------------------------
  31 |   # Monolitic
  32 |   
  33 |   The monolithic approach is to define a high-level virtual interface over the hardware, with a set of primitives or system calls to implement operating system services such as process management, concurrency, and memory management in several modules that run in supervisor mode.
  34 | 
  35 | Even if every module servicing these operations is separate from the whole, the code integration is very tight and difficult to do correctly, and, since all the modules run in the same address space, a bug in one module can bring down the whole system. However, when the implementation is complete and trustworthy, the tight internal integration of components allows the low-level features of the underlying system to be effectively utilized, making a good monolithic kernel highly efficient. Proponents of the monolithic kernel approach make the case that if code is incorrect, it does not belong in a kernel, and if it is, there is little advantage in the microkernel approach. More modern monolithic kernels such as Linux, FreeBSD and Solaris can load executable modules at runtime, allowing easy extension of the kernel's capabilities as required, while helping to keep the amount of code running in kernel-space to a minimum. It is alone in supervise mode. 
  36 | 
  37 | -----------------------------------------------------------------------------------------------------------------
  38 | 
  39 | # Microkernels
  40 | The microkernel approach is to define a very simple abstraction over the hardware, with a set of primitives or system calls to implement minimal OS services such as thread management, address spaces and interprocess communication. All other services, those normally provided by the kernel such as networking, are implemented in user-space programs referred to as servers. Servers are programs like any others, allowing the operating system to be modified simply by starting and stopping programs. For a small machine without networking support, for instance, the networking server simply isn't started. Under a traditional system this would require the kernel to be recompiled, something well beyond the capabilities of the average end-user. In theory the system is also more stable, because a failing server simply stops a single program, rather than causing the kernel itself to crash.
  41 | 
  42 | However, part of the system state is lost with the failing server, and it is generally difficult to continue execution of applications, or even of other servers with a fresh copy. For example, if a (theoretic) server responsible for TCP/IP connections is restarted, applications could be told the connection was "lost" and reconnect, going through the new instance of the server. However, other system objects, like files, do not have these convenient semantics, are supposed to be reliable, not become unavailable randomly and keep all the information written to them previously. So, database techniques like transactions, replication and checkpointing need to be used between servers in order to preserve essential state across single server restarts.
  43 | 
  44 | Microkernels generally underperform traditional designs, sometimes dramatically. This is due in large part to the overhead of moving in and out of the kernel, a context switch, in order to move data between the various applications and servers. It was originally believed that careful tuning could reduce this overhead dramatically, but by the mid-90s most researchers had given up. In more recent times newer microkernels, designed for performance first, have addressed these problems to a very large degree. Nevertheless the market for existing operating systems is so entrenched that little work continues on microkernel design. 
  45 | 
  46 | ------------------------------------------------------------------------------------------------------------------
  47 | 
  48 | # Exokernel
  49 | 
  50 | - General
  51 | An exokernel is a type of operating system where the kernel is limited to extending resources to sub operating systems called LibOS's. Resulting in a very small, fast kernel environment. The theory behind this method is that by providing as few abstractions as possible programs are able to do exactly what they want in a controlled environment. Such as MS-DOS achieved through real mode, except with paging and other modern programming techniques.
  52 | - LibOS
  53 | LibOS's provide a way for the programmer of an exokernel type system to easily program cross-platform programs using familiar interfaces, instead of having to write his\her own. Moreover, they provide an additional advantage over monolithic kernels in that by having multiple LibOS's running at the same time, one can theoretically run programs from Linux, Windows, and Mac (Provided that there is a LibOS for that system) all at the same time, on the same OS, and with no performance issues. 
  54 | 
  55 | --------------------------------------------------------------------------------------------------------------------
  56 | 
  57 | # Hybrid 
  58 | 
  59 | A hybrid kernel is one that combines aspects of both micro and monolithic kernels, but there is no exact definition. Often, "hybrid kernel" means that the kernel is highly modular, but all runs in the same address space. This allows the kernel avoid the overhead of a complicated message passing system within the kernel, while still retaining some microkernel-like features. 
  60 | 
  61 | --------------------------------------------------------------------------------------------------------------------
  62 | 
  63 | # INIT and zombie
  64 | [see the rule](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Linux/exit/exit.c#L652)
  65 | 
  66 | - Or all code:
  67 | ```c
  68 | // SPDX-License-Identifier: GPL-2.0-only
  69 | /*
  70 |  *  linux/kernel/exit.c
  71 |  *
  72 |  *  Copyright (C) 1991, 1992  Linus Torvalds
  73 |  */
  74 | 
  75 | #include <linux/mm.h>
  76 | #include <linux/slab.h>
  77 | #include <linux/sched/autogroup.h>
  78 | #include <linux/sched/mm.h>
  79 | #include <linux/sched/stat.h>
  80 | #include <linux/sched/task.h>
  81 | #include <linux/sched/task_stack.h>
  82 | #include <linux/sched/cputime.h>
  83 | #include <linux/interrupt.h>
  84 | #include <linux/module.h>
  85 | #include <linux/capability.h>
  86 | #include <linux/completion.h>
  87 | #include <linux/personality.h>
  88 | #include <linux/tty.h>
  89 | #include <linux/iocontext.h>
  90 | #include <linux/key.h>
  91 | #include <linux/cpu.h>
  92 | #include <linux/acct.h>
  93 | #include <linux/tsacct_kern.h>
  94 | #include <linux/file.h>
  95 | #include <linux/fdtable.h>
  96 | #include <linux/freezer.h>
  97 | #include <linux/binfmts.h>
  98 | #include <linux/nsproxy.h>
  99 | #include <linux/pid_namespace.h>
 100 | #include <linux/ptrace.h>
 101 | #include <linux/profile.h>
 102 | #include <linux/mount.h>
 103 | #include <linux/proc_fs.h>
 104 | #include <linux/kthread.h>
 105 | #include <linux/mempolicy.h>
 106 | #include <linux/taskstats_kern.h>
 107 | #include <linux/delayacct.h>
 108 | #include <linux/cgroup.h>
 109 | #include <linux/syscalls.h>
 110 | #include <linux/signal.h>
 111 | #include <linux/posix-timers.h>
 112 | #include <linux/cn_proc.h>
 113 | #include <linux/mutex.h>
 114 | #include <linux/futex.h>
 115 | #include <linux/pipe_fs_i.h>
 116 | #include <linux/audit.h> /* for audit_free() */
 117 | #include <linux/resource.h>
 118 | #include <linux/blkdev.h>
 119 | #include <linux/task_io_accounting_ops.h>
 120 | #include <linux/tracehook.h>
 121 | #include <linux/fs_struct.h>
 122 | #include <linux/init_task.h>
 123 | #include <linux/perf_event.h>
 124 | #include <trace/events/sched.h>
 125 | #include <linux/hw_breakpoint.h>
 126 | #include <linux/oom.h>
 127 | #include <linux/writeback.h>
 128 | #include <linux/shm.h>
 129 | #include <linux/kcov.h>
 130 | #include <linux/random.h>
 131 | #include <linux/rcuwait.h>
 132 | #include <linux/compat.h>
 133 | 
 134 | #include <linux/uaccess.h>
 135 | #include <asm/unistd.h>
 136 | #include <asm/pgtable.h>
 137 | #include <asm/mmu_context.h>
 138 | 
 139 | static void __unhash_process(struct task_struct *p, bool group_dead)
 140 | {
 141 | 	nr_threads--;
 142 | 	detach_pid(p, PIDTYPE_PID);
 143 | 	if (group_dead) {
 144 | 		detach_pid(p, PIDTYPE_TGID);
 145 | 		detach_pid(p, PIDTYPE_PGID);
 146 | 		detach_pid(p, PIDTYPE_SID);
 147 | 
 148 | 		list_del_rcu(&p->tasks);
 149 | 		list_del_init(&p->sibling);
 150 | 		__this_cpu_dec(process_counts);
 151 | 	}
 152 | 	list_del_rcu(&p->thread_group);
 153 | 	list_del_rcu(&p->thread_node);
 154 | }
 155 | 
 156 | /*
 157 |  * This function expects the tasklist_lock write-locked.
 158 |  */
 159 | static void __exit_signal(struct task_struct *tsk)
 160 | {
 161 | 	struct signal_struct *sig = tsk->signal;
 162 | 	bool group_dead = thread_group_leader(tsk);
 163 | 	struct sighand_struct *sighand;
 164 | 	struct tty_struct *uninitialized_var(tty);
 165 | 	u64 utime, stime;
 166 | 
 167 | 	sighand = rcu_dereference_check(tsk->sighand,
 168 | 					lockdep_tasklist_lock_is_held());
 169 | 	spin_lock(&sighand->siglock);
 170 | 
 171 | #ifdef CONFIG_POSIX_TIMERS
 172 | 	posix_cpu_timers_exit(tsk);
 173 | 	if (group_dead) {
 174 | 		posix_cpu_timers_exit_group(tsk);
 175 | 	} else {
 176 | 		/*
 177 | 		 * This can only happen if the caller is de_thread().
 178 | 		 * FIXME: this is the temporary hack, we should teach
 179 | 		 * posix-cpu-timers to handle this case correctly.
 180 | 		 */
 181 | 		if (unlikely(has_group_leader_pid(tsk)))
 182 | 			posix_cpu_timers_exit_group(tsk);
 183 | 	}
 184 | #endif
 185 | 
 186 | 	if (group_dead) {
 187 | 		tty = sig->tty;
 188 | 		sig->tty = NULL;
 189 | 	} else {
 190 | 		/*
 191 | 		 * If there is any task waiting for the group exit
 192 | 		 * then notify it:
 193 | 		 */
 194 | 		if (sig->notify_count > 0 && !--sig->notify_count)
 195 | 			wake_up_process(sig->group_exit_task);
 196 | 
 197 | 		if (tsk == sig->curr_target)
 198 | 			sig->curr_target = next_thread(tsk);
 199 | 	}
 200 | 
 201 | 	add_device_randomness((const void*) &tsk->se.sum_exec_runtime,
 202 | 			      sizeof(unsigned long long));
 203 | 
 204 | 	/*
 205 | 	 * Accumulate here the counters for all threads as they die. We could
 206 | 	 * skip the group leader because it is the last user of signal_struct,
 207 | 	 * but we want to avoid the race with thread_group_cputime() which can
 208 | 	 * see the empty ->thread_head list.
 209 | 	 */
 210 | 	task_cputime(tsk, &utime, &stime);
 211 | 	write_seqlock(&sig->stats_lock);
 212 | 	sig->utime += utime;
 213 | 	sig->stime += stime;
 214 | 	sig->gtime += task_gtime(tsk);
 215 | 	sig->min_flt += tsk->min_flt;
 216 | 	sig->maj_flt += tsk->maj_flt;
 217 | 	sig->nvcsw += tsk->nvcsw;
 218 | 	sig->nivcsw += tsk->nivcsw;
 219 | 	sig->inblock += task_io_get_inblock(tsk);
 220 | 	sig->oublock += task_io_get_oublock(tsk);
 221 | 	task_io_accounting_add(&sig->ioac, &tsk->ioac);
 222 | 	sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
 223 | 	sig->nr_threads--;
 224 | 	__unhash_process(tsk, group_dead);
 225 | 	write_sequnlock(&sig->stats_lock);
 226 | 
 227 | 	/*
 228 | 	 * Do this under ->siglock, we can race with another thread
 229 | 	 * doing sigqueue_free() if we have SIGQUEUE_PREALLOC signals.
 230 | 	 */
 231 | 	flush_sigqueue(&tsk->pending);
 232 | 	tsk->sighand = NULL;
 233 | 	spin_unlock(&sighand->siglock);
 234 | 
 235 | 	__cleanup_sighand(sighand);
 236 | 	clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
 237 | 	if (group_dead) {
 238 | 		flush_sigqueue(&sig->shared_pending);
 239 | 		tty_kref_put(tty);
 240 | 	}
 241 | }
 242 | 
 243 | static void delayed_put_task_struct(struct rcu_head *rhp)
 244 | {
 245 | 	struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
 246 | 
 247 | 	perf_event_delayed_put(tsk);
 248 | 	trace_sched_process_free(tsk);
 249 | 	put_task_struct(tsk);
 250 | }
 251 | 
 252 | 
 253 | void release_task(struct task_struct *p)
 254 | {
 255 | 	struct task_struct *leader;
 256 | 	int zap_leader;
 257 | repeat:
 258 | 	/* don't need to get the RCU readlock here - the process is dead and
 259 | 	 * can't be modifying its own credentials. But shut RCU-lockdep up */
 260 | 	rcu_read_lock();
 261 | 	atomic_dec(&__task_cred(p)->user->processes);
 262 | 	rcu_read_unlock();
 263 | 
 264 | 	proc_flush_task(p);
 265 | 	cgroup_release(p);
 266 | 
 267 | 	write_lock_irq(&tasklist_lock);
 268 | 	ptrace_release_task(p);
 269 | 	__exit_signal(p);
 270 | 
 271 | 	/*
 272 | 	 * If we are the last non-leader member of the thread
 273 | 	 * group, and the leader is zombie, then notify the
 274 | 	 * group leader's parent process. (if it wants notification.)
 275 | 	 */
 276 | 	zap_leader = 0;
 277 | 	leader = p->group_leader;
 278 | 	if (leader != p && thread_group_empty(leader)
 279 | 			&& leader->exit_state == EXIT_ZOMBIE) {
 280 | 		/*
 281 | 		 * If we were the last child thread and the leader has
 282 | 		 * exited already, and the leader's parent ignores SIGCHLD,
 283 | 		 * then we are the one who should release the leader.
 284 | 		 */
 285 | 		zap_leader = do_notify_parent(leader, leader->exit_signal);
 286 | 		if (zap_leader)
 287 | 			leader->exit_state = EXIT_DEAD;
 288 | 	}
 289 | 
 290 | 	write_unlock_irq(&tasklist_lock);
 291 | 	release_thread(p);
 292 | 	call_rcu(&p->rcu, delayed_put_task_struct);
 293 | 
 294 | 	p = leader;
 295 | 	if (unlikely(zap_leader))
 296 | 		goto repeat;
 297 | }
 298 | 
 299 | /*
 300 |  * Note that if this function returns a valid task_struct pointer (!NULL)
 301 |  * task->usage must remain >0 for the duration of the RCU critical section.
 302 |  */
 303 | struct task_struct *task_rcu_dereference(struct task_struct **ptask)
 304 | {
 305 | 	struct sighand_struct *sighand;
 306 | 	struct task_struct *task;
 307 | 
 308 | 	/*
 309 | 	 * We need to verify that release_task() was not called and thus
 310 | 	 * delayed_put_task_struct() can't run and drop the last reference
 311 | 	 * before rcu_read_unlock(). We check task->sighand != NULL,
 312 | 	 * but we can read the already freed and reused memory.
 313 | 	 */
 314 | retry:
 315 | 	task = rcu_dereference(*ptask);
 316 | 	if (!task)
 317 | 		return NULL;
 318 | 
 319 | 	probe_kernel_address(&task->sighand, sighand);
 320 | 
 321 | 	/*
 322 | 	 * Pairs with atomic_dec_and_test() in put_task_struct(). If this task
 323 | 	 * was already freed we can not miss the preceding update of this
 324 | 	 * pointer.
 325 | 	 */
 326 | 	smp_rmb();
 327 | 	if (unlikely(task != READ_ONCE(*ptask)))
 328 | 		goto retry;
 329 | 
 330 | 	/*
 331 | 	 * We've re-checked that "task == *ptask", now we have two different
 332 | 	 * cases:
 333 | 	 *
 334 | 	 * 1. This is actually the same task/task_struct. In this case
 335 | 	 *    sighand != NULL tells us it is still alive.
 336 | 	 *
 337 | 	 * 2. This is another task which got the same memory for task_struct.
 338 | 	 *    We can't know this of course, and we can not trust
 339 | 	 *    sighand != NULL.
 340 | 	 *
 341 | 	 *    In this case we actually return a random value, but this is
 342 | 	 *    correct.
 343 | 	 *
 344 | 	 *    If we return NULL - we can pretend that we actually noticed that
 345 | 	 *    *ptask was updated when the previous task has exited. Or pretend
 346 | 	 *    that probe_slab_address(&sighand) reads NULL.
 347 | 	 *
 348 | 	 *    If we return the new task (because sighand is not NULL for any
 349 | 	 *    reason) - this is fine too. This (new) task can't go away before
 350 | 	 *    another gp pass.
 351 | 	 *
 352 | 	 *    And note: We could even eliminate the false positive if re-read
 353 | 	 *    task->sighand once again to avoid the falsely NULL. But this case
 354 | 	 *    is very unlikely so we don't care.
 355 | 	 */
 356 | 	if (!sighand)
 357 | 		return NULL;
 358 | 
 359 | 	return task;
 360 | }
 361 | 
 362 | void rcuwait_wake_up(struct rcuwait *w)
 363 | {
 364 | 	struct task_struct *task;
 365 | 
 366 | 	rcu_read_lock();
 367 | 
 368 | 	/*
 369 | 	 * Order condition vs @task, such that everything prior to the load
 370 | 	 * of @task is visible. This is the condition as to why the user called
 371 | 	 * rcuwait_trywake() in the first place. Pairs with set_current_state()
 372 | 	 * barrier (A) in rcuwait_wait_event().
 373 | 	 *
 374 | 	 *    WAIT                WAKE
 375 | 	 *    [S] tsk = current	  [S] cond = true
 376 | 	 *        MB (A)	      MB (B)
 377 | 	 *    [L] cond		  [L] tsk
 378 | 	 */
 379 | 	smp_mb(); /* (B) */
 380 | 
 381 | 	/*
 382 | 	 * Avoid using task_rcu_dereference() magic as long as we are careful,
 383 | 	 * see comment in rcuwait_wait_event() regarding ->exit_state.
 384 | 	 */
 385 | 	task = rcu_dereference(w->task);
 386 | 	if (task)
 387 | 		wake_up_process(task);
 388 | 	rcu_read_unlock();
 389 | }
 390 | 
 391 | /*
 392 |  * Determine if a process group is "orphaned", according to the POSIX
 393 |  * definition in 2.2.2.52.  Orphaned process groups are not to be affected
 394 |  * by terminal-generated stop signals.  Newly orphaned process groups are
 395 |  * to receive a SIGHUP and a SIGCONT.
 396 |  *
 397 |  * "I ask you, have you ever known what it is to be an orphan?"
 398 |  */
 399 | static int will_become_orphaned_pgrp(struct pid *pgrp,
 400 | 					struct task_struct *ignored_task)
 401 | {
 402 | 	struct task_struct *p;
 403 | 
 404 | 	do_each_pid_task(pgrp, PIDTYPE_PGID, p) {
 405 | 		if ((p == ignored_task) ||
 406 | 		    (p->exit_state && thread_group_empty(p)) ||
 407 | 		    is_global_init(p->real_parent))
 408 | 			continue;
 409 | 
 410 | 		if (task_pgrp(p->real_parent) != pgrp &&
 411 | 		    task_session(p->real_parent) == task_session(p))
 412 | 			return 0;
 413 | 	} while_each_pid_task(pgrp, PIDTYPE_PGID, p);
 414 | 
 415 | 	return 1;
 416 | }
 417 | 
 418 | int is_current_pgrp_orphaned(void)
 419 | {
 420 | 	int retval;
 421 | 
 422 | 	read_lock(&tasklist_lock);
 423 | 	retval = will_become_orphaned_pgrp(task_pgrp(current), NULL);
 424 | 	read_unlock(&tasklist_lock);
 425 | 
 426 | 	return retval;
 427 | }
 428 | 
 429 | static bool has_stopped_jobs(struct pid *pgrp)
 430 | {
 431 | 	struct task_struct *p;
 432 | 
 433 | 	do_each_pid_task(pgrp, PIDTYPE_PGID, p) {
 434 | 		if (p->signal->flags & SIGNAL_STOP_STOPPED)
 435 | 			return true;
 436 | 	} while_each_pid_task(pgrp, PIDTYPE_PGID, p);
 437 | 
 438 | 	return false;
 439 | }
 440 | 
 441 | /*
 442 |  * Check to see if any process groups have become orphaned as
 443 |  * a result of our exiting, and if they have any stopped jobs,
 444 |  * send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2)
 445 |  */
 446 | static void
 447 | kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
 448 | {
 449 | 	struct pid *pgrp = task_pgrp(tsk);
 450 | 	struct task_struct *ignored_task = tsk;
 451 | 
 452 | 	if (!parent)
 453 | 		/* exit: our father is in a different pgrp than
 454 | 		 * we are and we were the only connection outside.
 455 | 		 */
 456 | 		parent = tsk->real_parent;
 457 | 	else
 458 | 		/* reparent: our child is in a different pgrp than
 459 | 		 * we are, and it was the only connection outside.
 460 | 		 */
 461 | 		ignored_task = NULL;
 462 | 
 463 | 	if (task_pgrp(parent) != pgrp &&
 464 | 	    task_session(parent) == task_session(tsk) &&
 465 | 	    will_become_orphaned_pgrp(pgrp, ignored_task) &&
 466 | 	    has_stopped_jobs(pgrp)) {
 467 | 		__kill_pgrp_info(SIGHUP, SEND_SIG_PRIV, pgrp);
 468 | 		__kill_pgrp_info(SIGCONT, SEND_SIG_PRIV, pgrp);
 469 | 	}
 470 | }
 471 | 
 472 | #ifdef CONFIG_MEMCG
 473 | /*
 474 |  * A task is exiting.   If it owned this mm, find a new owner for the mm.
 475 |  */
 476 | void mm_update_next_owner(struct mm_struct *mm)
 477 | {
 478 | 	struct task_struct *c, *g, *p = current;
 479 | 
 480 | retry:
 481 | 	/*
 482 | 	 * If the exiting or execing task is not the owner, it's
 483 | 	 * someone else's problem.
 484 | 	 */
 485 | 	if (mm->owner != p)
 486 | 		return;
 487 | 	/*
 488 | 	 * The current owner is exiting/execing and there are no other
 489 | 	 * candidates.  Do not leave the mm pointing to a possibly
 490 | 	 * freed task structure.
 491 | 	 */
 492 | 	if (atomic_read(&mm->mm_users) <= 1) {
 493 | 		WRITE_ONCE(mm->owner, NULL);
 494 | 		return;
 495 | 	}
 496 | 
 497 | 	read_lock(&tasklist_lock);
 498 | 	/*
 499 | 	 * Search in the children
 500 | 	 */
 501 | 	list_for_each_entry(c, &p->children, sibling) {
 502 | 		if (c->mm == mm)
 503 | 			goto assign_new_owner;
 504 | 	}
 505 | 
 506 | 	/*
 507 | 	 * Search in the siblings
 508 | 	 */
 509 | 	list_for_each_entry(c, &p->real_parent->children, sibling) {
 510 | 		if (c->mm == mm)
 511 | 			goto assign_new_owner;
 512 | 	}
 513 | 
 514 | 	/*
 515 | 	 * Search through everything else, we should not get here often.
 516 | 	 */
 517 | 	for_each_process(g) {
 518 | 		if (g->flags & PF_KTHREAD)
 519 | 			continue;
 520 | 		for_each_thread(g, c) {
 521 | 			if (c->mm == mm)
 522 | 				goto assign_new_owner;
 523 | 			if (c->mm)
 524 | 				break;
 525 | 		}
 526 | 	}
 527 | 	read_unlock(&tasklist_lock);
 528 | 	/*
 529 | 	 * We found no owner yet mm_users > 1: this implies that we are
 530 | 	 * most likely racing with swapoff (try_to_unuse()) or /proc or
 531 | 	 * ptrace or page migration (get_task_mm()).  Mark owner as NULL.
 532 | 	 */
 533 | 	WRITE_ONCE(mm->owner, NULL);
 534 | 	return;
 535 | 
 536 | assign_new_owner:
 537 | 	BUG_ON(c == p);
 538 | 	get_task_struct(c);
 539 | 	/*
 540 | 	 * The task_lock protects c->mm from changing.
 541 | 	 * We always want mm->owner->mm == mm
 542 | 	 */
 543 | 	task_lock(c);
 544 | 	/*
 545 | 	 * Delay read_unlock() till we have the task_lock()
 546 | 	 * to ensure that c does not slip away underneath us
 547 | 	 */
 548 | 	read_unlock(&tasklist_lock);
 549 | 	if (c->mm != mm) {
 550 | 		task_unlock(c);
 551 | 		put_task_struct(c);
 552 | 		goto retry;
 553 | 	}
 554 | 	WRITE_ONCE(mm->owner, c);
 555 | 	task_unlock(c);
 556 | 	put_task_struct(c);
 557 | }
 558 | #endif /* CONFIG_MEMCG */
 559 | 
 560 | /*
 561 |  * Turn us into a lazy TLB process if we
 562 |  * aren't already..
 563 |  */
 564 | static void exit_mm(void)
 565 | {
 566 | 	struct mm_struct *mm = current->mm;
 567 | 	struct core_state *core_state;
 568 | 
 569 | 	mm_release(current, mm);
 570 | 	if (!mm)
 571 | 		return;
 572 | 	sync_mm_rss(mm);
 573 | 	/*
 574 | 	 * Serialize with any possible pending coredump.
 575 | 	 * We must hold mmap_sem around checking core_state
 576 | 	 * and clearing tsk->mm.  The core-inducing thread
 577 | 	 * will increment ->nr_threads for each thread in the
 578 | 	 * group with ->mm != NULL.
 579 | 	 */
 580 | 	down_read(&mm->mmap_sem);
 581 | 	core_state = mm->core_state;
 582 | 	if (core_state) {
 583 | 		struct core_thread self;
 584 | 
 585 | 		up_read(&mm->mmap_sem);
 586 | 
 587 | 		self.task = current;
 588 | 		self.next = xchg(&core_state->dumper.next, &self);
 589 | 		/*
 590 | 		 * Implies mb(), the result of xchg() must be visible
 591 | 		 * to core_state->dumper.
 592 | 		 */
 593 | 		if (atomic_dec_and_test(&core_state->nr_threads))
 594 | 			complete(&core_state->startup);
 595 | 
 596 | 		for (;;) {
 597 | 			set_current_state(TASK_UNINTERRUPTIBLE);
 598 | 			if (!self.task) /* see coredump_finish() */
 599 | 				break;
 600 | 			freezable_schedule();
 601 | 		}
 602 | 		__set_current_state(TASK_RUNNING);
 603 | 		down_read(&mm->mmap_sem);
 604 | 	}
 605 | 	mmgrab(mm);
 606 | 	BUG_ON(mm != current->active_mm);
 607 | 	/* more a memory barrier than a real lock */
 608 | 	task_lock(current);
 609 | 	current->mm = NULL;
 610 | 	up_read(&mm->mmap_sem);
 611 | 	enter_lazy_tlb(mm, current);
 612 | 	task_unlock(current);
 613 | 	mm_update_next_owner(mm);
 614 | 	mmput(mm);
 615 | 	if (test_thread_flag(TIF_MEMDIE))
 616 | 		exit_oom_victim();
 617 | }
 618 | 
 619 | static struct task_struct *find_alive_thread(struct task_struct *p)
 620 | {
 621 | 	struct task_struct *t;
 622 | 
 623 | 	for_each_thread(p, t) {
 624 | 		if (!(t->flags & PF_EXITING))
 625 | 			return t;
 626 | 	}
 627 | 	return NULL;
 628 | }
 629 | 
 630 | static struct task_struct *find_child_reaper(struct task_struct *father,
 631 | 						struct list_head *dead)
 632 | 	__releases(&tasklist_lock)
 633 | 	__acquires(&tasklist_lock)
 634 | {
 635 | 	struct pid_namespace *pid_ns = task_active_pid_ns(father);
 636 | 	struct task_struct *reaper = pid_ns->child_reaper;
 637 | 	struct task_struct *p, *n;
 638 | 
 639 | 	if (likely(reaper != father))
 640 | 		return reaper;
 641 | 
 642 | 	reaper = find_alive_thread(father);
 643 | 	if (reaper) {
 644 | 		pid_ns->child_reaper = reaper;
 645 | 		return reaper;
 646 | 	}
 647 | 
 648 | 	write_unlock_irq(&tasklist_lock);
 649 | 	if (unlikely(pid_ns == &init_pid_ns)) {
 650 | 		panic("Attempted to kill init! exitcode=0x%08x\n",
 651 | 			father->signal->group_exit_code ?: father->exit_code);
 652 | 	}
 653 | 
 654 | 	list_for_each_entry_safe(p, n, dead, ptrace_entry) {
 655 | 		list_del_init(&p->ptrace_entry);
 656 | 		release_task(p);
 657 | 	}
 658 | 
 659 | 	zap_pid_ns_processes(pid_ns);
 660 | 	write_lock_irq(&tasklist_lock);
 661 | 
 662 | 	return father;
 663 | }
 664 | 
 665 | /*
 666 |  * When we die, we re-parent all our children, and try to:
 667 |  * 1. give them to another thread in our thread group, if such a member exists
 668 |  * 2. give it to the first ancestor process which prctl'd itself as a
 669 |  *    child_subreaper for its children (like a service manager)
 670 |  * 3. give it to the init process (PID 1) in our pid namespace
 671 |  */
 672 | static struct task_struct *find_new_reaper(struct task_struct *father,
 673 | 					   struct task_struct *child_reaper)
 674 | {
 675 | 	struct task_struct *thread, *reaper;
 676 | 
 677 | 	thread = find_alive_thread(father);
 678 | 	if (thread)
 679 | 		return thread;
 680 | 
 681 | 	if (father->signal->has_child_subreaper) {
 682 | 		unsigned int ns_level = task_pid(father)->level;
 683 | 		/*
 684 | 		 * Find the first ->is_child_subreaper ancestor in our pid_ns.
 685 | 		 * We can't check reaper != child_reaper to ensure we do not
 686 | 		 * cross the namespaces, the exiting parent could be injected
 687 | 		 * by setns() + fork().
 688 | 		 * We check pid->level, this is slightly more efficient than
 689 | 		 * task_active_pid_ns(reaper) != task_active_pid_ns(father).
 690 | 		 */
 691 | 		for (reaper = father->real_parent;
 692 | 		     task_pid(reaper)->level == ns_level;
 693 | 		     reaper = reaper->real_parent) {
 694 | 			if (reaper == &init_task)
 695 | 				break;
 696 | 			if (!reaper->signal->is_child_subreaper)
 697 | 				continue;
 698 | 			thread = find_alive_thread(reaper);
 699 | 			if (thread)
 700 | 				return thread;
 701 | 		}
 702 | 	}
 703 | 
 704 | 	return child_reaper;
 705 | }
 706 | 
 707 | /*
 708 | * Any that need to be release_task'd are put on the @dead list.
 709 |  */
 710 | static void reparent_leader(struct task_struct *father, struct task_struct *p,
 711 | 				struct list_head *dead)
 712 | {
 713 | 	if (unlikely(p->exit_state == EXIT_DEAD))
 714 | 		return;
 715 | 
 716 | 	/* We don't want people slaying init. */
 717 | 	p->exit_signal = SIGCHLD;
 718 | 
 719 | 	/* If it has exited notify the new parent about this child's death. */
 720 | 	if (!p->ptrace &&
 721 | 	    p->exit_state == EXIT_ZOMBIE && thread_group_empty(p)) {
 722 | 		if (do_notify_parent(p, p->exit_signal)) {
 723 | 			p->exit_state = EXIT_DEAD;
 724 | 			list_add(&p->ptrace_entry, dead);
 725 | 		}
 726 | 	}
 727 | 
 728 | 	kill_orphaned_pgrp(p, father);
 729 | }
 730 | 
 731 | /*
 732 |  * This does two things:
 733 |  *
 734 |  * A.  Make init inherit all the child processes
 735 |  * B.  Check to see if any process groups have become orphaned
 736 |  *	as a result of our exiting, and if they have any stopped
 737 |  *	jobs, send them a SIGHUP and then a SIGCONT.  (POSIX 3.2.2.2)
 738 |  */
 739 | static void forget_original_parent(struct task_struct *father,
 740 | 					struct list_head *dead)
 741 | {
 742 | 	struct task_struct *p, *t, *reaper;
 743 | 
 744 | 	if (unlikely(!list_empty(&father->ptraced)))
 745 | 		exit_ptrace(father, dead);
 746 | 
 747 | 	/* Can drop and reacquire tasklist_lock */
 748 | 	reaper = find_child_reaper(father, dead);
 749 | 	if (list_empty(&father->children))
 750 | 		return;
 751 | 
 752 | 	reaper = find_new_reaper(father, reaper);
 753 | 	list_for_each_entry(p, &father->children, sibling) {
 754 | 		for_each_thread(p, t) {
 755 | 			t->real_parent = reaper;
 756 | 			BUG_ON((!t->ptrace) != (t->parent == father));
 757 | 			if (likely(!t->ptrace))
 758 | 				t->parent = t->real_parent;
 759 | 			if (t->pdeath_signal)
 760 | 				group_send_sig_info(t->pdeath_signal,
 761 | 						    SEND_SIG_NOINFO, t,
 762 | 						    PIDTYPE_TGID);
 763 | 		}
 764 | 		/*
 765 | 		 * If this is a threaded reparent there is no need to
 766 | 		 * notify anyone anything has happened.
 767 | 		 */
 768 | 		if (!same_thread_group(reaper, father))
 769 | 			reparent_leader(father, p, dead);
 770 | 	}
 771 | 	list_splice_tail_init(&father->children, &reaper->children);
 772 | }
 773 | 
 774 | /*
 775 |  * Send signals to all our closest relatives so that they know
 776 |  * to properly mourn us..
 777 |  */
 778 | static void exit_notify(struct task_struct *tsk, int group_dead)
 779 | {
 780 | 	bool autoreap;
 781 | 	struct task_struct *p, *n;
 782 | 	LIST_HEAD(dead);
 783 | 
 784 | 	write_lock_irq(&tasklist_lock);
 785 | 	forget_original_parent(tsk, &dead);
 786 | 
 787 | 	if (group_dead)
 788 | 		kill_orphaned_pgrp(tsk->group_leader, NULL);
 789 | 
 790 | 	if (unlikely(tsk->ptrace)) {
 791 | 		int sig = thread_group_leader(tsk) &&
 792 | 				thread_group_empty(tsk) &&
 793 | 				!ptrace_reparented(tsk) ?
 794 | 			tsk->exit_signal : SIGCHLD;
 795 | 		autoreap = do_notify_parent(tsk, sig);
 796 | 	} else if (thread_group_leader(tsk)) {
 797 | 		autoreap = thread_group_empty(tsk) &&
 798 | 			do_notify_parent(tsk, tsk->exit_signal);
 799 | 	} else {
 800 | 		autoreap = true;
 801 | 	}
 802 | 
 803 | 	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
 804 | 	if (tsk->exit_state == EXIT_DEAD)
 805 | 		list_add(&tsk->ptrace_entry, &dead);
 806 | 
 807 | 	/* mt-exec, de_thread() is waiting for group leader */
 808 | 	if (unlikely(tsk->signal->notify_count < 0))
 809 | 		wake_up_process(tsk->signal->group_exit_task);
 810 | 	write_unlock_irq(&tasklist_lock);
 811 | 
 812 | 	list_for_each_entry_safe(p, n, &dead, ptrace_entry) {
 813 | 		list_del_init(&p->ptrace_entry);
 814 | 		release_task(p);
 815 | 	}
 816 | }
 817 | 
 818 | #ifdef CONFIG_DEBUG_STACK_USAGE
 819 | static void check_stack_usage(void)
 820 | {
 821 | 	static DEFINE_SPINLOCK(low_water_lock);
 822 | 	static int lowest_to_date = THREAD_SIZE;
 823 | 	unsigned long free;
 824 | 
 825 | 	free = stack_not_used(current);
 826 | 
 827 | 	if (free >= lowest_to_date)
 828 | 		return;
 829 | 
 830 | 	spin_lock(&low_water_lock);
 831 | 	if (free < lowest_to_date) {
 832 | 		pr_info("%s (%d) used greatest stack depth: %lu bytes left\n",
 833 | 			current->comm, task_pid_nr(current), free);
 834 | 		lowest_to_date = free;
 835 | 	}
 836 | 	spin_unlock(&low_water_lock);
 837 | }
 838 | #else
 839 | static inline void check_stack_usage(void) {}
 840 | #endif
 841 | 
 842 | void __noreturn do_exit(long code)
 843 | {
 844 | 	struct task_struct *tsk = current;
 845 | 	int group_dead;
 846 | 
 847 | 	profile_task_exit(tsk);
 848 | 	kcov_task_exit(tsk);
 849 | 
 850 | 	WARN_ON(blk_needs_flush_plug(tsk));
 851 | 
 852 | 	if (unlikely(in_interrupt()))
 853 | 		panic("Aiee, killing interrupt handler!");
 854 | 	if (unlikely(!tsk->pid))
 855 | 		panic("Attempted to kill the idle task!");
 856 | 
 857 | 	/*
 858 | 	 * If do_exit is called because this processes oopsed, it's possible
 859 | 	 * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
 860 | 	 * continuing. Amongst other possible reasons, this is to prevent
 861 | 	 * mm_release()->clear_child_tid() from writing to a user-controlled
 862 | 	 * kernel address.
 863 | 	 */
 864 | 	set_fs(USER_DS);
 865 | 
 866 | 	ptrace_event(PTRACE_EVENT_EXIT, code);
 867 | 
 868 | 	validate_creds_for_do_exit(tsk);
 869 | 
 870 | 	/*
 871 | 	 * We're taking recursive faults here in do_exit. Safest is to just
 872 | 	 * leave this task alone and wait for reboot.
 873 | 	 */
 874 | 	if (unlikely(tsk->flags & PF_EXITING)) {
 875 | 		pr_alert("Fixing recursive fault but reboot is needed!\n");
 876 | 		/*
 877 | 		 * We can do this unlocked here. The futex code uses
 878 | 		 * this flag just to verify whether the pi state
 879 | 		 * cleanup has been done or not. In the worst case it
 880 | 		 * loops once more. We pretend that the cleanup was
 881 | 		 * done as there is no way to return. Either the
 882 | 		 * OWNER_DIED bit is set by now or we push the blocked
 883 | 		 * task into the wait for ever nirwana as well.
 884 | 		 */
 885 | 		tsk->flags |= PF_EXITPIDONE;
 886 | 		set_current_state(TASK_UNINTERRUPTIBLE);
 887 | 		schedule();
 888 | 	}
 889 | 
 890 | 	exit_signals(tsk);  /* sets PF_EXITING */
 891 | 	/*
 892 | 	 * Ensure that all new tsk->pi_lock acquisitions must observe
 893 | 	 * PF_EXITING. Serializes against futex.c:attach_to_pi_owner().
 894 | 	 */
 895 | 	smp_mb();
 896 | 	/*
 897 | 	 * Ensure that we must observe the pi_state in exit_mm() ->
 898 | 	 * mm_release() -> exit_pi_state_list().
 899 | 	 */
 900 | 	raw_spin_lock_irq(&tsk->pi_lock);
 901 | 	raw_spin_unlock_irq(&tsk->pi_lock);
 902 | 
 903 | 	if (unlikely(in_atomic())) {
 904 | 		pr_info("note: %s[%d] exited with preempt_count %d\n",
 905 | 			current->comm, task_pid_nr(current),
 906 | 			preempt_count());
 907 | 		preempt_count_set(PREEMPT_ENABLED);
 908 | 	}
 909 | 
 910 | 	/* sync mm's RSS info before statistics gathering */
 911 | 	if (tsk->mm)
 912 | 		sync_mm_rss(tsk->mm);
 913 | 	acct_update_integrals(tsk);
 914 | 	group_dead = atomic_dec_and_test(&tsk->signal->live);
 915 | 	if (group_dead) {
 916 | #ifdef CONFIG_POSIX_TIMERS
 917 | 		hrtimer_cancel(&tsk->signal->real_timer);
 918 | 		exit_itimers(tsk->signal);
 919 | #endif
 920 | 		if (tsk->mm)
 921 | 			setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm);
 922 | 	}
 923 | 	acct_collect(code, group_dead);
 924 | 	if (group_dead)
 925 | 		tty_audit_exit();
 926 | 	audit_free(tsk);
 927 | 
 928 | 	tsk->exit_code = code;
 929 | 	taskstats_exit(tsk, group_dead);
 930 | 
 931 | 	exit_mm();
 932 | 
 933 | 	if (group_dead)
 934 | 		acct_process();
 935 | 	trace_sched_process_exit(tsk);
 936 | 
 937 | 	exit_sem(tsk);
 938 | 	exit_shm(tsk);
 939 | 	exit_files(tsk);
 940 | 	exit_fs(tsk);
 941 | 	if (group_dead)
 942 | 		disassociate_ctty(1);
 943 | 	exit_task_namespaces(tsk);
 944 | 	exit_task_work(tsk);
 945 | 	exit_thread(tsk);
 946 | 	exit_umh(tsk);
 947 | 
 948 | 	/*
 949 | 	 * Flush inherited counters to the parent - before the parent
 950 | 	 * gets woken up by child-exit notifications.
 951 | 	 *
 952 | 	 * because of cgroup mode, must be called before cgroup_exit()
 953 | 	 */
 954 | 	perf_event_exit_task(tsk);
 955 | 
 956 | 	sched_autogroup_exit_task(tsk);
 957 | 	cgroup_exit(tsk);
 958 | 
 959 | 	/*
 960 | 	 * FIXME: do that only when needed, using sched_exit tracepoint
 961 | 	 */
 962 | 	flush_ptrace_hw_breakpoint(tsk);
 963 | 
 964 | 	exit_tasks_rcu_start();
 965 | 	exit_notify(tsk, group_dead);
 966 | 	proc_exit_connector(tsk);
 967 | 	mpol_put_task_policy(tsk);
 968 | #ifdef CONFIG_FUTEX
 969 | 	if (unlikely(current->pi_state_cache))
 970 | 		kfree(current->pi_state_cache);
 971 | #endif
 972 | 	/*
 973 | 	 * Make sure we are holding no locks:
 974 | 	 */
 975 | 	debug_check_no_locks_held();
 976 | 	/*
 977 | 	 * We can do this unlocked here. The futex code uses this flag
 978 | 	 * just to verify whether the pi state cleanup has been done
 979 | 	 * or not. In the worst case it loops once more.
 980 | 	 */
 981 | 	tsk->flags |= PF_EXITPIDONE;
 982 | 
 983 | 	if (tsk->io_context)
 984 | 		exit_io_context(tsk);
 985 | 
 986 | 	if (tsk->splice_pipe)
 987 | 		free_pipe_info(tsk->splice_pipe);
 988 | 
 989 | 	if (tsk->task_frag.page)
 990 | 		put_page(tsk->task_frag.page);
 991 | 
 992 | 	validate_creds_for_do_exit(tsk);
 993 | 
 994 | 	check_stack_usage();
 995 | 	preempt_disable();
 996 | 	if (tsk->nr_dirtied)
 997 | 		__this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied);
 998 | 	exit_rcu();
 999 | 	exit_tasks_rcu_finish();
1000 | 
1001 | 	lockdep_free_task(tsk);
1002 | 	do_task_dead();
1003 | }
1004 | EXPORT_SYMBOL_GPL(do_exit);
1005 | 
1006 | void complete_and_exit(struct completion *comp, long code)
1007 | {
1008 | 	if (comp)
1009 | 		complete(comp);
1010 | 
1011 | 	do_exit(code);
1012 | }
1013 | EXPORT_SYMBOL(complete_and_exit);
1014 | 
1015 | SYSCALL_DEFINE1(exit, int, error_code)
1016 | {
1017 | 	do_exit((error_code&0xff)<<8);
1018 | }
1019 | 
1020 | /*
1021 |  * Take down every thread in the group.  This is called by fatal signals
1022 |  * as well as by sys_exit_group (below).
1023 |  */
1024 | void
1025 | do_group_exit(int exit_code)
1026 | {
1027 | 	struct signal_struct *sig = current->signal;
1028 | 
1029 | 	BUG_ON(exit_code & 0x80); /* core dumps don't get here */
1030 | 
1031 | 	if (signal_group_exit(sig))
1032 | 		exit_code = sig->group_exit_code;
1033 | 	else if (!thread_group_empty(current)) {
1034 | 		struct sighand_struct *const sighand = current->sighand;
1035 | 
1036 | 		spin_lock_irq(&sighand->siglock);
1037 | 		if (signal_group_exit(sig))
1038 | 			/* Another thread got here before we took the lock.  */
1039 | 			exit_code = sig->group_exit_code;
1040 | 		else {
1041 | 			sig->group_exit_code = exit_code;
1042 | 			sig->flags = SIGNAL_GROUP_EXIT;
1043 | 			zap_other_threads(current);
1044 | 		}
1045 | 		spin_unlock_irq(&sighand->siglock);
1046 | 	}
1047 | 
1048 | 	do_exit(exit_code);
1049 | 	/* NOTREACHED */
1050 | }
1051 | 
1052 | /*
1053 |  * this kills every thread in the thread group. Note that any externally
1054 |  * wait4()-ing process will get the correct exit code - even if this
1055 |  * thread is not the thread group leader.
1056 |  */
1057 | SYSCALL_DEFINE1(exit_group, int, error_code)
1058 | {
1059 | 	do_group_exit((error_code & 0xff) << 8);
1060 | 	/* NOTREACHED */
1061 | 	return 0;
1062 | }
1063 | 
1064 | struct waitid_info {
1065 | 	pid_t pid;
1066 | 	uid_t uid;
1067 | 	int status;
1068 | 	int cause;
1069 | };
1070 | 
1071 | struct wait_opts {
1072 | 	enum pid_type		wo_type;
1073 | 	int			wo_flags;
1074 | 	struct pid		*wo_pid;
1075 | 
1076 | 	struct waitid_info	*wo_info;
1077 | 	int			wo_stat;
1078 | 	struct rusage		*wo_rusage;
1079 | 
1080 | 	wait_queue_entry_t		child_wait;
1081 | 	int			notask_error;
1082 | };
1083 | 
1084 | static int eligible_pid(struct wait_opts *wo, struct task_struct *p)
1085 | {
1086 | 	return	wo->wo_type == PIDTYPE_MAX ||
1087 | 		task_pid_type(p, wo->wo_type) == wo->wo_pid;
1088 | }
1089 | 
1090 | static int
1091 | eligible_child(struct wait_opts *wo, bool ptrace, struct task_struct *p)
1092 | {
1093 | 	if (!eligible_pid(wo, p))
1094 | 		return 0;
1095 | 
1096 | 	/*
1097 | 	 * Wait for all children (clone and not) if __WALL is set or
1098 | 	 * if it is traced by us.
1099 | 	 */
1100 | 	if (ptrace || (wo->wo_flags & __WALL))
1101 | 		return 1;
1102 | 
1103 | 	/*
1104 | 	 * Otherwise, wait for clone children *only* if __WCLONE is set;
1105 | 	 * otherwise, wait for non-clone children *only*.
1106 | 	 *
1107 | 	 * Note: a "clone" child here is one that reports to its parent
1108 | 	 * using a signal other than SIGCHLD, or a non-leader thread which
1109 | 	 * we can only see if it is traced by us.
1110 | 	 */
1111 | 	if ((p->exit_signal != SIGCHLD) ^ !!(wo->wo_flags & __WCLONE))
1112 | 		return 0;
1113 | 
1114 | 	return 1;
1115 | }
1116 | 
1117 | /*
1118 |  * Handle sys_wait4 work for one task in state EXIT_ZOMBIE.  We hold
1119 |  * read_lock(&tasklist_lock) on entry.  If we return zero, we still hold
1120 |  * the lock and this task is uninteresting.  If we return nonzero, we have
1121 |  * released the lock and the system call should return.
1122 |  */
1123 | static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
1124 | {
1125 | 	int state, status;
1126 | 	pid_t pid = task_pid_vnr(p);
1127 | 	uid_t uid = from_kuid_munged(current_user_ns(), task_uid(p));
1128 | 	struct waitid_info *infop;
1129 | 
1130 | 	if (!likely(wo->wo_flags & WEXITED))
1131 | 		return 0;
1132 | 
1133 | 	if (unlikely(wo->wo_flags & WNOWAIT)) {
1134 | 		status = p->exit_code;
1135 | 		get_task_struct(p);
1136 | 		read_unlock(&tasklist_lock);
1137 | 		sched_annotate_sleep();
1138 | 		if (wo->wo_rusage)
1139 | 			getrusage(p, RUSAGE_BOTH, wo->wo_rusage);
1140 | 		put_task_struct(p);
1141 | 		goto out_info;
1142 | 	}
1143 | 	/*
1144 | 	 * Move the task's state to DEAD/TRACE, only one thread can do this.
1145 | 	 */
1146 | 	state = (ptrace_reparented(p) && thread_group_leader(p)) ?
1147 | 		EXIT_TRACE : EXIT_DEAD;
1148 | 	if (cmpxchg(&p->exit_state, EXIT_ZOMBIE, state) != EXIT_ZOMBIE)
1149 | 		return 0;
1150 | 	/*
1151 | 	 * We own this thread, nobody else can reap it.
1152 | 	 */
1153 | 	read_unlock(&tasklist_lock);
1154 | 	sched_annotate_sleep();
1155 | 
1156 | 	/*
1157 | 	 * Check thread_group_leader() to exclude the traced sub-threads.
1158 | 	 */
1159 | 	if (state == EXIT_DEAD && thread_group_leader(p)) {
1160 | 		struct signal_struct *sig = p->signal;
1161 | 		struct signal_struct *psig = current->signal;
1162 | 		unsigned long maxrss;
1163 | 		u64 tgutime, tgstime;
1164 | 
1165 | 		/*
1166 | 		 * The resource counters for the group leader are in its
1167 | 		 * own task_struct.  Those for dead threads in the group
1168 | 		 * are in its signal_struct, as are those for the child
1169 | 		 * processes it has previously reaped.  All these
1170 | 		 * accumulate in the parent's signal_struct c* fields.
1171 | 		 *
1172 | 		 * We don't bother to take a lock here to protect these
1173 | 		 * p->signal fields because the whole thread group is dead
1174 | 		 * and nobody can change them.
1175 | 		 *
1176 | 		 * psig->stats_lock also protects us from our sub-theads
1177 | 		 * which can reap other children at the same time. Until
1178 | 		 * we change k_getrusage()-like users to rely on this lock
1179 | 		 * we have to take ->siglock as well.
1180 | 		 *
1181 | 		 * We use thread_group_cputime_adjusted() to get times for
1182 | 		 * the thread group, which consolidates times for all threads
1183 | 		 * in the group including the group leader.
1184 | 		 */
1185 | 		thread_group_cputime_adjusted(p, &tgutime, &tgstime);
1186 | 		spin_lock_irq(&current->sighand->siglock);
1187 | 		write_seqlock(&psig->stats_lock);
1188 | 		psig->cutime += tgutime + sig->cutime;
1189 | 		psig->cstime += tgstime + sig->cstime;
1190 | 		psig->cgtime += task_gtime(p) + sig->gtime + sig->cgtime;
1191 | 		psig->cmin_flt +=
1192 | 			p->min_flt + sig->min_flt + sig->cmin_flt;
1193 | 		psig->cmaj_flt +=
1194 | 			p->maj_flt + sig->maj_flt + sig->cmaj_flt;
1195 | 		psig->cnvcsw +=
1196 | 			p->nvcsw + sig->nvcsw + sig->cnvcsw;
1197 | 		psig->cnivcsw +=
1198 | 			p->nivcsw + sig->nivcsw + sig->cnivcsw;
1199 | 		psig->cinblock +=
1200 | 			task_io_get_inblock(p) +
1201 | 			sig->inblock + sig->cinblock;
1202 | 		psig->coublock +=
1203 | 			task_io_get_oublock(p) +
1204 | 			sig->oublock + sig->coublock;
1205 | 		maxrss = max(sig->maxrss, sig->cmaxrss);
1206 | 		if (psig->cmaxrss < maxrss)
1207 | 			psig->cmaxrss = maxrss;
1208 | 		task_io_accounting_add(&psig->ioac, &p->ioac);
1209 | 		task_io_accounting_add(&psig->ioac, &sig->ioac);
1210 | 		write_sequnlock(&psig->stats_lock);
1211 | 		spin_unlock_irq(&current->sighand->siglock);
1212 | 	}
1213 | 
1214 | 	if (wo->wo_rusage)
1215 | 		getrusage(p, RUSAGE_BOTH, wo->wo_rusage);
1216 | 	status = (p->signal->flags & SIGNAL_GROUP_EXIT)
1217 | 		? p->signal->group_exit_code : p->exit_code;
1218 | 	wo->wo_stat = status;
1219 | 
1220 | 	if (state == EXIT_TRACE) {
1221 | 		write_lock_irq(&tasklist_lock);
1222 | 		/* We dropped tasklist, ptracer could die and untrace */
1223 | 		ptrace_unlink(p);
1224 | 
1225 | 		/* If parent wants a zombie, don't release it now */
1226 | 		state = EXIT_ZOMBIE;
1227 | 		if (do_notify_parent(p, p->exit_signal))
1228 | 			state = EXIT_DEAD;
1229 | 		p->exit_state = state;
1230 | 		write_unlock_irq(&tasklist_lock);
1231 | 	}
1232 | 	if (state == EXIT_DEAD)
1233 | 		release_task(p);
1234 | 
1235 | out_info:
1236 | 	infop = wo->wo_info;
1237 | 	if (infop) {
1238 | 		if ((status & 0x7f) == 0) {
1239 | 			infop->cause = CLD_EXITED;
1240 | 			infop->status = status >> 8;
1241 | 		} else {
1242 | 			infop->cause = (status & 0x80) ? CLD_DUMPED : CLD_KILLED;
1243 | 			infop->status = status & 0x7f;
1244 | 		}
1245 | 		infop->pid = pid;
1246 | 		infop->uid = uid;
1247 | 	}
1248 | 
1249 | 	return pid;
1250 | }
1251 | 
1252 | static int *task_stopped_code(struct task_struct *p, bool ptrace)
1253 | {
1254 | 	if (ptrace) {
1255 | 		if (task_is_traced(p) && !(p->jobctl & JOBCTL_LISTENING))
1256 | 			return &p->exit_code;
1257 | 	} else {
1258 | 		if (p->signal->flags & SIGNAL_STOP_STOPPED)
1259 | 			return &p->signal->group_exit_code;
1260 | 	}
1261 | 	return NULL;
1262 | }
1263 | 
1264 | /**
1265 |  * wait_task_stopped - Wait for %TASK_STOPPED or %TASK_TRACED
1266 |  * @wo: wait options
1267 |  * @ptrace: is the wait for ptrace
1268 |  * @p: task to wait for
1269 |  *
1270 |  * Handle sys_wait4() work for %p in state %TASK_STOPPED or %TASK_TRACED.
1271 |  *
1272 |  * CONTEXT:
1273 |  * read_lock(&tasklist_lock), which is released if return value is
1274 |  * non-zero.  Also, grabs and releases @p->sighand->siglock.
1275 |  *
1276 |  * RETURNS:
1277 |  * 0 if wait condition didn't exist and search for other wait conditions
1278 |  * should continue.  Non-zero return, -errno on failure and @p's pid on
1279 |  * success, implies that tasklist_lock is released and wait condition
1280 |  * search should terminate.
1281 |  */
1282 | static int wait_task_stopped(struct wait_opts *wo,
1283 | 				int ptrace, struct task_struct *p)
1284 | {
1285 | 	struct waitid_info *infop;
1286 | 	int exit_code, *p_code, why;
1287 | 	uid_t uid = 0; /* unneeded, required by compiler */
1288 | 	pid_t pid;
1289 | 
1290 | 	/*
1291 | 	 * Traditionally we see ptrace'd stopped tasks regardless of options.
1292 | 	 */
1293 | 	if (!ptrace && !(wo->wo_flags & WUNTRACED))
1294 | 		return 0;
1295 | 
1296 | 	if (!task_stopped_code(p, ptrace))
1297 | 		return 0;
1298 | 
1299 | 	exit_code = 0;
1300 | 	spin_lock_irq(&p->sighand->siglock);
1301 | 
1302 | 	p_code = task_stopped_code(p, ptrace);
1303 | 	if (unlikely(!p_code))
1304 | 		goto unlock_sig;
1305 | 
1306 | 	exit_code = *p_code;
1307 | 	if (!exit_code)
1308 | 		goto unlock_sig;
1309 | 
1310 | 	if (!unlikely(wo->wo_flags & WNOWAIT))
1311 | 		*p_code = 0;
1312 | 
1313 | 	uid = from_kuid_munged(current_user_ns(), task_uid(p));
1314 | unlock_sig:
1315 | 	spin_unlock_irq(&p->sighand->siglock);
1316 | 	if (!exit_code)
1317 | 		return 0;
1318 | 
1319 | 	/*
1320 | 	 * Now we are pretty sure this task is interesting.
1321 | 	 * Make sure it doesn't get reaped out from under us while we
1322 | 	 * give up the lock and then examine it below.  We don't want to
1323 | 	 * keep holding onto the tasklist_lock while we call getrusage and
1324 | 	 * possibly take page faults for user memory.
1325 | 	 */
1326 | 	get_task_struct(p);
1327 | 	pid = task_pid_vnr(p);
1328 | 	why = ptrace ? CLD_TRAPPED : CLD_STOPPED;
1329 | 	read_unlock(&tasklist_lock);
1330 | 	sched_annotate_sleep();
1331 | 	if (wo->wo_rusage)
1332 | 		getrusage(p, RUSAGE_BOTH, wo->wo_rusage);
1333 | 	put_task_struct(p);
1334 | 
1335 | 	if (likely(!(wo->wo_flags & WNOWAIT)))
1336 | 		wo->wo_stat = (exit_code << 8) | 0x7f;
1337 | 
1338 | 	infop = wo->wo_info;
1339 | 	if (infop) {
1340 | 		infop->cause = why;
1341 | 		infop->status = exit_code;
1342 | 		infop->pid = pid;
1343 | 		infop->uid = uid;
1344 | 	}
1345 | 	return pid;
1346 | }
1347 | 
1348 | /*
1349 |  * Handle do_wait work for one task in a live, non-stopped state.
1350 |  * read_lock(&tasklist_lock) on entry.  If we return zero, we still hold
1351 |  * the lock and this task is uninteresting.  If we return nonzero, we have
1352 |  * released the lock and the system call should return.
1353 |  */
1354 | static int wait_task_continued(struct wait_opts *wo, struct task_struct *p)
1355 | {
1356 | 	struct waitid_info *infop;
1357 | 	pid_t pid;
1358 | 	uid_t uid;
1359 | 
1360 | 	if (!unlikely(wo->wo_flags & WCONTINUED))
1361 | 		return 0;
1362 | 
1363 | 	if (!(p->signal->flags & SIGNAL_STOP_CONTINUED))
1364 | 		return 0;
1365 | 
1366 | 	spin_lock_irq(&p->sighand->siglock);
1367 | 	/* Re-check with the lock held.  */
1368 | 	if (!(p->signal->flags & SIGNAL_STOP_CONTINUED)) {
1369 | 		spin_unlock_irq(&p->sighand->siglock);
1370 | 		return 0;
1371 | 	}
1372 | 	if (!unlikely(wo->wo_flags & WNOWAIT))
1373 | 		p->signal->flags &= ~SIGNAL_STOP_CONTINUED;
1374 | 	uid = from_kuid_munged(current_user_ns(), task_uid(p));
1375 | 	spin_unlock_irq(&p->sighand->siglock);
1376 | 
1377 | 	pid = task_pid_vnr(p);
1378 | 	get_task_struct(p);
1379 | 	read_unlock(&tasklist_lock);
1380 | 	sched_annotate_sleep();
1381 | 	if (wo->wo_rusage)
1382 | 		getrusage(p, RUSAGE_BOTH, wo->wo_rusage);
1383 | 	put_task_struct(p);
1384 | 
1385 | 	infop = wo->wo_info;
1386 | 	if (!infop) {
1387 | 		wo->wo_stat = 0xffff;
1388 | 	} else {
1389 | 		infop->cause = CLD_CONTINUED;
1390 | 		infop->pid = pid;
1391 | 		infop->uid = uid;
1392 | 		infop->status = SIGCONT;
1393 | 	}
1394 | 	return pid;
1395 | }
1396 | 
1397 | /*
1398 |  * Consider @p for a wait by @parent.
1399 |  *
1400 |  * -ECHILD should be in ->notask_error before the first call.
1401 |  * Returns nonzero for a final return, when we have unlocked tasklist_lock.
1402 |  * Returns zero if the search for a child should continue;
1403 |  * then ->notask_error is 0 if @p is an eligible child,
1404 |  * or still -ECHILD.
1405 |  */
1406 | static int wait_consider_task(struct wait_opts *wo, int ptrace,
1407 | 				struct task_struct *p)
1408 | {
1409 | 	/*
1410 | 	 * We can race with wait_task_zombie() from another thread.
1411 | 	 * Ensure that EXIT_ZOMBIE -> EXIT_DEAD/EXIT_TRACE transition
1412 | 	 * can't confuse the checks below.
1413 | 	 */
1414 | 	int exit_state = READ_ONCE(p->exit_state);
1415 | 	int ret;
1416 | 
1417 | 	if (unlikely(exit_state == EXIT_DEAD))
1418 | 		return 0;
1419 | 
1420 | 	ret = eligible_child(wo, ptrace, p);
1421 | 	if (!ret)
1422 | 		return ret;
1423 | 
1424 | 	if (unlikely(exit_state == EXIT_TRACE)) {
1425 | 		/*
1426 | 		 * ptrace == 0 means we are the natural parent. In this case
1427 | 		 * we should clear notask_error, debugger will notify us.
1428 | 		 */
1429 | 		if (likely(!ptrace))
1430 | 			wo->notask_error = 0;
1431 | 		return 0;
1432 | 	}
1433 | 
1434 | 	if (likely(!ptrace) && unlikely(p->ptrace)) {
1435 | 		/*
1436 | 		 * If it is traced by its real parent's group, just pretend
1437 | 		 * the caller is ptrace_do_wait() and reap this child if it
1438 | 		 * is zombie.
1439 | 		 *
1440 | 		 * This also hides group stop state from real parent; otherwise
1441 | 		 * a single stop can be reported twice as group and ptrace stop.
1442 | 		 * If a ptracer wants to distinguish these two events for its
1443 | 		 * own children it should create a separate process which takes
1444 | 		 * the role of real parent.
1445 | 		 */
1446 | 		if (!ptrace_reparented(p))
1447 | 			ptrace = 1;
1448 | 	}
1449 | 
1450 | 	/* slay zombie? */
1451 | 	if (exit_state == EXIT_ZOMBIE) {
1452 | 		/* we don't reap group leaders with subthreads */
1453 | 		if (!delay_group_leader(p)) {
1454 | 			/*
1455 | 			 * A zombie ptracee is only visible to its ptracer.
1456 | 			 * Notification and reaping will be cascaded to the
1457 | 			 * real parent when the ptracer detaches.
1458 | 			 */
1459 | 			if (unlikely(ptrace) || likely(!p->ptrace))
1460 | 				return wait_task_zombie(wo, p);
1461 | 		}
1462 | 
1463 | 		/*
1464 | 		 * Allow access to stopped/continued state via zombie by
1465 | 		 * falling through.  Clearing of notask_error is complex.
1466 | 		 *
1467 | 		 * When !@ptrace:
1468 | 		 *
1469 | 		 * If WEXITED is set, notask_error should naturally be
1470 | 		 * cleared.  If not, subset of WSTOPPED|WCONTINUED is set,
1471 | 		 * so, if there are live subthreads, there are events to
1472 | 		 * wait for.  If all subthreads are dead, it's still safe
1473 | 		 * to clear - this function will be called again in finite
1474 | 		 * amount time once all the subthreads are released and
1475 | 		 * will then return without clearing.
1476 | 		 *
1477 | 		 * When @ptrace:
1478 | 		 *
1479 | 		 * Stopped state is per-task and thus can't change once the
1480 | 		 * target task dies.  Only continued and exited can happen.
1481 | 		 * Clear notask_error if WCONTINUED | WEXITED.
1482 | 		 */
1483 | 		if (likely(!ptrace) || (wo->wo_flags & (WCONTINUED | WEXITED)))
1484 | 			wo->notask_error = 0;
1485 | 	} else {
1486 | 		/*
1487 | 		 * @p is alive and it's gonna stop, continue or exit, so
1488 | 		 * there always is something to wait for.
1489 | 		 */
1490 | 		wo->notask_error = 0;
1491 | 	}
1492 | 
1493 | 	/*
1494 | 	 * Wait for stopped.  Depending on @ptrace, different stopped state
1495 | 	 * is used and the two don't interact with each other.
1496 | 	 */
1497 | 	ret = wait_task_stopped(wo, ptrace, p);
1498 | 	if (ret)
1499 | 		return ret;
1500 | 
1501 | 	/*
1502 | 	 * Wait for continued.  There's only one continued state and the
1503 | 	 * ptracer can consume it which can confuse the real parent.  Don't
1504 | 	 * use WCONTINUED from ptracer.  You don't need or want it.
1505 | 	 */
1506 | 	return wait_task_continued(wo, p);
1507 | }
1508 | 
1509 | /*
1510 |  * Do the work of do_wait() for one thread in the group, @tsk.
1511 |  *
1512 |  * -ECHILD should be in ->notask_error before the first call.
1513 |  * Returns nonzero for a final return, when we have unlocked tasklist_lock.
1514 |  * Returns zero if the search for a child should continue; then
1515 |  * ->notask_error is 0 if there were any eligible children,
1516 |  * or still -ECHILD.
1517 |  */
1518 | static int do_wait_thread(struct wait_opts *wo, struct task_struct *tsk)
1519 | {
1520 | 	struct task_struct *p;
1521 | 
1522 | 	list_for_each_entry(p, &tsk->children, sibling) {
1523 | 		int ret = wait_consider_task(wo, 0, p);
1524 | 
1525 | 		if (ret)
1526 | 			return ret;
1527 | 	}
1528 | 
1529 | 	return 0;
1530 | }
1531 | 
1532 | static int ptrace_do_wait(struct wait_opts *wo, struct task_struct *tsk)
1533 | {
1534 | 	struct task_struct *p;
1535 | 
1536 | 	list_for_each_entry(p, &tsk->ptraced, ptrace_entry) {
1537 | 		int ret = wait_consider_task(wo, 1, p);
1538 | 
1539 | 		if (ret)
1540 | 			return ret;
1541 | 	}
1542 | 
1543 | 	return 0;
1544 | }
1545 | 
1546 | static int child_wait_callback(wait_queue_entry_t *wait, unsigned mode,
1547 | 				int sync, void *key)
1548 | {
1549 | 	struct wait_opts *wo = container_of(wait, struct wait_opts,
1550 | 						child_wait);
1551 | 	struct task_struct *p = key;
1552 | 
1553 | 	if (!eligible_pid(wo, p))
1554 | 		return 0;
1555 | 
1556 | 	if ((wo->wo_flags & __WNOTHREAD) && wait->private != p->parent)
1557 | 		return 0;
1558 | 
1559 | 	return default_wake_function(wait, mode, sync, key);
1560 | }
1561 | 
1562 | void __wake_up_parent(struct task_struct *p, struct task_struct *parent)
1563 | {
1564 | 	__wake_up_sync_key(&parent->signal->wait_chldexit,
1565 | 				TASK_INTERRUPTIBLE, 1, p);
1566 | }
1567 | 
1568 | static long do_wait(struct wait_opts *wo)
1569 | {
1570 | 	struct task_struct *tsk;
1571 | 	int retval;
1572 | 
1573 | 	trace_sched_process_wait(wo->wo_pid);
1574 | 
1575 | 	init_waitqueue_func_entry(&wo->child_wait, child_wait_callback);
1576 | 	wo->child_wait.private = current;
1577 | 	add_wait_queue(&current->signal->wait_chldexit, &wo->child_wait);
1578 | repeat:
1579 | 	/*
1580 | 	 * If there is nothing that can match our criteria, just get out.
1581 | 	 * We will clear ->notask_error to zero if we see any child that
1582 | 	 * might later match our criteria, even if we are not able to reap
1583 | 	 * it yet.
1584 | 	 */
1585 | 	wo->notask_error = -ECHILD;
1586 | 	if ((wo->wo_type < PIDTYPE_MAX) &&
1587 | 	   (!wo->wo_pid || hlist_empty(&wo->wo_pid->tasks[wo->wo_type])))
1588 | 		goto notask;
1589 | 
1590 | 	set_current_state(TASK_INTERRUPTIBLE);
1591 | 	read_lock(&tasklist_lock);
1592 | 	tsk = current;
1593 | 	do {
1594 | 		retval = do_wait_thread(wo, tsk);
1595 | 		if (retval)
1596 | 			goto end;
1597 | 
1598 | 		retval = ptrace_do_wait(wo, tsk);
1599 | 		if (retval)
1600 | 			goto end;
1601 | 
1602 | 		if (wo->wo_flags & __WNOTHREAD)
1603 | 			break;
1604 | 	} while_each_thread(current, tsk);
1605 | 	read_unlock(&tasklist_lock);
1606 | 
1607 | notask:
1608 | 	retval = wo->notask_error;
1609 | 	if (!retval && !(wo->wo_flags & WNOHANG)) {
1610 | 		retval = -ERESTARTSYS;
1611 | 		if (!signal_pending(current)) {
1612 | 			schedule();
1613 | 			goto repeat;
1614 | 		}
1615 | 	}
1616 | end:
1617 | 	__set_current_state(TASK_RUNNING);
1618 | 	remove_wait_queue(&current->signal->wait_chldexit, &wo->child_wait);
1619 | 	return retval;
1620 | }
1621 | 
1622 | static long kernel_waitid(int which, pid_t upid, struct waitid_info *infop,
1623 | 			  int options, struct rusage *ru)
1624 | {
1625 | 	struct wait_opts wo;
1626 | 	struct pid *pid = NULL;
1627 | 	enum pid_type type;
1628 | 	long ret;
1629 | 
1630 | 	if (options & ~(WNOHANG|WNOWAIT|WEXITED|WSTOPPED|WCONTINUED|
1631 | 			__WNOTHREAD|__WCLONE|__WALL))
1632 | 		return -EINVAL;
1633 | 	if (!(options & (WEXITED|WSTOPPED|WCONTINUED)))
1634 | 		return -EINVAL;
1635 | 
1636 | 	switch (which) {
1637 | 	case P_ALL:
1638 | 		type = PIDTYPE_MAX;
1639 | 		break;
1640 | 	case P_PID:
1641 | 		type = PIDTYPE_PID;
1642 | 		if (upid <= 0)
1643 | 			return -EINVAL;
1644 | 		break;
1645 | 	case P_PGID:
1646 | 		type = PIDTYPE_PGID;
1647 | 		if (upid <= 0)
1648 | 			return -EINVAL;
1649 | 		break;
1650 | 	default:
1651 | 		return -EINVAL;
1652 | 	}
1653 | 
1654 | 	if (type < PIDTYPE_MAX)
1655 | 		pid = find_get_pid(upid);
1656 | 
1657 | 	wo.wo_type	= type;
1658 | 	wo.wo_pid	= pid;
1659 | 	wo.wo_flags	= options;
1660 | 	wo.wo_info	= infop;
1661 | 	wo.wo_rusage	= ru;
1662 | 	ret = do_wait(&wo);
1663 | 
1664 | 	put_pid(pid);
1665 | 	return ret;
1666 | }
1667 | 
1668 | SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
1669 | 		infop, int, options, struct rusage __user *, ru)
1670 | {
1671 | 	struct rusage r;
1672 | 	struct waitid_info info = {.status = 0};
1673 | 	long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL);
1674 | 	int signo = 0;
1675 | 
1676 | 	if (err > 0) {
1677 | 		signo = SIGCHLD;
1678 | 		err = 0;
1679 | 		if (ru && copy_to_user(ru, &r, sizeof(struct rusage)))
1680 | 			return -EFAULT;
1681 | 	}
1682 | 	if (!infop)
1683 | 		return err;
1684 | 
1685 | 	if (!user_access_begin(infop, sizeof(*infop)))
1686 | 		return -EFAULT;
1687 | 
1688 | 	unsafe_put_user(signo, &infop->si_signo, Efault);
1689 | 	unsafe_put_user(0, &infop->si_errno, Efault);
1690 | 	unsafe_put_user(info.cause, &infop->si_code, Efault);
1691 | 	unsafe_put_user(info.pid, &infop->si_pid, Efault);
1692 | 	unsafe_put_user(info.uid, &infop->si_uid, Efault);
1693 | 	unsafe_put_user(info.status, &infop->si_status, Efault);
1694 | 	user_access_end();
1695 | 	return err;
1696 | Efault:
1697 | 	user_access_end();
1698 | 	return -EFAULT;
1699 | }
1700 | 
1701 | long kernel_wait4(pid_t upid, int __user *stat_addr, int options,
1702 | 		  struct rusage *ru)
1703 | {
1704 | 	struct wait_opts wo;
1705 | 	struct pid *pid = NULL;
1706 | 	enum pid_type type;
1707 | 	long ret;
1708 | 
1709 | 	if (options & ~(WNOHANG|WUNTRACED|WCONTINUED|
1710 | 			__WNOTHREAD|__WCLONE|__WALL))
1711 | 		return -EINVAL;
1712 | 
1713 | 	/* -INT_MIN is not defined */
1714 | 	if (upid == INT_MIN)
1715 | 		return -ESRCH;
1716 | 
1717 | 	if (upid == -1)
1718 | 		type = PIDTYPE_MAX;
1719 | 	else if (upid < 0) {
1720 | 		type = PIDTYPE_PGID;
1721 | 		pid = find_get_pid(-upid);
1722 | 	} else if (upid == 0) {
1723 | 		type = PIDTYPE_PGID;
1724 | 		pid = get_task_pid(current, PIDTYPE_PGID);
1725 | 	} else /* upid > 0 */ {
1726 | 		type = PIDTYPE_PID;
1727 | 		pid = find_get_pid(upid);
1728 | 	}
1729 | 
1730 | 	wo.wo_type	= type;
1731 | 	wo.wo_pid	= pid;
1732 | 	wo.wo_flags	= options | WEXITED;
1733 | 	wo.wo_info	= NULL;
1734 | 	wo.wo_stat	= 0;
1735 | 	wo.wo_rusage	= ru;
1736 | 	ret = do_wait(&wo);
1737 | 	put_pid(pid);
1738 | 	if (ret > 0 && stat_addr && put_user(wo.wo_stat, stat_addr))
1739 | 		ret = -EFAULT;
1740 | 
1741 | 	return ret;
1742 | }
1743 | 
1744 | SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr,
1745 | 		int, options, struct rusage __user *, ru)
1746 | {
1747 | 	struct rusage r;
1748 | 	long err = kernel_wait4(upid, stat_addr, options, ru ? &r : NULL);
1749 | 
1750 | 	if (err > 0) {
1751 | 		if (ru && copy_to_user(ru, &r, sizeof(struct rusage)))
1752 | 			return -EFAULT;
1753 | 	}
1754 | 	return err;
1755 | }
1756 | 
1757 | #ifdef __ARCH_WANT_SYS_WAITPID
1758 | 
1759 | /*
1760 |  * sys_waitpid() remains for compatibility. waitpid() should be
1761 |  * implemented by calling sys_wait4() from libc.a.
1762 |  */
1763 | SYSCALL_DEFINE3(waitpid, pid_t, pid, int __user *, stat_addr, int, options)
1764 | {
1765 | 	return kernel_wait4(pid, stat_addr, options, NULL);
1766 | }
1767 | 
1768 | #endif
1769 | 
1770 | #ifdef CONFIG_COMPAT
1771 | COMPAT_SYSCALL_DEFINE4(wait4,
1772 | 	compat_pid_t, pid,
1773 | 	compat_uint_t __user *, stat_addr,
1774 | 	int, options,
1775 | 	struct compat_rusage __user *, ru)
1776 | {
1777 | 	struct rusage r;
1778 | 	long err = kernel_wait4(pid, stat_addr, options, ru ? &r : NULL);
1779 | 	if (err > 0) {
1780 | 		if (ru && put_compat_rusage(&r, ru))
1781 | 			return -EFAULT;
1782 | 	}
1783 | 	return err;
1784 | }
1785 | 
1786 | COMPAT_SYSCALL_DEFINE5(waitid,
1787 | 		int, which, compat_pid_t, pid,
1788 | 		struct compat_siginfo __user *, infop, int, options,
1789 | 		struct compat_rusage __user *, uru)
1790 | {
1791 | 	struct rusage ru;
1792 | 	struct waitid_info info = {.status = 0};
1793 | 	long err = kernel_waitid(which, pid, &info, options, uru ? &ru : NULL);
1794 | 	int signo = 0;
1795 | 	if (err > 0) {
1796 | 		signo = SIGCHLD;
1797 | 		err = 0;
1798 | 		if (uru) {
1799 | 			/* kernel_waitid() overwrites everything in ru */
1800 | 			if (COMPAT_USE_64BIT_TIME)
1801 | 				err = copy_to_user(uru, &ru, sizeof(ru));
1802 | 			else
1803 | 				err = put_compat_rusage(&ru, uru);
1804 | 			if (err)
1805 | 				return -EFAULT;
1806 | 		}
1807 | 	}
1808 | 
1809 | 	if (!infop)
1810 | 		return err;
1811 | 
1812 | 	if (!user_access_begin(infop, sizeof(*infop)))
1813 | 		return -EFAULT;
1814 | 
1815 | 	unsafe_put_user(signo, &infop->si_signo, Efault);
1816 | 	unsafe_put_user(0, &infop->si_errno, Efault);
1817 | 	unsafe_put_user(info.cause, &infop->si_code, Efault);
1818 | 	unsafe_put_user(info.pid, &infop->si_pid, Efault);
1819 | 	unsafe_put_user(info.uid, &infop->si_uid, Efault);
1820 | 	unsafe_put_user(info.status, &infop->si_status, Efault);
1821 | 	user_access_end();
1822 | 	return err;
1823 | Efault:
1824 | 	user_access_end();
1825 | 	return -EFAULT;
1826 | }
1827 | #endif
1828 | 
1829 | __weak void abort(void)
1830 | {
1831 | 	BUG();
1832 | 
1833 | 	/* if that doesn't kill us, halt */
1834 | 	panic("Oops failed to kill thread");
1835 | }
1836 | EXPORT_SYMBOL(abort);
1837 | ```
1838 | -----------------------------------------------------------------------------------------
1839 | - Creator
1840 | https://github.com/torvalds/linux.git
1841 | 
1842 | - Kernel doc 
1843 | https://www.kernel.org/doc/html/v4.15/index.html#
1844 | 
1845 | - How to patching your Kernel
1846 | https://www.kernel.org/doc/html/v4.10/process/applying-patches.html
1847 | 
1848 | - Source:
1849 | https://www.kernel.org/
1850 | 
1851 | - - - Online:
1852 | 
1853 | [![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/logo/Linux.png)](https://elixir.bootlin.com/linux/latest/source)
1854 | 
1855 | # VV
1856 | 


--------------------------------------------------------------------------------
/SystemCall/README.MD:
--------------------------------------------------------------------------------
 1 | # What is a system call?
 2 | 
 3 | In the most literal sense, a system call (also called a "syscall") is an instruction, similar to the "add" instruction or the "jump" instruction. At a higher level, a system call is the way a user level program asks the operating system to do something for it. If you're writing a program, and you need to read from a file, you use a system call to ask the operating system to read the file for you.
 4 | 
 5 | -----------------------------------------------------------------------------------------------------------------------------
 6 | - System calls in detail
 7 | 
 8 | Here's how a system call works. First, the user program sets up the arguments for the system call. One of the arguments is the system call number (more on that later). Note that all this is done automatically by library functions unless you are writing in assembly. After the arguments are all set up, the program executes the "system call" instruction. This instruction causes an exception: an event that causes the processor to jump to a new address and start executing the code there.
 9 | 
10 | The instructions at the new address save your user program's state, figure out what system call you want, call the function in the kernel that implements that system call, restores your user program state, and returns control back to the user program. A system call is one way that the functions defined in a device driver end up being called.
11 | 
12 | That was the whirlwind tour of how a system call works. Next, we'll go into minute detail for those who are curious about exactly how the kernel does all this. Don't worry if you don't quite understand all of the details - just remember that this is one way that a function in the kernel can end up being called, and that no magic is involved. You can trace the control flow all the way through the kernel - with difficulty sometimes, but you can do it.
13 | 
14 | -----------------------------------------------------------------------------------------------------------------------------
15 | - A system call example
16 | 
17 | This is a good place to start showing code to go along with the theory. We'll follow the progress of a read() system call, starting from the moment the system call instruction is executed. The PowerPC architecture will be used as an example for the architecture specific part of the code. On the PowerPC, when you execute a system call, the processor jumps to the address 0xc00. The code at that location is defined in the file:
18 | 
19 | `arch/ppc/kernel/head.S`
20 | 
21 | It looks something like this:
22 | 
23 | ```bash
24 | /* System call */
25 |         . = 0xc00
26 | SystemCall:
27 |         EXCEPTION_PROLOG
28 |         EXC_XFER_EE_LITE(0xc00, DoSyscall)
29 | 
30 | /* Single step - not used on 601 */
31 |         EXCEPTION(0xd00, SingleStep, SingleStepException, EXC_XFER_STD)
32 |         EXCEPTION(0xe00, Trap_0e, UnknownException, EXC_XFER_EE)
33 | ```
34 | 
35 | What this code does is save some state and call another function called `DoSyscall`. Here's a more detailed explanation (feel free to skip this part):
36 | 
37 | `EXCEPTION_PROLOG` is a macro that handles the switch from user to kernel space, which requires things like saving the register state of the user process. `EXC_XFER_EE_LITE` is called with the address of this routine, and the address of the function `DoSyscall`. Eventually, some state will be saved and `DoSyscall` will be called. The next two lines save two exception vectors on the addresses`0xd00` and `0xe00`.
38 | 
39 | `EXC_XFER_EE_LITE` looks like this:
40 | 
41 | ```bash
42 | #define EXC_XFER_EE_LITE(n, hdlr)       \
43 |         EXC_XFER_TEMPLATE(n, hdlr, n+1, COPY_EE, transfer_to_handler, \
44 |                           ret_from_except)
45 | ```
46 | 
47 | `EXC_XFER_TEMPLATE` is another macro, and the code looks like this:
48 | 
49 | ```bash
50 | #define EXC_XFER_TEMPLATE(n, hdlr, trap, copyee, tfer, ret)     \
51 |         li      r10,trap;                                       \
52 |         stw     r10,TRAP(r11);                                  \
53 |         li      r10,MSR_KERNEL;                                 \
54 |         copyee(r10, r9);                                        \
55 |         bl      tfer;                                           \
56 | ##n:                                                            \
57 |         .long   hdlr;                                           \
58 |         .long   ret
59 | ```
60 | 
61 | `li` stands for "load immediate", which means that a constant value known at compile time is stored in a register. First, `trap` is loaded into the register `r10`. On the next line, that value is stored on the address given by `TRAP(r11)`. `TRAP(r11)` and the next two lines do some hardware specific bit manipulation. After that we call the `tfer` function (i.e. the `transfer_to_handler` function), which does yet more housekeeping, and then transfers control to `hdlr` (i.e. `DoSyscall`). Note that `transfer_to_handler` loads the address of the handler from the link register, which is why you see `.long DoSyscall` instead of `bl DoSyscall`.
62 | 
63 | Now, let's look at `DoSyscall`. It's in the file:
64 | 
65 | `arch/ppc/kernel/entry.S`
66 | 
67 | Eventually, this function loads up the address of the syscall table and indexes into it using the system call number. The syscall table is what the OS uses to translate from a system call number to a particular system call. The system call table is named `sys_call_table` and defined in:
68 | 
69 | `arch/ppc/kernel/misc.S`
70 | 
71 | The syscall table contains the addresses of the functions that implement each system call. For example, the read() system call function is named `sys_read`. The `read()` system call number is 3, so the address of sys_read() is in the 4th entry of the system call table (since we start numbering the system calls with 0). We read the data from the address sys_call_table + (3 * word_size) and we get the address of `sys_read()`.
72 | 
73 | After `DoSyscall` has looked up the correct system call address, it transfers control to that system call. Let's look at where `sys_read()` is defined, in the file:
74 | `fs/read_write.c`
75 | 
76 | This function finds the file struct associated with the fd number you passed to the `read()` function. That structure contains a pointer to the function that should be used to read data from that particular kind of file. After doing some checks, it calls that file-specific read function in order to actually read the data from the file, and then returns. This file-specific function is defined somewhere else - the socket code, filesystem code, or device driver code, for example. This is one of the points at which a specific kernel subsystem finally interfaces with the rest of the kernel. After our read function finishes, we return from the `sys_read()`, back to `DoSyscall()`, which switches control to `ret_from_except`, which is in defined in:
77 | 
78 | `arch/ppc/kernel/entry.S`
79 | 
80 | This checks for tasks that might need to be done before switching back to user mode. If nothing else needs to be done, we fall through to the `restore` function, which restores the user process's state and returns control back to the user program. There! Your `read()` call is done! If you're lucky, you even got your data back.
81 | 
82 | You can explore syscalls further by putting printks at strategic places. Be sure to limit the amount of output from these printks. For example, if you add a `printk` to `sys_read()` syscall, you should do something like this:
83 | 
84 | ```bash
85 |   static int mycount = 0;
86 | 
87 |   if (mycount < 10) {
88 |            printk ("sys_read called\n");
89 | 	   mycount++;
90 |   }
91 |  ```
92 |  # BR nu11secur1ty!
93 | 


--------------------------------------------------------------------------------
/SystemCall/SystemCall+LKM+LSM.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/SystemCall/SystemCall+LKM+LSM.pdf


--------------------------------------------------------------------------------
/TLB/README.MD:
--------------------------------------------------------------------------------
 1 | ## Translation Lookaside Buffer (TLB) in Paging
 2 | 
 3 | In Operating System (Memory Management Technique : Paging), for each process page table will be created, which will contain Page Table Entry (PTE). This PTE will contain information like frame number (The address of main memory where we want to refer), and some other useful bits (e.g., valid/invalid bit, dirty bit, protection bit etc). This page table entry (PTE) will tell where in the main memory the actual page is residing. 
 4 | 
 5 | Now the question is where to place the page table, such that overall access time (or reference time) will be less. 
 6 | 
 7 | The problem initially was to fast access the main memory content based on address generated by CPU (i.e logical/virtual address). Initially, some people thought of using registers to store page table, as they are high-speed memory so access time will be less. 
 8 | 
 9 | The idea used here is, place the page table entries in registers, for each request generated from CPU (virtual address), it will be matched to the appropriate page number of the page table, which will now tell where in the main memory that corresponding page resides. Everything seems right here, but the problem is register size is small (in practical, it can accommodate maximum of 0.5k to 1k page table entries) and process size may be big hence the required page table will also be big (lets say this page table contains 1M entries), so registers may not hold all the PTE’s of Page table. So this is not a practical approach. 
10 | To overcome this size issue, the entire page table was kept in main memory. but the problem here is two main memory references are required: 
11 | 
12 | - To find the frame number 
13 |  
14 | - To go to the address specified by frame number 
15 |  
16 | To overcome this problem a high-speed cache is set up for page table entries called a Translation Lookaside Buffer (TLB). Translation Lookaside Buffer (TLB) is nothing but a special cache used to keep track of recently used transactions. TLB contains page table entries that have been most recently used. Given a virtual address, the processor examines the TLB if a page table entry is present (TLB hit), the frame number is retrieved and the real address is formed. If a page table entry is not found in the TLB (TLB miss), the page number is used as index while processing page table. TLB first checks if the page is already in main memory, if not in main memory a page fault is issued then the TLB is updated to include the new page entry. 
17 | 
18 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/TLB/screen/tlb1.jpg)
19 | 
20 | ## Steps in TLB hit: 
21 |  
22 | 
23 | 1 . CPU generates virtual (logical) address. 
24 |  
25 | 2 . It is checked in TLB (present). 
26 |  
27 | 3 . Corresponding frame number is retrieved, which now tells where in the main memory page lies. 
28 |  
29 | ## Steps in TLB miss: 
30 |  
31 | 
32 | 1 . CPU generates virtual (logical) address. 
33 |  
34 | 2 . It is checked in TLB (not present). 
35 |  
36 | 3 . Now the page number is matched to page table residing in main memory (assuming page table contains all PTE). 
37 |  
38 | 4 . Corresponding frame number is retrieved, which now tells where in the main memory page lies. 
39 |  
40 | 5 . The TLB is updated with new PTE (if space is not there, one of the replacement technique comes into picture i.e either FIFO, LRU or MFU etc). 
41 | 
42 | ```
43 | Effective memory access time(EMAT) : TLB is used to reduce effective memory access time as it is a high speed associative cache. 
44 | EMAT = h*(c+m) + (1-h)*(c+2m) 
45 | where, h = hit ratio of TLB 
46 | m = Memory access time 
47 | c = TLB access time 
48 | ```
49 | 


--------------------------------------------------------------------------------
/TLB/screen/screen:
--------------------------------------------------------------------------------
1 | screen
2 | 


--------------------------------------------------------------------------------
/TLB/screen/tlb1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/TLB/screen/tlb1.jpg


--------------------------------------------------------------------------------
/The-secrets-of-Kconfig/README.MD:
--------------------------------------------------------------------------------
  1 | # Kconfig
  2 | The Linux kernel config/build system, also known as Kconfig/kbuild, has been around for a long time, ever since the Linux kernel code migrated to Git. As supporting infrastructure, however, it is seldom in the spotlight; even kernel developers who use it in their daily work never really think about it.
  3 | 
  4 | To explore how the Linux kernel is compiled, this article will dive into the Kconfig/kbuild internal process, explain how the .config file and the vmlinux/bzImage files are produced, and introduce a smart trick for dependency tracking.
  5 | 
  6 | The first step in building a kernel is always configuration. Kconfig helps make the Linux kernel highly modular and customizable. Kconfig offers the user many config targets:
  7 | 
  8 | - Kconfig
  9 | ```
 10 | config  - Update current config utilizing a line-oriented program
 11 | nconfig - Update current config utilizing a ncurses menu-based program
 12 | menuconfig  - Update current config utilizing a menu-based program
 13 | xconfig - Update current config utilizing a Qt-based frontend
 14 | gconfig - Update current config utilizing a GTK+ based frontend
 15 | oldconfig - Update current config utilizing a provided .config as base
 16 | localmodconfig  - Update current config disabling modules not loaded
 17 | localyesconfig  - Update current config converting local mods to core
 18 | defconfig - New config with default from Arch-supplied defconfig
 19 | savedefconfig - Save current config as ./defconfig (minimal config)
 20 | allnoconfig - New config where all options are answered with 'no'
 21 | allyesconfig  - New config where all options are accepted with 'yes'
 22 | allmodconfig  - New config selecting modules when possible
 23 | alldefconfig  - New config with all symbols set to default
 24 | randconfig  - New config with a random answer to all options
 25 | listnewconfig - List new options
 26 | olddefconfig  - Same as oldconfig but sets new symbols to their default value without prompting
 27 | kvmconfig - Enable additional options for KVM guest kernel support
 28 | xenconfig - Enable additional options for xen dom0 and guest kernel support
 29 | tinyconfig  - Configure the tiniest possible kernel
 30 | ```
 31 | The menuconfig is the most popular of these targets. The targets are processed by different host programs, which are provided by the kernel and built during kernel building. Some targets have a GUI (for the user's convenience) while most don't. Kconfig-related tools and source code reside mainly under scripts/kconfig/ in the kernel source. As we can see from scripts/kconfig/Makefile, there are several host programs, including conf, mconf, and nconf. Except for conf, each of them is responsible for one of the GUI-based config targets, so, conf deals with most of them.
 32 | 
 33 | Logically, Kconfig's infrastructure has two parts: one implements a new language to define the configuration items (see the Kconfig files under the kernel source), and the other parses the Kconfig language and deals with configuration actions.
 34 | 
 35 | Most of the config targets have roughly the same internal process (shown below):
 36 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/The-secrets-of-Kconfig/wall/1.png)
 37 | 
 38 | 
 39 | - Note that all configuration items have a default value.
 40 | 
 41 | The first step reads the Kconfig file under source root to construct an initial configuration database; then it updates the initial database by reading an existing configuration file according to this priority:
 42 | ```
 43 |     .config
 44 |     /lib/modules/$(shell,uname -r)/.config
 45 |     /etc/kernel-config
 46 |     /boot/config-$(shell,uname -r)
 47 |     ARCH_DEFCONFIG
 48 |     arch/$(ARCH)/defconfig
 49 | ```
 50 | If you are doing GUI-based configuration via menuconfig or command-line-based configuration via oldconfig, the database is updated according to your customization. Finally, the configuration database is dumped into the .config file.
 51 | 
 52 | But the .config file is not the final fodder for kernel building; this is why the syncconfig target exists. syncconfig used to be a config target called silentoldconfig, but it doesn't do what the old name says, so it was renamed. Also, because it is for internal use (not for users), it was dropped from the list.
 53 | 
 54 | Here is an illustration of what syncconfig does:
 55 | 
 56 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/The-secrets-of-Kconfig/wall/2.png)
 57 | 
 58 | syncconfig takes .config as input and outputs many other files, which fall into three categories:
 59 | 
 60 | - auto.conf & tristate.conf are used for makefile text processing. For example, you may see statements like this in a component's makefile: 
 61 | 
 62 | ```bash
 63 | obj-$(CONFIG_GENERIC_CALIBRATE_DELAY) += calibrate.o
 64 | ```
 65 | - autoconf.h is used in C-language source files.
 66 | - Empty header files under include/config/ are used for configuration-dependency tracking during kbuild, which is explained below.
 67 | 
 68 | After configuration, we will know which files and code pieces are not compiled.
 69 | 
 70 | --------------------------------------------------------------------------------
 71 | 
 72 | # kbuild
 73 | 
 74 | Component-wise building, called recursive make, is a common way for GNU make to manage a large project. Kbuild is a good example of recursive make. By dividing source files into different modules/components, each component is managed by its own makefile. When you start building, a top makefile invokes each component's makefile in the proper order, builds the components, and collects them into the final executive.
 75 | 
 76 | Kbuild refers to different kinds of makefiles:
 77 | 
 78 | - `Makefile` is the top makefile located in source root.
 79 | - `.config` is the kernel configuration file.
 80 | - `arch/$(ARCH)/Makefile` is the arch makefile, which is the supplement to the top makefile.
 81 | - `scripts/Makefile.*` describes common rules for all kbuild makefiles.
 82 | - Finally, there are about 500 `kbuild makefiles`.
 83 | 
 84 | The top makefile includes the arch makefile, reads the .config file, descends into subdirectories, invokes `make` on each component's makefile with the help of routines defined in `scripts/Makefile.*`, builds up each intermediate object, and links all the intermediate objects into vmlinux. Kernel document Documentation/kbuild/makefiles.txt describes all aspects of these makefiles.
 85 | 
 86 | As an example, let's look at how vmlinux is produced on x86-64:
 87 | 
 88 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/The-secrets-of-Kconfig/wall/3.png)
 89 | 
 90 | 
 91 | All the `.o` files that go into vmlinux first go into their own built-in.a, which is indicated via variables `KBUILD_VMLINUX_INIT`, `KBUILD_VMLINUX_MAIN`, `KBUILD_VMLINUX_LIBS`, then are collected into the vmlinux file.
 92 | 
 93 | Take a look at how recursive make is implemented in the Linux kernel, with the help of simplified makefile code:
 94 | 
 95 | 
 96 | ```
 97 | # In top Makefile vmlinux: scripts/link-vmlinux.sh $(vmlinux-deps) +$(call if_changed,link-vmlinux) # Variable assignments vmlinux-deps := $(KBUILD_LDS) $(KBUILD_VMLINUX_INIT) $(KBUILD_VMLINUX_MAIN) $(KBUILD_VMLINUX_LIBS) export KBUILD_VMLINUX_INIT := $(head-y) $(init-y) export KBUILD_VMLINUX_MAIN := $(core-y) $(libs-y2) $(drivers-y) $(net-y) $(virt-y) export KBUILD_VMLINUX_LIBS := $(libs-y1) export KBUILD_LDS := arch/$(SRCARCH)/kernel/vmlinux.lds init-y := init/ drivers-y := drivers/ sound/ firmware/ net-y := net/ libs-y := lib/ core-y := usr/ virt-y := virt/ # Transform to corresponding built-in.a init-y := $(patsubst %/, %/built-in.a, $(init-y)) core-y := $(patsubst %/, %/built-in.a, $(core-y)) drivers-y := $(patsubst %/, %/built-in.a, $(drivers-y)) net-y := $(patsubst %/, %/built-in.a, $(net-y)) libs-y1 := $(patsubst %/, %/lib.a, $(libs-y)) libs-y2 := $(patsubst %/, %/built-in.a, $(filter-out %.a, $(libs-y))) virt-y := $(patsubst %/, %/built-in.a, $(virt-y)) # Setup the dependency. vmlinux-deps are all intermediate objects, vmlinux-dirs # are phony targets, so every time comes to this rule, the recipe of vmlinux-dirs # will be executed. Refer "4.6 Phony Targets" of `info make` $(sort $(vmlinux-deps)): $(vmlinux-dirs) ; # Variable vmlinux-dirs is the directory part of each built-in.a vmlinux-dirs := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \ $(core-y) $(core-m) $(drivers-y) $(drivers-m) \ $(net-y) $(net-m) $(libs-y) $(libs-m) $(virt-y))) # The entry of recursive make $(vmlinux-dirs): $(Q)$(MAKE) $(build)=$@ need-builtin=1
 98 | ```
 99 | 
100 | The recursive make recipe is expanded, for example:
101 | 
102 | ```bash
103 | make -f scripts/Makefile.build obj=init need-builtin=1
104 | ```
105 | 
106 | 
107 | This means make will go into scripts/Makefile.build to continue the work of building each built-in.a. With the help of scripts/link-vmlinux.sh, the vmlinux file is finally under source root.
108 | Understanding vmlinux vs. bzImage
109 | 
110 | Many Linux kernel developers may not be clear about the relationship between vmlinux and bzImage. For example, here is their relationship in x86-64:
111 | 
112 | ![](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/The-secrets-of-Kconfig/wall/4.png)
113 | 
114 | 
115 | The source root vmlinux is stripped, compressed, put into `piggy.S`, then linked with other peer objects into `arch/x86/boot/compressed/vmlinux`. Meanwhile, a file called setup.bin is produced under `arch/x86/boot`. There may be an optional third file that has relocation info, depending on the configuration of `CONFIG_X86_NEED_RELOCS`.
116 | 
117 | A host program called build, provided by the kernel, builds these two (or three) parts into the final bzImage file.
118 | 
119 | - Dependency tracking
120 | 
121 | Kbuild tracks three kinds of dependencies:
122 | 
123 | 1. All prerequisite files (both *.c and *.h)
124 | 2. CONFIG_ options used in all prerequisite files
125 | 3. Command-line dependencies used to compile the target
126 | 
127 | The first one is easy to understand, but what about the second and third? Kernel developers often see code pieces like this:
128 | 
129 | ```bash
130 | #ifdef CONFIG_SMP __boot_cpu_id = cpu; #endif
131 | ```
132 | 
133 | When `CONFIG_SMP` changes, this piece of code should be recompiled. The command line for compiling a source file also matters, because different command lines may result in different object files.
134 | 
135 | When a `.c` file uses a header file via a `#include` directive, you need write a rule like this:
136 | 
137 | ```c
138 | main.o: defs.h recipe...
139 | ```
140 | When managing a large project, you need a lot of these kinds of rules; writing them all would be tedious and boring. Fortunately, most modern C compilers can write these rules for you by looking at the #include lines in the source file. For the GNU Compiler Collection (GCC), it is just a matter of adding a command-line parameter: -MD depfile
141 | 
142 | 
143 | ```
144 | # In scripts/Makefile.lib c_flags = -Wp,-MD,$(depfile) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) \ -include $(srctree)/include/linux/compiler_types.h \ $(__c_flags) $(modkern_cflags) \ $(basename_flags) $(modname_flags)
145 | ```
146 | This would generate a `.d` file with content like:
147 | 
148 | ```
149 | init_task.o: init/init_task.c include/linux/kconfig.h \ include/generated/autoconf.h include/linux/init_task.h \ include/linux/rcupdate.h include/linux/types.h \ ...
150 | ```
151 | 
152 | Then the host program `fixdep` takes care of the other two dependencies by taking the `depfile` and command line as input, then outputting a `.<target>.cmd` file in makefile syntax, which records the command line and all the prerequisites (including the configuration) for a target. It looks like this:
153 | 
154 | 
155 | ```
156 | # The command line used to compile the target cmd_init/init_task.o := gcc -Wp,-MD,init/.init_task.o.d -nostdinc ... ... # The dependency files deps_init/init_task.o := \ $(wildcard include/config/posix/timers.h) \ $(wildcard include/config/arch/task/struct/on/stack.h) \ $(wildcard include/config/thread/info/in/task.h) \ ... include/uapi/linux/types.h \ arch/x86/include/uapi/asm/types.h \ include/uapi/asm-generic/types.h \ ...
157 | ```
158 | 
159 | 
160 | A `.<target>.cmd` file will be included during recursive make, providing all the dependency info and helping to decide whether to rebuild a target or not.
161 | 
162 | The secret behind this is that `fixdep` will parse the `depfil`e (`.d`file), then parse all the dependency files inside, search the text for all the `CONFIG_` strings, convert them to the corresponding empty header file, and add them to the target's prerequisites. Every time the configuration changes, the corresponding empty header file will be updated, too, so kbuild can detect that change and rebuild the target that depends on it. Because the command line is also recorded, it is easy to compare the last and current compiling parameters.
163 | 


--------------------------------------------------------------------------------
/The-secrets-of-Kconfig/wall/1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/The-secrets-of-Kconfig/wall/1.png


--------------------------------------------------------------------------------
/The-secrets-of-Kconfig/wall/2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/The-secrets-of-Kconfig/wall/2.png


--------------------------------------------------------------------------------
/The-secrets-of-Kconfig/wall/3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/The-secrets-of-Kconfig/wall/3.png


--------------------------------------------------------------------------------
/The-secrets-of-Kconfig/wall/4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/The-secrets-of-Kconfig/wall/4.png


--------------------------------------------------------------------------------
/Uninstalling/kernels.md:
--------------------------------------------------------------------------------
 1 | # Rpm based distro – Red Hat/CentOS/Fedora Core/Suse Linux
 2 | 
 3 | First find out all installed kernel version with following command:
 4 | ```bash
 5 | # rpm -qa | grep kernel-smp
 6 | ```
 7 | - or
 8 | ```bash
 9 | # rpm -qa | grep kernel
10 | ```
11 | - Output:
12 | ```bash
13 | kernel-smp-x.x.x-x.x.x.EL
14 | kernel-smp-x.x.x-x.x.x.EL
15 | kernel-smp-x.x.x-x.x.x.EL
16 | ```
17 | I’ve total 3 different kernel installed. To remove kernel-smp-2.6.9-42.EL type command:
18 | ```bash
19 | # rpm -e kernel-smp-x.x.x-x.x.x.EL
20 | ```
21 | - OR
22 | ```bash
23 | # rpm -vv -e kernel-smp-x.x.x-x.x.x.EL
24 | ```
25 | 


--------------------------------------------------------------------------------
/Userspace/README.MD:
--------------------------------------------------------------------------------
 1 | # Read
 2 | - Part1
 3 | 
 4 | [![alt text](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Userspace/logo/red-hat-logo.png)](https://www.redhat.com/en/blog/architecting-containers-part-1-why-understanding-user-space-vs-kernel-space-matters "Linux-Usercpace")
 5 | 
 6 | - Part2
 7 | 
 8 | [![alt text](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Userspace/logo/red-hat-logo.png)](https://www.redhat.com/en/blog/architecting-containers-part-2-why-user-space-matters "Linux-Usercpace")
 9 | 
10 | - Part3
11 | 
12 | [![alt text](https://github.com/nu11secur1ty/Kernel-and-Types-of-kernels/blob/master/Userspace/logo/red-hat-logo.png)](https://www.redhat.com/en/blog/architecting-containers-part-3-how-user-space-affects-your-application "Linux-Usercpace")
13 | 
14 | 
15 | 


--------------------------------------------------------------------------------
/Userspace/logo/red-hat-logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/Userspace/logo/red-hat-logo.png


--------------------------------------------------------------------------------
/Userspace/what_mean.md:
--------------------------------------------------------------------------------
 1 | This is a Linus Torvald's first rule of kernel development. This note explains it: https://lkml.org/lkml/2012/12/23/75. I.e., when maintaining the kernel, do not do something which breaks user programs/applications. In other words, when making kernel changes, it is very bad to cause problems in the user's application "space". That doesn't literally mean memory. That means anything that impacts the user applications in a way that negatively affects its behavior (causes the program to malfunction). The note I cite also indicates at least one example.
 2 | 
 3 | 
 4 | -----------------------------------------------------
 5 | 
 6 | ```bash
 7 | From	Linus Torvalds <>
 8 | Date	Sun, 23 Dec 2012 09:36:15 -0800
 9 | Subject	Re: [Regression w/ patch] Media commit causes user space to misbahave (was: Re: Linux 3.8-rc1)
10 | On Sun, Dec 23, 2012 at 6:08 AM, Mauro Carvalho Chehab
11 | <mchehab@redhat.com> wrote:
12 | >
13 | > Are you saying that pulseaudio is entering on some weird loop if the
14 | > returned value is not -EINVAL? That seems a bug at pulseaudio.
15 | 
16 | Mauro, SHUT THE FUCK UP!
17 | 
18 | It's a bug alright - in the kernel. How long have you been a
19 | maintainer? And you *still* haven't learnt the first rule of kernel
20 | maintenance?
21 | 
22 | If a change results in user programs breaking, it's a bug in the
23 | kernel. We never EVER blame the user programs. How hard can this be to
24 | understand?
25 | 
26 | To make matters worse, commit f0ed2ce840b3 is clearly total and utter
27 | CRAP even if it didn't break applications. ENOENT is not a valid error
28 | return from an ioctl. Never has been, never will be. ENOENT means "No
29 | such file and directory", and is for path operations. ioctl's are done
30 | on files that have already been opened, there's no way in hell that
31 | ENOENT would ever be valid.
32 | 
33 | > So, on a first glance, this doesn't sound like a regression,
34 | > but, instead, it looks tha pulseaudio/tumbleweed has some serious
35 | > bugs and/or regressions.
36 | 
37 | Shut up, Mauro. And I don't _ever_ want to hear that kind of obvious
38 | garbage and idiocy from a kernel maintainer again. Seriously.
39 | 
40 | I'd wait for Rafael's patch to go through you, but I have another
41 | error report in my mailbox of all KDE media applications being broken
42 | by v3.8-rc1, and I bet it's the same kernel bug. And you've shown
43 | yourself to not be competent in this issue, so I'll apply it directly
44 | and immediately myself.
45 | 
46 | WE DO NOT BREAK USERSPACE!
47 | 
48 | Seriously. How hard is this rule to understand? We particularly don't
49 | break user space with TOTAL CRAP. I'm angry, because your whole email
50 | was so _horribly_ wrong, and the patch that broke things was so
51 | obviously crap. The whole patch is incredibly broken shit. It adds an
52 | insane error code (ENOENT), and then because it's so insane, it adds a
53 | few places to fix it up ("ret == -ENOENT ? -EINVAL : ret").
54 | 
55 | The fact that you then try to make *excuses* for breaking user space,
56 | and blaming some external program that *used* to work, is just
57 | shameful. It's not how we work.
58 | 
59 | Fix your f*cking "compliance tool", because it is obviously broken.
60 | And fix your approach to kernel programming.
61 | 
62 |                Linus
63 | ```
64 | 


--------------------------------------------------------------------------------
/file_system/File System in Linux Kernel.md:
--------------------------------------------------------------------------------
  1 | # Introduction:
  2 | A file system is one of the central OS subsystems. File systems continue to develop as operating systems evolve. Currently we have an entire heterogenetic zoo of file systems, from the old “classic” UFS, to the new exotic NILFS (though this idea isn’t new at all, look at LFS) and BTRFS. We aren’t going to try to throw down monsters like ext3/4 and BTRFS. Our file system will be of educational nature, and we will familiarize ourselves with the Linux kernel using its help.
  3 | 
  4 | # Environment Setup:
  5 | Before we get into the kernel, let’s prepare all the necessary steps for building our OS module. I use Ubuntu, so I’m going to setup within this environment. Fortunately, it’s not difficult at all. To begin with, we’re going to need a compiler and building facilities:
  6 | 
  7 | ```
  8 | sudo apt-get install gcc build-essential
  9 | ```
 10 | Then we may need the kernel source code. We’ll go the easy route and won’t bother rebuilding the kernel from the source. We’ll just determine kernel headers; this should be enough to write a loadable module. The headers can be determined the following way:
 11 | 
 12 | Check for available linux headers:
 13 | ```
 14 | apt-cache search linux-headers
 15 | ```
 16 | Install them:
 17 | ```
 18 | sudo apt-get install linux-headers-xx.x.x.x.x
 19 | apt-get install linux-headers-$(uname -r)
 20 | ```
 21 | 
 22 | And now I’m going to jump onto my soap box. Rummaging in the kernel on a working machine isn’t the smartest idea, so I strongly recommend you perform all these actions withiin a virtual machine. We won’t do anything dangerous so the stored data is safe. But if anything goes wrong, we’ll probably have to restart the system. Besides, it’s more comfortable to debug the kernel modules in a virtual machine (such as QEMU), though this question won’t be considered in the article.
 23 | 
 24 | # Environment Check Up:
 25 | 
 26 | In order to check the environment we’ll write and start the kernel module, which won’t do anything useful (Hello, World!). Let’s consider the module code. I named it super.c (super is derived from superblock):
 27 | 
 28 | ```
 29 | #include <linux/init.h>
 30 | #include <linux/module.h>
 31 | 
 32 |   static int __init aufs_init(void)
 33 |   {
 34 |       pr_debug("aufs module loaded\n");
 35 |       return 0;
 36 |   }
 37 | 
 38 |   static void __exit aufs_fini(void)
 39 |   {
 40 |       pr_debug("aufs module unloaded\n");
 41 |   }
 42 | 
 43 |   module_init(aufs_init);
 44 |   module_exit(aufs_fini);
 45 | 
 46 |   MODULE_LICENSE("GPL");
 47 |   MODULE_AUTHOR("kmu");
 48 |   ```
 49 |   
 50 |   At the very beginning you can see two headers. They are an important part of any loadable module. Then two functions: aufs_init and aufs_fini follow. They will be called before and after the module roll-out. Some of you may be confused by __init label. __init is a hint to the kernel that the function is used during module initialization only. It means that after the module initialization it can be unloaded from memory. There is an analogous marker for the data; the kernel can ignore these hints. The reference to __init functions and data from the main module code is a potential error. That’s why during the module setup it should be checked that there are no such references. If such message is found, the kernel setup system will give out an alert. The similar check is carried out for __exit functions and data. If you want to know details about __init and __exit, you can refer to the source code.
 51 | 
 52 | Please note that aufs_init returns int. Thus the kernel finds out that during the module initialization something went wrong. If the module hasn’t returned a zero value, it means that an error occurred during initialization. In order to find out which functions should be called at module loading and unloading, two macros module_init and module_exit are used. In order to learn details, refer to lxr, it’s really useful if you want to study the kernel. pr_debug – is a function (it’s a macro actually, but it doesn’t matter for now) of kernel output to the log, it’s very similar to the family of printf functions with some extensions (for example, for IP and MAC addresses printing. You will find a complete list of modifiers in the documentation of the kernel. Together with pr_debug, there is an entire family of macros: pr_info, pr_warn, pr_err and others. If you are familiar with Linux module development you know about printk function. pr_* open into printk calls, so you can use printk instead of them.
 53 | 
 54 | Then there are macros with information about descendants – a license and an author. There are also other macros that allow to store manifold information about the module. For example MODULE_VERSION, MODULE_INFO, MODULE_SUPPORTED_DEVICE and others. By the way, if you’re using different form GPL license, you won’t be able to use some functions available for GPL modules.
 55 | 
 56 | Now let’s build and start our module. We’ll write Makefile for this. It will build our module:
 57 | 
 58 | ```
 59 | obj-m := aufs.o
 60 | aufs-objs := super.o
 61 | 
 62 | CFLAGS_super.o := -DDEBUG
 63 | 
 64 | all:
 65 |     make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
 66 | 
 67 | clean:
 68 |     make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
 69 | ```
 70 | 
 71 | Makefile calls Makefile for building, it should be located in /lib/modules/$(shell uname -r)/build catalogue (uname –r is a command that returns the version of the started kernel). If the headers (or source codes) of the kernel are located in another catalogue, you should fix it.
 72 | 
 73 | obj-m allows to state the name of the future module. In our case it will be named aufs.ko (ko exactly – from kernel object). aufs-objs allows to indicate what source code aufs module should be built from. In our case super.c file will be used. You can also indicate different compiler flags, which will be used (in addition to those used by Makefile kernel) for object files building. In our case I pass –DDEBUG flag during super.c build. If we don’t pass the flag, we won’t see pr_debug output in the system log.
 74 | 
 75 | In order to build the module make command should be executed. If everything’s fine, aufs.ko file will appear in the catalogue. It’s quite easy to load the module:
 76 | 
 77 | ```
 78 | sudo insmod ./aufs.ko
 79 | ```
 80 | In order to make sure that the module is loaded you can look at lsmod command output. 
 81 | 
 82 | ```
 83 | lsmod | grep aufs
 84 | ```
 85 | 
 86 | 
 87 | In order to see the system log, dmesg command should be called. We’ll see messages from our module in it. It’s not difficult to unload the module: 
 88 | 
 89 | ```
 90 | sudo rmmod aufs
 91 | ```
 92 | 
 93 | # Returning to the File System:
 94 | 
 95 | Thus, the environment is adjusted and operates. We know how to build the simplest module, load and unload it. Now we should consider the file system. The file system design should begin “on paper”, from a thorough consideration of the data structures being usedr. But we’ll follow a simple path and defer the details of files and folders storage at the disk for the next time. Now we’ll write a framework of our future file system.
 96 | 
 97 | The life of a file system begins with check-in. You can check-in the file system by calling register_filesystem. We’ll register the file system in the function of module initialization. There’s unregister_filesystem for the file system unregistration and we’ll call it in aufs_fini function of our module.
 98 | 
 99 | Both functions accept a pointer to file_system_type structure as a parameter. It’ll “describe” the file system. Consider it as a class of the file system. There’re enough fields in the structure, but we’re interested in some of them only:
100 | 
101 | 
102 | ```
103 | static struct file_system_type aufs_type = {
104 |       .owner = THIS_MODULE,
105 |       .name = "aufs",
106 |       .mount = aufs_mount,
107 |       .kill_sb = kill_block_super,
108 |       .fs_flags = FS_REQUIRES_DEV, 
109 |   };
110 | ```
111 | 
112 | First of all, we are interested in name field. It stores the file system name. This name will be used during the assembling.
113 | 
114 | mount & kill_sb — are two fields containing pointers to functions. The first function will be called during the file system assembling, the second one during disassembly. It’s enough to implement one of them, and we’ll use kill_block_super instead of the second one, which is provided by the kernel.
115 | 
116 | fs_flags — stores different flags. In our case it stores FS_REQUIRES_DEV flag, which says that our file system needs a disk for operation (though it’s not the case for the moment). You don’t have to indicate this flag if you don’t want to, everything will operate without it. Finally, owner field is necessary in order to setup a counter of links to the module. The link counter is necessary so that the module wouldn’t be unloaded too soon. For example, if the file system has been assembled, the module loading can lead to the crash. The link counter won’t allow to unload the module till its being used, i.e. until we disassembly the file system.
117 | 
118 | Now let’s consider aufs_mount function. It should assemble the device and return the structure describing a root catalogue of the file system. It sounds quite complicated, but, fortunately, the kernel will do it for us as well:
119 | 
120 | ```
121 | static struct dentry *aufs_mount(struct file_system_type *type, int flags,
122 |                                       char const *dev, void *data)
123 |   {
124 |       struct dentry *const entry = mount_bdev(type, flags, dev,
125 |                                                   data, aufs_fill_sb);
126 |       if (IS_ERR(entry))
127 |           pr_err("aufs mounting failed\n");
128 |       else
129 |           pr_debug("aufs mounted\n");
130 |       return entry;
131 |   }
132 | ```
133 | 
134 | The biggest part of work happens inside moun_bdev function. We are interested in its aufs_fill_sb parameter. It’s a pointer to the function (again), which will be called from mount_bdev in order to initialize the superblock. But before we move on to it, an important file subsystem of the kernel — dentry structure – should be considered. This structure represents a section of the path to the file name. For example, if we refer to /usr/bin/vim file, we’ll have different structure exemplars representing sections of / (root catalogue), bin/ and vim path. The kernel supports these structures cache. It allows to quickly find inode (another center structure) by the file (path) name. aufs_mount function should return dentry, which represents the root catalogue of our file system. aufs_fill_sb function will create it.
135 | 
136 | Thus, for the moment aufs_fill_sb is the most important function in our module and it looks like the following:
137 | 
138 | ```
139 | static int aufs_fill_sb(struct super_block *sb, void *data, int silent)
140 |   {
141 |       struct inode *root = NULL;
142 | 
143 |       sb->s_magic = AUFS_MAGIC_NUMBER;
144 |       sb->s_op = &aufs_super_ops;
145 | 
146 |       root = new_inode(sb);
147 |       if (!root)
148 |       {
149 |            pr_err("inode allocation failed\n");
150 |            return -ENOMEM;
151 |       }
152 | 
153 |       root->i_ino = 0;
154 |       root->i_sb = sb;
155 |       root->i_atime = root->i_mtime = root->i_ctime = CURRENT_TIME;
156 |       inode_init_owner(root, NULL, S_IFDIR);
157 | 
158 |       sb->s_root = d_make_root(root);
159 |       if (!sb->s_root)
160 |       {
161 |           pr_err("root creation failed\n");
162 |           return -ENOMEM;
163 |       }
164 | 
165 |      return 0;
166 |  }
167 | ```
168 | 
169 | First of all, we fill super_block structure. What kind of structure is it? Usually file systems store in a special place of a disk partition (this place is chosen by file system) the set of file system parameters, such as the block size, the number of occupied/free blocks, file system version, “pointer” to the root catalogue, magic number, by which a driver can check that the necessary file system is stored on the disk. This structure is named superblock (look at the picture below). super_block structure in Linux kernel is mainly intended for similar goals, we save a magic number in it and dentry for the root catalogue (the one returned by mount_bdev).
170 | 
171 | Besides, in s_op field of super_block structure we store a pointer to super_operations structure — these are super_block “class methods”, it’s another structure storing a lot of pointers to the functions. I’ll make another note here. Linux kernel is written on C, without support of different OOP features from the language side. But we can structure a program following OOP ideas without support from the language. So the structures storing a lot of pointers to the functions can be met quite often in the kernel. It’s a method of virtual functions implementation by current facilities.
172 | 
173 | Let’s get back to super_block structure and its “methods”. We’re interested in its put_super field. We’ll save a “destructor” of our superblock in it”
174 | 
175 | ```
176 | static void aufs_put_super(struct super_block *sb)
177 |   {
178 |       pr_debug("aufs super block destroyed\n");
179 |   }
180 | 
181 |   static struct super_operations const aufs_super_ops = {
182 |       .put_super = aufs_put_super,
183 |   };
184 |  ```
185 |  
186 | While aufs_put_super function does nothing useful, we use it just in order to type in the system log one more line. aufs_put_super function will be called within kill_block_super (see above) before super_block structure deleting. i.e. when disassembling the file system.
187 | 
188 | Now let’s return to the most important aufs_fill_sb function. Before we create dentry for the root catalogue we should create an index node (inode) of the root catalogue. Inode structure is probably the most important one in the file system. Each file system object (a file, a folder, a special file, a magazine, etc.) identifies with inode. As well as super_block, inode structure reflects the way file systems are stored on a disk. Inode name comes from index node, meaning that it Indexes files and folders on a disk. Usually inside inode on a disk a directive to the place file data are stored on a disk (in which blocks he file content is stored), different access flags (available for read/write/execute), information about the file owner, time of creation/modification/writing/execution and other similar things are stored.
189 | 
190 | We can’t read from the disk yet, so we’ll fill inode with fictive data. As time of creation/modification/writing/execution we use the current time and the kernel will assign the owner and access permissions (call inode_init_owner function). Finally, create dentry bound with the root inode.
191 | 
192 | Skeleton Check Up
193 | 
194 | The skeleton of our file system is ready. It’s time to check it. Building and loading of the file system driver doesn’t differ from building and loading of a general module. We’ll use loop device instead of a real disk for experiments. It’s a “disk” driver, which writes data not on the physical device, but to the file (disk image). Let’s create a disk image. It doesn’t store any data yet, so it’s simple:
195 | 
196 | ```
197 | touch image
198 | ```
199 | 
200 | We should also create a catalogue, which will be an assembling point (root) of our file system: 
201 | 
202 | ```
203 | mkdir dir
204 | ```
205 | Now using this image we’ll assemble our file system:
206 | ```
207 | sudo mount -o loop -t aufs ./image ./dir
208 | ```
209 | 
210 | 
211 | If the operation ended successfully, we’ll see messages from our module in the system log. In order to disassemble the file system we should: 
212 | 
213 | ```
214 | sudo umount ./dir
215 | ```
216 | 
217 | Check the system log again.
218 | 
219 | # Summary:
220 | 
221 | We familiarized ourselves with creation of loadable kernel modules and main structures of the file subsystem. We also wrote a real file system, which can assemble and disassemble only (it’s quite silly for the time being, but we’re going to fix it in future).
222 | 
223 | Then we’re going to consider data reading from the disk. To begin with we’ll define the way data will be stored on disks. We’ll also learn how to read superblock and inodes from the disk.
224 | 
225 | # see more [link](https://kukuruku.co/)
226 | 
227 | ------------------------------------------------
228 | 
229 | # Source for development:
230 | [link](https://github.com/torvalds/linux/tree/master/fs/ext4)
231 | 
232 | 
233 | 


--------------------------------------------------------------------------------
/kernel-types-networknuts.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/kernel-types-networknuts.png


--------------------------------------------------------------------------------
/logo/Linux.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/logo/Linux.png


--------------------------------------------------------------------------------
/logo/logo:
--------------------------------------------------------------------------------
1 | logo
2 | 


--------------------------------------------------------------------------------
/modules/example-lkm-module/.clang-format:
--------------------------------------------------------------------------------
 1 | ---
 2 | BasedOnStyle: Mozilla
 3 | AllowShortFunctionsOnASingleLine: Empty
 4 | BreakBeforeBraces: Linux
 5 | UseTab: ForIndentation
 6 | IndentWidth: 8
 7 | AlwaysBreakTemplateDeclarations: true
 8 | AlignConsecutiveDeclarations: true
 9 | AlignConsecutiveAssignments: true
10 | 


--------------------------------------------------------------------------------
/modules/example-lkm-module/.editorconfig:
--------------------------------------------------------------------------------
 1 | root = true
 2 | 
 3 | [*]
 4 | charset = utf-8
 5 | end_of_line = LF
 6 | insert_final_newline = true
 7 | trim_trailing_whitespace = true
 8 | indent_style = space
 9 | indent_size = 8
10 | 
11 | [Makefile]
12 | indent_size = 4
13 | indent_style = tab
14 | 
15 | [*.md]
16 | trim_trailing_whitespace = false
17 | 
18 | [*.go]
19 | indent_style = tab
20 | 
21 | [*.yml]
22 | indent_style = space
23 | indent_size = 2
24 | 
25 | [*.sh]
26 | indent_style = space
27 | indent_size = 2
28 | 
29 | 


--------------------------------------------------------------------------------
/modules/example-lkm-module/.gitignore:
--------------------------------------------------------------------------------
 1 | .tmp_versions
 2 | .cache.mk
 3 | .hello*
 4 | Module.symvers
 5 | *.ko
 6 | *.o
 7 | *.mod.o
 8 | *.order
 9 | *.mod.c
10 | 


--------------------------------------------------------------------------------
/modules/example-lkm-module/.ycm_extra_conf.py:
--------------------------------------------------------------------------------
 1 | import subprocess
 2 | 
 3 | kernel_release = subprocess.check_output(['uname', '-r']).rstrip()
 4 | 
 5 | def FlagsForFile( filename, **kwargs ):
 6 |         return { 'flags': [
 7 |             '-I/usr/include',
 8 |             '-I/usr/local/include',
 9 |             '-I/usr/src/linux-headers-' + kernel_release + '/arch/x86/include',
10 |             '-I/usr/src/linux-headers-' + kernel_release + '/arch/x86/include/uapi',
11 |             '-I/usr/src/linux-headers-' + kernel_release + '/include',
12 |             '-D__KERNEL__',
13 |             '-std=gnu99',
14 |             '-xc'
15 |         ]}
16 | 


--------------------------------------------------------------------------------
/modules/example-lkm-module/Kbuild:
--------------------------------------------------------------------------------
1 | EXTRA_FLAGS = -Wall -g
2 | 
3 | obj-m = katil.o
4 | 


--------------------------------------------------------------------------------
/modules/example-lkm-module/Makefile:
--------------------------------------------------------------------------------
 1 | KDIR = /lib/modules/`uname -r`/build
 2 | 
 3 | kbuild:
 4 | 	make -C $(KDIR) M=`pwd`
 5 | 
 6 | # Formats any C-related file using the clang-format
 7 | # definition at the root of the project.
 8 | #
 9 | # Make sure you have clang-format installed before
10 | # executing.
11 | fmt:
12 | 	find . -name "*.c" -o -name "*.h" | \
13 | 		xargs clang-format -style=file -i
14 | 
15 | 
16 | # Removes any binary generated.
17 | clean:
18 | 	find . -name "*.out" -type f -delete
19 | 	make -c $(KDIR) M=`pwd`
20 | 


--------------------------------------------------------------------------------
/modules/example-lkm-module/README.MD:
--------------------------------------------------------------------------------
 1 | # Katil-lkm is a loadable kernel module
 2 | 
 3 | - Usage:
 4 | 
 5 | - 1.  Install the necessary development headers
 6 | ```bash
 7 | apt update -y
 8 | apt install -y libelf-dev linux-headers-`uname -r`
 9 | ```
10 | - 2.  build the kernel module
11 | ```bash
12 | make
13 | ```
14 | - 3.  load the module
15 | ```bash
16 | insmod katil.ko
17 | ```
18 | - 4.  check that the module has been inserted
19 | ```bash
20 | lsmod | grep katil
21 | ```
22 | - 5.  check that the module has been really loaded
23 | ```bash
24 | dmesg
25 | ```
26 | - 6.  remove the module
27 | ```bash
28 | rmmod katil
29 | ```
30 | 


--------------------------------------------------------------------------------
/modules/example-lkm-module/katil.c:
--------------------------------------------------------------------------------
 1 | /**
 2 |  * Author V.Varbanovski nu11secur1ty
 3 |  * katil - a module that does nothing more than
 4 |  * printing `katil` using `printk`.
 5 |  */
 6 | 
 7 | /**
 8 |  * `module.h` contains the most basic functionality needed for
 9 |  * us to create a loadable kernel module, including the `MODULE_*`
10 |  * macros, `module_*` functions and including a bunch of other
11 |  * relevant headers that provide useful functionality for us
12 |  * (for instance, `printk`, which comes from `linux/printk.h`,
13 |  * a header included by `linux/module.h`).
14 |  */
15 | #include <linux/module.h>
16 | 
17 | /**
18 |  * Following, we make use of several macros to properly provide
19 |  * information about the kernel module that we're creating.
20 |  *
21 |  * The informations supplied here are visible through tools like
22 |  * `modinfo`.
23 |  *
24 |  * Note.: the license you choose here **DOES AFFECT** other things -
25 |  * by using a properietary license your kernel will be "tainted".
26 |  */
27 | MODULE_LICENSE("GPL");
28 | MODULE_AUTHOR("V.Varbanovsi. nu11secur1ty");
29 | MODULE_DESCRIPTION("Katil LKM");
30 | MODULE_VERSION("0.1");
31 | 
32 | /** hello_init - initializes the module
33 |  *
34 |  * The `hello_init` method defines the procedures that performs the set up
35 |  * of our module.
36 |  */
37 | static int
38 | hello_init(void)
39 | {
40 |          // By making use of `printk` here (in the initialization),
41 |          // we can look at `dmesg` and verify that what we log here
42 |          // appears there at the moment that we load the module with
43 |          // `insmod`.
44 | 	printk(KERN_INFO "Ko staaa katil?\n");
45 | 	return 0;
46 | }
47 | 
48 | static void
49 | hello_exit(void)
50 | {
51 |         // similar to `init`, but for the removal time.
52 | 	printk(KERN_INFO "Ae beee si igraesh!\n");
53 | }
54 | 
55 | // registers the `hello_init` method as the method to run at module
56 | // insertion time.
57 | module_init(hello_init);
58 | 
59 | // similar, but for `removal`
60 | module_exit(hello_exit);
61 | 


--------------------------------------------------------------------------------
/scheme/kernels.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/scheme/kernels.png


--------------------------------------------------------------------------------
/type of programming/types.md:
--------------------------------------------------------------------------------
1 | 1. - Imperative programming
2 | 2. - Functional Programming
3 | 3. - Object oriented programming
4 | 


--------------------------------------------------------------------------------
/understanding the linux kernel/Understanding Linux Kernel.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/understanding the linux kernel/Understanding Linux Kernel.pdf


--------------------------------------------------------------------------------
/wall/gV8hn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nu11secur1ty/Kernel-and-Types-of-kernels/905842f99dacb065481c2a78d6b31a849bfb958f/wall/gV8hn.png


--------------------------------------------------------------------------------
/wall/wall:
--------------------------------------------------------------------------------
1 | .....
2 | 


--------------------------------------------------------------------------------