├── .gitignore ├── assembly-and-processor.md ├── communication-between-kernel-module-and-process.md ├── container.md ├── data-structure.md ├── from-protocol-handler-to-socket-interface.md ├── ftrace.md ├── get-kernel-state-using-sysrq.md ├── how-debugger-works.md ├── interrupt-exception.md ├── io-cache.md ├── io-consumption-controller.md ├── io-scheduler.md ├── jvm-trouble-shooting.md ├── kernel-mode-stack.md ├── kernel-panic-oops-call-trace.md ├── memory-management-unit.md ├── memory-statistics.md ├── network-congestion-management.md ├── network-data-structure.md ├── nfs-client-cache-coherence.md ├── observe-kernel-state-system-tap.md ├── observe-kernel-state-using-crash.md ├── page-coloring.md ├── paging-unit.md ├── preemption.md ├── process-execution.md ├── process-scheduler.md ├── process-session-process-group-thread-group.md ├── process-state.md ├── processing-input-ip-packet.md ├── reading-source-code.md ├── receive-data-from-socket.md ├── receive-frame-from-network.md ├── samples └── timeslice.c ├── segmentation.md ├── send-frame-to-network.md ├── signals.md ├── slab-allocator.md ├── socket-file-api-implementation.md ├── spin-lock.md ├── strace.md ├── synchronization.md ├── syscall.md ├── sysfs-procfs.md ├── top-half.md ├── tss.md └── udev.md /.gitignore: -------------------------------------------------------------------------------- 1 | *.out 2 | *.o 3 | -------------------------------------------------------------------------------- /assembly-and-processor.md: -------------------------------------------------------------------------------- 1 | # Assembly language and how processor work 2 | 3 | Assuming that segment descriptor for code, data, stack are stored in cs, ds and respectively ss registers. 4 | 5 | **Instruction pointer** 6 | 7 | 1. value is in EIP register (RIP for 64 bits) 8 | 2. processor will execute an instruction stored in the address specified by the EIP 9 | 3. jump loc1 will change EIP to loc1 10 | 11 | **Stack pointer ESP** 12 | 13 | Point to top address of the stack 14 | 15 | 1. value is in ESP register (RSP for 64 bits) 16 | 2. push/pop instruction will store/get a value from an address specified by ESP and decrease/increase ESP by a corresponding size 17 | 3. stack grows from higher to lower memory address 18 | 4. Kernel can directly modify ESP during context switch 19 | 20 | call loc1 will push memory location after the instruction (EIP content) into current stack (memory location specified by ESP register) 21 | and jump to loc1. 22 | 23 | ret will change EIP to a value in an address specified by ESP and increase ESP by a corresponding size 24 | 25 | **Stack base pointer EBP** 26 | 27 | holds address of current stack frame, a range of stack holding passed parameters, return address and local parameters 28 | 29 | 1. value is in EBP register (RBP for 64 bits) 30 | 2. EBP contains address in stack where return address is stored. 31 | 32 | **x86 Function Call** 33 | 34 | ESP and EBP: The Stack Pointer and the Base Pointer 35 | 36 | When a block of code calls a function, it pushes the parameters and the return address on the stack. 37 | Once inside, function sets the base pointer equal to the stack pointer and then places its own internal variables on the stack. 38 | From that point on, the function refers to its parameters and variables relative to the base pointer rather than the stack pointer. 39 | 40 | Why not the stack pointer? For some reason, the stack pointer lousy addressing modes. In 16-bit mode, it cannot be a square-bracket memory offset at all. In 32-bit mode, it can be appear in square brackets only by adding an expensive SIB byte to the opcode. 41 | 42 | In your code, there is never a reason to use the stack pointer for anything other than the stack. 43 | 44 | The base pointer, however, is up for grabs. If your routines pass parameters by register instead of by stack (they should), 45 | there is no reason to copy the stack pointer into the base pointer. The base pointer becomes a free register for whatever you need. 46 | 47 | **AT&T assembly syntax** 48 | 49 | example assembly code generated by hotspot jvm 50 | 51 | # {method} 'setA' '(I)V' in 'TestVolatile' 52 | # this: ecx = 'TestVolatile' 53 | # parm0: edx = int 54 | # [sp+0x10] (sp of caller) 55 | 0xb4f48980: cmp 0x4(%ecx),%eax ;...3b4104 56 | 0xb4f48983: jne 0xb4f2cea0 ;...0f851745 feff 57 | ; {runtime_call} 58 | 0xb4f48989: xchg %ax,%ax ;...666690 59 | [Verified Entry Point] 60 | 0xb4f4898c: push %ebp ;...55 61 | 0xb4f4898d: sub $0x8,%esp ;...81ec0800 0000 62 | 0xb4f48993: mov %edx,0x8(%ecx) ;...895108 63 | 0xb4f48996: lock addl $0x0,(%esp) ;...f0830424 00 64 | ;*synchronization entry 65 | ; - TestVolatile::setA@-1 (line 14) 66 | 0xb4f4899b: add $0x8,%esp ;...83c408 67 | 0xb4f4899e: pop %ebp ;...5d 68 | 0xb4f4899f: test %eax,0xb7f62000 ;...85050020 f6b7 69 | ; {poll_return} 70 | 0xb4f489a5: ret ;...c3 71 | 72 | 73 | * register: %ax, %eax 74 | * constant: $0x18, $0x4 75 | * access memory location, its address is in register: (%ecx) 76 | * access memory location, its address is in register + offset: 0x4(%ecx) 77 | * move data from source (left operand) to destination (right operand): mov %edx,0x8(%ecx) 78 | 79 | **References** 80 | 81 | 1. http://www.intel.com/Assets/PDF/manual/253665.pdf 82 | 2. http://sig9.com/articles/att-syntax 83 | 3. http://asm.sourceforge.net/articles/linasm.html#Syntax 84 | 4. http://www.swansontec.com/sregisters.html 85 | -------------------------------------------------------------------------------- /communication-between-kernel-module-and-process.md: -------------------------------------------------------------------------------- 1 | ## Communication between kernel module and user process 2 | 3 | **Driver** 4 | 5 | Driver is installed in kernel. In order for process in userspace to access it, we need to create special device file with major 6 | number in file system. The driver module in turn register file system functions with the major number. 7 | 8 | result = register_chrdev(memory_major, "memory", &memory_fops); 9 | 10 | The result is pseudo file `/proc/devices/memory` . `udev` then create corresponding device file traditional `/dev/memory` 11 | 12 | **procfs file** 13 | 14 | Other way is to register using proc filesystem 15 | 16 | create_proc_read_entry("hello_world", 0, NULL, hello_read_proc, NULL) 17 | 18 | So when user process read, write to the special file, kernel invokes relevant registered file system functions. 19 | 20 | See example in 21 | 22 | * http://www.tldp.org/LDP/lkmpg/2.6/html/lkmpg.html 23 | * http://www.opensourceforu.com/2012/06/some-nifty-udev-rules-and-examples/ 24 | -------------------------------------------------------------------------------- /container.md: -------------------------------------------------------------------------------- 1 | ## Process container 2 | 3 | The concept process container in linux refers to isolated light weight OS environment for running processes (also called OS level virtualization). A process container appears as a complete OS (with root filesystem, network, hostname) dedicated to a process. 4 | 5 | Comparing to virtual machine, process container (including lxc, docker, rocket) provides reasonable good isolated environment at much cheaper cost (in term time and resource). 6 | 7 | The linux kernel provide features utilized by container's tools to create and manage of containers. 8 | 9 | These are 10 | 11 | * [namespace related syscalls](http://man7.org/linux/man-pages/man7/namespaces.7.html): which prevent collision in using named resources e.g. process id, ipc, network port, filesystem, hostname. 12 | * [resource management cgroups](http://en.wikipedia.org/wiki/Cgroups): allocate physical resources (cpu, memory, io) to each container. 13 | 14 | **Docker** 15 | 16 | Docker is set of tools an infrastructure enabling creation and management of container. Docker is written in golang and uses libcontainer (the previous version used lxc) for accessing linux kernel namespace syscalls and cgroups. 17 | 18 | **References** 19 | 20 | * http://blog.dotcloud.com/under-the-hood-linux-kernels-on-dotcloud-part 21 | * http://blog.dotcloud.com/kernel-secrets-from-the-paas-garage-part-24-c 22 | * http://www.zdnet.com/article/docker-libcontainer-unifies-linux-container-powers/ 23 | -------------------------------------------------------------------------------- /data-structure.md: -------------------------------------------------------------------------------- 1 | ## Kernel data structure 2 | 3 | 4 | References 5 | 1. http://isis.poly.edu/kulesh/stuff/src/klist/ -------------------------------------------------------------------------------- /from-protocol-handler-to-socket-interface.md: -------------------------------------------------------------------------------- 1 | # From protocol handler to Socket Interface 2 | 3 | The interface between user space program and kernel network is Socket. We access to socket using 4 | 5 | * socket api e.g. `connect`, `accept`, `bind`, `listen`, `send`, `receive` or 6 | * file api e.g. `read`, `write`, `close` 7 | 8 | Socket api is complete while file api some time offer only few operations, and alone is not enough 9 | to make socket work as the `open` is not provided, so application end up with use only socket api. 10 | 11 | It is important to know that syscall(s) related to socket use file descriptor associated with the 12 | kernel socket data structure. 13 | 14 | **Socket data structures** 15 | 16 | Socket 17 | 18 | struct socket { 19 | socket_state state; 20 | unsigned long flags; 21 | const struct proto_ops *ops; 22 | struct fasync_struct *fasync_list; 23 | struct file *file; 24 | struct sock *sk; 25 | wait_queue_head_t wait; 26 | short type; 27 | }; 28 | 29 | Sock structure encapsulate internal implementation of socket 30 | 31 | struct sock { 32 | /* 33 | * Now struct inet_timewait_sock also uses sock_common, so please just 34 | * don't add nothing before this first member (__sk_common) --acme 35 | */ 36 | struct sock_common __sk_common; 37 | #define sk_family __sk_common.skc_family 38 | #define sk_state __sk_common.skc_state 39 | #define sk_reuse __sk_common.skc_reuse 40 | #define sk_bound_dev_if __sk_common.skc_bound_dev_if 41 | #define sk_node __sk_common.skc_node 42 | #define sk_bind_node __sk_common.skc_bind_node 43 | #define sk_refcnt __sk_common.skc_refcnt 44 | #define sk_hash __sk_common.skc_hash 45 | #define sk_prot __sk_common.skc_prot 46 | unsigned char sk_shutdown : 2, 47 | sk_no_check : 2, 48 | sk_userlocks : 4; 49 | unsigned char sk_protocol; 50 | unsigned short sk_type; 51 | int sk_rcvbuf; 52 | socket_lock_t sk_lock; 53 | wait_queue_head_t *sk_sleep; 54 | struct dst_entry *sk_dst_cache; 55 | struct xfrm_policy *sk_policy[2]; 56 | rwlock_t sk_dst_lock; 57 | atomic_t sk_rmem_alloc; 58 | atomic_t sk_wmem_alloc; 59 | atomic_t sk_omem_alloc; 60 | struct sk_buff_head sk_receive_queue; 61 | struct sk_buff_head sk_write_queue; 62 | struct sk_buff_head sk_async_wait_queue; 63 | int sk_wmem_queued; 64 | int sk_forward_alloc; 65 | gfp_t sk_allocation; 66 | int sk_sndbuf; 67 | int sk_route_caps; 68 | int sk_gso_type; 69 | int sk_rcvlowat; 70 | unsigned long sk_flags; 71 | unsigned long sk_lingertime; 72 | /* 73 | * The backlog queue is special, it is always used with 74 | * the per-socket spinlock held and requires low latency 75 | * access. Therefore we special case it's implementation. 76 | */ 77 | struct { 78 | struct sk_buff *head; 79 | struct sk_buff *tail; 80 | } sk_backlog; 81 | struct sk_buff_head sk_error_queue; 82 | struct proto *sk_prot_creator; 83 | rwlock_t sk_callback_lock; 84 | int sk_err, 85 | sk_err_soft; 86 | unsigned short sk_ack_backlog; 87 | unsigned short sk_max_ack_backlog; 88 | __u32 sk_priority; 89 | struct ucred sk_peercred; 90 | long sk_rcvtimeo; 91 | long sk_sndtimeo; 92 | struct sk_filter *sk_filter; 93 | void *sk_protinfo; 94 | struct timer_list sk_timer; 95 | struct timeval sk_stamp; 96 | struct socket *sk_socket; 97 | void *sk_user_data; 98 | struct page *sk_sndmsg_page; 99 | struct sk_buff *sk_send_head; 100 | __u32 sk_sndmsg_off; 101 | int sk_write_pending; 102 | void *sk_security; 103 | void (*sk_state_change)(struct sock *sk); 104 | void (*sk_data_ready)(struct sock *sk, int bytes); 105 | void (*sk_write_space)(struct sock *sk); 106 | void (*sk_error_report)(struct sock *sk); 107 | int (*sk_backlog_rcv)(struct sock *sk, 108 | struct sk_buff *skb); 109 | void (*sk_create_child)(struct sock *sk, struct sock *newsk); 110 | void (*sk_destruct)(struct sock *sk); 111 | }; 112 | 113 | 114 | **Create a socket** 115 | 116 | asmlinkage long sys_socket(int family, int type, int protocol) 117 | { 118 | int retval; 119 | struct socket *sock; 120 | 121 | retval = sock_create(family, type, protocol, &sock); 122 | if (retval < 0) 123 | goto out; 124 | 125 | retval = sock_map_fd(sock); /* assign function pointers for file interface*/ 126 | ... 127 | return retval; 128 | } 129 | 130 | The syscall return a file descriptor associated to the newly created socket 131 | 132 | static int __sock_create(int family, int type, int protocol, struct socket **res, int kern) 133 | { 134 | … 135 | struct socket *sock; 136 | ... 137 | if (!(sock = sock_alloc())) { 138 | ... 139 | if ((err = net_families[family]->create(sock, protocol)) < 0) { 140 | ... 141 | *res = sock; 142 | … 143 | 144 | **initialization of socket function pointers** 145 | 146 | There is a function responsible for initializing socket for each socket family/domain 147 | 148 | struct net_proto_family { 149 | int family; 150 | int (*create)(struct socket *sock, int protocol); 151 | ... 152 | }; 153 | 154 | 155 | int sock_register(struct net_proto_family *ops) 156 | { 157 | ... 158 | if (net_families[ops->family] == NULL) { 159 | net_families[ops->family]=ops; 160 | err = 0; 161 | } 162 | … 163 | 164 | The socket family holding initialization function get registered at the system boot 165 | 166 | static int __init inet_init(void) 167 | { 168 | … 169 | (void)sock_register(&inet_family_ops); 170 | … 171 | 172 | 173 | E.g. of structure of socket family for IPv4 Internet protocols 174 | 175 | static struct net_proto_family inet_family_ops = { 176 | .family = PF_INET, 177 | .create = inet_create, 178 | .owner = THIS_MODULE, 179 | }; 180 | 181 | 182 | static int inet_create(struct socket *sock, int protocol) 183 | { 184 | struct sock *sk; 185 | 186 | 187 | ... 188 | lookup_protocol: 189 | err = -ESOCKTNOSUPPORT; 190 | rcu_read_lock(); 191 | list_for_each_rcu(p, &inetsw[sock->type]) { 192 | answer = list_entry(p, struct inet_protosw, list); 193 | 194 | 195 | /* Check the non-wild match. */ 196 | if (protocol == answer->protocol) { 197 | if (protocol != IPPROTO_IP) 198 | break; 199 | ... 200 | sock->ops = answer->ops; /* the socket operation is initialized from one registered in global variable inetsw */ 201 | answer_prot = answer->prot; /* this will be used internally when ops actually delegate its work to a functions specified here*/ 202 | ... 203 | sk = sk_alloc(PF_INET, GFP_KERNEL, answer_prot, 1); 204 | … 205 | sock_init_data(sock, sk) 206 | …. 207 | if (sk->sk_prot->init) { 208 | err = sk->sk_prot->init(sk); 209 | 210 | 211 | The operation functions are specific to protocol (normally 0 as there is only one protocol) of the family (local, IPv4) 212 | and type of the socket (datagram, stream or raw) 213 | 214 | static int __init inet_init(void) 215 | { 216 | … 217 | for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; ++q) 218 | inet_register_protosw(q); 219 | ... 220 | 221 | static struct inet_protosw inetsw_array[] = 222 | { 223 | { 224 | .type = SOCK_STREAM, 225 | .protocol = IPPROTO_TCP, 226 | .prot = &tcp_prot, 227 | .ops = &inet_stream_ops, 228 | .capability = -1, 229 | .no_check = 0, 230 | .flags = INET_PROTOSW_PERMANENT | 231 | INET_PROTOSW_ICSK, 232 | }, 233 | 234 | 235 | { 236 | .type = SOCK_DGRAM, 237 | .protocol = IPPROTO_UDP, 238 | .prot = &udp_prot, 239 | .ops = &inet_dgram_ops, /*functions here will call those specified in .prot via internal socket*/ 240 | .capability = -1, 241 | .no_check = UDP_CSUM_DEFAULT, 242 | .flags = INET_PROTOSW_PERMANENT, 243 | }, 244 | ... 245 | 246 | 247 | const struct proto_ops inet_dgram_ops = { 248 | .family = PF_INET, 249 | .owner = THIS_MODULE, 250 | .release = inet_release, 251 | .bind = inet_bind, 252 | .connect = inet_dgram_connect, 253 | ... 254 | .recvmsg = sock_common_recvmsg, 255 | ... 256 | 257 | 258 | void inet_register_protosw(struct inet_protosw *p) 259 | { 260 | ... 261 | answer = NULL; 262 | last_perm = &inetsw[p->type]; 263 | list_for_each(lh, &inetsw[p->type]) { 264 | answer = list_entry(lh, struct inet_protosw, list); 265 | /* Check only the non-wild match. */ 266 | if (INET_PROTOSW_PERMANENT & answer->flags) { 267 | if (protocol == answer->protocol) 268 | break; 269 | last_perm = lh; 270 | } 271 | … 272 | list_add_rcu(&p->list, last_perm); 273 | 274 | 275 | void sock_init_data(struct socket *sock, struct sock *sk) 276 | { 277 | ... 278 | if(sock) 279 | { 280 | sk->sk_type = sock->type; 281 | sk->sk_sleep = &sock->wait; 282 | sock->sk = sk; /* wiring internal socket with socket*/ 283 | } else 284 | sk->sk_sleep = NULL; 285 | .. 286 | sk->sk_state_change = sock_def_wakeup; 287 | sk->sk_data_ready = sock_def_readable; 288 | 289 | 290 | struct sock *sk_alloc(int family, gfp_t priority, 291 | struct proto *prot, int zero_it) 292 | { 293 | struct sock *sk = NULL; 294 | ... 295 | sk->sk_prot = sk->sk_prot_creator = prot; 296 | 297 | struct proto udp_prot = { 298 | .name = "UDP", 299 | .owner = THIS_MODULE, 300 | .close = udp_close, 301 | .connect = ip4_datagram_connect, 302 | .disconnect = udp_disconnect, 303 | .ioctl = udp_ioctl, 304 | .destroy = udp_destroy_sock, 305 | .setsockopt = udp_setsockopt, 306 | .getsockopt = udp_getsockopt, 307 | .sendmsg = udp_sendmsg, 308 | .recvmsg = udp_recvmsg, 309 | .sendpage = udp_sendpage, 310 | .backlog_rcv = __udp_queue_rcv_skb, 311 | .hash = udp_v4_hash, 312 | .unhash = udp_v4_unhash, 313 | .get_port = udp_v4_get_port, 314 | .memory_allocated = &udp_memory_allocated, 315 | .sysctl_mem = sysctl_udp_mem, 316 | .sysctl_wmem = &sysctl_udp_wmem_min, 317 | .sysctl_rmem = &sysctl_udp_rmem_min, 318 | .obj_size = sizeof(struct udp_sock), 319 | #ifdef CONFIG_COMPAT 320 | .compat_setsockopt = compat_udp_setsockopt, 321 | .compat_getsockopt = compat_udp_getsockopt, 322 | #endif 323 | }; 324 | 325 | 326 | References 327 | 328 | 1. http://www.ibm.com/developerworks/linux/library/l-hisock/index.html 329 | 2. http://www.haifux.org/lectures/217/netLec5.pdf 330 | -------------------------------------------------------------------------------- /ftrace.md: -------------------------------------------------------------------------------- 1 | ## FTrace 2 | 3 | Ftrace is kernel built in feature that enables tracing kernel function calls. 4 | 5 | **Quick start** 6 | 7 | change directory to `debugfs` 8 | 9 | # cd /sys/kernel/debug/tracing 10 | 11 | check `current_tracer` 12 | 13 | # cat current_tracer 14 | nop 15 | 16 | enable function trace 17 | 18 | # echo function > current_tracer 19 | 20 | view trace 21 | 22 | # cat trace | less 23 | ... 24 | bash-8329 [000] d... 80604.787465: account_system_time <-__vtime_account_system 25 | bash-8329 [000] d... 80604.787466: cpuacct_account_field <-account_system_time 26 | 27 | disable trace 28 | 29 | # echo nop > current_tracer 30 | 31 | **References** 32 | 33 | * http://lwn.net/Articles/365835/ 34 | * http://lwn.net/Articles/366796/ 35 | -------------------------------------------------------------------------------- /get-kernel-state-using-sysrq.md: -------------------------------------------------------------------------------- 1 | # Get kernel state using sysrq 2 | 3 | 4 | Enable 5 | 6 | 7 | # echo 1 > /proc/sys/kernel/sysrq 8 | 9 | 10 | Generate dump 11 | 12 | 13 | # echo 'm' > /proc/sysrq-trigger 14 | 15 | 16 | Trigger 17 | 18 | 1. m - dump information about memory allocation 19 | 2. t - dump thread state information 20 | 3. p - dump current CPU registers and flags 21 | 4. c - intentionally crash the system (useful for forcing a disk or netdump) 22 | 5. s - immediately sync all mounted filesystems 23 | 6. u - immediately remount all filesystems read-only 24 | 7. b - immediately reboot the machine 25 | 8. o - immediately power off the machine (if configured and supported) 26 | 27 | 28 | View dump 29 | 30 | 31 | # vi /var/log/messages 32 | 33 | 34 | References 35 | 36 | 1. https://access.redhat.com/kb/docs/DOC-2024 -------------------------------------------------------------------------------- /how-debugger-works.md: -------------------------------------------------------------------------------- 1 | ## How debugger works 2 | 3 | The implementation of debugger uses of system call `ptrace`. 4 | 5 | The debugger in this case a tracer establishes parent/child relationship (in this case real_parent and parent are different) 6 | with a process being traced then call `ptrace` with pid of the tracee. The tracer then can receive (by calling system call `wait`/`waitpid`) any signals (except KILL) sent to the tracee. 7 | 8 | The debugger ask the kernel to step (`PTRACE_SINGLESTEP`), continue execution (`PTRACE_CONT`) the process being traced. 9 | 10 | **breakpoint** 11 | 12 | The debugger can set a breakpoint by changing an instruction (`PTRACE_POKETEXT`) at desire location with INT 3 (raise `SIGTRAP`). 13 | Then it can examine, modify content of memory and registries using `PTRACE_PEEKTEXT`, `PTRACE_PEEKDATA`, `PTRACE_GETREGS`. 14 | Finally rewinds instruction pointer (`regs.eip=-1+`, `PTRACE_SETREGS`), restores original instruction then continue 15 | execution (`PTRACE_CONT`) . 16 | 17 | **References** 18 | 19 | 1. http://eli.thegreenplace.net/2011/01/23/how-debuggers-work-part-1/ 20 | 2. http://www.linuxjournal.com/article/6100?page=0,2 21 | 3. http://linux.die.net/man/2/ptrace 22 | -------------------------------------------------------------------------------- /interrupt-exception.md: -------------------------------------------------------------------------------- 1 | ## Interrupt and exception 2 | 3 | 4 | Exceptions aka “Synchronous interrupts” are produced by the CPU control unit while executing instruction, the control 5 | unit issues them after terminating the execution of an instruction. E.g. are divide by zero or trap. Programmed exception is used to implement syscal and debug. 6 | 7 | Interrupts aka “Asynchronous interrupts” are generated by other hardware devices at arbitrary times with respect to the 8 | CPU clock signals. APIC - Advanced Programmable Interrupt Controller. 9 | 10 | There is table of 256 of 8 bytes interrupt descriptors. Each corresponds with one address of interrupt or exception 11 | handler. Interrupt Descriptor Table (IDT) can be located in any place in memory, its address and length is stored in register idtr. The table is initialized using lidt assembly language. 12 | 13 | There are 3 types of descriptors 14 | 15 | 1. task gate 16 | 2. interrupt gate 17 | 3. trap gate 18 | 19 | **References** 20 | 21 | * http://linux.linti.unlp.edu.ar/images/0/0c/ULK3-CAPITULO4-UNNOBA.pdf 22 | -------------------------------------------------------------------------------- /io-cache.md: -------------------------------------------------------------------------------- 1 | ## IO cache 2 | 3 | Read/write to block io (disk) go through page cache implemented using LRU/2. 4 | Two lists of pages are maintained active and inactive. Page Frame Reclaiming Algorithm (PFRA) 5 | manages cache eviction, which only happens in inactive list. Pages are moved from inactive 6 | to active list when it is accessed. 7 | 8 | Various parameters and the daemon pdflush are employed to ensure that cache will be evicted to 9 | free memory when it is needed (`dirty_background_ratio`, `dirty_ratio`) as well as un-synchronized 10 | data are not remained too long in memory (`dirty_expire_centisecs`). 11 | 12 | Command to drop cache 13 | 14 | #free && sync && echo 3 > /proc/sys/vm/drop_caches && free 15 | 16 | References 17 | 18 | 1. http://www.westnet.com/~gsmith/content/linux-pdflush.htm 19 | -------------------------------------------------------------------------------- /io-consumption-controller.md: -------------------------------------------------------------------------------- 1 | ## IO consumption controller 2 | 3 | This is mechanism used to control upper limit of IO resource consumed by each process 4 | 5 | References 6 | 7 | 1. http://stuff.mit.edu:8001/afs/sipb.mit.edu/contrib/linux/Documentation/cgroups/blkio-controller.txt -------------------------------------------------------------------------------- /io-scheduler.md: -------------------------------------------------------------------------------- 1 | ## IO Scheduler 2 | 3 | The responsibility of IO Scheduler is to schedule pending IO requests (read or write) against block devices (usually hard disk) "efficiently" with respect to physical characteristics of the devices. The term schedule in this context means ordering ,merging requests and decide which one is going to device first. IO Scheduler has to balance between latency with throughput. 4 | 5 | Typical IO scheduler are 6 | 7 | * Elevator : default in 2.4, no longer used. 8 | * `deadline` : replaces Elevator, guarantees a start service time for a request. 9 | * `noop` : simplest schduler using queue. 10 | * Anticipatory : idle after a read request to anticipate next closed read request. Is replaced by CFQ. 11 | * `cfq` - Completely Fair Queuing : default scheduler. 12 | 13 | IO Scheduler is specified in `[]` for each block device 14 | 15 | /home/vagrant# cat /sys/block/sda/queue/scheduler 16 | noop [deadline] cfq 17 | /home/vagrant# echo cfq > /sys/block/sda/queue/scheduler 18 | /home/vagrant# cat /sys/block/sda/queue/scheduler 19 | noop deadline [cfq] 20 | 21 | **Elevator** 22 | 23 | It sort requests by sector's number maintaining queue of sorted requests and serves requests as elevator maximizing throughput. When new requests with lower sector's number are constantly inserted in the head of the queue causes starvation of a request at the end of the queue. Deadline I/O scheduler can outperform the CFQ I/O scheduler for certain multithreaded, database workloads. 24 | 25 | **Deadline** 26 | 27 | In addition to queue of requets sorted by block number, it maintains 2 deadline FIFO queue, one for read requests, other for write requests. The deadline for read requets queue is 500 ms and for write requests queue is 5 secs. The IO scheduler checks by looking at oldest requests if deadline ocurrs then it serves requests from deadline FIFO queue instead of sorted queue. 28 | 29 | **noop** 30 | 31 | Noop uses simple FIFO queue to merges and serves requests, no reordering based on sector number is performed. It assumes that OS has no productive way to optimize request order due to lack of information about physical devices. Example are SSD disk where seek time doesn't depend on sector number; Network Attached Storage, RAID and Tagged Command Queuing - TCQ, where the device manages request's queue by itself. 32 | 33 | **Anticipatory** 34 | 35 | Anticipatory scheduler tries to idle for short period (e.g. few ms) after a read operation in anticipation of another close-by read requests. It is remvoved because the same goal can be archived by tuning `slice_idle` of CFRQ. 36 | 37 | **CFQ** 38 | 39 | Maintains queue for synchronous requests per process. It allocates timeslice (based on IO priority of a process, which is set by syscall `ioprio_set` or `ionice` command) for each queue to access disk. The I/O scheduler visits each queue in a round-robin fashion, servicing requests from the queue until the queue's timeslice is exhausted, or until no more requests remain. In the latter case, the CFQ I/O Scheduler will then sit idle for a brief period specified by`slice_idle` — by default, 10 ms—waiting for a new request on the queue. If the anticipation pays off, the I/O scheduler avoids seeking. If not, the waiting was in vain, and the scheduler moves on to the next process' queue. 40 | 41 | Asynchronous requests for all processes are batched together in fewer queues, one per priority. 42 | 43 | CFQ IO scheduler works in similar way as CPU scheduler. The detail is mentioned in [http://en.wikipedia.org/wiki/Completely_Fair_Scheduler], it prevent both starving and hogging. 44 | 45 | **References** 46 | 47 | * http://en.wikipedia.org/wiki/Noop_scheduler 48 | * http://en.wikipedia.org/wiki/Deadline_scheduler 49 | * http://en.wikipedia.org/wiki/Anticipatory_scheduling 50 | * http://en.wikipedia.org/wiki/CFQ 51 | * http://man7.org/linux/man-pages/man2/ioprio_set.2.html 52 | * http://dom.as/2014/03/31/mongo-io/ 53 | * http://www.electricmonk.nl/log/2012/07/30/setting-io-priorities-on-linux/ 54 | -------------------------------------------------------------------------------- /jvm-trouble-shooting.md: -------------------------------------------------------------------------------- 1 | # Example of troubleshooting JVM 2 | 3 | Identify JVM process and thread (LWP), that causes a problem e.g. high CPU usage 4 | 5 | ps -eLo pid,lwp,pcpu,vsz,comm 6 | 7 | Run gdb attach to process and display stacktrace of problematic thread, remember lwp 8 | 9 | gdb -p 10 | gdb> info threads 11 | # identify thread number, its lwp matchs the problematic lwp in ps command 12 | gdb> thread # switch to that thread 13 | gdb> bt # get stacktrace 14 | … -------------------------------------------------------------------------------- /kernel-mode-stack.md: -------------------------------------------------------------------------------- 1 | # Kernel mode stack 2 | 3 | 4 | Every process (LWP) has its own stack when running in kernel mode. The size of kernel mode stack is usually 4 or 8 KB 5 | (configured at kernel’s compile time). However thread_info (usually 56 byte) is allocated at the bottom of the stack. 6 | Note that stack grows from top to bottom, which mean pushing data to stack decreasing address of stack pointer, while 7 | popping data from stack increasing its address. 8 | 9 | 10 | One way to get kernel mode stack size is used crash 11 | 12 | 13 | crash> print sizeof(struct thread_info) 14 | $5 = 56 15 | crash> print sizeof(union thread_union) 16 | $6 = 4096 17 | crash> 18 | 19 | 20 | thread_info is stored in kernel mode stack is for reason of efficiency. There are only few assembly instructions 21 | required to get address of process which is first field in thread_info struct 22 | 23 | References 24 | 25 | 1. http://notoveryet.wordpress.com/2009/07/09/linux-kernel-bits-which-i-feel-excited-about/ -------------------------------------------------------------------------------- /kernel-panic-oops-call-trace.md: -------------------------------------------------------------------------------- 1 | ## Kernel panic, oops, call trace 2 | 3 | 4 | Unlike kernel panic, which can not be recovered and reboot the system, when the kernel detects a oops’s problem, it 5 | prints kernel state including call trace and kills any offending process. 6 | 7 | 8 | E.g of call trace 9 | 10 | 11 | [] __alloc_pages+0x29f/0x2b1 12 | [] find_or_create_page+0x39/0x72 13 | [] grow_dev_page+0x2a/0x1eb 14 | [] __getblk_slow+0x12e/0x159 15 | [] __getblk+0x3f/0x49 16 | 17 | 18 | Description of each field 19 | 20 | 1. address of function in System.map, kernel use this address to figure out the function name 21 | 2. function name 22 | 3. the offset from start of the function in bytes, we use this to figure out which line was currently running 23 | 4. the size of the function in bytes 24 | 25 | To find the line of function, we need to install crash 26 | 27 | crash> bt 28 | PID: 2489 TASK: d3f86aa0 CPU: 0 COMMAND: "top" 29 | #0 [cd264af4] schedule at c061f412 30 | #1 [cd264b6c] schedule_timeout at c061fb54 31 | #2 [cd264b90] do_select at c0488ac3 32 | #3 [cd264e34] core_sys_select at c0488dc6 33 | #4 [cd264f74] sys_select at c048938d 34 | #5 [cd264fb8] system_call at c0404f44 35 | EAX: ffffffda EBX: 00000001 ECX: bfde8ddc EDX: 00000000 36 | DS: 007b ESI: 00000000 ES: 007b EDI: 08056a00 37 | SS: 007b ESP: bfde8b3c EBP: bfde9488 38 | CS: 0073 EIP: 002ee402 ERR: 0000008e EFLAGS: 00000246 39 | crash> sym c048938d 40 | c048938d (T) sys_select+149 ../debug/kernel-2.6.18/linux-2.6.18.i686/fs/select.c: 407 41 | crash> 42 | 43 | Description of each field 44 | 45 | 1. address of stack at the time of invoking next upper function, by looking at this we know the parameters passing to 46 | 2. name of kernel function 47 | 3. address of an instruction of the function 48 | 49 | system_call & sys_select means the task make call to select the user stack trace would be 50 | 51 | sh-3.2# pstack 2489 52 | #0 0x00d51402 in __kernel_vsyscall () 53 | #1 0x007d9c3d in ___newselect_nocancel () from /lib/libc.so.6 54 | #2 0x08051372 in signal_name_to_number () 55 | #3 0x00724e9c in __libc_start_main () from /lib/libc.so.6 56 | #4 0x08049741 in signal_name_to_number () 57 | 58 | 59 | References 60 | 61 | 1. http://en.wikipedia.org/wiki/System.map 62 | 2. http://madwifi-project.org/wiki/DevDocs/KernelOops 63 | 3. http://magazine.redhat.com/2007/08/15/a-quick-overview-of-linux-kernel-crash-dump-analysis/ 64 | 4. https://access.redhat.com/kb/docs/DOC-2024 -------------------------------------------------------------------------------- /memory-management-unit.md: -------------------------------------------------------------------------------- 1 | ## Memory Management Unit 2 | 3 | 4 | The MMU is hardware that translate logical memory address (used in machine instruction) to the physical one. There are 5 | two step from logical address to linear address by segmentation unit and from linear address to physical address by 6 | paging unit. 7 | 8 | 9 | OS Memory Management Subsystem manages memory by 10 | 11 | 1. setup and maintains a metadata about physical memory’ pages (global page descriptor table mem_map) so it will know if 12 | a memory’ page is in used or free, is modified i.e. dirty or clean. 13 | 2. set up/manipulate relevant virtual to physical memory translation tables (per process page directory and table) 14 | 15 | 16 | Linux does not make use of segmentation so the translation table for segmentation is in fact common for all processes. 17 | However the translation table of paging unit is per process so changes each time at context switch. 18 | 19 | 20 | 32 bit linux kernel can use more than 4GB RAM (one process still has limitation of 4 GB, 1GB for kernel mode, 3GB for 21 | user mode) if PAE (Physical Address Extension) is enabled in BIOS 22 | 23 | grep -i PAE /proc/cpuinfo 24 | 25 | and kernel 26 | 27 | grep -i HIGHMEM /boot/config-’uname -r’ 28 | 29 | 30 | Kernel maintain a meta information about memory pages. Due limitation of some hardware architectures (e.g. x86), pages 31 | are grouped in 3 memory zones DMA, NORMAL, HIGH. To transfer data with devices kernel must use DMA zone. 32 | 33 | 34 | Due to limitation of permanent mapping, unlike NORMAL zone, the memory in HIGH zone is not permanently mapped (put into 35 | page global directory) into kernel address space, i.e. is not possible to access directly but it requires first to 36 | allocate then map to kernel using kmap, which returns address of the allocated page. After usage this should be un map 37 | using kunmap. 38 | The common pattern for using high memory is e.g. from reading file 39 | 40 | 41 | int file_read_actor() 42 | ... 43 | kaddr = kmap(page); 44 | left = __copy_to_user(desc->arg.buf, kaddr + offset, size); 45 | kunmap(page); 46 | ... 47 | 48 | or from reading socket 49 | 50 | int skb_copy_datagram_iovec() 51 | ... 52 | vaddr = kmap(page); 53 | err = memcpy_toiovec(to, vaddr + frag->page_offset + 54 | offset - start, copy); 55 | kunmap(page); 56 | … 57 | 58 | 59 | The 64 bit OS version does not have this issue, all memory pages are permanently mapped into NORMAL zone and kmap just return already mapped linear address while kunmap do nothing 60 | 61 | 62 | void *kmap(struct page *page) 63 | { 64 | might_sleep(); 65 | if (!PageHighMem(page)) 66 | return page_address(page); 67 | return kmap_high(page); 68 | } 69 | 70 | 71 | void kunmap(struct page *page) 72 | { 73 | if (in_interrupt()) 74 | BUG(); 75 | if (!PageHighMem(page)) 76 | return; 77 | kunmap_high(page); 78 | } 79 | 80 | 81 | References 82 | 83 | 1. http://www.spack.org/wiki/LinuxRamLimits 84 | 2. http://www.redhat.com/magazine/001nov04/features/vm/ 85 | 3. http://blogs.oracle.com/gverma/2008/03/redhat_linux_kernels_and_proce_1.html 86 | 4. http://lists.us.dell.com/pipermail/linux-poweredge/2005-August/022327.html 87 | 5. http://en.wikipedia.org/wiki/Physical_Address_Extension -------------------------------------------------------------------------------- /memory-statistics.md: -------------------------------------------------------------------------------- 1 | # Memory statistics 2 | 3 | cat /proc/meminfo 4 | 5 | MemTotal: 514916 kB 6 | MemFree: 142676 kB 7 | Buffers: 57068 kB 8 | Cached: 214008 kB # Page cache 9 | SwapCached: 4 kB # Cache of swap file 10 | Active: 217180 kB # Active/inactive pages used by PFCA 11 | Inactive: 99832 kB # 12 | HighTotal: 0 kB 13 | HighFree: 0 kB 14 | LowTotal: 514916 kB 15 | LowFree: 142676 kB 16 | SwapTotal: 1048568 kB 17 | SwapFree: 1048552 kB 18 | Dirty: 0 kB # Cached file is modified and need to write to disk 19 | Writeback: 0 kB # Being written to disk 20 | AnonPages: 45932 kB 21 | Mapped: 13576 kB 22 | Slab: 47588 kB 23 | PageTables: 1676 kB 24 | NFS_Unstable: 0 kB 25 | Bounce: 0 kB 26 | CommitLimit: 1306024 kB 27 | Committed_AS: 170724 kB 28 | VmallocTotal: 507896 kB 29 | VmallocUsed: 5440 kB 30 | VmallocChunk: 502304 kB 31 | HugePages_Total: 0 32 | HugePages_Free: 0 33 | HugePages_Rsvd: 0 34 | Hugepagesize: 4096 kB 35 | 36 | 37 | **Cache** 38 | 39 | physical memory used to cache files usually result from calling mmap 40 | 41 | fs/proc/proc_misc.c 42 | ... 43 | cached = global_page_state(NR_FILE_PAGES) - 44 | total_swapcache_pages - i.bufferram; 45 | ... 46 | 47 | mm/filemap.c 48 | ... 49 | int add_to_page_cache(struct page *page, struct address_space *mapping, 50 | pgoff_t offset, gfp_t gfp_mask) 51 | { 52 | ... 53 | __inc_zone_page_state(page, NR_FILE_PAGES); 54 | 55 | 56 | **SwapCached** 57 | 58 | When a page is swapped out, it goes through the same process as when writing a block of file to disk. 59 | So first go to cache and then the pdflush wil write it to block device. The benefit is possible merging 60 | many pages into single write (less IO operation). 61 | 62 | **Active/Inactive** 63 | 64 | Page Frame Reclaiming Algorithm maintains LRU lists of inactive and active page frames (physical memory page) 65 | that can be either assigned to processes or used as cache (excluding free pages). 66 | 67 | fs/proc/proc_misc.c 68 | ... 69 | get_zone_counts(&active, &inactive, &free); 70 | 71 | **Dirty** 72 | 73 | Amount of memory modified by processes and need to be written to files at some point of time. After calling sync, 74 | this should be zero 75 | 76 | fs/proc/proc_misc.c 77 | ... 78 | K(global_page_state(NR_FILE_DIRTY)), 79 | 80 | 81 | **Writeback** 82 | 83 | Amount of memory being written to files. Should be zero, if the disk is fast enough. 84 | 85 | fs/proc/proc_misc.c 86 | ... 87 | K(global_page_state(NR_WRITEBACK)), 88 | 89 | 90 | fs/buffer.c 91 | ... 92 | static int __block_write_full_page(struct inode *inode, struct page *page, 93 | get_block_t *get_block, struct writeback_control *wbc) 94 | ... 95 | set_page_writeback(page); 96 | 97 | **References** 98 | 99 | 1. http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/4/html/Reference_Guie/s2-proc-meminfo.html 100 | 2. fs/proc/proc_misc.c 101 | 3. mm/filemap.c 102 | -------------------------------------------------------------------------------- /network-congestion-management.md: -------------------------------------------------------------------------------- 1 | ## Network congestion management for incoming frames 2 | 3 | There are two technique applied to network congestion management mainly to reduce CPU load. 4 | 5 | 1. reduce number of interrupt : this is done in interrupt handler of driver code by processing 6 | multi frame in one interrupt activation and polling 7 | 8 | 2. discard incoming frame as soon as possible: if the driver use NAPI which mean manage its own 9 | queue of incoming frames, it is up to driver to handle it. Otherwise kernel will handle it in `netif_rx` 10 | function by watching average of per CPU queue. 11 | -------------------------------------------------------------------------------- /network-data-structure.md: -------------------------------------------------------------------------------- 1 | # Network data structure 2 | 3 | 4 | The implementation of network use two fundamental data structure socket buffer `sk_buff` and `net_device`. 5 | 6 | `sk_buff` is where packets are stored and processed while net_device keep information about hardware device 7 | and pointer to driver’s functions put data into sk_buff as well as get data from it. 8 | 9 | Other important is `softnet_data`, which is per CPU and contain information needed for communication between 10 | top half and bottom half 11 | 12 | The `softnet_data` has the following noticeable fields 13 | 14 | 1. poll_list is the list of devices that are polled because they have a nonempty receive queue 15 | 2. output_queueis the list of devices that have something to transmit 16 | -------------------------------------------------------------------------------- /nfs-client-cache-coherence.md: -------------------------------------------------------------------------------- 1 | 2 | ## NFS client cache 3 | 4 | NFS client caches both data (content of a file) and metadata (file name and directory). When mounting with options to disable data and metadata cache, then every operations on NFS file system will send requests to NFS server and blocks until the server notifying a completion. So basically they behave as happen in the local file system. 5 | 6 | **close-to-open cache consistency** 7 | 8 | Enable cache make thing faster so by default NFS file system is mounted with cache enabled. Unless we store database files on NFS or allow many NFS client read/write the same file concurrently, we usually enable cache. The question is does NFS client implementation guarantee cache coherence in any form ?. It turn out that NFS client v3 introduced a concept of close to open consistency. It works in this scenario e.g. 9 | 10 | A client create a new file, close it and tell other client the file name (by some other mean). The other one may see that file or may not but if it see the file and read it then it will get a complete file. It is pretty good behavior. 11 | 12 | NFS v3 client achieves it by flushing data + meta data cache when closing file. The reader will access the meta data of the file on NFS server to get most upto date one then will invalidate any data cache when it seens that the medata has been changed (perhaps by looking at date modified). 13 | 14 | **References** 15 | 16 | * https://www.avidandrew.com/understanding-nfs-caching.html 17 | * https://www.sebastien-han.fr/blog/2012/12/18/noac-performance-impact-on-web-applications/ 18 | * https://serverfault.com/questions/611044/linux-read-disk-cache-and-nfs 19 | * http://nfs.sourceforge.net/ 20 | -------------------------------------------------------------------------------- /observe-kernel-state-system-tap.md: -------------------------------------------------------------------------------- 1 | ## Observe Kernel state - systemtap 2 | 3 | stap - System Tap is tracing tool for both user and kernel. System Tap's goal is to provide full system 4 | observability on production systems. On redhat System Tap requires kernel development, debug info and gcc compiler. 5 | 6 | Example of tracing syscall open 7 | 8 | [root@localhost tap]# cat open.stap 9 | probe syscall.open # whenever any process make syscall open then do at the beginning 10 | { 11 | printf("%s(%d) open(%s)\n", execname(),pid(),argstr) # print name, pid and arguments passed to open 12 | } 13 | probe timer.ms(4000) # watch only 4 seconds 14 | { 15 | exit() 16 | } 17 | 18 | [root@localhost tap]# stap open.stp 19 | pcscd(2487) open("/proc/bus/usb", O_RDONLY|O_DIRECTORY|O_LARGEFILE|O_NONBLOCK) 20 | pcscd(2487) open("/proc/bus/usb/002", O_RDONLY|O_DIRECTORY|O_LARGEFILE|O_NONBLOCK) 21 | pcscd(2487) open("/proc/bus/usb/002/001", O_RDWR) 22 | pcscd(2487) open("/proc/bus/usb/002/001", O_RDWR) 23 | 24 | 25 | The different between stap and strace is the strace perform for one program while stap for all programs. 26 | 27 | References 28 | 29 | 1. http://sourceware.org/systemtap/tutorial/Introduction.html 30 | 2. http://www.redbooks.ibm.com/redpapers/pdfs/redp4469.pdf 31 | 3. http://lwn.net/Articles/315022/ 32 | 33 | -------------------------------------------------------------------------------- /observe-kernel-state-using-crash.md: -------------------------------------------------------------------------------- 1 | ## Observe Kernel state using crash 2 | 3 | The most valuable tool for observing kernel state is crash, which can be used display kernel call trace, 4 | kernel memory structure, signal, irq, net socket, files 5 | 6 | Example of examine kernel stack trace 7 | 8 | crash>ps 9 | PID PPID CPU TASK ST %MEM VSZ RSS COMM 10 | 0 0 0 c068e3c0 RU 0.0 0 0 [swapper] 11 | 1 0 0 dfc01aa0 IN 0.1 2160 688 init 12 | 2 1 0 dfc01550 IN 0.0 0 0 [migration/0] 13 | 3 1 0 dfc01000 IN 0.0 0 0 [ksoftirqd/0] 14 | 15 | 16 | crash> set 1 17 | PID: 1 18 | COMMAND: "init" 19 | TASK: dfc01aa0 [THREAD_INFO: dfc02000] 20 | CPU: 0 21 | STATE: TASK_INTERRUPTIBLE 22 | crash> bt 23 | PID: 1 TASK: dfc01aa0 CPU: 0 COMMAND: "init" 24 | #0 [dfc02af4] schedule at c061f412 25 | #1 [dfc02b6c] schedule_timeout at c061fb54 26 | #2 [dfc02b90] do_select at c0488ac3 27 | #3 [dfc02e34] core_sys_select at c0488dc6 28 | #4 [dfc02f74] sys_select at c048938d 29 | #5 [dfc02fb8] system_call at c0404f44 30 | EAX: 0000008e EBX: 0000000b ECX: bfc8e240 EDX: 00000000 31 | DS: 007b ESI: 00000000 ES: 007b EDI: bfc8e370 32 | SS: 007b ESP: bfc8e20c EBP: bfc8e508 33 | CS: 0073 EIP: 00909402 ERR: 0000008e EFLAGS: 00000246 34 | 35 | 36 | crash>task 37 | PID: 1 TASK: dfc01aa0 CPU: 0 COMMAND: "init" 38 | struct task_struct { 39 | state = 1, 40 | thread_info = 0xdfc02000, 41 | usage = { 42 | counter = 2 43 | }, 44 | flags = 4194560, 45 | lock_depth = -1, 46 | load_weight = 128, 47 | prio = 115, 48 | static_prio = 120, 49 | normal_prio = 115, 50 | run_list = { 51 | next = 0x100100, 52 | prev = 0x200200 53 | }, 54 | 55 | 56 | crash> struct thread_info 0xdfc02000 57 | struct thread_info { 58 | task = 0xdfc01aa0, 59 | exec_domain = 0xc0693660, 60 | flags = 0, 61 | status = 0, 62 | cpu = 0, 63 | preempt_count = 0, 64 | addr_limit = { 65 | seg = 3221225472 66 | }, 67 | sysenter_return = 0x909410, 68 | 69 | 70 | References 71 | 72 | 1. http://people.redhat.com/anderson/crash_whitepaper/ 73 | 2. http://codeascraft.etsy.com/2012/03/30/kernel-debugging-101/ - very practical useful using of crash 74 | -------------------------------------------------------------------------------- /page-coloring.md: -------------------------------------------------------------------------------- 1 | # Page coloring 2 | 3 | Page coloring is a performance optimization designed to ensure that accesses to contiguous pages 4 | in virtual memory make the best use of the processor cache. 5 | 6 | Memory allocation unit of OS makes effort to avoid page aliasing (same page frame is placed more than one 7 | in cache because different virtual pages are mapped to the same page frame). 8 | 9 | References 10 | 11 | 1. http://en.wikipedia.org/wiki/Cache_coloring 12 | 2. http://en.wikipedia.org/wiki/CPU_cache 13 | 3. http://www.freebsd.org/doc/en_US.ISO8859-1/articles/vm-design/page-coloring-optimizations.html 14 | 4. http://lwn.net/Articles/252125/ 15 | -------------------------------------------------------------------------------- /paging-unit.md: -------------------------------------------------------------------------------- 1 | ## Paging unit 2 | 3 | A memory page can be stored on disk or in a page frame (i.e. physical memory page) of physical memory. 4 | Paging unit of the processor works with fixed size page frame. Paging unit does translation linear address 5 | into physical address using Page table that must be initialize by the OS before enabling Paging unit. 6 | 7 | The 32 bit linear address consists of 8 | 9 | 1. Directory: The most significant 10 bits 10 | 2. Table: The intermediate 10 bits 11 | 3. Offset: The least significant 12 bits - imply that pagesize is 4 KB 12 | 13 | Paging unit it is enabled by setting the PG flag of a control register named cr0. When PG = 0, linear 14 | addresses are interpreted as physical addresses. 15 | 16 | The physical address of Page directory is stored in cr3 register. As each OS process see memory as contiguous 17 | region dedicated to it, the process needs its own Page directory, which is in mm field of task struct. 18 | 19 | crash> task | grep mm 20 | mm = 0xcd742e40, 21 | active_mm = 0xcd742e40, 22 | … 23 | struct mm_struct 0xcd742e40 24 | struct mm_struct { 25 | mmap = 0xce29d128, 26 | mm_rb = { 27 | rb_node = 0xcd64323c 28 | }, 29 | ... 30 | free_area_cache = 3086761984, 31 | pgd = 0xcd743000, 32 | ... 33 | 34 | The Page directory of a process map linear address in both user mode and kernel mode. It is divided into two parts 35 | 36 | 1. user part that maps logical address from 0x0-0xBFFFFFFF is different to each process 37 | 2. kernel part that maps logical address from 0xC0000000 to 0xFFFFFFFF is the same for all processes and equals 38 | master kernel Page Global Directory. 39 | 40 | When a process is created the OS make sure that master kernel Page Global Directory is propagated in the Page 41 | directory of the new created process. 42 | The code running in user mode can only access the first part while running in kernel mode can access both parts. 43 | 44 | Because kernel processes does not need to access to the user part, it can use Page table of any processes so 45 | switch to kernel thread does not need to change Page directory table. This is a reason why task struct has 46 | both mm and active_mm pointers. An kernel thread always has mm=NULL while in user process mm equals active_mm. 47 | 48 | crash> task 49 | PID: 0 TASK: c068e3c0 CPU: 0 COMMAND: "swapper" 50 | struct task_struct { 51 | state = 0, 52 | thread_info = 0xc0708000, 53 | usage = { 54 | counter = 2 55 | }, 56 | ... 57 | mm = 0x0, 58 | active_mm = 0x0, 59 | ... 60 | 61 | Beside Address Translation Table, kernel must know the status of physical page frame whether a page is free, 62 | used by any process or has been modified. It keep these information in page descriptor table mem_map. 63 | Each page frame is represented by a 32 bytes descriptor, which mean around 1% of physical memory is waste 64 | (Linux use 4KB size page frame). 65 | 66 | Linear memory accessible from each process can be viewed by looking at its mapping in proc file system. 67 | Each segment has start and end and permission. 68 | 69 | [root@localhost kernel-internal-stuffs]# cat /proc/self/maps 70 | 008e8000-00903000 r-xp 00000000 fd:00 1737604 /lib/ld-2.5.so 71 | 00903000-00904000 r-xp 0001a000 fd:00 1737604 /lib/ld-2.5.so 72 | 00904000-00905000 rwxp 0001b000 fd:00 1737604 /lib/ld-2.5.so 73 | 00907000-00a5a000 r-xp 00000000 fd:00 1737605 /lib/libc-2.5.so 74 | 00a5a000-00a5c000 r-xp 00153000 fd:00 1737605 /lib/libc-2.5.so 75 | 00a5c000-00a5d000 rwxp 00155000 fd:00 1737605 /lib/libc-2.5.so 76 | 00a5d000-00a60000 rwxp 00a5d000 00:00 0 77 | 00a82000-00a83000 r-xp 00a82000 00:00 0 [vdso] 78 | 08048000-0804d000 r-xp 00000000 fd:00 1678405 /bin/cat 79 | 0804d000-0804e000 rw-p 00004000 fd:00 1678405 /bin/cat 80 | 093b7000-093d8000 rw-p 093b7000 00:00 0 [heap] 81 | b7d3c000-b7f3c000 r--p 00000000 fd:00 714328 /usr/lib/locale/locale-archive 82 | b7f3c000-b7f3e000 rw-p b7f3c000 00:00 0 83 | bf820000-bf835000 rw-p bffe9000 00:00 0 [stack] 84 | 85 | or using pmap pid 86 | 87 | [root@localhost kernel-internal-stuffs]# pmap -x 2982 88 | 2982: cscope 89 | Address Kbytes RSS Dirty Mode Mapping 90 | 004af000 4 4 0 r-x-- [ anon ] 91 | 008e8000 108 84 0 r-x-- ld-2.5.so 92 | 00903000 4 4 4 r-x-- ld-2.5.so 93 | 00904000 4 4 4 rwx-- ld-2.5.so 94 | 00907000 1356 440 0 r-x-- libc-2.5.so 95 | 00a5a000 8 8 4 r-x-- libc-2.5.so 96 | 00a5c000 4 4 4 rwx-- libc-2.5.so 97 | 00a5d000 12 12 12 rwx-- [ anon ] 98 | 00a8d000 12 8 0 r-x-- libdl-2.5.so 99 | 00a90000 4 4 0 r-x-- libdl-2.5.so 100 | 00a91000 4 4 4 rwx-- libdl-2.5.so 101 | 05d01000 256 152 0 r-x-- libncurses.so.5.5 102 | 05d41000 32 12 4 rwx-- libncurses.so.5.5 103 | 05d49000 4 4 4 rwx-- [ anon ] 104 | 08048000 300 92 0 r-x-- cscope 105 | 08093000 4 4 4 rw--- cscope 106 | 08094000 104 44 44 rw--- [ anon ] 107 | 09d55000 1208 1208 1208 rw--- [ anon ] 108 | b7fc7000 8 8 8 rw--- [ anon ] 109 | b7fd5000 12 12 12 rw--- [ anon ] 110 | bfc78000 84 32 32 rw--- [ stack ] 111 | -------- ------- ------- ------- ------- 112 | total kB 3532 - - - 113 | 114 | References 115 | 116 | 1. https://patchwork.kernel.org/patch/17020/ 117 | -------------------------------------------------------------------------------- /preemption.md: -------------------------------------------------------------------------------- 1 | ## Preemption 2 | 3 | **User preemption** 4 | 5 | User preemption can occur 6 | 7 | 1. When returning to user-space from a system call 8 | 2. When returning to user-space from an interrupt handler 9 | 10 | the kernel provides the `need_resched` flag (one bit in thread_info.flags) to signify whether a reschedule 11 | should be performed. This flag is set by `scheduler_tick()` when a process should be preempted, and by 12 | `try_to_wake_up()` when a process that has a higher priority than the currently running process is awakened. 13 | 14 | static inline int need_resched(void) 15 | { 16 | return unlikely(test_thread_flag(TIF_NEED_RESCHED)); 17 | } 18 | 19 | Upon returning to user-space or returning from an interrupt, the `need_resched` flag is checked. If it is set, 20 | the kernel invokes the `schedule()` which may result in switching context to other task (i.e. other task runs on CPU). 21 | 22 | The flag is per-process, and not simply global, because it is faster to access a value in the process 23 | descriptor (because of the speed of current and high probability of it being cache hot) than a global variable. 24 | 25 | Historically, the flag was global before the 2.2 kernel. In 2.2 and 2.4, the flag was an int inside the 26 | task_struct. In 2.6, it was moved into a single bit of a special flag variable inside the thread_info structure. 27 | 28 | crash> struct task_struct { 29 | state = 0, 30 | thread_info = 0xc2546000, 31 | ... 32 | crash> struct thread_info 0xc2546000 33 | struct thread_info { 34 | task = 0xd1447550, 35 | exec_domain = 0xc0693660, 36 | flags = 128, 37 | ... 38 | 39 | **Kernel Preemption** 40 | 41 | 42 | Non preemptive kernel does not switch a task when it is in kernel mode. Task context switch only happens when the task 43 | voluntarily call `schedule()` (i.e. cooperative kernel) or upon return from kernel mode to user mode (from system call or 44 | interrupt handler) 45 | 46 | Preemptive kernel however can preempt a task at kernel mode if it is safe to reschedule, which usually means the task 47 | holding no lock. The task `preempt_count` increases by 1 when a lock is acquired by a task and decrements by 1 when a 48 | lock is released. 49 | 50 | crash> struct task_struct { 51 | state = 0, 52 | thread_info = 0xc2546000, 53 | ... 54 | 55 | crash> struct thread_info 0xc2546000 56 | struct thread_info { 57 | task = 0xd1447550, 58 | exec_domain = 0xc0693660, 59 | flags = 128, 60 | status = 0, 61 | cpu = 0, 62 | preempt_count = 0, 63 | ... 64 | 65 | 66 | Kernel preemption can occur 67 | 68 | 1. When an interrupt handler exits, before returning to kernel-space. This is a case of an interrupt arises during a syscall 69 | 2. When kernel code becomes preemptible again, which means all the locks that the current task is holding are released, preempt_count returns to zero. The macro preempt_enable() which is called to check whether need_resched is set. If so, the `schedule()` is invoked. 70 | 3. If a task in the kernel explicitly calls `schedule()` 71 | 4. If a task in the kernel blocks which results in a call to `schedule()` 72 | -------------------------------------------------------------------------------- /process-execution.md: -------------------------------------------------------------------------------- 1 | ## Process execution 2 | 3 | The processor can execute instruction of a process in either user mode (Current Privilege Level CPL=3 in 2 bits of cs) 4 | or kernel mode (CPL=0). The task struct holds information about process context in both user mode and kernel mode. 5 | 6 | The process context switch can occur only in kernel mode. The contents of all registers used by a process in User Mode 7 | have already been saved on the Kernel Mode stack before performing process switching. This includes the contents of the 8 | ss and esp pair that specifies the User Mode stack pointer address 9 | 10 | The set of data that must be loaded into the registers before the process resumes its execution on the CPU is called 11 | the hardware context . The hardware context is a subset of the process execution context, which includes all information 12 | needed for the process execution. In Linux, a part of the hardware context of a process is stored in the process 13 | descriptor, while the remaining part is saved in the Kernel Mode stack 14 | 15 | When the hardware interrupt raises, processor save hardware context in current stack (can be either in user mode or 16 | kernel mode of any processes) and processor jump to interrupt handler upon return it restore hardware context from current 17 | stack and continue its work. 18 | 19 | Interrupt can be nested, which mean during an interrupt handler other interrupt can occur. Because of that an interrupt 20 | handler must never block, that is, no process switch can take place while an interrupt handler is running. 21 | 22 | In fact, all the data needed to resume a nested kernel control path is stored in the Kernel Mode stack, which is 23 | tightly bound to the current process. So if the kernel switch to other process , Kernel mode stack will be changed 24 | and the code of the interrupt handler messes up. 25 | 26 | References 27 | 28 | 1. http://stackoverflow.com/questions/4732409/context-switch-in-interrupt-handlers 29 | 2. http://stackoverflow.com/questions/1053572/why-kernel-code-thread-executing-in-interrupt-context-cannot-sleep 30 | -------------------------------------------------------------------------------- /process-scheduler.md: -------------------------------------------------------------------------------- 1 | ## Process's scheduler 2 | 3 | Process scheduler is responsible for finding the next eligible task and switching to the context of that task. It has to balance between responsiveness and efficency. The cost of context switch is not negligible. 4 | 5 | The option for tuning in case of `CFS` is `/proc/sys/kernel/sched_min_granularity_ns` and `/proc/sys/kernel/sched_latency_ns`. 6 | 7 | # cat /proc/sys/kernel/sched_min_granularity_ns 8 | 750000 9 | # cat /proc/sys/kernel/sched_latency_ns 10 | 6000000 11 | 12 | Scheduler maintains per CPU run queue and picks next task for particular CPU from its run queue. When a task calls `scheduler()` function, the scheduler code processes run queue associated with the CPU of the calling task. 13 | 14 | **Scheduluer and policy** 15 | 16 | There are mutiple schedulers, each is responsible for a type of tasks. 17 | 18 | stop_sched_class → rt_sched_class → fair_sched_class → idle_sched_class 19 | 20 | * `stop_sched_class` is for kernel tasks that add/remove CPUs dynamically. 21 | * `rt_sched_class` is for real time tasks. 22 | * `fair_sched_class` is for ordinary tasks. Most of tasks are of this type 23 | * `ide_sched_class` is for lowest priority tasks that only run when there are no other runable tasks. 24 | 25 | Linux provide syscall for assign a process to a specific scheduler policy `sched_setscheduler`. The following are policies applicable for non realtime scheduling 26 | 27 | * `SCHED_OTHER`: the standard round-robin time-sharing policy, this is served by `fair_sched_class` 28 | * `SCHED_BATCH`: for "batch" style execution of processes, this is served by `fair_sched_class` 29 | * `SCHED_IDLE`: for running very low priority background jobs, this is served by `idle_sched_class` 30 | 31 | The realtime scheduling has following policies 32 | 33 | * `SCHED_FIFO`: fifo scheduling 34 | * `SCHED_RR`: round robind scheduling 35 | 36 | Scheduler class assigned to a process can be viewed using `class` field in `ps -o` command 37 | 38 | # ps -e -o pid,cmd,class 39 | PID CMD CLS 40 | 1 /sbin/init TS 41 | 2 [kthreadd] TS 42 | ... 43 | 7 [watchdog/0] FF 44 | 45 | where 46 | 47 | TS SCHED_OTHER B SCHED_BATCH 48 | FF SCHED_FIFO ISO SCHED_ISO 49 | RR SCHED_RR IDL SCHED_IDLE 50 | 51 | We can change scheduler class using command `chrt` 52 | 53 | # ps -o pid,cmd,class,priority -p 1003 54 | PID CMD CLS PRI 55 | 1003 cron TS 20 56 | # chrt -p -b 0 1003 57 | # ps -o pid,cmd,class,priority -p 1003 58 | PID CMD CLS PRI 59 | 1003 cron B 20 60 | 61 | `chrt` can'nt change priority for non realtime scheduler's class. We need to use `renice` 62 | 63 | # renice 5 -p 1003 64 | 1003 (process ID) old priority 10, new priority 5 65 | # ps -o pid,cmd,class,priority,nice -p 1003 66 | PID CMD CLS PRI NI 67 | 1003 cron B 25 5 68 | 69 | **CFS** 70 | 71 | Linux kernel from version 2.6.23 use Comppletly Faire Scheduler - CFS for process scheduling see [http://en.wikipedia.org/wiki/Completely_Fair_Scheduler]. 72 | 73 | `CFS` maintains per-task `vruntime`. As task is on CPU, this value get increased by time spending on CPU. The scheduler function check and pick task with minimum `vruntime` and assign CPU to it. 74 | 75 | `priority` (i.e `nice`) is used as weight when adjusting `vruntime`. 76 | 77 | **References** 78 | 79 | * http://www.linuxjournal.com/magazine/completely-fair-scheduler 80 | * https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt 81 | -------------------------------------------------------------------------------- /process-session-process-group-thread-group.md: -------------------------------------------------------------------------------- 1 | ## Relationship between processes, Session, process group and thread group 2 | 3 | Kernel use this information for delivering signals to a collection of processes/threads. 4 | 5 | When a process is created, it inherits group id of its parent. It can later change its group or create new 6 | group and becomes the group leader by calling setpgid. The group leader’s group id equals its pid. 7 | 8 | `kill(pid_t pid, int sig)` with negative `pid` will send signal to all members of group `-pid`. Similarly `waitpid` 9 | with negative `pid` wil wait for any member of group `-pid`. 10 | 11 | Each group belong to a unique session. First process of a session (also group leader) is the session leader, 12 | its `pid` equals session id. The session can be changed by calling `setsid` but is usually called in login process. 13 | Session has controlling tty (`/dev/tty`), signals generated by `CTRL-C/Z` are delivered to these processes of the 14 | foreground group of that session. 15 | 16 | When a process is created, its ordinary thread id equals process id. When additional thread is created with 17 | CLONE_THREAD it has different thread id but same process id, which is called thread group id. The `gettid` returns 18 | thread id while getpid returns thread group id. `tkill` is used to send signal to a thread. 19 | 20 | References 21 | 22 | 1. http://www.win.tue.nl/~aeb/linux/lk/lk-10.html#ss10.2 23 | 2. http://linux.die.net/man/2/kill 24 | 3. http://linux.die.net/man/2/waitpid 25 | 4. http://linux.die.net/man/2/clone 26 | -------------------------------------------------------------------------------- /process-state.md: -------------------------------------------------------------------------------- 1 | ## Process state 2 | 3 | Process state can be observed by `ps` command 4 | 5 | root@gerrit01:/home/vagrant# ps auxf 6 | USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 7 | root 2 0.0 0.0 0 0 ? S Mar19 0:00 [kthreadd] 8 | root 3 0.0 0.0 0 0 ? S Mar19 0:01 \_ [ksoftirqd/0] 9 | root 5 0.0 0.0 0 0 ? S< Mar19 0:00 \_ [kworker/0:0H] 10 | 11 | Process state can be one of the following 12 | 13 | PROCESS STATE CODES 14 | D uninterruptible sleep (usually IO) 15 | R running or runnable (on run queue) 16 | S interruptible sleep (waiting for an event to complete) 17 | T stopped, either by a job control signal or because it is being traced 18 | W paging (not valid since the 2.6.xx kernel) 19 | X dead (should never be seen) 20 | Z defunct ("zombie") process, terminated but not reaped by its parent 21 | For BSD formats and when the stat keyword is used, additional characters may be displayed: 22 | < high-priority (not nice to other users) 23 | N low-priority (nice to other users) 24 | L has pages locked into memory (for real-time and custom IO) 25 | s is a session leader 26 | l is multi-threaded (using CLONE_THREAD, like NPTL pthreads do) 27 | + is in the foreground process group 28 | 29 | The most interesting is uninterruptible sleep `D`, which can be watched using 30 | 31 | $while true; do date; ps auxf | awk '{if($8=="D") print $0;}'; sleep 1; done 32 | 33 | Process in in uninterruptible sleep, means that it runs in kernel mode and can't be preempted (see [preemption](preemption.md)). 34 | The CPU is not used but can't be reused to run other tasks because kernel either is holding lock or 35 | its data structure is not protected. It is a sign of deficiency of certain parts of the kernel and 36 | should be avoided as much as possible. 37 | 38 | References 39 | 40 | * http://stackoverflow.com/questions/223644/what-is-an-uninterruptable-process 41 | -------------------------------------------------------------------------------- /processing-input-ip-packet.md: -------------------------------------------------------------------------------- 1 | # processing input IPv4 packet 2 | 3 | Frame carrying IP packet is handled by ip_rcv function, which is registered during kernel initialization 4 | 5 | static struct packet_type ip_packet_type = { 6 | .type = __constant_htons(ETH_P_IP), 7 | .func = ip_rcv, 8 | .gso_send_check = inet_gso_send_check, 9 | .gso_segment = inet_gso_segment, 10 | }; 11 | 12 | static int __init inet_init(void) 13 | { 14 | … 15 | dev_add_pack(&ip_packet_type); 16 | 17 | 18 | void dev_add_pack(struct packet_type *pt) 19 | { 20 | int hash; 21 | 22 | 23 | spin_lock_bh(&ptype_lock); 24 | if (pt->type == htons(ETH_P_ALL)) { 25 | netdev_nit++; 26 | list_add_rcu(&pt->list, &ptype_all); 27 | } else { 28 | hash = ntohs(pt->type) & 15; 29 | list_add_rcu(&pt->list, &ptype_base[hash]); 30 | } 31 | spin_unlock_bh(&ptype_lock); 32 | } 33 | 34 | 35 | `ip_packet_type` get registered in global variable `ptype_base`, which is then used in function 36 | 37 | int netif_receive_skb(struct sk_buff *skb) 38 | { 39 | ... 40 | type = skb->protocol; 41 | /* normal protocol handlers are registered in ptype_base */ 42 | list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) { 43 | … 44 | ret = deliver_skb(skb, pt_prev, orig_dev); 45 | 46 | 47 | Due to presence of netfilter firewall component, the `ip_rcv` however just perform sanity check 48 | (e.g. ip version, checksum, drop frame to other host), then invoke netfilter hook if passed, `ip_rcv_finish` 49 | will perform the main of work (decide of local delivery vs. forward, parse ip options). 50 | 51 | int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) 52 | { 53 | struct iphdr *iph; 54 | u32 len; 55 | 56 | 57 | /* When the interface is in promisc. mode, drop all the crap 58 | * that it receives, do not try to analyse it. 59 | */ 60 | if (skb->pkt_type == PACKET_OTHERHOST) 61 | goto drop; 62 | 63 | 64 | IP_INC_STATS_BH(IPSTATS_MIB_INRECEIVES); 65 | 66 | 67 | if ((skb = skb_share_check(skb, GFP_ATOMIC)) == NULL) { 68 | IP_INC_STATS_BH(IPSTATS_MIB_INDISCARDS); 69 | goto out; 70 | } 71 | 72 | 73 | if (!pskb_may_pull(skb, sizeof(struct iphdr))) 74 | goto inhdr_error; 75 | 76 | 77 | iph = skb->nh.iph; 78 | 79 | 80 | /* 81 | * RFC1122: 3.1.2.2 MUST silently discard any IP frame that fails the checksum. 82 | * 83 | * Is the datagram acceptable? 84 | * 85 | * 1. Length at least the size of an ip header 86 | * 2. Version of 4 87 | * 3. Checksums correctly. [Speed optimisation for later, skip loopback checksums] 88 | * 4. Doesn't have a bogus length 89 | */ 90 | 91 | 92 | if (iph->ihl < 5 || iph->version != 4) 93 | goto inhdr_error; 94 | 95 | 96 | if (!pskb_may_pull(skb, iph->ihl*4)) 97 | goto inhdr_error; 98 | 99 | 100 | iph = skb->nh.iph; 101 | 102 | 103 | if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl))) 104 | goto inhdr_error; 105 | 106 | 107 | len = ntohs(iph->tot_len); 108 | if (skb->len < len) { 109 | IP_INC_STATS_BH(IPSTATS_MIB_INTRUNCATEDPKTS); 110 | goto drop; 111 | } else if (len < (iph->ihl*4)) 112 | goto inhdr_error; 113 | 114 | 115 | /* Our transport medium may have padded the buffer out. Now we know it 116 | * is IP we can trim to the true length of the frame. 117 | * Note this now means skb->len holds ntohs(iph->tot_len). 118 | */ 119 | if (pskb_trim_rcsum(skb, len)) { 120 | IP_INC_STATS_BH(IPSTATS_MIB_INDISCARDS); 121 | goto drop; 122 | } 123 | 124 | 125 | /* Remove any debris in the socket control block */ 126 | memset(IPCB(skb), 0, sizeof(struct inet_skb_parm)); 127 | 128 | 129 | return NF_HOOK(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL, 130 | ip_rcv_finish); 131 | 132 | 133 | inhdr_error: 134 | IP_INC_STATS_BH(IPSTATS_MIB_INHDRERRORS); 135 | drop: 136 | kfree_skb(skb); 137 | out: 138 | return NET_RX_DROP; 139 | } 140 | 141 | 142 | static inline int ip_rcv_finish(struct sk_buff *skb) 143 | { 144 | struct iphdr *iph = skb->nh.iph; 145 | struct rtable *rt; 146 | 147 | 148 | /* 149 | * Initialise the virtual path cache for the packet. It describes 150 | * how the packet travels inside Linux networking. 151 | */ 152 | if (skb->dst == NULL) { 153 | int err = ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, 154 | skb->dev); 155 | if (unlikely(err)) { 156 | if (err == -EHOSTUNREACH) 157 | IP_INC_STATS_BH(IPSTATS_MIB_INADDRERRORS); 158 | else if (err == -ENETUNREACH) 159 | IP_INC_STATS_BH(IPSTATS_MIB_INNOROUTES); 160 | goto drop; 161 | } 162 | } 163 | 164 | 165 | #ifdef CONFIG_NET_CLS_ROUTE 166 | if (unlikely(skb->dst->tclassid)) { 167 | struct ip_rt_acct *st = ip_rt_acct + 256*smp_processor_id(); 168 | u32 idx = skb->dst->tclassid; 169 | st[idx&0xFF].o_packets++; 170 | st[idx&0xFF].o_bytes+=skb->len; 171 | st[(idx>>16)&0xFF].i_packets++; 172 | st[(idx>>16)&0xFF].i_bytes+=skb->len; 173 | } 174 | #endif 175 | 176 | 177 | if (iph->ihl > 5 && ip_rcv_options(skb)) 178 | goto drop; 179 | 180 | 181 | rt = (struct rtable*)skb->dst; 182 | if (rt->rt_type == RTN_MULTICAST) 183 | IP_INC_STATS_BH(IPSTATS_MIB_INMCASTPKTS); 184 | else if (rt->rt_type == RTN_BROADCAST) 185 | IP_INC_STATS_BH(IPSTATS_MIB_INBCASTPKTS); 186 | 187 | 188 | return dst_input(skb); 189 | 190 | 191 | drop: 192 | kfree_skb(skb); 193 | return NET_RX_DROP; 194 | } 195 | 196 | The function ip_route_input intefaces with routing system to decide of whether local delivery or forward 197 | 198 | int ip_route_input(struct sk_buff *skb, u32 daddr, u32 saddr, 199 | u8 tos, struct net_device *dev) 200 | { 201 | … 202 | return ip_route_input_slow(skb, daddr, saddr, tos, dev); 203 | } 204 | 205 | 206 | static int ip_route_input_slow(struct sk_buff *skb, u32 daddr, u32 saddr, 207 | u8 tos, struct net_device *dev) 208 | { 209 | struct fib_result res; 210 | struct in_device *in_dev = in_dev_get(dev); 211 | 212 | 213 | … 214 | err = ip_mkroute_input(skb, &res, &fl, in_dev, daddr, saddr, tos); /* this path leads to packet forward */ 215 | ... 216 | 217 | 218 | local_input: 219 | rth = dst_alloc(&ipv4_dst_ops); 220 | if (!rth) 221 | 222 | 223 | … 224 | rth->u.dst.input= ip_local_deliver; /* assign function for local deliver the ip packet */ 225 | 226 | 227 | 228 | 229 | static inline int ip_mkroute_input(struct sk_buff *skb, 230 | struct fib_result* res, 231 | const struct flowi *fl, 232 | struct in_device *in_dev, 233 | u32 daddr, u32 saddr, u32 tos) 234 | … 235 | err = __mkroute_input(skb, res, in_dev, daddr, saddr, tos, 236 | &rth); 237 | ... 238 | 239 | 240 | 241 | 242 | static inline int __mkroute_input(struct sk_buff *skb, 243 | struct fib_result* res, 244 | struct in_device *in_dev, 245 | u32 daddr, u32 saddr, u32 tos, 246 | struct rtable **result) 247 | … 248 | rth->u.dst.input = ip_forward; /* assign function for forward ip packet*/ 249 | ... 250 | 251 | 252 | The `ip_local_delivery` delivers packet to L4 protocol handler, which is registered in global variable `inet_protos` 253 | 254 | struct net_protocol *inet_protos[MAX_INET_PROTOS]; 255 | 256 | 257 | int inet_add_protocol(struct net_protocol *prot, unsigned char protocol) 258 | { 259 | int hash, ret; 260 | 261 | 262 | hash = protocol & (MAX_INET_PROTOS - 1); 263 | 264 | 265 | spin_lock_bh(&inet_proto_lock); 266 | if (inet_protos[hash]) { 267 | ret = -1; 268 | } else { 269 | inet_protos[hash] = prot; 270 | ret = 0; 271 | } 272 | spin_unlock_bh(&inet_proto_lock); 273 | 274 | 275 | return ret; 276 | } 277 | 278 | 279 | static struct net_protocol icmp_protocol = { 280 | .handler = icmp_rcv, }; 281 | 282 | 283 | static int __init inet_init(void) 284 | { 285 | … 286 | if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0) 287 | printk(KERN_CRIT "inet_init: Cannot add ICMP protocol\n"); 288 | 289 | 290 | The `ip_local_deliver` use netfilter hook to check firewall rule before calling `ip_local_delivery_finish` 291 | 292 | 293 | int ip_local_deliver(struct sk_buff *skb) 294 | { 295 | /* 296 | * Reassemble IP fragments. 297 | */ 298 | 299 | 300 | if (skb->nh.iph->frag_off & htons(IP_MF|IP_OFFSET)) { 301 | skb = ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER); 302 | if (!skb) 303 | return 0; 304 | } 305 | 306 | 307 | return NF_HOOK(PF_INET, NF_IP_LOCAL_IN, skb, skb->dev, NULL, 308 | ip_local_deliver_finish); 309 | } 310 | 311 | 312 | `ip_local_deliver_finish` find L4 protocol in L3 protocol (ip) header 313 | 314 | static inline int ip_local_deliver_finish(struct sk_buff *skb) 315 | { 316 | … 317 | int protocol = skb->nh.iph->protocol; 318 | … 319 | hash = protocol & (MAX_INET_PROTOS - 1); 320 | … 321 | if ((ipprot = rcu_dereference(inet_protos[hash])) != NULL) { 322 | int ret; 323 | 324 | 325 | if (!ipprot->no_policy) { 326 | if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) { 327 | kfree_skb(skb); 328 | goto out; 329 | } 330 | nf_reset(skb); 331 | } 332 | ret = ipprot->handler(skb); /*invoke L4 protocol handler */ 333 | if (ret < 0) { 334 | protocol = -ret; 335 | goto resubmit; 336 | } 337 | IP_INC_STATS_BH(IPSTATS_MIB_INDELIVERS); 338 | 339 | 340 | static struct net_protocol udp_protocol = { 341 | .handler = udp_rcv, 342 | .err_handler = udp_err, 343 | .no_policy = 1, 344 | }; 345 | 346 | 347 | static int __init inet_init(void) 348 | { 349 | … 350 | if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0) 351 | printk(KERN_CRIT "inet_init: Cannot add UDP protocol\n"); 352 | ... 353 | 354 | 355 | int udp_rcv(struct sk_buff *skb) 356 | { 357 | … 358 | struct sock *sk; /* internal representation of socket*/ 359 | 360 | 361 | ... 362 | sk = udp_v4_lookup(saddr, uh->source, daddr, uh->dest, skb->dev->ifindex); 363 | 364 | 365 | if (sk != NULL) { 366 | int ret = udp_queue_rcv_skb(sk, skb); 367 | sock_put(sk); 368 | 369 | 370 | static __inline__ struct sock *udp_v4_lookup(u32 saddr, u16 sport, 371 | u32 daddr, u16 dport, int dif) 372 | { 373 | struct sock *sk; 374 | 375 | 376 | read_lock(&udp_hash_lock); 377 | sk = udp_v4_lookup_longway(saddr, sport, daddr, dport, dif); 378 | if (sk) 379 | sock_hold(sk); 380 | read_unlock(&udp_hash_lock); 381 | return sk; 382 | } 383 | 384 | 385 | This method try to find sock that satisfies condition of the packet in term of source , destination address and port 386 | 387 | 388 | static struct sock *udp_v4_lookup_longway(u32 saddr, u16 sport, 389 | u32 daddr, u16 dport, int dif) 390 | { 391 | struct sock *sk, *result = NULL; 392 | ... 393 | 394 | 395 | sk_for_each(sk, node, &udp_hash[hnum & (UDP_HTABLE_SIZE - 1)]) { 396 | struct inet_sock *inet = inet_sk(sk); 397 | … 398 | 399 | The method that place packet to socket queue for user space program 400 | 401 | static int udp_queue_rcv_skb(struct sock * sk, struct sk_buff *skb) 402 | { 403 | … 404 | struct udp_sock *up = udp_sk(sk); 405 | … 406 | if (!sock_owned_by_user(sk)) 407 | rc = __udp_queue_rcv_skb(sk, skb); 408 | else 409 | sk_add_backlog(sk, skb); 410 | 411 | 412 | static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) 413 | { 414 | int rc; 415 | 416 | 417 | if ((rc = sock_queue_rcv_skb(sk, skb)) < 0) { 418 | /* Note that an ENOMEM error is charged twice */ 419 | if (rc == -ENOMEM) 420 | UDP_INC_STATS_BH(UDP_MIB_INERRORS); 421 | goto drop; 422 | } 423 | … 424 | 425 | int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) 426 | { 427 | … 428 | skb_queue_tail(&sk->sk_receive_queue, skb); 429 | if (!sock_flag(sk, SOCK_DEAD)) 430 | sk->sk_data_ready(sk, skb_len); 431 | 432 | The sk_data_ready is initialized when creating socket and its internal representation 433 | 434 | struct sock { 435 | ... 436 | void (*sk_data_ready)(struct sock *sk, int bytes); 437 | ... 438 | 439 | 440 | void sock_init_data(struct socket *sock, struct sock *sk) 441 | { 442 | ... 443 | sk->sk_state_change = sock_def_wakeup; 444 | sk->sk_data_ready = sock_def_readable; 445 | 446 | 447 | 448 | 449 | This function wake up processes sleeping in socket queue 450 | 451 | static void sock_def_readable(struct sock *sk, int len) 452 | { 453 | read_lock(&sk->sk_callback_lock); 454 | if (sk_has_sleeper(sk)) 455 | wake_up_interruptible(sk->sk_sleep); 456 | sk_wake_async(sk,1,POLL_IN); 457 | read_unlock(&sk->sk_callback_lock); 458 | } 459 | 460 | References 461 | 462 | 1. http://www.cyberciti.biz/files/linux-kernel/Documentation/networking/ip-sysctl.txt 463 | -------------------------------------------------------------------------------- /reading-source-code.md: -------------------------------------------------------------------------------- 1 | ## Reading source code 2 | 3 | Kernel use extensively c extension and macro e.g. `typeof/__typeof__` , `__attribute__` 4 | 5 | #define __get_cpu_var(var) per_cpu__##var 6 | 7 | static DEFINE_PER_CPU(struct socket *, __icmp_socket) = NULL; 8 | 9 | /* Separate out the type, so (int[3], foo) works. */ 10 | #define DEFINE_PER_CPU(type, name) \ 11 | __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name 12 | 13 | crash> per_cpu____icmp_socket 14 | PER-CPU DATA TYPE: 15 | struct socket *per_cpu____icmp_socket; 16 | PER-CPU ADDRESSES: 17 | [0]: c1406ac0 18 | 19 | crash> struct socket per_cpu____icmp_socket 20 | struct socket { 21 | state = 3518432990, 22 | flags = 2251442003, 23 | ops = 0xe9518e15, 24 | fasync_list = 0x671341b9, 25 | file = 0x41b13d1, 26 | sk = 0xf3e74312, 27 | wait = { 28 | lock = { 29 | raw_lock = { 30 | slock = 1482566028 31 | } 32 | }, 33 | task_list = { 34 | next = 0xa9ea9e0f, 35 | prev = 0x5b6959e4 36 | } 37 | }, 38 | type = -24496 39 | } 40 | 41 | `__attribute__(__section__)` is used to place global/static memory variable into non-default segment so the linker later will map them to appropriate desired memory location. An example is a variable that maps to DMA location of device so read/write to that variable mean get/send data to the device. 42 | 43 | unsigned long long denotes 64 bit integer in 32 bit architecture 44 | 45 | fastcall indicates that compiler will try to use registers for passing parameters to the function instead of stack as normal. 46 | 47 | Some code of kernel is platform dependent so are written in assembly language 48 | 49 | References 50 | 51 | 1. http://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html#C-Extensions 52 | 2. http://en.wikipedia.org/wiki/X86_assembly_language 53 | 3. http://www.ibm.com/developerworks/library/l-gas-nasm.html 54 | 4. http://lxr.linux.no/ 55 | -------------------------------------------------------------------------------- /receive-data-from-socket.md: -------------------------------------------------------------------------------- 1 | # Receive data from socket 2 | 3 | asmlinkage long sys_recvfrom(int fd, void __user * ubuf, size_t size, unsigned flags, 4 | struct sockaddr __user *addr, int __user *addr_len) 5 | { 6 | … 7 | sock_file = fget_light(fd, &fput_needed); 8 | if (!sock_file) 9 | return -EBADF; 10 | ... 11 | sock = sock_from_file(sock_file, &err); 12 | if (!sock) 13 | goto out; 14 | … 15 | err=sock_recvmsg(sock, &msg, size, flags); 16 | … 17 | 18 | 19 | The syscall first look for a socket from file descriptor and call relevent function using the socket data structure 20 | 21 | int sock_recvmsg(struct socket *sock, struct msghdr *msg, 22 | size_t size, int flags) 23 | { 24 | ... 25 | ret = __sock_recvmsg(&iocb, sock, msg, size, flags); 26 | ... 27 | } 28 | 29 | 30 | static inline int __sock_recvmsg(struct kiocb *iocb, struct socket *sock, 31 | struct msghdr *msg, size_t size, int flags) 32 | { 33 | ... 34 | err = sock->ops->recvmsg(iocb, sock, msg, size, flags); 35 | … 36 | 37 | 38 | ops is protocol operation struct holding pointers to functions operating on socket file descriptor. 39 | When creating socket this field is filled depending mostly on domain/family and type 40 | 41 | int sock_common_recvmsg(struct kiocb *iocb, struct socket *sock, 42 | struct msghdr *msg, size_t size, int flags) 43 | { 44 | struct sock *sk = sock->sk; 45 | int addr_len = 0; 46 | int err; 47 | /* this will delegate its work to internal sock*/ 48 | err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT, 49 | flags & ~MSG_DONTWAIT, &addr_len); 50 | if (err >= 0) 51 | msg->msg_namelen = addr_len; 52 | return err; 53 | } 54 | 55 | static int udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, 56 | size_t len, int noblock, int flags, int *addr_len) 57 | { 58 | struct inet_sock *inet = inet_sk(sk); 59 | struct sockaddr_in *sin = (struct sockaddr_in *)msg->msg_name; 60 | struct sk_buff *skb; 61 | int copied, err; 62 | int peeked; 63 | 64 | 65 | /* 66 | * Check any passed addresses 67 | */ 68 | if (addr_len) 69 | *addr_len=sizeof(*sin); 70 | 71 | 72 | if (flags & MSG_ERRQUEUE) 73 | return ip_recv_error(sk, msg, len); 74 | 75 | 76 | try_again: 77 | skb = __skb_recv_datagram(sk, flags | (noblock ? MSG_DONTWAIT : 0), 78 | &peeked, &err); 79 | ... 80 | 81 | 82 | struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned flags, 83 | int *peeked, int *err) 84 | { 85 | ... 86 | 87 | 88 | timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT); 89 | … 90 | do { 91 | /* Again only user level code calls this function, so nothing 92 | * interrupt level will suddenly eat the receive_queue. 93 | * 94 | * Look at current nfs client by the way... 95 | * However, this function was corrent in any case. 8) 96 | */ 97 | 98 | 99 | unsigned long cpu_flags; 100 | 101 | 102 | if (flags & MSG_PEEK) { 103 | spin_lock_irqsave(&sk->sk_receive_queue.lock, 104 | cpu_flags); 105 | skb = skb_peek(&sk->sk_receive_queue); 106 | if (skb) { 107 | *peeked = skb->peeked; 108 | skb->peeked = 1; 109 | atomic_inc(&skb->users); 110 | } 111 | spin_unlock_irqrestore(&sk->sk_receive_queue.lock, 112 | cpu_flags); 113 | } else { 114 | skb = skb_dequeue(&sk->sk_receive_queue); 115 | if (skb) 116 | *peeked = skb->peeked; 117 | } 118 | if (skb) 119 | return skb; 120 | 121 | 122 | /* User doesn't want to wait */ 123 | error = -EAGAIN; 124 | if (!timeo) 125 | goto no_packet; 126 | 127 | 128 | } while (!wait_for_packet(sk, err, &timeo)); 129 | … 130 | 131 | 132 | static int wait_for_packet(struct sock *sk, int *err, long *timeo_p) 133 | { 134 | int error; 135 | DEFINE_WAIT(wait); 136 | 137 | 138 | prepare_to_wait_exclusive(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE); 139 | ... 140 | *timeo_p = schedule_timeout(*timeo_p); 141 | 142 | 143 | 144 | References 145 | 146 | 1. http://www.ibm.com/developerworks/linux/library/l-hisock/index.html 147 | 2. http://www.haifux.org/lectures/217/netLec5.pdf 148 | -------------------------------------------------------------------------------- /receive-frame-from-network.md: -------------------------------------------------------------------------------- 1 | # Receive frame from network 2 | 3 | The driver code need to be written in such way to ensure minimize frame’s lost at the same 4 | time giving CPU to other processes under high network load. 5 | 6 | There are two techniques used by kernel to know about incoming frame a) interrupt and b) polling. 7 | While polling (when timer expires) is non efficient under low traffic, interrupt is non efficient under high traffic. 8 | 9 | Linux kernel NAPI - New API combines both techniques to archive low latency and fairness. However there are still 10 | many drivers that use old API employing only interrupt technique. 11 | 12 | When a frame is received in NIC, it raises interrupt that is delivered to one of CPU. With old API, 13 | the interrupt handler (implemented by driver e.g. `vortex_rx`) then transfer the frame from NIC buffer placing it into 14 | host memory by calling `netif_rx`. 15 | 16 | static irqreturn_t 17 | vortex_interrupt(int irq, void *dev_id, struct pt_regs *regs) 18 | { 19 | … 20 | if (status & RxComplete) 21 | vortex_rx(dev); 22 | ... 23 | 24 | 25 | static int vortex_rx(struct net_device *dev) 26 | { 27 | … 28 | skb = dev_alloc_skb(pkt_len + 5); 29 | … 30 | skb->protocol = eth_type_trans(skb, dev); /* figure out protocol handler */ 31 | netif_rx(skb); 32 | ... 33 | 34 | 35 | int netif_rx(struct sk_buff *skb) 36 | { 37 | ... 38 | queue = &__get_cpu_var(softnet_data); 39 | ... 40 | enqueue: 41 | dev_hold(skb->dev); 42 | __skb_queue_tail(&queue->input_pkt_queue, skb); 43 | ... 44 | netif_rx_schedule(&queue->backlog_dev); /* this is fake dev*/ 45 | 46 | 47 | The `netif_rx` adds frame to per CPU input queue `softnet_data->input_pkt_queue` and call `netif_rx_schedule` to put 48 | a signal for bottom haft `netif_rx_action` by adding a fake `backlog_dev` device in to `softnet_data->poll_list`. 49 | 50 | void __netif_rx_schedule(struct net_device *dev) 51 | { 52 | unsigned long flags; 53 | 54 | 55 | local_irq_save(flags); 56 | dev_hold(dev); 57 | list_add_tail(&dev->poll_list, &__get_cpu_var(softnet_data).poll_list); 58 | if (dev->quota < 0) 59 | dev->quota += dev->weight; 60 | else 61 | dev->quota = dev->weight; 62 | __raise_softirq_irqoff(NET_RX_SOFTIRQ); /* trigger net_rx_action */ 63 | local_irq_restore(flags); 64 | } 65 | 66 | 67 | The list structure in kernel is kind of embedded container inside data structure, so adding `&dev->poll_list` 68 | into `softnet_data->poll_list` means add the dev into the list. 69 | 70 | Example of NAPI is tg3_rx of broadcom driver, which copy frame from NIC to the input queue managed by the driver itself. 71 | 72 | static irqreturn_t tg3_interrupt(int irq, void *dev_id, struct pt_regs *regs) 73 | { 74 | … 75 | netif_rx_schedule(tnapi->dummy_netdev); 76 | ... 77 | 78 | 79 | static int tg3_rx(struct tg3_napi *tnapi, int budget) 80 | { 81 | … 82 | netif_rx_schedule(tp->napi[1].dummy_netdev); 83 | 84 | 85 | static int tg3_poll(struct net_device *netdev, int *budget) 86 | { 87 | ... 88 | work_done = tg3_rx(tnapi, orig_budget); 89 | ... 90 | 91 | 92 | The `softnet_data->poll_list` seems does not contain a list of devices that has some incoming frames waiting 93 | for process but just a fake device with sufficient information for net_rx_action to know how to poll frame from input queue. 94 | 95 | In the old API, there is one input queue per CPU, its size is specified in parameter. The kernel drops incoming frame 96 | if the queue reaches its size. 97 | 98 | [root@localhost tap]# cat /proc/sys/net/core/netdev_max_backlog 99 | 1000 100 | 101 | In the NAPI input queue is per device and it is up to the driver to handle it. 102 | 103 | The top haft is a function of a specific driver while the bottom haft is always kernel function `net_rx_action`. 104 | The `net_rx_action` get the frames either from `softnet_data->input_pkt_queue` or by calling NAPI driver `net_device->poll`. 105 | 106 | The default net_device->poll is set to process_backlog 107 | static int __init net_dev_init(void) 108 | { 109 | … 110 | for_each_possible_cpu(i) { 111 | struct softnet_data *queue; 112 | ... 113 | queue->backlog_dev.poll = process_backlog; 114 | 115 | This generic `backlog_dev` was put into `softnet_data->poll_list` in `netif_rx` in previous stage. 116 | The `net_rx_action` check `poll_list` of devices and invoke poll of each device 117 | 118 | static void net_rx_action(struct softirq_action *h) 119 | { 120 | … 121 | while (!list_empty(&queue->poll_list)) { 122 | ... 123 | dev = list_entry(queue->poll_list.next, 124 | struct net_device, poll_list); 125 | … 126 | poll_result = dev->poll(dev, &budget); 127 | 128 | 129 | The process_backlog used to dequeue frame for old API driver and send it up to protocol handler by calling `netif_receive_skb` 130 | 131 | static int process_backlog(struct net_device *backlog_dev, int *budget) 132 | { 133 | ... 134 | for (;;) { 135 | ... 136 | skb = __skb_dequeue(&queue->input_pkt_queue); 137 | ... 138 | netif_receive_skb(skb); 139 | 140 | In the NAPI tg3 driver of broadcom NIC, the `net_device->poll` is set to `tg3_poll` 141 | 142 | static int __devinit tg3_init_one(struct pci_dev *pdev, 143 | const struct pci_device_id *ent) 144 | { 145 | … 146 | dev->poll = tg3_poll; 147 | 148 | 149 | static int tg3_poll(struct net_device *netdev, int *budget) 150 | { 151 | ... 152 | work_done = tg3_rx(tnapi, orig_budget); 153 | 154 | 155 | static int tg3_rx(struct tg3_napi *tnapi, int budget) 156 | { 157 | … 158 | netif_receive_skb(skb); 159 | 160 | The `netif_receive_skb` then deliver the frame to tap and protocol handler 161 | 162 | int netif_receive_skb(struct sk_buff *skb) 163 | { 164 | … 165 | /* protocol sniffers - ETH_P_ALL are registered in ptype_all */ 166 | list_for_each_entry_rcu(ptype, &ptype_all, list) { 167 | ... 168 | ret = deliver_skb(skb, pt_prev, orig_dev); 169 | … 170 | 171 | 172 | type = skb->protocol; 173 | /* normal protocol handlers are registered in ptype_base */ 174 | list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) { 175 | … 176 | ret = deliver_skb(skb, pt_prev, orig_dev); 177 | ... 178 | 179 | 180 | 181 | 182 | static __inline__ int deliver_skb(struct sk_buff *skb, 183 | struct packet_type *pt_prev, 184 | struct net_device *orig_dev) 185 | { 186 | atomic_inc(&skb->users); 187 | return pt_prev->func(skb, skb->dev, pt_prev, orig_dev); 188 | } 189 | 190 | References 191 | 192 | 1. http://en.wikipedia.org/wiki/New_API 193 | 2. http://knol.google.com/k/napi-linux-new-api# 194 | 3. http://www.redhat.com/promo/summit/2008/downloads/pdf/Thursday/Mark_Wagner.pdf 195 | 4. http://www.cs.clemson.edu/~westall/853/notes/devrecv.pdf 196 | 5. http://isis.poly.edu/kulesh/stuff/src/klist/ 197 | -------------------------------------------------------------------------------- /samples/timeslice.c: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | 4 | int main(int argc,char* argv[]){ 5 | struct timespec tp; 6 | int status; 7 | status = sched_rr_get_interval(0, &tp); 8 | if( status == 0 ) 9 | printf("Timeslice is %d nano secs\n",tp.tv_nsec); 10 | else 11 | printf("fail status is %d",status); 12 | } 13 | 14 | -------------------------------------------------------------------------------- /segmentation.md: -------------------------------------------------------------------------------- 1 | ## Segmentation 2 | 3 | 4 | Segmentation is the part of memory management unit that translate logical address to linear address. 5 | 6 | In Intel processor design 7 | 8 | 1. there is one Global Descriptor Table (GDT) of segment’s descriptors per CPU. The address and length 9 | of GDT is stored in register gdtr. 10 | 2. each OS process can maintain it own Local Descriptor Table (LDT). The address and length of LDT is 11 | stored in register ldtr. 12 | 13 | Each segment’s descriptor contains information about start address, size, type (whether it is user segment 14 | e.g. code, data or system segment e.g Task State Segment -TSS ) of the segment as well as access privilege 15 | (Descriptor Privilege Level - DPL). 16 | 17 | Logical memory address comprises of segment and offset within it. Register cs,ss,ds,gs,fs store segment 18 | identifier (also called segment selector), which has two parts 19 | 20 | 1. TI - Type Identification specify whether it is about segment descriptor in GDT or LDT 21 | 2. pointer to the corresponding segment descriptor in GDT or LDT 22 | 23 | Linux made limited usage of segmentation. Linux setups one GDT per CPU. Inside GDT Linux maintains only one 24 | pair of descriptors for code and data/stack segment in user mode (`__USER_CS`,` __USER_DS`) and other pair in 25 | kernel mode (`__KERNEL_CS`,`__KERNEL_DS`) crossing all processes. All these segment’s descriptors have same 26 | start address (0x0), limit(0xFFFFF). The only different between kernel and it’s user counter part are privileges. 27 | 28 | When OS switch from user mode to kernel mode, it load `__KERNEL_CS` into cs register, `__KERNEL_DS` into ds register. 29 | It loads `__USER_CS` into cs, `__USER_DS` into ds when doing reverse. `__USER_DS`/`__KERNEL_DS` descriptor is used for 30 | both data and stack segments. 31 | 32 | Contents of all these descriptors are specified in cpu_gdt_table and do not change once they are initialized 33 | in memory by cpu_init. 34 | 35 | References 36 | 37 | 1. http://web.cs.wpi.edu/~cs3013/c07/lectures/Section09.1-Intel.pdf 38 | -------------------------------------------------------------------------------- /send-frame-to-network.md: -------------------------------------------------------------------------------- 1 | # Send a frame to network 2 | 3 | Upper layer put output frame into driver buffer `struct Qdisc *q = dev->qdisc` using `dev_queue_xmit` 4 | 5 | int dev_queue_xmit(struct sk_buff *skb) 6 | { 7 | struct net_device *dev = skb->dev; 8 | 9 | 10 | … 11 | q = rcu_dereference(dev->qdisc); 12 | #ifdef CONFIG_NET_CLS_ACT 13 | skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_EGRESS); 14 | #endif 15 | if (q->enqueue) { 16 | /* Grab device queue */ 17 | spin_lock(&dev->queue_lock); 18 | q = dev->qdisc; 19 | if (q->enqueue) { 20 | rc = q->enqueue(skb, q); 21 | qdisc_run(dev); 22 | spin_unlock(&dev->queue_lock); 23 | 24 | 25 | rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc; 26 | goto out; 27 | } 28 | spin_unlock(&dev->queue_lock); 29 | } 30 | 31 | then call `qdisc_run` to send frames to the NIC. 32 | 33 | static inline void qdisc_run(struct net_device *dev) { 34 | while (!netif_queue_stopped(dev) && qdisc_restart(dev) < 0) 35 | /* NOTHING */; 36 | } 37 | 38 | Due to various reasons (queue is stopped because of not enough memory of NIC, someone else is transfering), 39 | the immediate transfer may not success and have to be postponed to later time. 40 | 41 | int qdisc_restart(struct net_device *dev) { 42 | 43 | 44 | if (!nolock) { 45 | if (!netif_tx_trylock(dev)) { 46 | collision: 47 | /* So, someone grabbed the driver. */ 48 | … 49 | goto requeue; 50 | } 51 | } 52 | ... 53 | requeue: 54 | if (skb->next) 55 | dev->gso_skb = skb; 56 | else 57 | q->ops->requeue(skb, q); 58 | netif_schedule(dev); /* schedule a transfer for later time */ 59 | return 1; 60 | } 61 | 62 | The soft interrupt `net_tx_action` and `netif_schedule` are used to facilitate that. 63 | 64 | static inline void _ _netif_schedule(struct net_device *dev) { 65 | if (!test_and_set_bit(_ _LINK_STATE_SCHED, &dev->state)) { 66 | unsigned long flags; 67 | struct softnet_data *sd; 68 | local_irq_save(flags); 69 | sd = &_ _get_cpu_var(softnet_data); 70 | dev->next_sched = sd->output_queue; 71 | sd->output_queue = dev; 72 | raise_softirq_irqoff(cpu, NET_TX_SOFTIRQ); /* trigger net_tx_action */ 73 | local_irq_restore(flags); 74 | } 75 | } 76 | 77 | 78 | A soft interrupt `net_tx_action` is responsible for sending frame from driver buffer (kernel memory) to NIC. 79 | 80 | static void net_tx_action(struct softirq_action *h) 81 | { 82 | ... 83 | 84 | if (sd->output_queue) { 85 | struct net_device *head; 86 | local_irq_disable( ); 87 | head = sd->output_queue; 88 | sd->output_queue = NULL; 89 | local_irq_enable( ); 90 | while (head) { 91 | struct net_device *dev = head; 92 | head = head->next_sched; 93 | smp_mb_ _before_clear_bit( ); 94 | clear_bit(_ _LINK_STATE_SCHED, &dev->state); 95 | if (spin_trylock(&dev->queue_lock)) { 96 | qdisc_run(dev); 97 | spin_unlock(&dev->queue_lock); 98 | } else { 99 | netif_schedule(dev); 100 | } 101 | } 102 | } 103 | 104 | int qdisc_restart(struct net_device *dev) { 105 | struct Qdisc1 *q = dev->qdisc; 106 | struct sk_buff *skb; 107 | if ((skb = q->dequeue(q)) != NULL) { 108 | … 109 | if (!spin_trylock(&dev->xmit_lock)) { 110 | … 111 | } 112 | { 113 | if (!netif_queue_stopped(dev)) { 114 | int ret; 115 | if (netdev_nit) { 116 | dev_queue_xmit_nit(skb, dev); /* send frame to protocol sniffer*/ 117 | ret = dev->hard_start_xmit(skb, dev); /* send frame to NIC */ 118 | if (ret == NETDEV_TX_OK) { 119 | if (!nolock) { 120 | dev->xmit_lock_owner = -1; 121 | spin_unlock(&dev->xmit_lock); 122 | } 123 | spin_lock(&dev->queue_lock); 124 | return -1; 125 | } 126 | if (ret == NETDEV_TX_LOCKED && nolock) { 127 | spin_lock(&dev->queue_lock); goto collision; 128 | } 129 | } 130 | } 131 | 132 | 133 | In `dev->hard_start_xmit`, when the device driver realizes that it does not have enough space to store a 134 | frame of maximum size (MTU), it stops the egress queue with netif_stop_queue to avoid wasting resources with future 135 | transmissions that the kernel already knows will fail. 136 | 137 | The following example of this throttling at work is taken from vortex_start_xmit (the hard_start_xmit method used 138 | by the drivers/net/3c59x.c driver): 139 | 140 | vortex_start_xmit(struct sk_buff *skb, struct net_device *dev) 141 | { 142 | … 143 | outsl(ioaddr + TX_FIFO, skb->data, (skb->len + 3) >> 2); 144 | dev_kfree_skb (skb); 145 | if (inw(ioaddr + TxFree) > 1536) { 146 | netif_start_queue (dev); /* AKPM: redundant? */ 147 | } else { 148 | /* stop the queue */ 149 | netif_stop_queue(dev); 150 | /* Interrupt us when the FIFO has room for max-sized packet. */ 151 | outw(SetTxThreshold + (1536>>2), ioaddr + EL3_CMD); 152 | } 153 | 154 | 155 | static void vortex_interrupt(int irq, void *dev_id, struct pt_regs *regs) { 156 | … 157 | if (status & TxAvailable) { 158 | if (vortex_debug > 5) 159 | printk(KERN_DEBUG " TX room bit was handled.\n"); 160 | /* There's room in the FIFO for a full-sized packet. */ 161 | outw(AckIntr | TxAvailable, ioaddr + EL3_CMD); 162 | /* wake queue up to start transfer again when there is space in NIC*/ 163 | netif_wake_queue (dev); 164 | } 165 | … 166 | } 167 | 168 | 169 | static inline void netif_wake_queue(struct net_device *dev) 170 | { 171 | #ifdef CONFIG_NETPOLL_TRAP 172 | if (netpoll_trap()) 173 | return; 174 | #endif 175 | if (test_and_clear_bit(__LINK_STATE_XOFF, &dev->state)) 176 | __netif_schedule(dev); 177 | } 178 | -------------------------------------------------------------------------------- /signals.md: -------------------------------------------------------------------------------- 1 | ## Signals 2 | 3 | Signal is used to notify a process about external events (page fault, interrupt from keyboard) . Except real time signals, normal signals are not queued, kernel keep track of pending signal for each process as bit mask. Also no other data is associated with signal. 4 | 5 | Except KILL and STOP, signals can be blocked, default signal handlers can be ignored and overwritten. Signals can be generated in program using kill, tkill, send_sig. 6 | 7 | To suspend a process run e.g. 8 | 9 | kill -SIGSTOP 3878 10 | 11 | State of process can be observed by 12 | 13 | ps -eo pid,state,cmd 14 | 15 | To resume the suspended process run 16 | 17 | kill -SIGCONT 3878 18 | 19 | **Default signals** 20 | 21 | The default signal sent out by `kill` is `TERM` mean terminate the process. Each process reacts to each signal differently, Java process e.g. dump stacktrace of all threads to stdout unpon receiving `QUIT`. 22 | 23 | **Delivery of a signal** 24 | 25 | When sending a signal to a running process (on CPU), the signal is pending until the process is preempted, at that time the signal is delivered to the process and a signal handler is called. 26 | 27 | When sending a signal to process in an `interupptible` waiting state, the process is waked up, the signal is delivered to the process and signal a handler is called. 28 | 29 | When sending a signal to process in ready state, the signal is delivered to the process and signal a handler is called whenever the process get CPU. 30 | 31 | -------------------------------------------------------------------------------- /slab-allocator.md: -------------------------------------------------------------------------------- 1 | ## Slab allocator 2 | 3 | Kernel allocate objects using slap layer, which provide a interface for create a cache for a specific type of 4 | object (e.g. task_struct), allocate an object from the cache and release object from cache. 5 | 6 | Because memory are allocated in pages and size of objects are usually smaller than the page size, this layer 7 | allows faster allocation of objects, eliminates memory waste and reduces memory fragmentation. 8 | 9 | The implementation give kernel sufficient information for free memory by shrinking size of cache when needed. 10 | 11 | References 12 | 13 | 1. http://linux.die.net/man/1/slabtop and cat /proc/slabinfo 14 | 2. http://www.puschitz.com/TuningLinuxForOracle.shtml 15 | 3. http://www.redhat.com/magazine/001nov04/features/vm/ 16 | -------------------------------------------------------------------------------- /socket-file-api-implementation.md: -------------------------------------------------------------------------------- 1 | # Socket file api implementation 2 | 3 | Because socket is kind of file, Linux implements special filesystem, inode for socket. 4 | 5 | **Socket file system** 6 | 7 | [root@localhost ~]# cat /proc/filesystems 8 | .. 9 | nodev sockfs 10 | ... 11 | 12 | 13 | static int __init sock_init(void) 14 | { 15 | ... 16 | register_filesystem(&sock_fs_type); 17 | sock_mnt = kern_mount(&sock_fs_type); 18 | … 19 | } 20 | 21 | 22 | struct vfsmount *kern_mount(struct file_system_type *type) 23 | { 24 | return vfs_kern_mount(type, 0, type->name, NULL); 25 | } 26 | 27 | 28 | struct vfsmount * 29 | vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void *data) 30 | { 31 | .. 32 | 33 | 34 | error = type->get_sb(type, flags, name, data, mnt); 35 | .. 36 | } 37 | 38 | static struct file_system_type sock_fs_type = { 39 | .name = "sockfs", 40 | .get_sb = sockfs_get_sb, /* return super block*/ 41 | .kill_sb = kill_anon_super, 42 | }; 43 | 44 | static struct super_operations sockfs_ops = { 45 | .alloc_inode = sock_alloc_inode, 46 | .destroy_inode =sock_destroy_inode, 47 | .statfs = simple_statfs, 48 | }; 49 | 50 | static int sockfs_get_sb(struct file_system_type *fs_type, 51 | int flags, const char *dev_name, void *data, struct vfsmount *mnt) 52 | { 53 | return get_sb_pseudo(fs_type, "socket:", &sockfs_ops, SOCKFS_MAGIC, 54 | mnt); 55 | } 56 | 57 | **Socket inode** 58 | 59 | struct socket_alloc { 60 | struct socket socket; 61 | struct inode vfs_inode; 62 | }; 63 | 64 | static struct inode *sock_alloc_inode(struct super_block *sb) 65 | { 66 | struct socket_alloc *ei; 67 | ei = (struct socket_alloc *)kmem_cache_alloc(sock_inode_cachep, SLAB_KERNEL); 68 | if (!ei) 69 | return NULL; 70 | init_waitqueue_head(&ei->socket.wait); 71 | 72 | 73 | ei->socket.fasync_list = NULL; 74 | ei->socket.state = SS_UNCONNECTED; 75 | ei->socket.flags = 0; 76 | ei->socket.ops = NULL; 77 | ei->socket.sk = NULL; 78 | ei->socket.file = NULL; 79 | ei->socket.flags = 0; 80 | 81 | 82 | return &ei->vfs_inode; 83 | } 84 | 85 | 86 | static struct dentry_operations sockfs_dentry_operations = { 87 | .d_delete = sockfs_delete_dentry, 88 | }; 89 | 90 | 91 | struct file_operations socket_file_ops = { 92 | .owner = THIS_MODULE, 93 | .llseek = no_llseek, 94 | .aio_read = sock_aio_read, 95 | .aio_write = sock_aio_write, 96 | .poll = sock_poll, 97 | .unlocked_ioctl = sock_ioctl, 98 | #ifdef CONFIG_COMPAT 99 | .compat_ioctl = compat_sock_ioctl, 100 | #endif 101 | .mmap = sock_mmap, 102 | .open = sock_no_open, /* special open code to disallow open via /proc */ 103 | .release = sock_close, 104 | .fasync = sock_fasync, 105 | .readv = sock_readv, 106 | .writev = sock_writev, 107 | .sendpage = sock_sendpage, 108 | .splice_write = generic_splice_sendpage, 109 | }; 110 | 111 | **Socket file handler** 112 | 113 | Socket is attached to a file handle so we can operate with socket using file operation e.g. read, write 114 | 115 | static int sock_attach_fd(struct socket *sock, struct file *file) 116 | { 117 | struct qstr this; 118 | char name[32]; 119 | 120 | 121 | this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino); 122 | this.name = name; 123 | this.hash = SOCK_INODE(sock)->i_ino; 124 | 125 | 126 | file->f_dentry = d_alloc(sock_mnt->mnt_sb->s_root, &this); 127 | if (unlikely(!file->f_dentry)) 128 | return -ENOMEM; 129 | 130 | 131 | file->f_dentry->d_op = &sockfs_dentry_operations; 132 | d_add(file->f_dentry, SOCK_INODE(sock)); 133 | file->f_vfsmnt = mntget(sock_mnt); 134 | file->f_mapping = file->f_dentry->d_inode->i_mapping; 135 | 136 | 137 | sock->file = file; 138 | file->f_op = SOCK_INODE(sock)->i_fop = &socket_file_ops; 139 | file->f_mode = FMODE_READ | FMODE_WRITE; 140 | file->f_flags = O_RDWR; 141 | file->f_pos = 0; 142 | file->private_data = sock; 143 | 144 | 145 | return 0; 146 | } 147 | -------------------------------------------------------------------------------- /spin-lock.md: -------------------------------------------------------------------------------- 1 | ## Spin lock 2 | 3 | When process is not given an access to the critical section, with normal lock, process will be paused 4 | and moved to wait queue giving CPU to other process. 5 | 6 | Switching one process to other takes time because certain amount of machine’s instructions has to be performed 7 | to save running context from machine’s registers of one process to the memory and restore the context of other 8 | process from memory back to machine’s registers. 9 | 10 | In multi processors machine, spin lock was invented to address this problem. With spin lock, the process try to 11 | enter a loop of nop (no operation - doing nothing) machine’s instruction for a while (the length is much shorter 12 | compare to context switch) then try to acquire the lock again hoping that the process holding the lock runs on 13 | other processor will release it so there is no need to do expensive context switching. 14 | 15 | 16 | There is no meaning of doing spin lock on uni processor machine as if the process spins, other process that 17 | currently holds the lock can not get CPU to run in order to release the lock. 18 | 19 | Because normally the same code base is used to run in both uni and multi processors machine, the implementation 20 | of spin lock in case of uni processor will instead 21 | 22 | 1. when running in user mode, in case of not able to acquire the lock: relinquish the processor and go to wait queue 23 | 2. when running in kernel mode, in case of successful acquisition of the lock, disable kernel preemption, 24 | which does not allow other process to get the CPU 25 | 26 | Spin lock is implemented and used in both user mode and kernel mode. The POSIX Thread API for spin lock is 27 | 28 | 1. pthread_spinlock_t spinlock 29 | 2. pthread_spin_lock(&spinlock) 30 | 31 | References 32 | 33 | 1. http://www.moserware.com/2008/09/how-do-locks-lock.html#lockfn5 34 | 1. http://www.spinics.net/lists/newbies/msg40369.html 35 | -------------------------------------------------------------------------------- /strace.md: -------------------------------------------------------------------------------- 1 | ## strace quick start 2 | 3 | **Identify running process** 4 | 5 | # ps -ef | grep Gerrit 6 | gerrit 17549 1 0 May06 pts/1 00:02:45 GerritCodeReview -server 7 | 8 | 9 | **Run strace** 10 | 11 | strace -f -p 17549 -s 128 -o /tmp/strace.lo 12 | 13 | * the `-f` means trace all childrens/threads of the given process. 14 | * the `-s 128` means buffer for store syscall's parameter is 128 byte. The default is 80, which is usually too small 15 | * the `-o` redirect tracing info into a file for filtering later as amount of tracing info is usually overwhelming 16 | 17 | With the knowledge if syscall we can ask strace to trace only specified syscalls or group of syscalls 18 | 19 | strace -f -p 17549 -s 128 -e trace=lstat,open 20 | 21 | strace predefine a few groups of syscalls such as `file`, `process`, `network`, `rpc` 22 | 23 | strace -f -p 17549 -s 128 -e trace=file 24 | 25 | However, sometimes knowning sequence of syscalls is more important than a particular one when troubleshooting. 26 | 27 | **References** 28 | 29 | * http://linux.die.net/man/1/strace 30 | -------------------------------------------------------------------------------- /synchronization.md: -------------------------------------------------------------------------------- 1 | ## Synchronization 2 | 3 | 4 | Semaphore is for synchronization between processes while mutex between threads of the same processes. 5 | 6 | 7 | Memory barrier is mechanism to ensure that the machine executes instructions accessing memory(read/write) in the same order as the program has been written. Usually the processor/compiler change the order of the program execution due to the optimization. This is important because when multi processes concurrently access memory the semantic depends on order of execution. 8 | 9 | 10 | Per CPU data is a technique, in which data is dedicated into one processor and thus no synchronization between CPUs is required and disable kernel preemption is sufficient. -------------------------------------------------------------------------------- /syscall.md: -------------------------------------------------------------------------------- 1 | ## Syscall 2 | 3 | Syscall is the way user process requests for a service from OS. 4 | 5 | Reference 6 | 7 | * http://articles.manugarg.com/systemcallinlinux2_6.html 8 | -------------------------------------------------------------------------------- /sysfs-procfs.md: -------------------------------------------------------------------------------- 1 | ## Overview 2 | 3 | In general `procfs` export kernel information of processess while `sysfs` export information about devices and other kernel objects. 4 | However `procfs` also export kernel subsystem information not related to proceses, which make `procfs` procfs cluttered with lots of non-process information. These should be replaced by `sysfs`. 5 | 6 | e.g. 7 | 8 | root@gerrit01:/home/vagrant# ls /proc/ | egrep -v -E "[0-9]+" | more 9 | acpi 10 | buddyinfo 11 | bus 12 | cgroups 13 | cmdline 14 | consoles 15 | cpuinfo 16 | crypto 17 | devices 18 | diskstats 19 | ... 20 | 21 | 22 | References 23 | 24 | * http://people.ee.ethz.ch/~arkeller/linux/multi/kernel_user_space_howto-2.html 25 | * https://www.kernel.org/doc/Documentation/filesystems/sysfs.txt 26 | * https://lwn.net/Articles/51437/ 27 | * https://www.kernel.org/doc/Documentation/filesystems/proc.txt 28 | 29 | -------------------------------------------------------------------------------- /top-half.md: -------------------------------------------------------------------------------- 1 | # The top half - hardware interrupt and the bottom half - software interrupt 2 | 3 | During execution of most of hardware interrupt handler, interrupts are disable on local cpu to reduce 4 | likelihood of race condition. That is reason that to divide the work of interrupt handler to 5 | 6 | 1. top half, which runs very fast with interrupt disable 7 | 2. bottom half, which do the remain work and can be preempt-able. 8 | 9 | During several routine operations, the kernel checks whether any bottom halves are scheduled for execution. 10 | If any are waiting, the kernel runs the function do_softirq/invoke_softirq, to execute them. 11 | 12 | The checks are performed during 13 | 14 | 1. exit_irq in do_IRQ: because the top half - hardware interrupt handler usually delegate some work to 15 | bottom half - software interrupt handler. So by calling software interrupt handler at exit of hardware 16 | interrupt handler, the latency is minimum. 17 | 18 | fastcall unsigned int do_IRQ(struct pt_regs * regs) { 19 | irq_enter( ); 20 | … 21 | irq_exit( ); 22 | return 1; 23 | } 24 | 25 | void irq_exit(void) { 26 | ... 27 | sub_preempt_count(IRQ_EXIT_OFFSET); 28 | if (!in_interrupt( ) && local_softirq_pending( )) 29 | invoke_softirq( ); 30 | preempt_enable_no_resched( ); 31 | } 32 | 33 | 2. return from interrupt and exception including system call 34 | 3. schedule 35 | -------------------------------------------------------------------------------- /tss.md: -------------------------------------------------------------------------------- 1 | ## TSS 2 | 3 | Task State Segment -`TSS` identifier is stored in `tr` register. Because the processor uses different stack 4 | for user mode (privilege level 3) and protected mode (privilege level 0,1,2), stack switch occurs. 5 | The `TSS` is used to stored segment selector and offset of stack that will be used by a procedure being 6 | called when processor make a call from one privilege level to other higher privilege level. 7 | 8 | In linux `TSS` is per CPU (not per process as originally intended by Intel design). `TSS` gets used when 9 | OS switches from user to kernel mode (Requester Privilege Level - RPL and DPL are different). 10 | 11 | When CPU switches from lower privilege (user mode) to higher privilege (kernel mode), it fetches address 12 | (segment selector and pointer) of kernel stack from TSS. TSS also contains permission bitmap for I/O access 13 | (address used in in/out instruction) from user mode. 14 | 15 | `TSS` segment is 236 bytes long, processes in User Mode are not allowed to access `TSS` segments. `TSS` occupies 16 | part of the kernel data segment. A `TSS` is different for each processor in the system. They are sequentially 17 | stored in the `init_tss` array, each element corresponds to one segment dedicated for one CPU. 18 | 19 | References 20 | 21 | 1. http://en.wikipedia.org/wiki/Task_State_Segment 22 | -------------------------------------------------------------------------------- /udev.md: -------------------------------------------------------------------------------- 1 | # udev 2 | 3 | `udev` dynamically create inode by default in `/dev/` for hotplug devices. udev runs as userspace deamon listening on netlink socket for hardware hotplug event and create/remove a device's inode according to predefined rules as well as load kernel module device's driver if it is required. 4 | 5 | `udev` also create meaningful symlink to easy identification devices. e.g. 6 | 7 | # ls -l /dev/disk/by-uuid/ 8 | total 0 9 | lrwxrwxrwx 1 root root 10 Apr 30 16:08 30051059-44e1-4425-9bc8-9b9ade27822c -> ../../dm-1 10 | lrwxrwxrwx 1 root root 10 Apr 30 16:08 a975dd09-15f7-4945-a2f8-f59de9af125a -> ../../dm-0 11 | lrwxrwxrwx 1 root root 10 Apr 30 16:08 f012a222-cb8d-436f-9291-d559e99ce304 -> ../../sda1 12 | 13 | 14 | **references** 15 | 16 | * https://www.linux.com/news/hardware/peripherals/180950-udev 17 | * https://www.kernel.org/doc/pending/hotplug.txt 18 | --------------------------------------------------------------------------------