├── .gitignore
├── assembly-and-processor.md
├── communication-between-kernel-module-and-process.md
├── container.md
├── data-structure.md
├── from-protocol-handler-to-socket-interface.md
├── ftrace.md
├── get-kernel-state-using-sysrq.md
├── how-debugger-works.md
├── interrupt-exception.md
├── io-cache.md
├── io-consumption-controller.md
├── io-scheduler.md
├── jvm-trouble-shooting.md
├── kernel-mode-stack.md
├── kernel-panic-oops-call-trace.md
├── memory-management-unit.md
├── memory-statistics.md
├── network-congestion-management.md
├── network-data-structure.md
├── nfs-client-cache-coherence.md
├── observe-kernel-state-system-tap.md
├── observe-kernel-state-using-crash.md
├── page-coloring.md
├── paging-unit.md
├── preemption.md
├── process-execution.md
├── process-scheduler.md
├── process-session-process-group-thread-group.md
├── process-state.md
├── processing-input-ip-packet.md
├── reading-source-code.md
├── receive-data-from-socket.md
├── receive-frame-from-network.md
├── samples
    └── timeslice.c
├── segmentation.md
├── send-frame-to-network.md
├── signals.md
├── slab-allocator.md
├── socket-file-api-implementation.md
├── spin-lock.md
├── strace.md
├── synchronization.md
├── syscall.md
├── sysfs-procfs.md
├── top-half.md
├── tss.md
└── udev.md


/.gitignore:
--------------------------------------------------------------------------------
1 | *.out
2 | *.o
3 | 


--------------------------------------------------------------------------------
/assembly-and-processor.md:
--------------------------------------------------------------------------------
 1 | # Assembly language and how processor work
 2 | 
 3 | Assuming that segment descriptor for code, data, stack are stored in cs, ds and respectively ss registers.
 4 | 
 5 | **Instruction pointer**
 6 | 
 7 | 1. value is in EIP register (RIP for 64 bits)
 8 | 2. processor will execute an instruction stored in the address specified by the EIP
 9 | 3. jump loc1 will change EIP to loc1
10 | 
11 | **Stack pointer ESP**
12 | 
13 | Point to top address of the stack
14 | 
15 | 1. value is in ESP register (RSP for 64 bits)
16 | 2. push/pop instruction will store/get a value from an address specified by ESP and decrease/increase ESP by a corresponding size
17 | 3. stack grows from higher to lower memory address
18 | 4. Kernel can directly modify ESP during context switch
19 | 
20 | call loc1 will push memory location after the instruction (EIP content) into current stack (memory location specified by ESP register) 
21 | and jump to loc1.
22 | 
23 | ret will change EIP to a value in an address specified by ESP and increase ESP by a corresponding size
24 | 
25 | **Stack base pointer EBP** 
26 | 
27 | holds address of current stack frame, a range of stack holding passed parameters, return address and local parameters
28 | 
29 | 1. value is in EBP register (RBP for 64 bits)
30 | 2. EBP contains address in stack where return address is stored.
31 | 
32 | **x86 Function Call**
33 | 
34 | ESP and EBP: The Stack Pointer and the Base Pointer
35 | 
36 | When a block of code calls a function, it pushes the parameters and the return address on the stack. 
37 | Once inside, function sets the base pointer equal to the stack pointer and then places its own internal variables on the stack. 
38 | From that point on, the function refers to its parameters and variables relative to the base pointer rather than the stack pointer. 
39 | 
40 | Why not the stack pointer? For some reason, the stack pointer lousy addressing modes. In 16-bit mode, it cannot be a square-bracket memory offset at all. In 32-bit mode, it can be appear in square brackets only by adding an expensive SIB byte to the opcode.
41 | 
42 | In your code, there is never a reason to use the stack pointer for anything other than the stack. 
43 | 
44 | The base pointer, however, is up for grabs. If your routines pass parameters by register instead of by stack (they should), 
45 | there is no reason to copy the stack pointer into the base pointer. The base pointer becomes a free register for whatever you need.
46 | 
47 | **AT&T assembly syntax**
48 | 
49 | example assembly code generated by hotspot jvm
50 | 
51 |           # {method} 'setA' '(I)V' in 'TestVolatile'
52 |           # this:     ecx       = 'TestVolatile'
53 |           # parm0:    edx       = int
54 |           #           [sp+0x10]  (sp of caller)
55 |           0xb4f48980: cmp    0x4(%ecx),%eax     ;...3b4104
56 |           0xb4f48983: jne    0xb4f2cea0         ;...0f851745 feff
57 |                                                 ;   {runtime_call}
58 |           0xb4f48989: xchg   %ax,%ax            ;...666690
59 |         [Verified Entry Point]
60 |           0xb4f4898c: push   %ebp               ;...55
61 |           0xb4f4898d: sub    $0x8,%esp          ;...81ec0800 0000
62 |           0xb4f48993: mov    %edx,0x8(%ecx)     ;...895108
63 |           0xb4f48996: lock addl $0x0,(%esp)     ;...f0830424 00
64 |                                                 ;*synchronization entry
65 |                                                 ; - TestVolatile::setA@-1 (line 14)
66 |           0xb4f4899b: add    $0x8,%esp          ;...83c408
67 |           0xb4f4899e: pop    %ebp               ;...5d
68 |           0xb4f4899f: test   %eax,0xb7f62000    ;...85050020 f6b7
69 |                                                 ;   {poll_return}
70 |           0xb4f489a5: ret                       ;...c3
71 |          
72 | 
73 | * register: %ax, %eax
74 | * constant: $0x18, $0x4
75 | * access memory location, its address is  in register: (%ecx)
76 | * access memory location, its address is  in register + offset: 0x4(%ecx)
77 | * move data from source (left operand) to destination (right operand): mov %edx,0x8(%ecx)
78 | 
79 | **References**
80 | 
81 | 1. http://www.intel.com/Assets/PDF/manual/253665.pdf
82 | 2. http://sig9.com/articles/att-syntax
83 | 3. http://asm.sourceforge.net/articles/linasm.html#Syntax
84 | 4. http://www.swansontec.com/sregisters.html
85 | 


--------------------------------------------------------------------------------
/communication-between-kernel-module-and-process.md:
--------------------------------------------------------------------------------
 1 | ## Communication between kernel module and user process
 2 | 
 3 | **Driver**
 4 | 
 5 | Driver is installed in kernel. In order for process in userspace to access it, we need to create special device file with major
 6 | number in file system. The driver module in turn register file system functions with the major number.
 7 | 
 8 |     result = register_chrdev(memory_major, "memory", &memory_fops);
 9 | 
10 | The result is pseudo file `/proc/devices/memory` . `udev` then create corresponding device file traditional `/dev/memory` 
11 | 
12 | **procfs file**
13 | 
14 | Other way is to register using proc filesystem
15 | 
16 |     create_proc_read_entry("hello_world", 0, NULL, hello_read_proc, NULL) 
17 | 
18 | So when user process read, write to the special file, kernel invokes relevant registered file system functions. 
19 |     
20 | See example in 
21 | 
22 | * http://www.tldp.org/LDP/lkmpg/2.6/html/lkmpg.html
23 | * http://www.opensourceforu.com/2012/06/some-nifty-udev-rules-and-examples/
24 | 


--------------------------------------------------------------------------------
/container.md:
--------------------------------------------------------------------------------
 1 | ## Process container
 2 | 
 3 | The concept process container in linux refers to isolated light weight OS environment for running processes (also called OS level virtualization). A process container appears as a complete OS (with root filesystem, network, hostname) dedicated to a process. 
 4 | 
 5 | Comparing to virtual machine, process container (including lxc, docker, rocket) provides reasonable good isolated environment at much cheaper cost (in term time and resource).
 6 | 
 7 | The linux kernel provide features utilized by container's tools to create and manage of containers.
 8 | 
 9 | These are
10 | 
11 | * [namespace related syscalls](http://man7.org/linux/man-pages/man7/namespaces.7.html): which prevent collision in using named resources e.g. process id, ipc, network port, filesystem, hostname.
12 | * [resource management cgroups](http://en.wikipedia.org/wiki/Cgroups): allocate physical resources (cpu, memory, io) to each container.
13 | 
14 | **Docker**
15 | 
16 | Docker is set of tools an infrastructure enabling creation and management of container. Docker is written in golang and uses libcontainer (the previous version used lxc) for accessing linux kernel namespace syscalls and cgroups.
17 | 
18 | **References**
19 | 
20 | * http://blog.dotcloud.com/under-the-hood-linux-kernels-on-dotcloud-part
21 | * http://blog.dotcloud.com/kernel-secrets-from-the-paas-garage-part-24-c
22 | * http://www.zdnet.com/article/docker-libcontainer-unifies-linux-container-powers/
23 | 


--------------------------------------------------------------------------------
/data-structure.md:
--------------------------------------------------------------------------------
1 | ﻿## Kernel data structure
2 | 
3 | 
4 | References
5 | 1. http://isis.poly.edu/kulesh/stuff/src/klist/


--------------------------------------------------------------------------------
/from-protocol-handler-to-socket-interface.md:
--------------------------------------------------------------------------------
  1 | ﻿# From protocol handler to Socket Interface
  2 | 
  3 | The interface between user space program and kernel network is Socket. We access to socket using 
  4 | 
  5 | * socket api e.g.  `connect`, `accept`, `bind`, `listen`, `send`, `receive`  or
  6 | * file api e.g. `read`, `write`, `close`
  7 | 
  8 | Socket api is complete while file api some time offer only few operations, and alone is not enough 
  9 | to make socket work as the `open` is not provided, so application end up with use only socket api.
 10 | 
 11 | It is important to know that syscall(s) related to socket use file descriptor associated with the 
 12 | kernel socket data structure.
 13 | 
 14 | **Socket data structures**
 15 | 
 16 | Socket
 17 | 
 18 |     struct socket {
 19 |         socket_state            state;
 20 |         unsigned long           flags;
 21 |         const struct proto_ops  *ops;
 22 |         struct fasync_struct    *fasync_list;
 23 |         struct file             *file;
 24 |         struct sock             *sk;
 25 |         wait_queue_head_t       wait;
 26 |         short                   type;
 27 |     };
 28 | 
 29 | Sock structure encapsulate internal implementation of socket
 30 | 
 31 |     struct sock {
 32 |             /*
 33 |              * Now struct inet_timewait_sock also uses sock_common, so please just
 34 |              * don't add nothing before this first member (__sk_common) --acme
 35 |              */
 36 |             struct sock_common      __sk_common;
 37 |     #define sk_family               __sk_common.skc_family
 38 |     #define sk_state                __sk_common.skc_state
 39 |     #define sk_reuse                __sk_common.skc_reuse
 40 |     #define sk_bound_dev_if         __sk_common.skc_bound_dev_if
 41 |     #define sk_node                 __sk_common.skc_node
 42 |     #define sk_bind_node            __sk_common.skc_bind_node
 43 |     #define sk_refcnt               __sk_common.skc_refcnt
 44 |     #define sk_hash                 __sk_common.skc_hash
 45 |     #define sk_prot                 __sk_common.skc_prot
 46 |             unsigned char           sk_shutdown : 2,
 47 |                                     sk_no_check : 2,
 48 |                                     sk_userlocks : 4;
 49 |             unsigned char           sk_protocol;
 50 |             unsigned short          sk_type;
 51 |             int                     sk_rcvbuf;
 52 |             socket_lock_t           sk_lock;
 53 |             wait_queue_head_t       *sk_sleep;
 54 |             struct dst_entry        *sk_dst_cache;
 55 |             struct xfrm_policy      *sk_policy[2];
 56 |             rwlock_t                sk_dst_lock;
 57 |             atomic_t                sk_rmem_alloc;
 58 |             atomic_t                sk_wmem_alloc;
 59 |             atomic_t                sk_omem_alloc;
 60 |             struct sk_buff_head     sk_receive_queue;
 61 |             struct sk_buff_head     sk_write_queue;
 62 |             struct sk_buff_head     sk_async_wait_queue;
 63 |             int                     sk_wmem_queued;
 64 |             int                     sk_forward_alloc;
 65 |             gfp_t                   sk_allocation;
 66 |             int                     sk_sndbuf;
 67 |             int                     sk_route_caps;
 68 |             int                     sk_gso_type;
 69 |             int                     sk_rcvlowat;
 70 |             unsigned long           sk_flags;
 71 |             unsigned long           sk_lingertime;
 72 |             /*
 73 |              * The backlog queue is special, it is always used with
 74 |              * the per-socket spinlock held and requires low latency
 75 |              * access. Therefore we special case it's implementation.
 76 |              */
 77 |             struct {
 78 |                     struct sk_buff *head;
 79 |                     struct sk_buff *tail;
 80 |             } sk_backlog;
 81 |             struct sk_buff_head     sk_error_queue;
 82 |             struct proto            *sk_prot_creator;
 83 |             rwlock_t                sk_callback_lock;
 84 |             int                     sk_err,
 85 |                                     sk_err_soft;
 86 |             unsigned short          sk_ack_backlog;
 87 |             unsigned short          sk_max_ack_backlog;
 88 |             __u32                   sk_priority;
 89 |             struct ucred            sk_peercred;
 90 |             long                    sk_rcvtimeo;
 91 |             long                    sk_sndtimeo;
 92 |             struct sk_filter        *sk_filter;
 93 |             void                    *sk_protinfo;
 94 |             struct timer_list       sk_timer;
 95 |             struct timeval          sk_stamp;
 96 |             struct socket           *sk_socket;
 97 |             void                    *sk_user_data;
 98 |             struct page             *sk_sndmsg_page;
 99 |             struct sk_buff          *sk_send_head;
100 |             __u32                   sk_sndmsg_off;
101 |             int                     sk_write_pending;
102 |             void                    *sk_security;
103 |             void                    (*sk_state_change)(struct sock *sk);
104 |             void                    (*sk_data_ready)(struct sock *sk, int bytes);
105 |             void                    (*sk_write_space)(struct sock *sk);
106 |             void                    (*sk_error_report)(struct sock *sk);
107 |             int                     (*sk_backlog_rcv)(struct sock *sk,
108 |                                                       struct sk_buff *skb);
109 |             void                    (*sk_create_child)(struct sock *sk, struct sock *newsk);
110 |             void                    (*sk_destruct)(struct sock *sk);
111 |     };
112 |     
113 | 
114 | **Create a socket**
115 |     
116 |     asmlinkage long sys_socket(int family, int type, int protocol)
117 |     {
118 |             int retval;
119 |             struct socket *sock;
120 |     
121 |             retval = sock_create(family, type, protocol, &sock);
122 |             if (retval < 0)
123 |                     goto out;
124 |     
125 |             retval = sock_map_fd(sock); /* assign function pointers for file interface*/
126 |     ...
127 |             return retval;
128 |     }
129 | 
130 | The syscall return a file descriptor associated to the newly created socket
131 |     
132 |     static int __sock_create(int family, int type, int protocol, struct socket **res, int kern)
133 |     {
134 |     …
135 |             struct socket *sock;
136 |     ...
137 |             if (!(sock = sock_alloc())) {
138 |     ...
139 |             if ((err = net_families[family]->create(sock, protocol)) < 0) {
140 |     ...
141 |             *res = sock;
142 |     …
143 | 
144 | **initialization of socket function pointers**
145 | 
146 | There is a function responsible for initializing socket for each socket family/domain
147 |     
148 |     struct net_proto_family {
149 |             int             family;
150 |             int             (*create)(struct socket *sock, int protocol);
151 |     ...
152 |     };
153 |     
154 |     
155 |     int sock_register(struct net_proto_family *ops)
156 |     {
157 |     ...
158 |             if (net_families[ops->family] == NULL) {
159 |                     net_families[ops->family]=ops;
160 |                     err = 0;
161 |             }
162 |     …
163 | 
164 | The socket family holding initialization function get registered at the system boot
165 |     
166 |     static int __init inet_init(void)
167 |     {
168 |     …
169 |             (void)sock_register(&inet_family_ops);
170 |     …
171 |     
172 |     
173 | E.g. of  structure of socket family for IPv4 Internet protocols
174 |     
175 |     static struct net_proto_family inet_family_ops = {
176 |             .family = PF_INET,
177 |             .create = inet_create,
178 |             .owner  = THIS_MODULE,
179 |     };
180 |     
181 |     
182 |     static int inet_create(struct socket *sock, int protocol)
183 |     {
184 |          struct sock *sk;
185 |     
186 |     
187 |     ...
188 |     lookup_protocol:
189 |             err = -ESOCKTNOSUPPORT;
190 |             rcu_read_lock();
191 |             list_for_each_rcu(p, &inetsw[sock->type]) {
192 |                     answer = list_entry(p, struct inet_protosw, list);
193 |     
194 |     
195 |                     /* Check the non-wild match. */
196 |                     if (protocol == answer->protocol) {
197 |                             if (protocol != IPPROTO_IP)
198 |                                     break;
199 |     ...
200 |     sock->ops = answer->ops; /* the socket operation is initialized from one registered in global variable inetsw */
201 |     answer_prot = answer->prot; /* this will be used internally when ops actually delegate its work to a functions specified here*/
202 |     ...
203 |         sk = sk_alloc(PF_INET, GFP_KERNEL, answer_prot, 1);
204 |     …
205 |          sock_init_data(sock, sk)
206 |     ….     
207 |          if (sk->sk_prot->init) {
208 |                     err = sk->sk_prot->init(sk);
209 |     
210 |     
211 | The operation functions are specific to protocol (normally 0 as there is only one protocol) of the family (local, IPv4) 
212 | and type of the socket (datagram,  stream or raw)  
213 |     
214 |     static int __init inet_init(void)
215 |     {
216 |     …
217 |            for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; ++q)
218 |                     inet_register_protosw(q);
219 |     ...
220 |        
221 |     static struct inet_protosw inetsw_array[] =
222 |     {
223 |             {
224 |                     .type =       SOCK_STREAM,
225 |                     .protocol =   IPPROTO_TCP,
226 |                     .prot =       &tcp_prot,
227 |                     .ops =        &inet_stream_ops,
228 |                     .capability = -1,
229 |                     .no_check =   0,
230 |                     .flags =      INET_PROTOSW_PERMANENT |
231 |                                   INET_PROTOSW_ICSK,
232 |             },
233 |     
234 |     
235 |             {
236 |                     .type =       SOCK_DGRAM,
237 |                     .protocol =   IPPROTO_UDP,
238 |                     .prot =       &udp_prot,
239 |                     .ops =        &inet_dgram_ops, /*functions here will call those specified in .prot via internal socket*/
240 |                     .capability = -1,
241 |                     .no_check =   UDP_CSUM_DEFAULT,
242 |                     .flags =      INET_PROTOSW_PERMANENT,
243 |            },
244 |     ...
245 |     
246 |     
247 |     const struct proto_ops inet_dgram_ops = {
248 |             .family            = PF_INET,
249 |             .owner             = THIS_MODULE,
250 |             .release           = inet_release,
251 |             .bind              = inet_bind,
252 |             .connect           = inet_dgram_connect,
253 |     ...
254 |             .recvmsg           = sock_common_recvmsg,
255 |     ...
256 |     
257 |     
258 |     void inet_register_protosw(struct inet_protosw *p)
259 |     {
260 |     ...
261 |             answer = NULL;
262 |             last_perm = &inetsw[p->type];
263 |             list_for_each(lh, &inetsw[p->type]) {
264 |                     answer = list_entry(lh, struct inet_protosw, list);
265 |                     /* Check only the non-wild match. */
266 |                     if (INET_PROTOSW_PERMANENT & answer->flags) {
267 |                             if (protocol == answer->protocol)
268 |                                     break;
269 |                             last_perm = lh;
270 |                     }
271 |     …
272 |           list_add_rcu(&p->list, last_perm);
273 |     
274 |     
275 |     void sock_init_data(struct socket *sock, struct sock *sk)
276 |     {
277 |     ...
278 |           if(sock)
279 |             {
280 |                     sk->sk_type     =       sock->type;
281 |                     sk->sk_sleep    =       &sock->wait;
282 |                     sock->sk        =       sk; /* wiring internal socket with socket*/
283 |             } else
284 |                     sk->sk_sleep    =       NULL;
285 |     ..
286 |             sk->sk_state_change     =       sock_def_wakeup;
287 |             sk->sk_data_ready       =       sock_def_readable;
288 |      
289 |     
290 |     struct sock *sk_alloc(int family, gfp_t priority,
291 |                           struct proto *prot, int zero_it)
292 |     {
293 |             struct sock *sk = NULL;
294 |     ...
295 |                             sk->sk_prot = sk->sk_prot_creator = prot;
296 |     
297 |     struct proto udp_prot = {
298 |             .name              = "UDP",
299 |             .owner             = THIS_MODULE,
300 |             .close             = udp_close,
301 |             .connect           = ip4_datagram_connect,
302 |             .disconnect        = udp_disconnect,
303 |             .ioctl             = udp_ioctl,
304 |             .destroy           = udp_destroy_sock,
305 |             .setsockopt        = udp_setsockopt,
306 |             .getsockopt        = udp_getsockopt,
307 |             .sendmsg           = udp_sendmsg,
308 |             .recvmsg           = udp_recvmsg,
309 |             .sendpage          = udp_sendpage,
310 |             .backlog_rcv       = __udp_queue_rcv_skb,
311 |             .hash              = udp_v4_hash,
312 |             .unhash            = udp_v4_unhash,
313 |             .get_port          = udp_v4_get_port,
314 |             .memory_allocated  = &udp_memory_allocated,
315 |             .sysctl_mem        = sysctl_udp_mem,
316 |             .sysctl_wmem       = &sysctl_udp_wmem_min,
317 |             .sysctl_rmem       = &sysctl_udp_rmem_min,
318 |             .obj_size          = sizeof(struct udp_sock),
319 |     #ifdef CONFIG_COMPAT
320 |             .compat_setsockopt = compat_udp_setsockopt,
321 |             .compat_getsockopt = compat_udp_getsockopt,
322 |     #endif
323 |     };
324 |     
325 |     
326 | References
327 | 
328 | 1. http://www.ibm.com/developerworks/linux/library/l-hisock/index.html
329 | 2. http://www.haifux.org/lectures/217/netLec5.pdf
330 | 


--------------------------------------------------------------------------------
/ftrace.md:
--------------------------------------------------------------------------------
 1 | ## FTrace 
 2 | 
 3 | Ftrace is kernel built in feature that enables tracing kernel function calls.
 4 | 
 5 | **Quick start**
 6 | 
 7 | change directory to `debugfs`
 8 |  
 9 |      # cd /sys/kernel/debug/tracing
10 |      
11 | check `current_tracer`
12 | 
13 |      # cat current_tracer
14 |      nop
15 |  
16 | enable function trace
17 |  
18 |     # echo function > current_tracer
19 |  
20 | view trace
21 | 
22 |     # cat trace | less
23 |     ...
24 |             bash-8329  [000] d... 80604.787465: account_system_time <-__vtime_account_system
25 |             bash-8329  [000] d... 80604.787466: cpuacct_account_field <-account_system_time
26 | 
27 | disable trace
28 | 
29 |     # echo nop > current_tracer
30 | 
31 | **References**
32 | 
33 | * http://lwn.net/Articles/365835/
34 | * http://lwn.net/Articles/366796/
35 | 


--------------------------------------------------------------------------------
/get-kernel-state-using-sysrq.md:
--------------------------------------------------------------------------------
 1 | # Get kernel state using  sysrq
 2 | 
 3 | 
 4 | Enable
 5 | 
 6 | 
 7 |     # echo 1 > /proc/sys/kernel/sysrq
 8 | 
 9 | 
10 | Generate dump
11 | 
12 | 
13 |     # echo 'm' > /proc/sysrq-trigger
14 | 
15 | 
16 | Trigger
17 | 
18 | 1. m - dump information about memory allocation
19 | 2. t - dump thread state information
20 | 3. p - dump current CPU registers and flags
21 | 4. c - intentionally crash the system (useful for forcing a disk or netdump)
22 | 5. s - immediately sync all mounted filesystems
23 | 6. u - immediately remount all filesystems read-only
24 | 7. b - immediately reboot the machine
25 | 8. o - immediately power off the machine (if configured and supported)
26 | 
27 | 
28 | View dump
29 | 
30 | 
31 |     # vi /var/log/messages
32 | 
33 | 
34 | References
35 | 
36 | 1. https://access.redhat.com/kb/docs/DOC-2024


--------------------------------------------------------------------------------
/how-debugger-works.md:
--------------------------------------------------------------------------------
 1 | ## How debugger works
 2 | 
 3 | The implementation of debugger uses of system call `ptrace`.
 4 | 
 5 | The debugger in this case a tracer establishes parent/child relationship (in this case real_parent and parent are different) 
 6 | with a process being traced then call `ptrace` with pid of the tracee. The tracer then can receive (by calling system call `wait`/`waitpid`) any signals (except KILL) sent to the tracee.
 7 | 
 8 | The debugger ask the kernel to step (`PTRACE_SINGLESTEP`), continue execution (`PTRACE_CONT`) the process being traced.
 9 | 
10 | **breakpoint**
11 | 
12 | The debugger can set a breakpoint by changing an instruction (`PTRACE_POKETEXT`) at desire location with INT 3 (raise `SIGTRAP`).
13 | Then it can examine, modify content of memory and registries using `PTRACE_PEEKTEXT`, `PTRACE_PEEKDATA`, `PTRACE_GETREGS`.
14 | Finally rewinds instruction pointer (`regs.eip=-1+`, `PTRACE_SETREGS`), restores original instruction then continue 
15 | execution (`PTRACE_CONT`) .
16 | 
17 | **References**
18 | 
19 | 1. http://eli.thegreenplace.net/2011/01/23/how-debuggers-work-part-1/
20 | 2. http://www.linuxjournal.com/article/6100?page=0,2
21 | 3. http://linux.die.net/man/2/ptrace
22 | 


--------------------------------------------------------------------------------
/interrupt-exception.md:
--------------------------------------------------------------------------------
 1 | ## Interrupt and exception
 2 | 
 3 | 
 4 | Exceptions aka “Synchronous interrupts” are produced by the CPU control unit while executing instruction, the control
 5 | unit issues them after terminating the execution of an instruction. E.g. are divide by zero or trap. Programmed exception is used to implement syscal and debug.
 6 | 
 7 | Interrupts aka “Asynchronous interrupts” are generated by other hardware devices at arbitrary times with respect to the
 8 | CPU clock signals. APIC - Advanced Programmable Interrupt Controller.
 9 | 
10 | There is table of 256 of 8 bytes interrupt descriptors. Each corresponds with one address of interrupt or exception
11 | handler. Interrupt Descriptor Table (IDT) can be located in any place in memory, its address and length is stored in register idtr. The table is initialized using lidt assembly language.
12 | 
13 | There are 3 types of descriptors
14 | 
15 | 1. task gate
16 | 2. interrupt gate
17 | 3. trap gate
18 | 
19 | **References**
20 | 
21 | * http://linux.linti.unlp.edu.ar/images/0/0c/ULK3-CAPITULO4-UNNOBA.pdf
22 | 


--------------------------------------------------------------------------------
/io-cache.md:
--------------------------------------------------------------------------------
 1 | ﻿## IO cache
 2 | 
 3 | Read/write to block io (disk) go through page cache implemented using LRU/2. 
 4 | Two lists of pages are maintained active and inactive. Page Frame Reclaiming Algorithm (PFRA) 
 5 | manages cache eviction, which only happens in inactive list. Pages are moved from inactive 
 6 | to active list when it is accessed.
 7 | 
 8 | Various parameters and the daemon pdflush are employed to ensure that cache will be evicted to 
 9 | free memory when it is needed (`dirty_background_ratio`, `dirty_ratio`) as well as un-synchronized 
10 | data are not remained too long in memory (`dirty_expire_centisecs`).
11 | 
12 | Command to drop cache
13 | 
14 |     #free && sync && echo 3 > /proc/sys/vm/drop_caches && free
15 |     
16 | References
17 | 
18 | 1. http://www.westnet.com/~gsmith/content/linux-pdflush.htm
19 | 


--------------------------------------------------------------------------------
/io-consumption-controller.md:
--------------------------------------------------------------------------------
1 | ﻿## IO consumption controller
2 | 
3 | This is mechanism used to control upper limit of  IO resource consumed by each process
4 | 
5 | References
6 | 
7 | 1. http://stuff.mit.edu:8001/afs/sipb.mit.edu/contrib/linux/Documentation/cgroups/blkio-controller.txt


--------------------------------------------------------------------------------
/io-scheduler.md:
--------------------------------------------------------------------------------
 1 | ## IO Scheduler
 2 | 
 3 | The responsibility of IO Scheduler is to schedule pending IO requests (read or write) against block devices (usually hard disk) "efficiently" with respect to physical characteristics of the devices. The term schedule in this context means ordering ,merging requests and decide which one is going to device first. IO Scheduler has to balance between latency with throughput.
 4 | 
 5 | Typical IO scheduler are
 6 | 
 7 | * Elevator : default in 2.4, no longer used.
 8 | * `deadline` : replaces Elevator, guarantees a start service time for a request.
 9 | * `noop` : simplest schduler using  queue.
10 | * Anticipatory : idle after a read request to anticipate next closed read request. Is replaced by CFQ.
11 | * `cfq` - Completely Fair Queuing : default scheduler.
12 | 
13 | IO Scheduler is specified in `[]` for each block device 
14 | 
15 |     /home/vagrant# cat /sys/block/sda/queue/scheduler
16 |     noop [deadline] cfq
17 |     /home/vagrant# echo cfq > /sys/block/sda/queue/scheduler
18 |     /home/vagrant# cat /sys/block/sda/queue/scheduler
19 |     noop deadline [cfq]
20 | 
21 | **Elevator**
22 | 
23 | It sort requests by sector's number maintaining queue of sorted requests and serves requests as elevator maximizing throughput. When new requests with lower sector's number are constantly inserted in the head of the queue causes starvation of a request at the end of the queue. Deadline I/O scheduler can outperform the CFQ I/O scheduler for certain multithreaded, database workloads.
24 | 
25 | **Deadline**
26 | 
27 | In addition to queue of requets sorted by block number, it maintains 2 deadline FIFO queue, one for read requests, other for write requests. The deadline for read requets queue is 500 ms and for write requests queue is 5 secs. The IO scheduler checks by looking at oldest requests if deadline ocurrs then it serves requests from deadline FIFO queue instead of sorted queue.
28 | 
29 | **noop**
30 | 
31 | Noop uses simple FIFO queue to merges and serves requests, no reordering based on sector number is performed. It assumes that OS has no productive way to optimize request order due to lack of information about physical devices. Example are SSD disk where seek time doesn't depend on sector number; Network Attached Storage, RAID and Tagged Command Queuing - TCQ, where the device manages request's queue by itself.
32 | 
33 | **Anticipatory**
34 | 
35 | Anticipatory scheduler tries to idle for short period (e.g. few ms) after a read operation in anticipation of another close-by read requests. It is remvoved because the same goal can be archived by tuning `slice_idle` of CFRQ.
36 | 
37 | **CFQ**
38 | 
39 | Maintains queue for synchronous requests per process. It allocates timeslice (based on IO priority of a process, which is set by syscall `ioprio_set` or `ionice` command) for each queue to access disk. The I/O scheduler visits each queue in a round-robin fashion, servicing requests from the queue until the queue's timeslice is exhausted, or until no more requests remain. In the latter case, the CFQ I/O Scheduler will then sit idle for a brief period specified by`slice_idle` — by default, 10 ms—waiting for a new request on the queue. If the anticipation pays off, the I/O scheduler avoids seeking. If not, the waiting was in vain, and the scheduler moves on to the next process' queue.
40 | 
41 | Asynchronous requests for all processes are batched together in fewer queues, one per priority. 
42 | 
43 | CFQ IO scheduler works in similar way as CPU scheduler. The detail is mentioned in [http://en.wikipedia.org/wiki/Completely_Fair_Scheduler], it prevent both starving and hogging. 
44 | 
45 | **References**
46 | 
47 | * http://en.wikipedia.org/wiki/Noop_scheduler
48 | * http://en.wikipedia.org/wiki/Deadline_scheduler
49 | * http://en.wikipedia.org/wiki/Anticipatory_scheduling
50 | * http://en.wikipedia.org/wiki/CFQ
51 | * http://man7.org/linux/man-pages/man2/ioprio_set.2.html
52 | * http://dom.as/2014/03/31/mongo-io/
53 | * http://www.electricmonk.nl/log/2012/07/30/setting-io-priorities-on-linux/
54 | 


--------------------------------------------------------------------------------
/jvm-trouble-shooting.md:
--------------------------------------------------------------------------------
 1 | ﻿# Example of troubleshooting JVM
 2 | 
 3 | Identify JVM process and thread (LWP), that causes a problem e.g. high CPU usage
 4 | 
 5 |     ps -eLo pid,lwp,pcpu,vsz,comm
 6 | 
 7 | Run gdb attach to process and display stacktrace of problematic thread, remember lwp
 8 | 
 9 |     gdb -p <jvm_process_id>
10 |     gdb> info threads
11 |                                 # identify thread number, its lwp matchs the problematic lwp in ps command
12 |     gdb> thread <thread_number> # switch to that thread
13 |     gdb> bt                     #  get stacktrace
14 |     …


--------------------------------------------------------------------------------
/kernel-mode-stack.md:
--------------------------------------------------------------------------------
 1 | ﻿# Kernel mode stack
 2 | 
 3 | 
 4 | Every process (LWP) has its own stack when running in kernel mode. The size of kernel mode stack is usually 4 or 8 KB
 5 | (configured at kernel’s compile time). However thread_info (usually 56 byte) is allocated at the bottom of the stack.
 6 | Note that stack grows from top to bottom, which mean pushing data to stack decreasing address of stack pointer, while
 7 | popping data from stack increasing its address.
 8 | 
 9 | 
10 | One way to get kernel mode stack size is used crash
11 | 
12 | 
13 |     crash> print sizeof(struct thread_info)
14 |     $5 = 56
15 |     crash> print sizeof(union thread_union)
16 |     $6 = 4096
17 |     crash>
18 | 
19 | 
20 | thread_info is stored in kernel mode stack is for reason of efficiency. There are only few assembly instructions
21 | required to get address of process which is first field in thread_info struct
22 | 
23 | References
24 | 
25 | 1. http://notoveryet.wordpress.com/2009/07/09/linux-kernel-bits-which-i-feel-excited-about/


--------------------------------------------------------------------------------
/kernel-panic-oops-call-trace.md:
--------------------------------------------------------------------------------
 1 | ﻿## Kernel panic, oops, call trace
 2 | 
 3 | 
 4 | Unlike kernel panic, which can not be recovered and reboot the system, when the kernel detects a oops’s problem, it
 5 | prints kernel state including call trace and kills any offending process.
 6 | 
 7 | 
 8 | E.g of call trace
 9 | 
10 | 
11 |     [<c01511a7>] __alloc_pages+0x29f/0x2b1
12 |     [<c014cfc3>] find_or_create_page+0x39/0x72
13 |     [<c01717d8>] grow_dev_page+0x2a/0x1eb
14 |     [<c0171ac7>] __getblk_slow+0x12e/0x159
15 |     [<c0171e49>] __getblk+0x3f/0x49
16 | 
17 | 
18 | Description of each field
19 | 
20 | 1. address of function in System.map, kernel use this address to figure out the function name
21 | 2. function name
22 | 3. the offset from start of the function in bytes, we use this to figure out which line was currently running
23 | 4. the size of the function in bytes
24 | 
25 | To find the line of function, we need to install crash
26 | 
27 |     crash> bt
28 |     PID: 2489   TASK: d3f86aa0  CPU: 0   COMMAND: "top"
29 |     #0 [cd264af4] schedule at c061f412
30 |     #1 [cd264b6c] schedule_timeout at c061fb54
31 |     #2 [cd264b90] do_select at c0488ac3
32 |     #3 [cd264e34] core_sys_select at c0488dc6
33 |     #4 [cd264f74] sys_select at c048938d
34 |     #5 [cd264fb8] system_call at c0404f44
35 |        EAX: ffffffda  EBX: 00000001  ECX: bfde8ddc  EDX: 00000000
36 |        DS:  007b      ESI: 00000000  ES:  007b      EDI: 08056a00
37 |        SS:  007b      ESP: bfde8b3c  EBP: bfde9488
38 |        CS:  0073      EIP: 002ee402  ERR: 0000008e  EFLAGS: 00000246
39 |     crash> sym c048938d
40 |     c048938d (T) sys_select+149  ../debug/kernel-2.6.18/linux-2.6.18.i686/fs/select.c: 407
41 |     crash>
42 | 
43 | Description of each field
44 | 
45 | 1. address of stack at the time of invoking next upper function, by looking at this we know the parameters passing to
46 | 2. name of kernel function
47 | 3. address of an instruction of the function
48 | 
49 | system_call & sys_select means the task make call to select the user stack trace would be
50 | 
51 |     sh-3.2# pstack 2489
52 |     #0  0x00d51402 in __kernel_vsyscall ()
53 |     #1  0x007d9c3d in ___newselect_nocancel () from /lib/libc.so.6
54 |     #2  0x08051372 in signal_name_to_number ()
55 |     #3  0x00724e9c in __libc_start_main () from /lib/libc.so.6
56 |     #4  0x08049741 in signal_name_to_number ()
57 | 
58 | 
59 | References
60 | 
61 | 1. http://en.wikipedia.org/wiki/System.map
62 | 2. http://madwifi-project.org/wiki/DevDocs/KernelOops
63 | 3. http://magazine.redhat.com/2007/08/15/a-quick-overview-of-linux-kernel-crash-dump-analysis/
64 | 4. https://access.redhat.com/kb/docs/DOC-2024


--------------------------------------------------------------------------------
/memory-management-unit.md:
--------------------------------------------------------------------------------
 1 | ﻿## Memory Management Unit
 2 | 
 3 | 
 4 | The MMU is hardware that translate logical memory address (used in machine instruction) to the physical one. There are
 5 | two step from logical address to linear address by segmentation unit and from linear address to physical address by 
 6 | paging unit.
 7 | 
 8 | 
 9 | OS Memory Management Subsystem manages memory by
10 | 
11 | 1. setup and maintains a metadata about physical memory’ pages (global page descriptor table mem_map) so it will know if
12 | a memory’ page is in used or free, is modified i.e. dirty or clean.
13 | 2. set up/manipulate relevant virtual to physical memory translation tables (per process page directory and table)
14 | 
15 | 
16 | Linux does not make use of segmentation so the translation table for segmentation is in fact common for all processes.
17 | However the translation table of paging unit is per process so changes each time at context switch.
18 | 
19 | 
20 | 32 bit linux kernel can use more than 4GB RAM (one process still has limitation of 4 GB, 1GB for kernel mode, 3GB for 
21 | user mode) if  PAE (Physical Address Extension) is enabled in BIOS 
22 |     
23 |     grep -i PAE /proc/cpuinfo
24 | 
25 | and kernel 
26 | 
27 |     grep -i HIGHMEM /boot/config-’uname -r’
28 | 
29 | 
30 | Kernel maintain a meta information about memory pages. Due limitation of some hardware architectures (e.g. x86), pages 
31 | are grouped in 3 memory zones DMA, NORMAL, HIGH.   To transfer data with devices kernel must use DMA zone.
32 | 
33 | 
34 | Due to limitation of permanent mapping, unlike NORMAL zone, the memory in HIGH zone is not permanently mapped (put into 
35 | page global directory) into kernel address space, i.e. is not possible to access directly but it requires first to
36 | allocate then map to kernel using kmap, which  returns address of the allocated page. After usage this should be un map 
37 | using kunmap.  
38 | The common pattern for using high memory is e.g. from reading file
39 | 
40 | 
41 |      int file_read_actor()   
42 |      ...
43 |         kaddr = kmap(page);
44 |         left = __copy_to_user(desc->arg.buf, kaddr + offset, size);
45 |         kunmap(page);
46 |      ...
47 |      
48 | or from reading socket
49 | 
50 |     int skb_copy_datagram_iovec()    
51 |     ...
52 |                        vaddr = kmap(page);
53 |                         err = memcpy_toiovec(to, vaddr + frag->page_offset +
54 |                                              offset - start, copy);
55 |                         kunmap(page);
56 | …
57 | 
58 | 
59 | The 64 bit OS version does not have this issue, all memory pages are permanently mapped into NORMAL zone and kmap just return already mapped linear address while kunmap do nothing
60 | 
61 | 
62 |     void *kmap(struct page *page)
63 |     {
64 |         might_sleep();
65 |         if (!PageHighMem(page))
66 |                 return page_address(page);
67 |         return kmap_high(page);
68 |      }
69 | 
70 | 
71 |     void kunmap(struct page *page)
72 |     {
73 |         if (in_interrupt())
74 |                 BUG();
75 |         if (!PageHighMem(page))
76 |                 return;
77 |         kunmap_high(page);
78 |      }      
79 | 
80 | 
81 | References
82 | 
83 | 1. http://www.spack.org/wiki/LinuxRamLimits
84 | 2. http://www.redhat.com/magazine/001nov04/features/vm/
85 | 3. http://blogs.oracle.com/gverma/2008/03/redhat_linux_kernels_and_proce_1.html
86 | 4. http://lists.us.dell.com/pipermail/linux-poweredge/2005-August/022327.html
87 | 5. http://en.wikipedia.org/wiki/Physical_Address_Extension


--------------------------------------------------------------------------------
/memory-statistics.md:
--------------------------------------------------------------------------------
  1 | ﻿# Memory statistics
  2 | 
  3 |     cat /proc/meminfo
  4 | 
  5 |     MemTotal:       514916 kB
  6 |     MemFree:        142676 kB
  7 |     Buffers:         57068 kB
  8 |     Cached:         214008 kB          # Page cache
  9 |     SwapCached:          4 kB          # Cache of swap file
 10 |     Active:         217180 kB          #  Active/inactive pages used by PFCA
 11 |     Inactive:        99832 kB          #  
 12 |     HighTotal:           0 kB
 13 |     HighFree:            0 kB
 14 |     LowTotal:       514916 kB
 15 |     LowFree:        142676 kB
 16 |     SwapTotal:     1048568 kB
 17 |     SwapFree:      1048552 kB
 18 |     Dirty:               0 kB          # Cached file is modified and need to write to disk
 19 |     Writeback:           0 kB          #  Being written to disk
 20 |     AnonPages:       45932 kB
 21 |     Mapped:          13576 kB
 22 |     Slab:            47588 kB
 23 |     PageTables:       1676 kB
 24 |     NFS_Unstable:        0 kB
 25 |     Bounce:              0 kB
 26 |     CommitLimit:   1306024 kB
 27 |     Committed_AS:   170724 kB
 28 |     VmallocTotal:   507896 kB
 29 |     VmallocUsed:      5440 kB
 30 |     VmallocChunk:   502304 kB
 31 |     HugePages_Total:     0
 32 |     HugePages_Free:      0
 33 |     HugePages_Rsvd:      0
 34 |     Hugepagesize:     4096 kB
 35 |     
 36 | 
 37 | **Cache** 
 38 | 
 39 | physical memory used to cache files usually result from calling mmap
 40 | 
 41 |     fs/proc/proc_misc.c
 42 |     ...
 43 |     cached = global_page_state(NR_FILE_PAGES) -
 44 |                        total_swapcache_pages - i.bufferram;
 45 |     ...
 46 | 
 47 |     mm/filemap.c
 48 |     ...
 49 |     int add_to_page_cache(struct page *page, struct address_space *mapping,
 50 |                pgoff_t offset, gfp_t gfp_mask)
 51 |     {
 52 |     ...
 53 |                        __inc_zone_page_state(page, NR_FILE_PAGES);
 54 | 
 55 | 
 56 | **SwapCached**
 57 | 
 58 | When a page is swapped out, it goes through the same process as when writing a block of file to disk. 
 59 | So first go to cache and then the pdflush wil write it to block device. The benefit is possible merging 
 60 | many pages into single write (less IO operation).
 61 | 
 62 | **Active/Inactive**
 63 | 
 64 | Page Frame Reclaiming Algorithm maintains LRU lists of inactive and active page frames (physical memory page) 
 65 | that can be either assigned to processes or used as cache (excluding free pages).
 66 | 
 67 |     fs/proc/proc_misc.c
 68 |     ...
 69 |     get_zone_counts(&active, &inactive, &free);
 70 | 
 71 | **Dirty**
 72 | 
 73 | Amount of memory modified by processes and need to be written to files at some point of time. After calling sync, 
 74 | this should be zero
 75 | 
 76 |     fs/proc/proc_misc.c
 77 |     ...
 78 |                K(global_page_state(NR_FILE_DIRTY)),
 79 | 
 80 | 
 81 | **Writeback**
 82 | 
 83 | Amount of memory being written to files. Should be zero, if the disk is fast enough.
 84 | 
 85 |     fs/proc/proc_misc.c
 86 |     ...
 87 |               K(global_page_state(NR_WRITEBACK)),
 88 | 
 89 | 
 90 |     fs/buffer.c
 91 |     ...
 92 |     static int __block_write_full_page(struct inode *inode, struct page *page,
 93 |                        get_block_t *get_block, struct writeback_control *wbc)
 94 |     ...
 95 |        set_page_writeback(page);
 96 | 
 97 | **References**
 98 | 
 99 | 1. http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/4/html/Reference_Guie/s2-proc-meminfo.html
100 | 2. fs/proc/proc_misc.c
101 | 3. mm/filemap.c
102 | 


--------------------------------------------------------------------------------
/network-congestion-management.md:
--------------------------------------------------------------------------------
 1 | ## Network congestion management for incoming frames
 2 | 
 3 | There are two technique applied to network congestion management mainly to reduce CPU load.
 4 | 
 5 | 1. reduce number of interrupt : this is done in interrupt handler of driver code by processing 
 6 | multi frame in one interrupt activation and polling
 7 | 
 8 | 2. discard incoming frame as soon as possible: if the driver use NAPI which mean manage its own 
 9 | queue of incoming frames, it is up to driver to handle it. Otherwise kernel will handle it in `netif_rx` 
10 | function by watching average of per CPU queue.
11 | 


--------------------------------------------------------------------------------
/network-data-structure.md:
--------------------------------------------------------------------------------
 1 | ﻿# Network data structure
 2 | 
 3 | 
 4 | The implementation of network use two fundamental data structure socket buffer `sk_buff` and  `net_device`.
 5 | 
 6 | `sk_buff` is where packets are stored and processed while net_device keep information about hardware device 
 7 | and pointer to driver’s functions put data into sk_buff as well as get data from it.
 8 | 
 9 | Other important is `softnet_data`, which is per CPU and contain information needed for communication between 
10 | top half and bottom half
11 | 
12 | The `softnet_data` has the following noticeable fields
13 | 
14 | 1. poll_list is the list of devices that are polled because they have a nonempty receive queue
15 | 2. output_queueis the list of devices that have something to transmit
16 | 


--------------------------------------------------------------------------------
/nfs-client-cache-coherence.md:
--------------------------------------------------------------------------------
 1 | 
 2 | ## NFS client cache
 3 | 
 4 | NFS client caches both data (content of a file) and metadata (file name and directory). When mounting with options to disable data and metadata cache, then every operations on NFS file system will send requests to NFS server and blocks until the server notifying a completion. So basically they behave as happen in the local file system.
 5 | 
 6 | **close-to-open cache consistency**
 7 | 
 8 | Enable cache make thing faster so by default NFS file system is mounted with cache enabled. Unless we store database files on NFS or allow many NFS client read/write the same file concurrently, we usually enable cache. The question is does NFS client implementation guarantee cache coherence in any form ?. It turn out that NFS client v3 introduced a concept of close to open consistency. It works in this scenario e.g.
 9 | 
10 | A client create a new file, close it and tell other client the file name (by some other mean). The other one may see that file or may not but if it see the file and read it then it will get a complete file. It is pretty good behavior.
11 | 
12 | NFS v3 client achieves it by flushing data + meta data cache when closing file. The reader will access the meta data of the file on NFS server to get most upto date one then will invalidate any data cache when it seens that the medata has been changed (perhaps by looking at date modified). 
13 | 
14 | **References**
15 | 
16 | * https://www.avidandrew.com/understanding-nfs-caching.html
17 | * https://www.sebastien-han.fr/blog/2012/12/18/noac-performance-impact-on-web-applications/
18 | * https://serverfault.com/questions/611044/linux-read-disk-cache-and-nfs
19 | * http://nfs.sourceforge.net/
20 | 


--------------------------------------------------------------------------------
/observe-kernel-state-system-tap.md:
--------------------------------------------------------------------------------
 1 | ﻿## Observe Kernel state - systemtap
 2 | 
 3 | stap - System Tap is tracing tool for both user and kernel.  System Tap's goal is to provide full system 
 4 | observability on production systems. On redhat System Tap requires kernel development, debug info and gcc compiler.
 5 | 
 6 | Example of tracing syscall open
 7 | 
 8 |     [root@localhost tap]#  cat open.stap
 9 |     probe syscall.open # whenever any process make syscall open then do at the beginning  
10 |     {
11 |      printf("%s(%d) open(%s)\n", execname(),pid(),argstr) # print name, pid and arguments passed to open
12 |     }
13 |     probe timer.ms(4000)  # watch only 4 seconds
14 |     {
15 |      exit()
16 |     }
17 |     
18 |     [root@localhost tap]# stap open.stp
19 |     pcscd(2487) open("/proc/bus/usb", O_RDONLY|O_DIRECTORY|O_LARGEFILE|O_NONBLOCK)
20 |     pcscd(2487) open("/proc/bus/usb/002", O_RDONLY|O_DIRECTORY|O_LARGEFILE|O_NONBLOCK)
21 |     pcscd(2487) open("/proc/bus/usb/002/001", O_RDWR)
22 |     pcscd(2487) open("/proc/bus/usb/002/001", O_RDWR)
23 | 
24 | 
25 | The different between stap and strace is the strace perform for one program while stap for all programs.
26 | 
27 | References
28 | 
29 | 1. http://sourceware.org/systemtap/tutorial/Introduction.html
30 | 2. http://www.redbooks.ibm.com/redpapers/pdfs/redp4469.pdf
31 | 3. http://lwn.net/Articles/315022/
32 | 
33 | 


--------------------------------------------------------------------------------
/observe-kernel-state-using-crash.md:
--------------------------------------------------------------------------------
 1 | ﻿## Observe Kernel state  using crash
 2 | 
 3 | The most valuable tool for observing kernel state is crash, which can be used display kernel call trace, 
 4 | kernel memory structure, signal, irq, net socket, files
 5 | 
 6 | Example of examine kernel stack trace
 7 | 
 8 |     crash>ps
 9 |       PID    PPID  CPU   TASK    ST  %MEM     VSZ    RSS  COMM
10 |          0      0   0  c068e3c0  RU   0.0       0      0  [swapper]
11 |          1      0   0  dfc01aa0  IN   0.1    2160    688  init
12 |          2      1   0  dfc01550  IN   0.0       0      0  [migration/0]
13 |          3      1   0  dfc01000  IN   0.0       0      0  [ksoftirqd/0]
14 |     
15 |     
16 |     crash> set 1
17 |        PID: 1
18 |     COMMAND: "init"
19 |       TASK: dfc01aa0  [THREAD_INFO: dfc02000]
20 |        CPU: 0
21 |      STATE: TASK_INTERRUPTIBLE
22 |     crash> bt
23 |     PID: 1      TASK: dfc01aa0  CPU: 0   COMMAND: "init"
24 |     #0 [dfc02af4] schedule at c061f412
25 |     #1 [dfc02b6c] schedule_timeout at c061fb54
26 |     #2 [dfc02b90] do_select at c0488ac3
27 |     #3 [dfc02e34] core_sys_select at c0488dc6
28 |     #4 [dfc02f74] sys_select at c048938d
29 |     #5 [dfc02fb8] system_call at c0404f44
30 |        EAX: 0000008e  EBX: 0000000b  ECX: bfc8e240  EDX: 00000000
31 |        DS:  007b      ESI: 00000000  ES:  007b      EDI: bfc8e370
32 |        SS:  007b      ESP: bfc8e20c  EBP: bfc8e508
33 |        CS:  0073      EIP: 00909402  ERR: 0000008e  EFLAGS: 00000246
34 |     
35 |     
36 |     crash>task
37 |     PID: 1      TASK: dfc01aa0  CPU: 0   COMMAND: "init"
38 |     struct task_struct {
39 |      state = 1,
40 |      thread_info = 0xdfc02000,
41 |      usage = {
42 |        counter = 2
43 |      },
44 |      flags = 4194560,
45 |      lock_depth = -1,
46 |      load_weight = 128,
47 |      prio = 115,
48 |      static_prio = 120,
49 |      normal_prio = 115,
50 |      run_list = {
51 |        next = 0x100100,
52 |        prev = 0x200200
53 |      },
54 |     
55 |     
56 |     crash> struct thread_info 0xdfc02000
57 |     struct thread_info {
58 |      task = 0xdfc01aa0,
59 |      exec_domain = 0xc0693660,
60 |      flags = 0,
61 |      status = 0,
62 |      cpu = 0,
63 |      preempt_count = 0,
64 |      addr_limit = {
65 |        seg = 3221225472
66 |      },
67 |      sysenter_return = 0x909410,
68 |   
69 | 
70 | References
71 | 
72 | 1. http://people.redhat.com/anderson/crash_whitepaper/
73 | 2. http://codeascraft.etsy.com/2012/03/30/kernel-debugging-101/ - very practical useful using of crash
74 | 


--------------------------------------------------------------------------------
/page-coloring.md:
--------------------------------------------------------------------------------
 1 | ﻿# Page coloring
 2 | 
 3 | Page coloring is a performance optimization designed to ensure that accesses to contiguous pages 
 4 | in virtual memory make the best use of the processor cache. 
 5 | 
 6 | Memory allocation unit of OS makes effort to avoid page aliasing (same page frame is placed more than one 
 7 | in cache because different virtual pages are mapped to the same page frame).
 8 | 
 9 | References
10 | 
11 | 1. http://en.wikipedia.org/wiki/Cache_coloring
12 | 2. http://en.wikipedia.org/wiki/CPU_cache
13 | 3. http://www.freebsd.org/doc/en_US.ISO8859-1/articles/vm-design/page-coloring-optimizations.html
14 | 4. http://lwn.net/Articles/252125/
15 | 


--------------------------------------------------------------------------------
/paging-unit.md:
--------------------------------------------------------------------------------
  1 | ﻿## Paging unit
  2 | 
  3 | A memory page can be stored on disk or in a page frame (i.e. physical memory page) of physical memory. 
  4 | Paging unit of the processor works with fixed size page frame. Paging unit does translation linear address 
  5 | into physical address using Page table that must be initialize by the OS  before enabling Paging unit.
  6 | 
  7 | The 32 bit linear address consists of
  8 | 
  9 | 1. Directory: The most significant 10 bits
 10 | 2. Table: The intermediate 10 bits
 11 | 3. Offset: The least significant 12 bits - imply that pagesize is 4 KB
 12 | 
 13 | Paging unit it is enabled by setting the PG flag of a control register named cr0. When PG = 0, linear 
 14 | addresses are interpreted as physical addresses.
 15 | 
 16 | The physical address of Page directory is stored in cr3 register. As each OS process see memory as contiguous 
 17 | region dedicated to it, the process needs its own Page directory, which is in mm field of task struct.
 18 | 
 19 |     crash> task | grep mm
 20 |      mm = 0xcd742e40,
 21 |      active_mm = 0xcd742e40,
 22 |     …
 23 |     struct mm_struct 0xcd742e40
 24 |     struct mm_struct {
 25 |      mmap = 0xce29d128,
 26 |      mm_rb = {
 27 |        rb_node = 0xcd64323c
 28 |      },
 29 |     ...
 30 |      free_area_cache = 3086761984,
 31 |      pgd = 0xcd743000,
 32 |     ...
 33 | 
 34 | The Page directory of a process map linear address in both user mode and kernel mode. It is divided into two parts
 35 | 
 36 | 1. user part that maps logical address from 0x0-0xBFFFFFFF is different to each process
 37 | 2. kernel part that maps logical address from 0xC0000000 to 0xFFFFFFFF is the same for all processes and equals 
 38 | master kernel Page Global Directory.
 39 | 
 40 | When a process is created the OS make sure that master kernel Page Global Directory is propagated in the Page 
 41 | directory of the new created process.
 42 | The code running in user mode can only access the first part while running in kernel mode can access both parts.
 43 | 
 44 | Because kernel processes does not need to access to the user part, it can use Page table of any processes so  
 45 | switch to kernel thread does not need to change Page directory table. This is a reason why task struct has 
 46 | both mm and active_mm pointers. An kernel thread always has mm=NULL while in user process mm equals active_mm.
 47 | 
 48 |     crash> task
 49 |     PID: 0      TASK: c068e3c0  CPU: 0   COMMAND: "swapper"
 50 |     struct task_struct {
 51 |      state = 0,
 52 |      thread_info = 0xc0708000,
 53 |      usage = {
 54 |        counter = 2
 55 |      },
 56 |     ...
 57 |     mm = 0x0,
 58 |     active_mm = 0x0,
 59 |     ...
 60 |     
 61 | Beside Address Translation Table, kernel must know the status of physical page frame whether a page is free, 
 62 | used by any process or has been modified. It keep these information in page descriptor table mem_map.  
 63 | Each page frame is represented by a 32 bytes descriptor, which mean around 1% of physical memory is waste 
 64 | (Linux use 4KB size page frame).
 65 | 
 66 | Linear memory accessible from each process can be viewed by looking at its mapping in proc file system. 
 67 | Each segment has start and end and permission. 
 68 | 
 69 |     [root@localhost kernel-internal-stuffs]# cat /proc/self/maps 
 70 |     008e8000-00903000 r-xp 00000000 fd:00 1737604    /lib/ld-2.5.so
 71 |     00903000-00904000 r-xp 0001a000 fd:00 1737604    /lib/ld-2.5.so
 72 |     00904000-00905000 rwxp 0001b000 fd:00 1737604    /lib/ld-2.5.so
 73 |     00907000-00a5a000 r-xp 00000000 fd:00 1737605    /lib/libc-2.5.so
 74 |     00a5a000-00a5c000 r-xp 00153000 fd:00 1737605    /lib/libc-2.5.so
 75 |     00a5c000-00a5d000 rwxp 00155000 fd:00 1737605    /lib/libc-2.5.so
 76 |     00a5d000-00a60000 rwxp 00a5d000 00:00 0 
 77 |     00a82000-00a83000 r-xp 00a82000 00:00 0          [vdso]
 78 |     08048000-0804d000 r-xp 00000000 fd:00 1678405    /bin/cat
 79 |     0804d000-0804e000 rw-p 00004000 fd:00 1678405    /bin/cat
 80 |     093b7000-093d8000 rw-p 093b7000 00:00 0          [heap]
 81 |     b7d3c000-b7f3c000 r--p 00000000 fd:00 714328     /usr/lib/locale/locale-archive
 82 |     b7f3c000-b7f3e000 rw-p b7f3c000 00:00 0 
 83 |     bf820000-bf835000 rw-p bffe9000 00:00 0          [stack]
 84 | 
 85 | or  using pmap pid
 86 | 
 87 |     [root@localhost kernel-internal-stuffs]# pmap -x 2982
 88 |     2982:   cscope
 89 |     Address   Kbytes     RSS   Dirty Mode   Mapping
 90 |     004af000       4       4       0 r-x--    [ anon ]
 91 |     008e8000     108      84       0 r-x--  ld-2.5.so
 92 |     00903000       4       4       4 r-x--  ld-2.5.so
 93 |     00904000       4       4       4 rwx--  ld-2.5.so
 94 |     00907000    1356     440       0 r-x--  libc-2.5.so
 95 |     00a5a000       8       8       4 r-x--  libc-2.5.so
 96 |     00a5c000       4       4       4 rwx--  libc-2.5.so
 97 |     00a5d000      12      12      12 rwx--    [ anon ]
 98 |     00a8d000      12       8       0 r-x--  libdl-2.5.so
 99 |     00a90000       4       4       0 r-x--  libdl-2.5.so
100 |     00a91000       4       4       4 rwx--  libdl-2.5.so
101 |     05d01000     256     152       0 r-x--  libncurses.so.5.5
102 |     05d41000      32      12       4 rwx--  libncurses.so.5.5
103 |     05d49000       4       4       4 rwx--    [ anon ]
104 |     08048000     300      92       0 r-x--  cscope
105 |     08093000       4       4       4 rw---  cscope
106 |     08094000     104      44      44 rw---    [ anon ]
107 |     09d55000    1208    1208    1208 rw---    [ anon ]
108 |     b7fc7000       8       8       8 rw---    [ anon ]
109 |     b7fd5000      12      12      12 rw---    [ anon ]
110 |     bfc78000      84      32      32 rw---    [ stack ]
111 |     -------- ------- ------- ------- -------
112 |     total kB    3532       -       -       -
113 | 
114 | References
115 | 
116 | 1. https://patchwork.kernel.org/patch/17020/
117 | 


--------------------------------------------------------------------------------
/preemption.md:
--------------------------------------------------------------------------------
 1 | ## Preemption
 2 | 
 3 | **User preemption**
 4 | 
 5 | User preemption can occur
 6 | 
 7 | 1. When returning to user-space from a system call
 8 | 2. When returning to user-space from an interrupt handler
 9 | 
10 | the kernel provides the `need_resched`  flag (one bit in thread_info.flags) to signify whether a reschedule 
11 | should be performed. This flag is set by `scheduler_tick()` when a process should be preempted, and by 
12 | `try_to_wake_up()` when a process that has a higher priority than the currently running process is awakened.
13 | 
14 |     static inline int need_resched(void)
15 |     {
16 |        return unlikely(test_thread_flag(TIF_NEED_RESCHED));
17 |     }
18 | 
19 | Upon returning to user-space or returning from an interrupt, the `need_resched` flag is checked. If it is set, 
20 | the kernel invokes the `schedule()` which may result in switching context to other task (i.e. other task runs on CPU).
21 | 
22 | The flag is per-process, and not simply global, because it is faster to access a value in the process 
23 | descriptor (because of the speed of current and high probability of it being cache hot) than a global variable.
24 | 
25 | Historically, the flag was global before the 2.2 kernel. In 2.2 and 2.4, the flag was an int inside the 
26 | task_struct. In 2.6, it was moved into a single bit of a special flag variable inside the thread_info structure.
27 | 
28 |     crash> struct task_struct {
29 |      state = 0,
30 |      thread_info = 0xc2546000,
31 |     ...
32 |     crash> struct thread_info 0xc2546000
33 |     struct thread_info {
34 |      task = 0xd1447550,
35 |      exec_domain = 0xc0693660,
36 |      flags = 128,
37 |     ...
38 | 
39 | **Kernel Preemption**
40 | 
41 | 
42 | Non preemptive kernel does not switch a task when it is in kernel mode.  Task context switch only happens when the task
43 | voluntarily call `schedule()` (i.e. cooperative kernel) or upon return from kernel mode to user mode (from system call or
44 | interrupt handler)
45 | 
46 | Preemptive kernel however can preempt a task at kernel mode if it is safe to reschedule, which usually means the task 
47 | holding no lock. The task `preempt_count` increases by 1 when a lock is acquired by  a task and decrements by 1 when a 
48 | lock is released.
49 | 
50 |     crash> struct task_struct {
51 |      state = 0,
52 |      thread_info = 0xc2546000,
53 |      ...
54 |      
55 |     crash> struct thread_info 0xc2546000
56 |      struct thread_info {
57 |      task = 0xd1447550,
58 |      exec_domain = 0xc0693660,
59 |      flags = 128,
60 |      status = 0,
61 |      cpu = 0,
62 |      preempt_count = 0,
63 |      ...
64 | 
65 | 
66 | Kernel preemption can occur
67 | 
68 | 1. When an interrupt handler exits, before returning to kernel-space. This is a case of an interrupt arises during a syscall
69 | 2. When kernel code becomes preemptible again, which means all the locks that the current task is holding are released, preempt_count returns to zero. The macro preempt_enable() which is called to check whether need_resched is set. If so, the `schedule()` is invoked.
70 | 3. If a task in the kernel explicitly calls `schedule()`
71 | 4. If a task in the kernel blocks which results in a call to `schedule()`
72 | 


--------------------------------------------------------------------------------
/process-execution.md:
--------------------------------------------------------------------------------
 1 | ﻿## Process execution
 2 | 
 3 | The processor can execute instruction of a process in either user mode (Current Privilege Level CPL=3 in 2 bits of cs) 
 4 | or kernel mode (CPL=0). The task struct holds information about process context in both user mode and kernel mode.
 5 | 
 6 | The process context switch can occur only in kernel mode. The contents of all registers used by a process in User Mode 
 7 | have already been saved on the Kernel Mode stack before performing process switching. This includes the contents of the 
 8 | ss and esp pair that specifies the User Mode stack pointer address
 9 | 
10 | The set of data that must be loaded into the registers before the process resumes its execution on the CPU is called 
11 | the hardware context . The hardware context is a subset of the process execution context, which includes all information 
12 | needed for the process execution. In Linux, a part of the hardware context of a process is stored in the process 
13 | descriptor, while the remaining part is saved in the Kernel Mode stack
14 | 
15 | When the hardware interrupt raises, processor save hardware context in current stack (can be either in user mode or 
16 | kernel mode of any processes) and  processor jump to interrupt handler upon return it restore hardware context from current 
17 | stack and continue its work.
18 | 
19 | Interrupt can be nested, which mean during an interrupt handler other interrupt can occur. Because of that an interrupt 
20 | handler must never block, that is, no process switch can take place while an interrupt handler is running.
21 | 
22 | In fact, all the data needed to resume a nested kernel control path is stored in the Kernel Mode stack, which is 
23 | tightly bound to the current process. So if the kernel switch to other process , Kernel mode stack will be changed 
24 | and the code of the interrupt handler messes up.  
25 | 
26 | References
27 | 
28 | 1. http://stackoverflow.com/questions/4732409/context-switch-in-interrupt-handlers
29 | 2. http://stackoverflow.com/questions/1053572/why-kernel-code-thread-executing-in-interrupt-context-cannot-sleep
30 | 


--------------------------------------------------------------------------------
/process-scheduler.md:
--------------------------------------------------------------------------------
 1 | ## Process's scheduler
 2 | 
 3 | Process scheduler is responsible for finding the next eligible task and switching to the context of that task. It has to balance between responsiveness and efficency. The cost of context switch is not negligible. 
 4 | 
 5 | The option for tuning in case of `CFS` is `/proc/sys/kernel/sched_min_granularity_ns`  and `/proc/sys/kernel/sched_latency_ns`.
 6 | 
 7 |     # cat /proc/sys/kernel/sched_min_granularity_ns
 8 |     750000
 9 |     # cat /proc/sys/kernel/sched_latency_ns
10 |     6000000
11 | 
12 | Scheduler maintains per CPU run queue and picks next task for particular CPU from its run queue. When a task calls `scheduler()` function, the scheduler code processes run queue associated with the CPU of the calling task.
13 | 
14 | **Scheduluer and policy**
15 | 
16 | There are mutiple schedulers, each is responsible for a type of tasks. 
17 | 
18 |     stop_sched_class → rt_sched_class → fair_sched_class → idle_sched_class 
19 | 
20 | * `stop_sched_class` is for kernel tasks that add/remove CPUs dynamically. 
21 | * `rt_sched_class` is for real time tasks.
22 | * `fair_sched_class` is for ordinary tasks. Most of tasks are of this type
23 | * `ide_sched_class` is for lowest priority tasks that only run when there are no other runable tasks.
24 | 
25 | Linux provide syscall for assign a process to a specific scheduler policy `sched_setscheduler`. The following are policies applicable for non realtime scheduling
26 | 
27 | * `SCHED_OTHER`: the standard round-robin time-sharing policy, this is served by `fair_sched_class`
28 | * `SCHED_BATCH`: for "batch" style execution of processes, this is served by `fair_sched_class`
29 | * `SCHED_IDLE`: for running very low priority background jobs, this is served by `idle_sched_class`
30 | 
31 | The realtime scheduling has following policies
32 | 
33 | * `SCHED_FIFO`: fifo scheduling
34 | * `SCHED_RR`: round robind scheduling
35 | 
36 | Scheduler class assigned to a process can be viewed using `class` field in `ps -o` command
37 |  
38 |     # ps -e -o pid,cmd,class 
39 |     PID CMD                       CLS
40 |     1 /sbin/init                  TS
41 |     2 [kthreadd]                  TS
42 |     ...
43 |     7 [watchdog/0]                FF
44 | 
45 | where 
46 | 
47 |     TS  SCHED_OTHER    B   SCHED_BATCH
48 |     FF  SCHED_FIFO     ISO SCHED_ISO
49 |     RR  SCHED_RR       IDL SCHED_IDLE
50 | 
51 | We can change scheduler class using command `chrt`
52 | 
53 |     # ps -o pid,cmd,class,priority -p 1003
54 |     PID CMD                         CLS PRI
55 |     1003 cron                        TS   20
56 |     # chrt -p -b 0 1003
57 |     # ps -o pid,cmd,class,priority -p 1003
58 |     PID CMD                         CLS PRI
59 |     1003 cron                        B    20
60 | 
61 |  `chrt` can'nt change priority for non realtime scheduler's class. We need to use `renice`
62 | 
63 |     # renice 5 -p 1003
64 |     1003 (process ID) old priority 10, new priority 5
65 |     # ps -o pid,cmd,class,priority,nice -p 1003
66 |     PID CMD                         CLS PRI  NI
67 |     1003 cron                        B    25   5
68 | 
69 | **CFS**
70 | 
71 | Linux kernel from version 2.6.23 use Comppletly Faire Scheduler - CFS for process scheduling see [http://en.wikipedia.org/wiki/Completely_Fair_Scheduler].
72 | 
73 | `CFS` maintains per-task `vruntime`. As task is on CPU, this value get increased by time spending on CPU. The scheduler function check and pick task with minimum `vruntime` and assign CPU to it.
74 | 
75 | `priority` (i.e `nice`) is used as weight when adjusting `vruntime`.
76 | 
77 | **References**
78 | 
79 | * http://www.linuxjournal.com/magazine/completely-fair-scheduler
80 | * https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt
81 | 


--------------------------------------------------------------------------------
/process-session-process-group-thread-group.md:
--------------------------------------------------------------------------------
 1 | ﻿## Relationship between processes, Session, process group and thread group
 2 | 
 3 | Kernel use this information for  delivering signals to a collection of processes/threads.
 4 | 
 5 | When a process is created, it inherits group id of its parent. It can later change its group or create new 
 6 | group and becomes the group leader by calling setpgid. The group leader’s group id equals its pid.
 7 | 
 8 | `kill(pid_t pid, int sig)` with negative `pid` will send  signal to all members of group `-pid`. Similarly `waitpid` 
 9 | with negative `pid` wil wait for any member of group `-pid`.
10 | 
11 | Each group belong to a unique session. First process of a session (also group leader) is the session leader, 
12 | its `pid` equals session id. The session can be changed by calling `setsid` but is usually called in login process.
13 | Session has controlling tty (`/dev/tty`), signals generated by `CTRL-C/Z` are delivered to these processes of the 
14 | foreground group of that session.
15 | 
16 | When a process is created, its ordinary thread id equals process id.  When additional thread is created with 
17 | CLONE_THREAD it has different thread id  but same process id, which is called thread group id. The `gettid` returns 
18 | thread id while getpid returns thread group id. `tkill` is used to send signal to a thread.
19 | 
20 | References
21 | 
22 | 1. http://www.win.tue.nl/~aeb/linux/lk/lk-10.html#ss10.2
23 | 2. http://linux.die.net/man/2/kill
24 | 3. http://linux.die.net/man/2/waitpid
25 | 4. http://linux.die.net/man/2/clone
26 | 


--------------------------------------------------------------------------------
/process-state.md:
--------------------------------------------------------------------------------
 1 | ## Process state
 2 | 
 3 | Process state can be observed by `ps` command
 4 | 
 5 |     root@gerrit01:/home/vagrant# ps auxf
 6 |     USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
 7 |     root         2  0.0  0.0      0     0 ?        S    Mar19   0:00 [kthreadd]
 8 |     root         3  0.0  0.0      0     0 ?        S    Mar19   0:01  \_ [ksoftirqd/0]
 9 |     root         5  0.0  0.0      0     0 ?        S<   Mar19   0:00  \_ [kworker/0:0H]
10 | 
11 | Process state can be one of the following
12 | 
13 |     PROCESS STATE CODES
14 |                D    uninterruptible sleep (usually IO)
15 |                R    running or runnable (on run queue)
16 |                S    interruptible sleep (waiting for an event to complete)
17 |                T    stopped, either by a job control signal or because it is being traced
18 |                W    paging (not valid since the 2.6.xx kernel)
19 |                X    dead (should never be seen)
20 |                Z    defunct ("zombie") process, terminated but not reaped by its parent
21 |     For BSD formats and when the stat keyword is used, additional characters may be displayed:
22 |                <    high-priority (not nice to other users)
23 |                N    low-priority (nice to other users)
24 |                L    has pages locked into memory (for real-time and custom IO)
25 |                s    is a session leader
26 |                l    is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
27 |                +    is in the foreground process group
28 | 
29 | The most interesting is uninterruptible sleep `D`, which can be watched using
30 | 
31 |       $while true; do date; ps auxf | awk '{if($8=="D") print $0;}'; sleep 1; done
32 |       
33 | Process in in uninterruptible sleep, means that it runs in kernel mode and can't be preempted (see [preemption](preemption.md)). 
34 | The CPU is not used but can't be reused to run other tasks because kernel either is holding lock or 
35 | its data structure is not protected. It is a sign of deficiency of certain parts of the kernel and 
36 | should be avoided as much as possible.
37 | 
38 | References
39 | 
40 | * http://stackoverflow.com/questions/223644/what-is-an-uninterruptable-process
41 | 


--------------------------------------------------------------------------------
/processing-input-ip-packet.md:
--------------------------------------------------------------------------------
  1 | ﻿# processing input IPv4 packet
  2 | 
  3 | Frame carrying  IP packet is handled by ip_rcv function, which is registered during kernel initialization
  4 | 
  5 |     static struct packet_type ip_packet_type = {
  6 |            .type = __constant_htons(ETH_P_IP),
  7 |            .func = ip_rcv,
  8 |            .gso_send_check = inet_gso_send_check,
  9 |            .gso_segment = inet_gso_segment,
 10 |     };
 11 |     
 12 |     static int __init inet_init(void)
 13 |     {
 14 |     …
 15 |            dev_add_pack(&ip_packet_type);
 16 |     
 17 |     
 18 |     void dev_add_pack(struct packet_type *pt)
 19 |     {
 20 |            int hash;
 21 |     
 22 |     
 23 |            spin_lock_bh(&ptype_lock);
 24 |            if (pt->type == htons(ETH_P_ALL)) {
 25 |                    netdev_nit++;
 26 |                    list_add_rcu(&pt->list, &ptype_all);
 27 |            } else {
 28 |                    hash = ntohs(pt->type) & 15;
 29 |                    list_add_rcu(&pt->list, &ptype_base[hash]);
 30 |            }
 31 |            spin_unlock_bh(&ptype_lock);
 32 |     }
 33 | 
 34 | 
 35 | `ip_packet_type` get registered in global variable `ptype_base`, which is then used in function
 36 | 
 37 |     int netif_receive_skb(struct sk_buff *skb)
 38 |     {
 39 |     ...
 40 |            type = skb->protocol;
 41 |           /* normal protocol handlers are registered in ptype_base */
 42 |            list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
 43 |     …
 44 |                                   ret = deliver_skb(skb, pt_prev, orig_dev);
 45 |     
 46 | 
 47 | Due to presence of netfilter firewall component, the `ip_rcv` however just perform sanity check 
 48 | (e.g. ip version, checksum, drop frame to other host), then invoke netfilter hook if passed, `ip_rcv_finish` 
 49 | will perform the main of work (decide of local delivery vs. forward, parse ip options).
 50 | 
 51 |     int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev)
 52 |     {
 53 |            struct iphdr *iph;
 54 |            u32 len;
 55 |     
 56 |     
 57 |            /* When the interface is in promisc. mode, drop all the crap
 58 |             * that it receives, do not try to analyse it.
 59 |             */
 60 |            if (skb->pkt_type == PACKET_OTHERHOST)
 61 |                    goto drop;
 62 |     
 63 |     
 64 |            IP_INC_STATS_BH(IPSTATS_MIB_INRECEIVES);
 65 |     
 66 |     
 67 |            if ((skb = skb_share_check(skb, GFP_ATOMIC)) == NULL) {
 68 |                    IP_INC_STATS_BH(IPSTATS_MIB_INDISCARDS);
 69 |                    goto out;
 70 |            }
 71 |     
 72 |     
 73 |            if (!pskb_may_pull(skb, sizeof(struct iphdr)))
 74 |                    goto inhdr_error;
 75 |     
 76 |     
 77 |            iph = skb->nh.iph;
 78 |     
 79 |     
 80 |         /*
 81 |             *      RFC1122: 3.1.2.2 MUST silently discard any IP frame that fails the checksum.
 82 |             *
 83 |             *      Is the datagram acceptable?
 84 |             *
 85 |             *      1.      Length at least the size of an ip header
 86 |             *      2.      Version of 4
 87 |             *      3.      Checksums correctly. [Speed optimisation for later, skip loopback checksums]
 88 |             *      4.      Doesn't have a bogus length
 89 |             */
 90 |     
 91 |     
 92 |            if (iph->ihl < 5 || iph->version != 4)
 93 |                    goto inhdr_error;
 94 |     
 95 |     
 96 |            if (!pskb_may_pull(skb, iph->ihl*4))
 97 |                    goto inhdr_error;
 98 |     
 99 |     
100 |            iph = skb->nh.iph;
101 |     
102 |     
103 |            if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
104 |                    goto inhdr_error;
105 |     
106 |     
107 |            len = ntohs(iph->tot_len);
108 |            if (skb->len < len) {
109 |                    IP_INC_STATS_BH(IPSTATS_MIB_INTRUNCATEDPKTS);
110 |                    goto drop;
111 |            } else if (len < (iph->ihl*4))
112 |                    goto inhdr_error;
113 |     
114 |     
115 |            /* Our transport medium may have padded the buffer out. Now we know it
116 |             * is IP we can trim to the true length of the frame.
117 |             * Note this now means skb->len holds ntohs(iph->tot_len).
118 |             */
119 |            if (pskb_trim_rcsum(skb, len)) {
120 |                    IP_INC_STATS_BH(IPSTATS_MIB_INDISCARDS);
121 |                    goto drop;
122 |            }
123 |     
124 |     
125 |            /* Remove any debris in the socket control block */
126 |            memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
127 |     
128 |     
129 |            return NF_HOOK(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL,
130 |                           ip_rcv_finish);
131 |     
132 |     
133 |     inhdr_error:
134 |            IP_INC_STATS_BH(IPSTATS_MIB_INHDRERRORS);
135 |     drop:
136 |            kfree_skb(skb);
137 |     out:
138 |            return NET_RX_DROP;
139 |     }
140 |     
141 |     
142 |     static inline int ip_rcv_finish(struct sk_buff *skb)
143 |     {
144 |            struct iphdr *iph = skb->nh.iph;
145 |            struct rtable *rt;
146 |     
147 |     
148 |            /*
149 |             *      Initialise the virtual path cache for the packet. It describes
150 |             *      how the packet travels inside Linux networking.
151 |             */
152 |            if (skb->dst == NULL) {
153 |                    int err = ip_route_input(skb, iph->daddr, iph->saddr, iph->tos,
154 |                                             skb->dev);
155 |                    if (unlikely(err)) {
156 |                            if (err == -EHOSTUNREACH)
157 |                                    IP_INC_STATS_BH(IPSTATS_MIB_INADDRERRORS);
158 |                            else if (err == -ENETUNREACH)
159 |                                    IP_INC_STATS_BH(IPSTATS_MIB_INNOROUTES);
160 |                            goto drop;
161 |                    }
162 |            }
163 |     
164 |     
165 |     #ifdef CONFIG_NET_CLS_ROUTE
166 |            if (unlikely(skb->dst->tclassid)) {
167 |                    struct ip_rt_acct *st = ip_rt_acct + 256*smp_processor_id();
168 |                    u32 idx = skb->dst->tclassid;
169 |                    st[idx&0xFF].o_packets++;
170 |                    st[idx&0xFF].o_bytes+=skb->len;
171 |                    st[(idx>>16)&0xFF].i_packets++;
172 |                    st[(idx>>16)&0xFF].i_bytes+=skb->len;
173 |            }
174 |     #endif
175 |     
176 |     
177 |            if (iph->ihl > 5 && ip_rcv_options(skb))
178 |                    goto drop;
179 |     
180 |     
181 |            rt = (struct rtable*)skb->dst;
182 |            if (rt->rt_type == RTN_MULTICAST)
183 |                    IP_INC_STATS_BH(IPSTATS_MIB_INMCASTPKTS);
184 |            else if (rt->rt_type == RTN_BROADCAST)
185 |                    IP_INC_STATS_BH(IPSTATS_MIB_INBCASTPKTS);
186 |     
187 |     
188 |            return dst_input(skb);
189 |     
190 |     
191 |     drop:
192 |            kfree_skb(skb);
193 |            return NET_RX_DROP;
194 |     }
195 |     
196 | The function ip_route_input intefaces with routing system to decide of whether local delivery or forward
197 | 
198 |     int ip_route_input(struct sk_buff *skb, u32 daddr, u32 saddr,
199 |                       u8 tos, struct net_device *dev)
200 |     {
201 |     …
202 |           return ip_route_input_slow(skb, daddr, saddr, tos, dev);
203 |     }
204 |     
205 | 
206 |     static int ip_route_input_slow(struct sk_buff *skb, u32 daddr, u32 saddr,
207 |                                   u8 tos, struct net_device *dev)
208 |     {
209 |            struct fib_result res;
210 |            struct in_device *in_dev = in_dev_get(dev);
211 |     
212 |     
213 |     …
214 |            err = ip_mkroute_input(skb, &res, &fl, in_dev, daddr, saddr, tos); /* this path leads to packet forward */
215 |     ...
216 |     
217 |     
218 |     local_input:
219 |            rth = dst_alloc(&ipv4_dst_ops);
220 |            if (!rth)
221 |     
222 |     
223 |     …
224 |            rth->u.dst.input= ip_local_deliver; /* assign function for local deliver the ip packet */
225 |     
226 |     
227 |     
228 |     
229 |     static inline int ip_mkroute_input(struct sk_buff *skb,
230 |                                       struct fib_result* res,
231 |                                       const struct flowi *fl,
232 |                                       struct in_device *in_dev,
233 |                                       u32 daddr, u32 saddr, u32 tos)
234 |     …
235 |                    err = __mkroute_input(skb, res, in_dev, daddr, saddr, tos,
236 |                                          &rth);
237 |     ...
238 |     
239 |     
240 |     
241 |     
242 |     static inline int __mkroute_input(struct sk_buff *skb,
243 |                                      struct fib_result* res,
244 |                                      struct in_device *in_dev,
245 |                                      u32 daddr, u32 saddr, u32 tos,
246 |                                      struct rtable **result)
247 |     …
248 |            rth->u.dst.input = ip_forward; /* assign function for forward ip packet*/
249 |     ...
250 |     
251 | 
252 | The `ip_local_delivery` delivers packet to L4 protocol handler, which is registered in global variable `inet_protos`
253 | 
254 |     struct net_protocol *inet_protos[MAX_INET_PROTOS];
255 |     
256 |     
257 |     int inet_add_protocol(struct net_protocol *prot, unsigned char protocol)
258 |     {
259 |            int hash, ret;
260 |     
261 |     
262 |            hash = protocol & (MAX_INET_PROTOS - 1);
263 |     
264 |     
265 |            spin_lock_bh(&inet_proto_lock);
266 |            if (inet_protos[hash]) {
267 |                    ret = -1;
268 |            } else {
269 |                    inet_protos[hash] = prot;
270 |                    ret = 0;
271 |            }
272 |            spin_unlock_bh(&inet_proto_lock);
273 |     
274 |     
275 |            return ret;
276 |     }
277 |     
278 |     
279 |     static struct net_protocol icmp_protocol = {
280 |     .handler = icmp_rcv, };
281 |     
282 |     
283 |     static int __init inet_init(void)
284 |     {
285 |     …
286 |            if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0)
287 |                    printk(KERN_CRIT "inet_init: Cannot add ICMP protocol\n");
288 |     
289 |     
290 | The `ip_local_deliver` use netfilter hook to check firewall rule before calling `ip_local_delivery_finish`
291 |     
292 |     
293 |     int ip_local_deliver(struct sk_buff *skb)
294 |     {
295 |            /*
296 |             *      Reassemble IP fragments.
297 |             */
298 |     
299 |     
300 |            if (skb->nh.iph->frag_off & htons(IP_MF|IP_OFFSET)) {
301 |                    skb = ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER);
302 |                    if (!skb)
303 |                            return 0;
304 |            }
305 |     
306 |     
307 |            return NF_HOOK(PF_INET, NF_IP_LOCAL_IN, skb, skb->dev, NULL,
308 |                           ip_local_deliver_finish);
309 |     }
310 |     
311 | 
312 | `ip_local_deliver_finish` find L4 protocol in L3 protocol (ip) header
313 | 
314 |     static inline int ip_local_deliver_finish(struct sk_buff *skb)
315 |     {
316 |     …
317 |                    int protocol = skb->nh.iph->protocol;
318 |     …
319 |                    hash = protocol & (MAX_INET_PROTOS - 1);
320 |     …
321 |                    if ((ipprot = rcu_dereference(inet_protos[hash])) != NULL) {
322 |                            int ret;
323 |     
324 |     
325 |                            if (!ipprot->no_policy) {
326 |                                    if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
327 |                                            kfree_skb(skb);
328 |                                            goto out;
329 |                                    }
330 |                                    nf_reset(skb);
331 |                            }
332 |                            ret = ipprot->handler(skb); /*invoke L4 protocol handler */
333 |                            if (ret < 0) {
334 |                                    protocol = -ret;
335 |                                    goto resubmit;
336 |                            }
337 |                            IP_INC_STATS_BH(IPSTATS_MIB_INDELIVERS);
338 |     
339 |     
340 |     static struct net_protocol udp_protocol = {
341 |            .handler =      udp_rcv,
342 |            .err_handler =  udp_err,
343 |            .no_policy =    1,
344 |     };
345 |     
346 |     
347 |     static int __init inet_init(void)
348 |     {
349 |     …
350 |     if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0)
351 |                    printk(KERN_CRIT "inet_init: Cannot add UDP protocol\n");
352 |     ...
353 |     
354 |     
355 |     int udp_rcv(struct sk_buff *skb)
356 |     {
357 |     …
358 |            struct sock *sk; /* internal representation of socket*/
359 |     
360 |     
361 |     ...   
362 |            sk = udp_v4_lookup(saddr, uh->source, daddr, uh->dest, skb->dev->ifindex);
363 |     
364 |     
365 |            if (sk != NULL) {
366 |                    int ret = udp_queue_rcv_skb(sk, skb);
367 |                    sock_put(sk);
368 |     
369 |     
370 |     static __inline__ struct sock *udp_v4_lookup(u32 saddr, u16 sport,
371 |                                                 u32 daddr, u16 dport, int dif)
372 |     {
373 |            struct sock *sk;
374 |     
375 |     
376 |            read_lock(&udp_hash_lock);
377 |            sk = udp_v4_lookup_longway(saddr, sport, daddr, dport, dif);
378 |            if (sk)
379 |                    sock_hold(sk);
380 |            read_unlock(&udp_hash_lock);
381 |            return sk;
382 |     }
383 |     
384 | 
385 | This method try to find sock that satisfies condition of the packet in term of source , destination address and port
386 | 
387 | 
388 |     static struct sock *udp_v4_lookup_longway(u32 saddr, u16 sport,
389 |                                              u32 daddr, u16 dport, int dif)
390 |     {
391 |            struct sock *sk, *result = NULL;
392 |     ...
393 |     
394 |     
395 |            sk_for_each(sk, node, &udp_hash[hnum & (UDP_HTABLE_SIZE - 1)]) {
396 |                    struct inet_sock *inet = inet_sk(sk);
397 |     …
398 | 
399 | The method that place packet to socket queue for user space program 
400 | 
401 |     static int udp_queue_rcv_skb(struct sock * sk, struct sk_buff *skb)
402 |     {
403 |     …
404 |            struct udp_sock *up = udp_sk(sk);
405 |     …
406 |             if (!sock_owned_by_user(sk))
407 |                     rc = __udp_queue_rcv_skb(sk, skb);
408 |             else
409 |                     sk_add_backlog(sk, skb);
410 |     
411 |     
412 |     static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
413 |     {
414 |             int rc;
415 |     
416 |     
417 |             if ((rc = sock_queue_rcv_skb(sk, skb)) < 0) {
418 |                     /* Note that an ENOMEM error is charged twice */
419 |                     if (rc == -ENOMEM)
420 |                             UDP_INC_STATS_BH(UDP_MIB_INERRORS);
421 |                     goto drop;
422 |             }
423 |            …
424 |     
425 |           int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
426 |           {
427 |           …
428 |                   skb_queue_tail(&sk->sk_receive_queue, skb);
429 |                  if (!sock_flag(sk, SOCK_DEAD))
430 |                           sk->sk_data_ready(sk, skb_len);
431 |           
432 | The sk_data_ready is initialized when creating socket and its internal representation
433 | 
434 |       struct sock {
435 |       ...
436 |               void                    (*sk_data_ready)(struct sock *sk, int bytes);
437 |       ...
438 |       
439 |       
440 |       void sock_init_data(struct socket *sock, struct sock *sk)
441 |       {
442 |       ...
443 |              sk->sk_state_change     =       sock_def_wakeup;
444 |              sk->sk_data_ready       =       sock_def_readable;
445 |       
446 |       
447 |       
448 | 
449 | This function wake up processes sleeping in socket queue
450 | 
451 |       static void sock_def_readable(struct sock *sk, int len)
452 |       {
453 |               read_lock(&sk->sk_callback_lock);
454 |               if (sk_has_sleeper(sk))
455 |                       wake_up_interruptible(sk->sk_sleep);
456 |               sk_wake_async(sk,1,POLL_IN);
457 |               read_unlock(&sk->sk_callback_lock);
458 |       }
459 |       
460 | References
461 | 
462 | 1. http://www.cyberciti.biz/files/linux-kernel/Documentation/networking/ip-sysctl.txt
463 | 


--------------------------------------------------------------------------------
/reading-source-code.md:
--------------------------------------------------------------------------------
 1 | ﻿## Reading source code
 2 | 
 3 | Kernel use extensively c extension and macro e.g.  `typeof/__typeof__` ,  `__attribute__`
 4 | 
 5 |     #define __get_cpu_var(var)        per_cpu__##var
 6 | 
 7 |     static DEFINE_PER_CPU(struct socket *, __icmp_socket) = NULL;
 8 | 
 9 |     /* Separate out the type, so (int[3], foo) works. */
10 |     #define DEFINE_PER_CPU(type, name) \
11 |        __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name
12 | 
13 |     crash> per_cpu____icmp_socket
14 |     PER-CPU DATA TYPE:
15 |     struct socket *per_cpu____icmp_socket;
16 |     PER-CPU ADDRESSES:
17 |     [0]: c1406ac0
18 | 
19 |     crash> struct socket per_cpu____icmp_socket
20 |     struct socket {
21 |      state = 3518432990,
22 |      flags = 2251442003,
23 |      ops = 0xe9518e15,
24 |      fasync_list = 0x671341b9,
25 |      file = 0x41b13d1,
26 |      sk = 0xf3e74312,
27 |      wait = {
28 |        lock = {
29 |          raw_lock = {
30 |            slock = 1482566028
31 |          }
32 |        },
33 |        task_list = {
34 |          next = 0xa9ea9e0f,
35 |          prev = 0x5b6959e4
36 |        }
37 |      },
38 |      type = -24496
39 |     }
40 |     
41 | `__attribute__(__section__)` is used to place global/static memory variable into non-default segment so the linker later will map them to appropriate desired memory location. An example is a variable that maps to DMA location of device so read/write to that variable mean get/send data to the device.
42 | 
43 | unsigned long long denotes 64 bit integer in 32 bit architecture
44 | 
45 | fastcall indicates that compiler will try to use registers for passing parameters to the function instead of stack as normal.
46 | 
47 | Some code of kernel is platform dependent so are written in assembly language
48 | 
49 | References
50 | 
51 | 1. http://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html#C-Extensions
52 | 2. http://en.wikipedia.org/wiki/X86_assembly_language
53 | 3. http://www.ibm.com/developerworks/library/l-gas-nasm.html
54 | 4. http://lxr.linux.no/
55 | 


--------------------------------------------------------------------------------
/receive-data-from-socket.md:
--------------------------------------------------------------------------------
  1 | ﻿# Receive data from socket
  2 |     
  3 |     asmlinkage long sys_recvfrom(int fd, void __user * ubuf, size_t size, unsigned flags,
  4 |                                  struct sockaddr __user *addr, int __user *addr_len)
  5 |     {
  6 |     …
  7 |             sock_file = fget_light(fd, &fput_needed);
  8 |             if (!sock_file)
  9 |                     return -EBADF;
 10 |     ...
 11 |             sock = sock_from_file(sock_file, &err);
 12 |             if (!sock)
 13 |                     goto out;
 14 |     …
 15 |             err=sock_recvmsg(sock, &msg, size, flags);
 16 |     …
 17 |     
 18 |     
 19 | The syscall first look for a socket from file descriptor and call relevent function using the socket data structure
 20 | 
 21 |     int sock_recvmsg(struct socket *sock, struct msghdr *msg,
 22 |                      size_t size, int flags)
 23 |     {
 24 |     ...
 25 |             ret = __sock_recvmsg(&iocb, sock, msg, size, flags);
 26 |     ...
 27 |     }
 28 |     
 29 |     
 30 |     static inline int __sock_recvmsg(struct kiocb *iocb, struct socket *sock,
 31 |                                      struct msghdr *msg, size_t size, int flags)
 32 |     {
 33 |     ...
 34 |            err = sock->ops->recvmsg(iocb, sock, msg, size, flags); 
 35 |     …
 36 |     
 37 |     
 38 | ops is protocol operation struct holding pointers to functions operating on socket file descriptor. 
 39 | When creating socket this field is filled depending mostly on domain/family and type
 40 |     
 41 |     int sock_common_recvmsg(struct kiocb *iocb, struct socket *sock,
 42 |                             struct msghdr *msg, size_t size, int flags)
 43 |     {
 44 |             struct sock *sk = sock->sk;
 45 |             int addr_len = 0;
 46 |             int err;
 47 |            /* this will delegate its work to internal sock*/
 48 |             err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT,
 49 |                                        flags & ~MSG_DONTWAIT, &addr_len);
 50 |             if (err >= 0)
 51 |                     msg->msg_namelen = addr_len;
 52 |             return err;
 53 |     }
 54 |     
 55 |     static int udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 56 |                            size_t len, int noblock, int flags, int *addr_len)
 57 |     {
 58 |             struct inet_sock *inet = inet_sk(sk);
 59 |             struct sockaddr_in *sin = (struct sockaddr_in *)msg->msg_name;
 60 |             struct sk_buff *skb;
 61 |             int copied, err;
 62 |             int peeked;
 63 |     
 64 |     
 65 |             /*
 66 |              *      Check any passed addresses
 67 |              */
 68 |             if (addr_len)
 69 |                     *addr_len=sizeof(*sin);
 70 |     
 71 |     
 72 |             if (flags & MSG_ERRQUEUE)
 73 |                     return ip_recv_error(sk, msg, len);
 74 |     
 75 |     
 76 |     try_again:
 77 |             skb = __skb_recv_datagram(sk, flags | (noblock ? MSG_DONTWAIT : 0),
 78 |                                       &peeked, &err);
 79 |     ...
 80 |     
 81 |     
 82 |     struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned flags,
 83 |                                       int *peeked, int *err)
 84 |     {
 85 |     ...
 86 |     
 87 |     
 88 |             timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
 89 |     …
 90 |           do {
 91 |                     /* Again only user level code calls this function, so nothing
 92 |                      * interrupt level will suddenly eat the receive_queue.
 93 |                      *
 94 |                      * Look at current nfs client by the way...
 95 |                      * However, this function was corrent in any case. 8)
 96 |                      */
 97 |     
 98 |     
 99 |                     unsigned long cpu_flags;
100 |     
101 |     
102 |                     if (flags & MSG_PEEK) {
103 |                             spin_lock_irqsave(&sk->sk_receive_queue.lock,
104 |                                               cpu_flags);
105 |                             skb = skb_peek(&sk->sk_receive_queue);
106 |                             if (skb) {
107 |                                     *peeked = skb->peeked;
108 |                                     skb->peeked = 1;
109 |                                     atomic_inc(&skb->users);
110 |                             }
111 |                             spin_unlock_irqrestore(&sk->sk_receive_queue.lock,
112 |                                                    cpu_flags);
113 |                     } else {
114 |                             skb = skb_dequeue(&sk->sk_receive_queue);
115 |                             if (skb)
116 |                                     *peeked = skb->peeked;
117 |                     }
118 |                     if (skb)
119 |                             return skb;
120 |     
121 |     
122 |                     /* User doesn't want to wait */
123 |                     error = -EAGAIN;
124 |                     if (!timeo)
125 |                             goto no_packet;
126 |     
127 |     
128 |             } while (!wait_for_packet(sk, err, &timeo));
129 |     …
130 |     
131 |     
132 |     static int wait_for_packet(struct sock *sk, int *err, long *timeo_p)
133 |     {
134 |             int error;
135 |             DEFINE_WAIT(wait);
136 |     
137 |     
138 |             prepare_to_wait_exclusive(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
139 |     ...
140 |            *timeo_p = schedule_timeout(*timeo_p);
141 |     
142 | 
143 | 
144 | References
145 | 
146 | 1. http://www.ibm.com/developerworks/linux/library/l-hisock/index.html
147 | 2. http://www.haifux.org/lectures/217/netLec5.pdf
148 | 


--------------------------------------------------------------------------------
/receive-frame-from-network.md:
--------------------------------------------------------------------------------
  1 | ﻿# Receive frame from network
  2 | 
  3 | The driver code need to be written in such way to ensure minimize frame’s lost at the same 
  4 | time giving CPU to other processes under high network load.
  5 | 
  6 | There are two techniques used by kernel to know about incoming frame a) interrupt and b) polling. 
  7 | While polling (when timer expires) is non efficient under low traffic, interrupt is non efficient under high traffic.
  8 | 
  9 | Linux kernel NAPI - New API combines both techniques to archive low latency and fairness.  However there are still 
 10 | many drivers that use old API employing only interrupt technique.
 11 | 
 12 | When a frame is received in NIC, it raises interrupt that is delivered to one of CPU. With old API, 
 13 | the interrupt handler (implemented by driver e.g. `vortex_rx`) then transfer the frame from NIC buffer placing it into 
 14 | host memory by calling `netif_rx`.  
 15 | 
 16 |     static irqreturn_t
 17 |     vortex_interrupt(int irq, void *dev_id, struct pt_regs *regs)
 18 |     {
 19 |     …
 20 |                    if (status & RxComplete)
 21 |                            vortex_rx(dev);
 22 |     ...
 23 |     
 24 |     
 25 |     static int vortex_rx(struct net_device *dev)
 26 |     {
 27 |     …
 28 |                           skb = dev_alloc_skb(pkt_len + 5);
 29 |     …
 30 |                                    skb->protocol = eth_type_trans(skb, dev); /* figure out protocol handler */
 31 |                                    netif_rx(skb);
 32 |     ...
 33 |     
 34 |     
 35 |     int netif_rx(struct sk_buff *skb)
 36 |     {
 37 |     ...
 38 |            queue = &__get_cpu_var(softnet_data);
 39 |     ...
 40 |     enqueue:
 41 |                            dev_hold(skb->dev);
 42 |                            __skb_queue_tail(&queue->input_pkt_queue, skb);
 43 |     ...
 44 |                    netif_rx_schedule(&queue->backlog_dev); /* this is fake dev*/
 45 |     
 46 | 
 47 | The `netif_rx` adds frame to per CPU input queue `softnet_data->input_pkt_queue` and call `netif_rx_schedule` to put 
 48 | a signal for bottom haft `netif_rx_action` by adding a fake `backlog_dev` device in to `softnet_data->poll_list`.
 49 | 
 50 |     void __netif_rx_schedule(struct net_device *dev)
 51 |     {
 52 |            unsigned long flags;
 53 |     
 54 |     
 55 |            local_irq_save(flags);
 56 |            dev_hold(dev);
 57 |            list_add_tail(&dev->poll_list, &__get_cpu_var(softnet_data).poll_list);
 58 |            if (dev->quota < 0)
 59 |                    dev->quota += dev->weight;
 60 |            else
 61 |                    dev->quota = dev->weight;
 62 |            __raise_softirq_irqoff(NET_RX_SOFTIRQ); /* trigger net_rx_action */
 63 |            local_irq_restore(flags);
 64 |     }
 65 |     
 66 | 
 67 | The list structure in kernel is kind of embedded container inside data structure, so adding `&dev->poll_list` 
 68 | into `softnet_data->poll_list` means add the dev into the list.
 69 | 
 70 | Example of NAPI is tg3_rx of broadcom driver, which copy frame from NIC to the input queue managed by the driver itself.
 71 | 
 72 |     static irqreturn_t tg3_interrupt(int irq, void *dev_id, struct pt_regs *regs)
 73 |     {
 74 |     …
 75 |                    netif_rx_schedule(tnapi->dummy_netdev);
 76 |     ...
 77 |     
 78 |     
 79 |     static int tg3_rx(struct tg3_napi *tnapi, int budget)
 80 |     {
 81 |     …
 82 |                            netif_rx_schedule(tp->napi[1].dummy_netdev);
 83 |     
 84 |     
 85 |     static int tg3_poll(struct net_device *netdev, int *budget)
 86 |     {
 87 |     ...
 88 |                    work_done = tg3_rx(tnapi, orig_budget);
 89 |     ...
 90 |     
 91 | 
 92 | The `softnet_data->poll_list` seems does not contain a list of devices that has some incoming frames waiting 
 93 | for process but just a fake device with sufficient information for net_rx_action to know how to poll frame from input queue.
 94 | 
 95 | In the old API, there is one input queue per CPU, its size is specified in parameter. The kernel drops incoming frame  
 96 | if  the queue reaches its size.
 97 | 
 98 |     [root@localhost tap]# cat /proc/sys/net/core/netdev_max_backlog
 99 |     1000
100 | 
101 | In the NAPI input queue is per device and it is up to the driver to handle it.
102 | 
103 | The top haft is a function of a specific driver while the bottom haft is always kernel function `net_rx_action`.  
104 | The `net_rx_action` get the frames either from `softnet_data->input_pkt_queue` or by calling NAPI driver `net_device->poll`.  
105 | 
106 |     The default net_device->poll is set to process_backlog
107 |     static int __init net_dev_init(void)
108 |     {
109 |     …
110 |            for_each_possible_cpu(i) {
111 |                    struct softnet_data *queue;
112 |     ...
113 |                    queue->backlog_dev.poll = process_backlog;
114 |     
115 | This generic `backlog_dev`  was put into `softnet_data->poll_list` in `netif_rx` in previous stage. 
116 | The `net_rx_action` check `poll_list` of devices and invoke poll of each device
117 | 
118 |     static void net_rx_action(struct softirq_action *h)
119 |     {
120 |     …
121 |            while (!list_empty(&queue->poll_list)) {
122 |     ...
123 |                    dev = list_entry(queue->poll_list.next,
124 |                                     struct net_device, poll_list);
125 |     …
126 |                            poll_result = dev->poll(dev, &budget);
127 |     
128 | 
129 | The process_backlog used to dequeue frame for old API driver and send it up to protocol handler by calling `netif_receive_skb`
130 | 
131 |     static int process_backlog(struct net_device *backlog_dev, int *budget)
132 |     {
133 |     ...
134 |            for (;;) {
135 |     ...
136 |                    skb = __skb_dequeue(&queue->input_pkt_queue);
137 |     ...
138 |                    netif_receive_skb(skb);
139 |     
140 | In the NAPI tg3 driver of broadcom NIC, the `net_device->poll` is set to `tg3_poll`
141 | 
142 |     static int __devinit tg3_init_one(struct pci_dev *pdev,
143 |                                      const struct pci_device_id *ent)
144 |     {
145 |     …
146 |           dev->poll = tg3_poll;
147 |     
148 |     
149 |     static int tg3_poll(struct net_device *netdev, int *budget)
150 |     {
151 |     ...
152 |                     work_done = tg3_rx(tnapi, orig_budget);
153 |     
154 |     
155 |     static int tg3_rx(struct tg3_napi *tnapi, int budget)
156 |     {
157 |     …
158 |                            netif_receive_skb(skb);
159 |     
160 | The `netif_receive_skb` then deliver the frame to tap and protocol handler
161 | 
162 |     int netif_receive_skb(struct sk_buff *skb)
163 |     {
164 |     …
165 |            /* protocol sniffers - ETH_P_ALL  are registered in ptype_all */
166 |            list_for_each_entry_rcu(ptype, &ptype_all, list) {
167 |     ...
168 |                                    ret = deliver_skb(skb, pt_prev, orig_dev);
169 |     …
170 |     
171 |     
172 |            type = skb->protocol;
173 |           /* normal protocol handlers are registered in ptype_base */
174 |            list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
175 |     …
176 |                                   ret = deliver_skb(skb, pt_prev, orig_dev);
177 |     ...
178 |     
179 |     
180 |     
181 |     
182 |     static __inline__ int deliver_skb(struct sk_buff *skb,
183 |                                      struct packet_type *pt_prev,
184 |                                      struct net_device *orig_dev)
185 |     {
186 |            atomic_inc(&skb->users);
187 |            return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
188 |     }
189 | 
190 | References
191 | 
192 | 1. http://en.wikipedia.org/wiki/New_API
193 | 2. http://knol.google.com/k/napi-linux-new-api#
194 | 3. http://www.redhat.com/promo/summit/2008/downloads/pdf/Thursday/Mark_Wagner.pdf
195 | 4. http://www.cs.clemson.edu/~westall/853/notes/devrecv.pdf
196 | 5. http://isis.poly.edu/kulesh/stuff/src/klist/
197 | 


--------------------------------------------------------------------------------
/samples/timeslice.c:
--------------------------------------------------------------------------------
 1 | #include <stdio.h>
 2 | #include <sched.h>
 3 | 
 4 | int main(int argc,char* argv[]){
 5 |   struct timespec tp;
 6 |   int status;
 7 |   status  = sched_rr_get_interval(0, &tp);
 8 |   if( status == 0 )
 9 |     printf("Timeslice is %d nano secs\n",tp.tv_nsec);
10 |   else
11 |     printf("fail status is %d",status);
12 | }
13 | 
14 | 


--------------------------------------------------------------------------------
/segmentation.md:
--------------------------------------------------------------------------------
 1 | ﻿## Segmentation
 2 | 
 3 | 
 4 | Segmentation is the part of memory management unit that translate logical address to linear address.
 5 | 
 6 | In Intel processor design
 7 | 
 8 | 1. there is one Global Descriptor Table (GDT) of segment’s descriptors per CPU. The address and length 
 9 | of  GDT is stored in register gdtr.
10 | 2. each OS process can maintain it own Local Descriptor Table (LDT). The address and length of LDT is 
11 | stored in register ldtr.
12 | 
13 | Each segment’s descriptor contains information about start address, size, type (whether it is user segment 
14 | e.g. code, data or system segment e.g Task State Segment -TSS ) of the segment as well as access privilege  
15 | (Descriptor Privilege Level - DPL).
16 | 
17 | Logical memory address comprises of segment and offset within it. Register cs,ss,ds,gs,fs store segment 
18 | identifier (also called segment selector), which has two parts
19 | 
20 | 1. TI - Type Identification specify whether it is about segment descriptor  in GDT or LDT
21 | 2. pointer to the corresponding segment descriptor in GDT  or LDT
22 | 
23 | Linux made limited usage of segmentation. Linux setups one GDT per CPU. Inside GDT Linux maintains only one 
24 | pair of descriptors for code and data/stack segment in user mode (`__USER_CS`,` __USER_DS`) and other pair in 
25 | kernel mode (`__KERNEL_CS`,`__KERNEL_DS`) crossing all processes. All these segment’s descriptors have same 
26 | start address (0x0), limit(0xFFFFF). The only different between kernel and it’s user counter part are privileges.
27 | 
28 | When OS switch from user mode to kernel mode, it load `__KERNEL_CS` into cs register, `__KERNEL_DS` into ds register. 
29 | It loads `__USER_CS` into cs, `__USER_DS` into ds when doing reverse. `__USER_DS`/`__KERNEL_DS` descriptor is used for 
30 | both data and stack segments.
31 | 
32 | Contents of all these descriptors are specified in cpu_gdt_table and do not change once they are initialized 
33 | in memory by  cpu_init.
34 | 
35 | References
36 | 
37 | 1. http://web.cs.wpi.edu/~cs3013/c07/lectures/Section09.1-Intel.pdf
38 | 


--------------------------------------------------------------------------------
/send-frame-to-network.md:
--------------------------------------------------------------------------------
  1 | ﻿# Send a frame to network
  2 | 
  3 | Upper layer put output frame into driver buffer `struct Qdisc *q = dev->qdisc` using `dev_queue_xmit`
  4 | 
  5 |     int dev_queue_xmit(struct sk_buff *skb)
  6 |     {
  7 |            struct net_device *dev = skb->dev;
  8 |     
  9 |     
 10 |     …
 11 |            q = rcu_dereference(dev->qdisc);
 12 |     #ifdef CONFIG_NET_CLS_ACT
 13 |            skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_EGRESS);
 14 |     #endif
 15 |            if (q->enqueue) {
 16 |                    /* Grab device queue */
 17 |                    spin_lock(&dev->queue_lock);
 18 |                    q = dev->qdisc;
 19 |                    if (q->enqueue) {
 20 |                            rc = q->enqueue(skb, q);
 21 |                            qdisc_run(dev);
 22 |                            spin_unlock(&dev->queue_lock);
 23 |     
 24 |     
 25 |                            rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc;
 26 |                            goto out;
 27 |                    }
 28 |                    spin_unlock(&dev->queue_lock);
 29 |            }
 30 | 
 31 | then call `qdisc_run` to send frames to the NIC.
 32 | 
 33 |     static inline void qdisc_run(struct net_device *dev) {
 34 |     while (!netif_queue_stopped(dev) && qdisc_restart(dev) < 0)
 35 |     /* NOTHING */;
 36 |     }
 37 | 
 38 | Due to various reasons (queue is stopped because of not enough memory of NIC, someone else is transfering), 
 39 | the immediate transfer may not success and have to be postponed to later time.
 40 | 
 41 |     int qdisc_restart(struct net_device *dev) {
 42 |     
 43 |     
 44 |            if (!nolock) {
 45 |                  if (!netif_tx_trylock(dev)) {
 46 |                     collision:
 47 |                     /* So, someone grabbed the driver. */
 48 |     …
 49 |                          goto requeue;
 50 |                   }
 51 |             }
 52 |     ...
 53 |     requeue:
 54 |             if (skb->next)
 55 |                  dev->gso_skb = skb;
 56 |             else
 57 |                  q->ops->requeue(skb, q);
 58 |             netif_schedule(dev); /* schedule a transfer for later time */
 59 |             return 1;
 60 |     }
 61 |     
 62 | The soft interrupt `net_tx_action` and `netif_schedule` are used to facilitate that.    
 63 | 
 64 |     static inline void _ _netif_schedule(struct net_device *dev) {
 65 |         if (!test_and_set_bit(_ _LINK_STATE_SCHED, &dev->state)) {
 66 |             unsigned long flags;
 67 |             struct softnet_data *sd;  
 68 |             local_irq_save(flags);
 69 |             sd = &_ _get_cpu_var(softnet_data);
 70 |             dev->next_sched = sd->output_queue;
 71 |             sd->output_queue = dev;
 72 |             raise_softirq_irqoff(cpu, NET_TX_SOFTIRQ); /* trigger net_tx_action */
 73 |             local_irq_restore(flags);
 74 |         }
 75 |     }
 76 |     
 77 | 
 78 | A soft interrupt `net_tx_action` is responsible for sending frame from driver buffer (kernel memory) to NIC.
 79 | 
 80 |     static void net_tx_action(struct softirq_action *h)
 81 |     {
 82 |     ...
 83 |     
 84 |         if (sd->output_queue) {
 85 |             struct net_device *head;   
 86 |             local_irq_disable( );
 87 |             head = sd->output_queue;
 88 |             sd->output_queue = NULL;
 89 |             local_irq_enable( );   
 90 |             while (head) {
 91 |                 struct net_device *dev = head;
 92 |                 head = head->next_sched;   
 93 |                 smp_mb_ _before_clear_bit( );
 94 |                 clear_bit(_ _LINK_STATE_SCHED, &dev->state);   
 95 |                 if (spin_trylock(&dev->queue_lock)) {
 96 |                    qdisc_run(dev);
 97 |                    spin_unlock(&dev->queue_lock);
 98 |                 } else {
 99 |                   netif_schedule(dev);
100 |                 }
101 |             }
102 |     }
103 |     
104 |     int qdisc_restart(struct net_device *dev) {
105 |         struct Qdisc1 *q = dev->qdisc;
106 |         struct sk_buff *skb;   
107 |         if ((skb = q->dequeue(q)) != NULL) {
108 |         …
109 |                 if (!spin_trylock(&dev->xmit_lock)) {
110 |                 …
111 |                 }
112 |         {
113 |         if (!netif_queue_stopped(dev)) {
114 |             int ret;
115 |             if (netdev_nit) {
116 |                 dev_queue_xmit_nit(skb, dev);   /* send frame to protocol sniffer*/
117 |                 ret = dev->hard_start_xmit(skb, dev); /* send frame to NIC */
118 |                 if (ret == NETDEV_TX_OK) {
119 |                     if (!nolock) {
120 |                         dev->xmit_lock_owner = -1;
121 |                         spin_unlock(&dev->xmit_lock);
122 |                     }
123 |                     spin_lock(&dev->queue_lock);
124 |                     return -1;
125 |                  }
126 |                  if (ret == NETDEV_TX_LOCKED && nolock) {
127 |                      spin_lock(&dev->queue_lock); goto collision;
128 |                  }
129 |             }
130 |          }
131 | 
132 | 
133 | In `dev->hard_start_xmit`, when the device driver realizes that it does not have enough space to store a 
134 | frame of maximum size (MTU), it stops the egress queue with netif_stop_queue to avoid wasting resources with future 
135 | transmissions that the kernel already knows will fail.
136 | 
137 | The following example of this throttling at work is taken from vortex_start_xmit (the hard_start_xmit method used 
138 | by the drivers/net/3c59x.c driver):
139 | 
140 |         vortex_start_xmit(struct sk_buff *skb, struct net_device *dev)
141 |         {
142 |             …
143 |             outsl(ioaddr + TX_FIFO, skb->data, (skb->len + 3) >> 2);
144 |             dev_kfree_skb (skb);
145 |             if (inw(ioaddr + TxFree) > 1536) {
146 |                 netif_start_queue (dev); /* AKPM: redundant? */
147 |             } else {
148 |                 /* stop the queue */
149 |                 netif_stop_queue(dev);
150 |                 /* Interrupt us when the FIFO has room for max-sized packet. */
151 |                outw(SetTxThreshold + (1536>>2), ioaddr + EL3_CMD);
152 |             }
153 |         
154 |         
155 |         static void vortex_interrupt(int irq, void *dev_id, struct pt_regs *regs) {
156 |             …
157 |             if (status & TxAvailable) {
158 |                  if (vortex_debug > 5)
159 |                       printk(KERN_DEBUG " TX room bit was handled.\n");
160 |                  /* There's room in the FIFO for a full-sized packet. */
161 |                  outw(AckIntr | TxAvailable, ioaddr + EL3_CMD);
162 |                  /* wake queue up to start transfer again when there is space in NIC*/
163 |                 netif_wake_queue (dev);
164 |              }
165 |         …
166 |         }
167 |         
168 |         
169 |         static inline void netif_wake_queue(struct net_device *dev)
170 |         {
171 |         #ifdef CONFIG_NETPOLL_TRAP
172 |                if (netpoll_trap())
173 |                        return;
174 |         #endif
175 |                if (test_and_clear_bit(__LINK_STATE_XOFF, &dev->state))
176 |                        __netif_schedule(dev);
177 |          }
178 | 


--------------------------------------------------------------------------------
/signals.md:
--------------------------------------------------------------------------------
 1 | ## Signals
 2 | 
 3 | Signal is used to notify a process about external events (page fault, interrupt from keyboard) . Except real time signals, normal signals are not queued, kernel keep track of pending  signal for each process as bit mask. Also no other data is associated with signal.
 4 | 
 5 | Except KILL and STOP, signals can be blocked, default signal handlers can be ignored and overwritten. Signals can be generated in program using kill, tkill, send_sig.
 6 | 
 7 | To suspend a process run e.g.
 8 | 
 9 |      kill -SIGSTOP 3878
10 | 
11 | State of process can be observed by
12 | 
13 |      ps -eo pid,state,cmd
14 |      
15 | To resume the suspended process run
16 | 
17 |     kill -SIGCONT 3878
18 | 
19 | **Default signals**
20 | 
21 | The default signal sent out by `kill` is `TERM` mean terminate the process. Each process reacts to each signal differently, Java process e.g. dump stacktrace of all threads to stdout unpon receiving `QUIT`.  
22 | 
23 | **Delivery of a signal**
24 | 
25 | When sending a signal to a running process (on CPU), the signal is pending until the process is preempted, at that time the signal is delivered to the process and a signal handler is called.
26 | 
27 | When sending a signal to process in an `interupptible` waiting state, the process is waked up, the signal is delivered to the process and signal a handler is called.
28 | 
29 | When sending a signal to process in ready state, the signal is delivered to the process and signal a handler is called whenever the process get CPU.
30 | 
31 | 


--------------------------------------------------------------------------------
/slab-allocator.md:
--------------------------------------------------------------------------------
 1 | ﻿## Slab allocator
 2 | 
 3 | Kernel allocate objects using slap layer, which provide a interface for create a cache for a specific type of 
 4 | object (e.g. task_struct), allocate an object from the cache and release object from cache.
 5 | 
 6 | Because memory are allocated in pages and size of objects are  usually smaller than the page size, this layer 
 7 | allows faster allocation of objects, eliminates memory waste and reduces memory fragmentation.
 8 | 
 9 | The implementation give kernel sufficient information for free memory by shrinking size of cache when needed.
10 | 
11 | References
12 | 
13 | 1. http://linux.die.net/man/1/slabtop and cat /proc/slabinfo
14 | 2. http://www.puschitz.com/TuningLinuxForOracle.shtml
15 | 3. http://www.redhat.com/magazine/001nov04/features/vm/
16 | 


--------------------------------------------------------------------------------
/socket-file-api-implementation.md:
--------------------------------------------------------------------------------
  1 | ﻿# Socket file api implementation
  2 | 
  3 | Because socket is kind of file, Linux implements special filesystem, inode for socket.
  4 | 
  5 | **Socket file system**
  6 | 
  7 |     [root@localhost ~]# cat /proc/filesystems 
  8 |     ..
  9 |     nodev        sockfs
 10 |     ...
 11 |     
 12 |     
 13 |     static int __init sock_init(void)
 14 |     {
 15 |     ...
 16 |             register_filesystem(&sock_fs_type);
 17 |             sock_mnt = kern_mount(&sock_fs_type);
 18 |     …
 19 |     }
 20 |     
 21 |     
 22 |     struct vfsmount *kern_mount(struct file_system_type *type)
 23 |     {
 24 |             return vfs_kern_mount(type, 0, type->name, NULL);
 25 |     }
 26 |     
 27 |     
 28 |     struct vfsmount *
 29 |     vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void *data)
 30 |     {
 31 |     ..
 32 |     
 33 |     
 34 |             error = type->get_sb(type, flags, name, data, mnt);
 35 |     ..
 36 |     }
 37 |     
 38 |     static struct file_system_type sock_fs_type = {
 39 |             .name =         "sockfs",
 40 |             .get_sb =       sockfs_get_sb, /* return super block*/
 41 |             .kill_sb =      kill_anon_super,
 42 |     };
 43 |     
 44 |     static struct super_operations sockfs_ops = {
 45 |             .alloc_inode =  sock_alloc_inode,
 46 |             .destroy_inode =sock_destroy_inode,
 47 |             .statfs =       simple_statfs,
 48 |     };
 49 |     
 50 |     static int sockfs_get_sb(struct file_system_type *fs_type,
 51 |             int flags, const char *dev_name, void *data, struct vfsmount *mnt)
 52 |     {
 53 |             return get_sb_pseudo(fs_type, "socket:", &sockfs_ops, SOCKFS_MAGIC,
 54 |                                  mnt);
 55 |     }
 56 | 
 57 | **Socket inode**
 58 |     
 59 |     struct socket_alloc {
 60 |             struct socket socket;
 61 |             struct inode vfs_inode;
 62 |     };
 63 |     
 64 |     static struct inode *sock_alloc_inode(struct super_block *sb)
 65 |     {
 66 |             struct socket_alloc *ei;
 67 |             ei = (struct socket_alloc *)kmem_cache_alloc(sock_inode_cachep, SLAB_KERNEL);
 68 |             if (!ei)
 69 |                     return NULL;
 70 |             init_waitqueue_head(&ei->socket.wait);
 71 |     
 72 |     
 73 |             ei->socket.fasync_list = NULL;
 74 |             ei->socket.state = SS_UNCONNECTED;
 75 |             ei->socket.flags = 0;
 76 |             ei->socket.ops = NULL;
 77 |             ei->socket.sk = NULL;
 78 |             ei->socket.file = NULL;
 79 |             ei->socket.flags = 0;
 80 |     
 81 |     
 82 |             return &ei->vfs_inode;
 83 |     }
 84 |     
 85 |     
 86 |     static struct dentry_operations sockfs_dentry_operations = {
 87 |             .d_delete =     sockfs_delete_dentry,
 88 |     };
 89 |     
 90 |     
 91 |     struct file_operations socket_file_ops = {
 92 |             .owner =        THIS_MODULE,
 93 |             .llseek =       no_llseek,
 94 |             .aio_read =     sock_aio_read,
 95 |             .aio_write =    sock_aio_write,
 96 |             .poll =         sock_poll,
 97 |             .unlocked_ioctl = sock_ioctl,
 98 |     #ifdef CONFIG_COMPAT
 99 |             .compat_ioctl = compat_sock_ioctl,
100 |     #endif
101 |             .mmap =         sock_mmap,
102 |             .open =         sock_no_open,   /* special open code to disallow open via /proc */
103 |             .release =      sock_close,
104 |             .fasync =       sock_fasync,
105 |             .readv =        sock_readv,
106 |             .writev =       sock_writev,
107 |             .sendpage =     sock_sendpage,
108 |             .splice_write = generic_splice_sendpage,
109 |     };
110 |     
111 | **Socket file handler**
112 | 
113 | Socket is attached to a file handle so we can operate with socket using file operation e.g. read, write
114 | 
115 |     static int sock_attach_fd(struct socket *sock, struct file *file)
116 |     {
117 |             struct qstr this;
118 |             char name[32];
119 |    
120 |    
121 |             this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino);
122 |             this.name = name;
123 |             this.hash = SOCK_INODE(sock)->i_ino;
124 |    
125 |    
126 |             file->f_dentry = d_alloc(sock_mnt->mnt_sb->s_root, &this);
127 |             if (unlikely(!file->f_dentry))
128 |                     return -ENOMEM;
129 |    
130 |    
131 |             file->f_dentry->d_op = &sockfs_dentry_operations;
132 |             d_add(file->f_dentry, SOCK_INODE(sock));
133 |             file->f_vfsmnt = mntget(sock_mnt);
134 |             file->f_mapping = file->f_dentry->d_inode->i_mapping;
135 | 
136 | 
137 |             sock->file = file;
138 |             file->f_op = SOCK_INODE(sock)->i_fop = &socket_file_ops;
139 |             file->f_mode = FMODE_READ | FMODE_WRITE;
140 |             file->f_flags = O_RDWR;
141 |             file->f_pos = 0;
142 |             file->private_data = sock;
143 | 
144 | 
145 |             return 0;
146 |     }
147 | 


--------------------------------------------------------------------------------
/spin-lock.md:
--------------------------------------------------------------------------------
 1 | ﻿## Spin lock
 2 | 
 3 | When process is not given an access to the critical section,  with normal lock, process will be paused 
 4 | and moved to wait queue giving CPU to other process. 
 5 | 
 6 | Switching one process to other takes time because certain amount of machine’s instructions has to be performed 
 7 | to save running context from machine’s registers of one process to the memory and restore the context of other 
 8 | process from memory back to machine’s registers.
 9 | 
10 | In multi processors machine, spin lock was invented to address this problem. With spin lock, the process try to 
11 | enter a loop of nop (no operation - doing nothing) machine’s instruction for a while (the length is much shorter 
12 | compare to context switch) then try to acquire  the lock again hoping that the process holding the lock runs on 
13 | other processor will release it so there is no need to do expensive context switching.  
14 | 
15 | 
16 | There is no meaning of doing spin lock on uni processor machine as if the process spins, other process that 
17 | currently holds the lock can not get CPU to run in order to release the lock. 
18 | 
19 | Because normally the same code base is used to run in both uni and multi processors machine, the implementation 
20 | of spin lock in case of uni processor will instead 
21 | 
22 | 1. when running in user mode, in case of not able to acquire the lock: relinquish the processor and go to wait queue
23 | 2. when running in kernel mode, in case of successful acquisition of the lock, disable kernel preemption, 
24 | which does not allow other process to get the CPU     
25 | 
26 | Spin lock is implemented and used in both user mode and kernel mode. The POSIX Thread API for spin lock is
27 | 
28 | 1. pthread_spinlock_t spinlock
29 | 2. pthread_spin_lock(&spinlock)
30 | 
31 | References
32 | 
33 | 1. http://www.moserware.com/2008/09/how-do-locks-lock.html#lockfn5
34 | 1. http://www.spinics.net/lists/newbies/msg40369.html
35 | 


--------------------------------------------------------------------------------
/strace.md:
--------------------------------------------------------------------------------
 1 | ## strace quick start
 2 | 
 3 | **Identify running process**
 4 | 
 5 |     # ps -ef | grep Gerrit
 6 |     gerrit   17549     1  0 May06 pts/1    00:02:45 GerritCodeReview -server
 7 |     
 8 |     
 9 | **Run strace**
10 | 
11 |     strace -f -p 17549 -s 128 -o /tmp/strace.lo
12 | 
13 | * the `-f` means trace all childrens/threads of the given process.
14 | * the `-s 128` means buffer for store syscall's parameter is 128 byte. The default is 80, which is usually too small
15 | * the `-o` redirect tracing info into a file for filtering later as amount of tracing info is usually overwhelming
16 | 
17 | With the knowledge if syscall we can ask strace to trace only specified syscalls or group of syscalls
18 | 
19 |     strace -f -p 17549 -s 128 -e trace=lstat,open
20 |     
21 | strace predefine a few groups of syscalls such as `file`, `process`, `network`, `rpc`
22 | 
23 |     strace -f -p 17549 -s 128 -e trace=file
24 | 
25 | However, sometimes knowning sequence of syscalls is more important than a particular one when troubleshooting.
26 | 
27 | **References**
28 | 
29 | * http://linux.die.net/man/1/strace
30 | 


--------------------------------------------------------------------------------
/synchronization.md:
--------------------------------------------------------------------------------
 1 | ﻿## Synchronization
 2 | 
 3 | 
 4 | Semaphore is for synchronization between processes while mutex between threads of the same processes.
 5 | 
 6 | 
 7 | Memory barrier is mechanism to ensure that the machine executes instructions accessing memory(read/write) in the same order as the program has been written. Usually the processor/compiler change the order of the program execution due to the optimization.  This is important because when multi processes concurrently access memory the semantic depends on order of execution.  
 8 | 
 9 | 
10 | Per CPU data is a technique, in which data is dedicated into one processor and thus no synchronization between CPUs is required and disable kernel preemption is sufficient.


--------------------------------------------------------------------------------
/syscall.md:
--------------------------------------------------------------------------------
1 | ## Syscall
2 | 
3 | Syscall is the way user process requests for a service from OS.
4 | 
5 | Reference
6 | 
7 | * http://articles.manugarg.com/systemcallinlinux2_6.html
8 | 


--------------------------------------------------------------------------------
/sysfs-procfs.md:
--------------------------------------------------------------------------------
 1 | ## Overview
 2 | 
 3 | In general `procfs` export kernel information of processess while `sysfs` export information about devices and other kernel objects. 
 4 | However `procfs` also export kernel subsystem information not related to proceses, which make `procfs` procfs cluttered with lots of non-process information. These should be replaced by  `sysfs`.
 5 | 
 6 | e.g.
 7 | 
 8 |       root@gerrit01:/home/vagrant# ls /proc/ | egrep -v -E "[0-9]+" | more
 9 |       acpi
10 |       buddyinfo
11 |       bus
12 |       cgroups
13 |       cmdline
14 |       consoles
15 |       cpuinfo
16 |       crypto
17 |       devices
18 |       diskstats
19 |       ...
20 | 
21 | 
22 | References
23 | 
24 | * http://people.ee.ethz.ch/~arkeller/linux/multi/kernel_user_space_howto-2.html
25 | * https://www.kernel.org/doc/Documentation/filesystems/sysfs.txt
26 | * https://lwn.net/Articles/51437/
27 | * https://www.kernel.org/doc/Documentation/filesystems/proc.txt
28 | 
29 | 


--------------------------------------------------------------------------------
/top-half.md:
--------------------------------------------------------------------------------
 1 | ﻿# The top half - hardware interrupt and the bottom half - software interrupt
 2 | 
 3 | During execution of most of hardware interrupt handler, interrupts are disable on local cpu to reduce 
 4 | likelihood of race condition.  That is reason that to divide the work of interrupt handler to
 5 | 
 6 | 1. top half, which runs very fast with interrupt disable
 7 | 2. bottom half, which do the remain work and can be preempt-able.
 8 | 
 9 | During several routine operations, the kernel checks whether any bottom halves are scheduled for execution. 
10 | If any are waiting, the kernel runs the function do_softirq/invoke_softirq, to execute them.
11 | 
12 | The checks are performed during
13 | 
14 | 1. exit_irq in do_IRQ: because the top half - hardware interrupt handler usually delegate some work to 
15 | bottom half - software interrupt handler. So by calling software interrupt handler at exit of hardware 
16 | interrupt handler, the latency is minimum.
17 | 
18 |         fastcall unsigned int do_IRQ(struct pt_regs * regs) {
19 |         irq_enter( );
20 |         …
21 |         irq_exit( );
22 |         return 1;
23 |         }
24 | 
25 |         void irq_exit(void) {
26 |         ...
27 |         sub_preempt_count(IRQ_EXIT_OFFSET);
28 |         if (!in_interrupt( ) && local_softirq_pending( ))
29 |         invoke_softirq( );
30 |         preempt_enable_no_resched( );
31 |         }
32 | 
33 | 2. return from interrupt and exception including system call
34 | 3. schedule
35 | 


--------------------------------------------------------------------------------
/tss.md:
--------------------------------------------------------------------------------
 1 | ## TSS
 2 | 
 3 | Task State Segment -`TSS` identifier is stored in `tr` register. Because the processor uses different stack 
 4 | for user mode (privilege level 3) and protected mode (privilege level 0,1,2), stack switch occurs. 
 5 | The `TSS` is used to stored segment selector and offset of stack that will be used by a procedure being 
 6 | called when processor make a call from one privilege level to other higher privilege level.         
 7 | 
 8 | In linux `TSS` is per CPU (not per process as originally intended by Intel design). `TSS` gets used when 
 9 | OS switches from user to kernel mode (Requester Privilege Level - RPL and DPL are different).   
10 | 
11 | When CPU switches from lower privilege (user mode) to higher privilege (kernel mode), it fetches address 
12 | (segment selector and pointer) of kernel stack from TSS. TSS also contains permission bitmap for I/O access 
13 | (address used in in/out instruction) from user mode.
14 | 
15 | `TSS` segment is 236 bytes long, processes in User Mode are not allowed to access `TSS` segments. `TSS` occupies 
16 | part of the kernel data segment. A `TSS` is different for each processor in the system. They are sequentially 
17 | stored in the `init_tss` array, each element corresponds to one segment dedicated for one CPU.
18 | 
19 | References
20 | 
21 | 1. http://en.wikipedia.org/wiki/Task_State_Segment
22 | 


--------------------------------------------------------------------------------
/udev.md:
--------------------------------------------------------------------------------
 1 | # udev
 2 | 
 3 | `udev` dynamically create inode by default in `/dev/` for hotplug devices. udev runs as userspace deamon listening on netlink socket for hardware hotplug event and create/remove a device's inode according to predefined rules as well as load kernel module device's driver if it is required. 
 4 | 
 5 | `udev` also create meaningful symlink to easy identification devices. e.g.
 6 | 
 7 |     # ls -l /dev/disk/by-uuid/
 8 |     total 0
 9 |     lrwxrwxrwx 1 root root 10 Apr 30 16:08 30051059-44e1-4425-9bc8-9b9ade27822c -> ../../dm-1
10 |     lrwxrwxrwx 1 root root 10 Apr 30 16:08 a975dd09-15f7-4945-a2f8-f59de9af125a -> ../../dm-0
11 |     lrwxrwxrwx 1 root root 10 Apr 30 16:08 f012a222-cb8d-436f-9291-d559e99ce304 -> ../../sda1
12 | 
13 | 
14 | **references**
15 | 
16 | * https://www.linux.com/news/hardware/peripherals/180950-udev
17 | * https://www.kernel.org/doc/pending/hotplug.txt
18 | 


--------------------------------------------------------------------------------