├── .gitmodules ├── README.md ├── operating-system-concepts ├── chapter1-notes.md ├── chapter2-notes.md ├── chapter3-notes.md ├── chapter4-notes.md └── chapter6-notes.md └── pdfs ├── cheatsheets ├── C_CS.pdf ├── CommanLine_CS.pdf ├── Docker_CS.pdf └── Linux_Syscall_CS.pdf ├── cs6200.pdf └── papers ├── anderson-notes.md ├── anderson-paper.pdf ├── arpaci-dusseau-notes.md ├── arpaci-dusseau-paper.pdf ├── birrell-paper.pdf ├── birrell-rpc-paper.pdf ├── eykholt-notes.md ├── eykholt-paper.pdf ├── fedorova-notes.md ├── fedorova-paper.pdf ├── nelson-paper.pdf ├── pai-notes.md ├── pai-paper.pdf ├── popek-notes.md ├── popek-paper.pdf ├── protic-paper.pdf ├── rosenblum-paper.pdf ├── stein-notes.md └── stein-shah-paper.pdf /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "cs6200-pr1"] 2 | path = cs6200-pr1 3 | url = ../cs6200-pr1 4 | [submodule "cs6200-pr3"] 5 | path = cs6200-pr3 6 | url = ../cs6200-pr3 7 | [submodule "cs6200-pr4"] 8 | path = cs6200-pr4 9 | url = ../cs6200-pr4 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # cs6200 2 | 3 | This class is a graduate-level introductory course in operating systems. I 4 | already took an operating systems course during my undergraduate studies, 5 | however, I felt like this would be a good refresher and first class in GATech's 6 | OMSCS program. 7 | 8 | [Here](https://omscs.gatech.edu/cs-6200-introduction-operating-systems) is the 9 | official course webpage. 10 | 11 | ## Books 12 | 13 | In the folder `./operating-system-concepts`, you'll find some notes that I took 14 | while reading the textbook, "Operating System Concepts 10th edition", by 15 | Abraham Silberschatz et. al. 16 | 17 | This book is a great read and the course material closely follows some of the 18 | chapters in the book. I highly recommend reading this book if you want to gain 19 | a thorough understanding of how operating systems are designed and implemented. 20 | 21 | ## Pdfs 22 | 23 | In the folder `./pdfs/papers`, you'll find a majority of the papers covered in 24 | the course - I don't know if I kept all of them, though. This folder also 25 | contains some of my notes for each paper, essentially summarizing the most 26 | important points. 27 | 28 | `./pdfs/cheatsheets` contains a couple of useful cheatsheets for Docker 29 | commands, Linux system calls, and programming in C. 30 | 31 | `./pdfs/cs6200.pdf` contains my entire notebook for the course. 32 | 33 | ## Projects 34 | 35 | I worked on each project for this course in a separate repository. Each project 36 | is added to this repository as a submodule. The submodules in this repository 37 | are private to uphold the Georgia Institute of Technology 38 | [Academic Honor Code](https://osi.gatech.edu/content/honor-code). 39 | 40 | ### Project 1 41 | 42 | This project is an exercise in designing and implementing multithreaded 43 | applications. The student is tasked with creating a multithreaded web server 44 | that will serve static files based on a custom protocol. The student must also 45 | create a multithread client application that will use the same custom protocol 46 | to generate requests for files from the web server. 47 | 48 | The entire project is written in C. In this project I learned: 49 | 50 | * How to efficiently use the [pthreads](https://man7.org/linux/man-pages/man7/pthreads.7.html) 51 | library for multithreaded programming 52 | * How to avoid race conditions 53 | * Effective signal handling 54 | * Multithreaded and thread-safe file access conventions 55 | * Network programming 56 | * Process memory management 57 | * How to use stack and queue data structures 58 | * How to use Valgrind and [ASan](https://github.com/google/sanitizers/wiki/AddressSanitizer) to hunt 59 | down memory leaks and programming errors 60 | 61 | ### Project 2 62 | 63 | Project 2 was an extra credit project that was only graded if a student needed 64 | a couple more points to increase their letter grade. This project required 65 | students to evaluate the performance of their implemetation of Project 1, and 66 | how the project could be designed differently in order to increase 67 | performance. 68 | 69 | ### Project 3 70 | 71 | This project is an exercise in designing and implementing applications that 72 | leverage inter-process communication (IPC). This project builds upon Project 1. 73 | The multithreaded web server previously designed will now work in tandem with a 74 | web proxy. When the multithreaded client requests a file, the web server will 75 | check with the web proxy to see if the requested file exists locally within the 76 | cache. If not, the web proxy will use the [curl](https://curl.haxx.se/) library 77 | to download the file from the internet and then store the contents of the file 78 | locally. The web proxy will then transfer the file to the web server using the 79 | [POSIX shared memory API](https://www.man7.org/linux/man-pages/man7/shm_overview.7.html). 80 | If the file exists locally, the web proxy will not make an HTTP request, 81 | transferring the file to the web server via shared memory. The client and the 82 | web server still use a custom protocol to conduct communication and file 83 | exchanges. 84 | 85 | The entire project is written in C. In this project I learned: 86 | 87 | * How to use libcurl 88 | * How to create, manage, and use [POSIX message queues](https://www.man7.org/linux/man-pages/man7/mq_overview.7.html) 89 | for IPC 90 | * How to use the POSIX shared memory API for IPC. 91 | 92 | ### Project 4 93 | 94 | This project is an exercise in designing and implementing applications that 95 | leverage remote procedure calls and protocol buffers. In this project, students 96 | design a distributed file system using [gRPC](https://grpc.io/) and Google's 97 | [Protocol Buffers](https://developers.google.com/protocol-buffers). These tools 98 | will allow students to implement several file transfer protocols and a 99 | synchronization system so that multiple clients and one server have consistent 100 | files across different file caches. 101 | 102 | The entire project is written in C++. In this project I learned how to: 103 | 104 | * Use the protobuf library 105 | * How to handle asynchronous gRPC calls 106 | * How to use [inotify](https://man7.org/linux/man-pages/man7/inotify.7.html) to monitor 107 | filesystem events 108 | * How to design and implement object-oriented applications written in C++ 109 | -------------------------------------------------------------------------------- /operating-system-concepts/chapter1-notes.md: -------------------------------------------------------------------------------- 1 | # Chapter 1 2 | 3 | ## Introduction 4 | 5 | An introduction to what operating systems are, what they do, and the various parts that go into creating a computing system and operating system. 6 | 7 | ## Definitions 8 | 9 | * **Operating system** - is software that manages a computer's hardware, provides a basis for application programs and acts as an intermediary between the computer use and the computer hardware. 10 | 11 | * **Resource allocator** - operating systems can be viewed as these, managing resources such as CPU time, memory space, storage space, I/O devices, etc. 12 | 13 | * **Control program** - manages the execution of user programs to prevent errors and improper use of the computer. It is especially concerned with the operation and control of I/O devices. 14 | 15 | * **Kernel** - on program running at all times on the computer, the operating system itself. 16 | 17 | * **System programs** - programs associated with the operating system, but not part of the kernel. 18 | 19 | * **Device driver** - provides the operating system with a uniform interface to a device controller. 20 | 21 | * **Interrupt** - a way for a device driver to asynchronously notify the CPU of a change in hardware status. 22 | 23 | * **Interrupt vector** - a table of pointers stored low in memory that holds the addresses of the interrupt service routines for various devices (array). 24 | 25 | * **Trap (or exception)** - software generated interrupt caused by either an error (e.g. division by zero or invalid memory access) or by a specific request from a user program that an operating-system service be performed. 26 | 27 | * **System call** - calls User-Mode programs can request when a Kernel-Mode operation needs to be executed. 28 | 29 | * **Mode bit** - bit added to the hardware for the computer to indiciate the current privilege mode: kernel (0) or user (0). 30 | -------------------------------------------------------------------------------- /operating-system-concepts/chapter2-notes.md: -------------------------------------------------------------------------------- 1 | # Chapter 2 2 | 3 | ## Types of System Calls 4 | 5 | * **Process Control** 6 | 7 | * create process, terminate process 8 | * load, execute 9 | * get process attributes, set process attributes 10 | * wait event, signal event 11 | * allocate and free memory 12 | 13 | * **File Management** 14 | 15 | * create file, delete file 16 | * open, close 17 | * read, write, reposition 18 | * get file attributes, set file attributes 19 | 20 | * **Device Management** 21 | 22 | * request device, release device 23 | * read, write, reposition 24 | * get device attributes, set device attributes 25 | * locgically attach or detach devices 26 | 27 | * **Information Maintenance** 28 | 29 | * get time or date, set time or date 30 | * get system data, set system data 31 | * get process, file, or device attributes 32 | * set process, file, or device attributes 33 | 34 | * **Communications** 35 | 36 | * create, delete communication connection 37 | * send, receive messages 38 | * transfer status information 39 | * attach or detach remote devices 40 | 41 | * **Protection** 42 | 43 | * get file permissions 44 | * set file permissions 45 | 46 | ## Definitions 47 | 48 | * **Run-time environment** - the full suite of software needed to execute applications written in a given programming language, including its compilers or interpreters as well as other software, such as libraries and loaders. 49 | 50 | * **System-call interface** - serves as the link to system calls made available by the operating system. This interface intercepts function calls in the API and invokes the necessary system calls within the operating system. 51 | 52 | * **Relocatable object file** - source files are compiled into object files that are designed to be loaded into any physical memory location. 53 | 54 | * **Linker** - combines relocatable object files into a single binary executable 55 | 56 | ## Programming Projects 57 | 58 | **Lessons learned from the File Copy exercise.** 59 | 60 | * For portability, always use the C99 standard. To compile to the C99 standard, use this gcc option: `--std=c99`. 61 | * For full error checking, use this gcc option: `-Wall` 62 | * To compare a variable to a string literal, use this function: `strncmp()` 63 | * To find a library to use in your project that conforms to C99 standards, view the library's `man` page and **goto** the 'CONFORMING TO' section. 64 | 65 | I wrote the File Copy executable with `open()`, `read()`, and `write()` which aren't C99 standard functions - these functions are contained in the `` library. 66 | 67 | The C99 standard library functions for opening files, reading, and closing can be found in ``. 68 | 69 | **Lessons learned from the Kernel Modules exercise.** 70 | 71 | * `lsmod` lists all of your currently loaded kernel modules. 72 | * `sudo insmod kernel_mod.ko` and `sudo rmmod kernel_mod.ko` insert kernel modules into the Linux kernel, and remove kernel modules from the Linux kernel, respectively. 73 | * `dmesg` is `stdout` for the kernel, essentially. All kernel messages have a priority level when you use `printk` to print to dmesg. These priority levels are similar to SNMP traps. 74 | * `HZ`, located in the `` library, contains the **tick rate** that establishes the frequency of the timer interrupt of the operating system. 75 | * `jiffies`, located in the `` library, keeps track of the number of timer interrupts that have occured since the system was booted. 76 | 77 | This exercise was pretty cool. Got to mess around with kernel modules for the first time, use the kernel to build custom kernel modules and created processes in the `/proc/` folder. 78 | -------------------------------------------------------------------------------- /operating-system-concepts/chapter3-notes.md: -------------------------------------------------------------------------------- 1 | # Chapter 3 2 | 3 | ## Process Concept 4 | 5 | Early computers executed what were then known as jobs, then time-shared systems emerged that ran user programs or tasks. The concept is that a computer must be able to run multiple different programs at one time, conduct memory management, etc. All of these different activities are called processes. 6 | 7 | The status of a process is represented by the program counter and the contents of the processor's registers. The memory layout of a process is typically divided into multiple sections: 8 | 9 | * **Text section** - the executable code 10 | * **Data section** - global variables 11 | * **Heap section** - memory that is dynamically allocated during program run time 12 | * **Stack section** - temporary data storage when invoking functions (such as function parameters, return addresses, and local variables) 13 | 14 | The text and data sections of a process are static and will remain the same throughout the execution of the process, however, the stack and heap can shrink and grow dynamically during program execution. Even though the stack and heap grow *toward* one another, the operating systems ensures they do not *overlap*. 15 | 16 | As a process executes, it changes states with each state being defined by the current activity of that process: 17 | 18 | * **New** - the process is being created 19 | * **Running** - instructions are being executed by the processor 20 | * **Waiting** - the process is waiting for some event to occur (I/O or a signal) 21 | * **Ready** - the process is waiting to be assigned to a processor 22 | * **Terminated** - the process has finished execution 23 | 24 | The states above are found on all operating systems, however, some operating systems also more finely delineate process states. 25 | 26 | It is important to understand that only **one** process may be in the **running** state on any processor core at any instant. Many processes may be **ready** and **waiting**. 27 | 28 | All processes within an operating system are represented by a process control block (PCB). PCBs contain these pieces of information about a process: 29 | 30 | * **Process state** 31 | * **Program counter** - indicates the address of the next instruction to be executed for the process 32 | * **CPU registers** - vary in number and type depending upon computer architecture; includes accumulators, index registers, stack pointers, general-purpose pointers, condition-code information; this information must be saved when an interrupt occurs to allow the process to be continued correctly when it is rescheduled to run 33 | * **CPU-scheduling information** - includes the process priority, pointers to scheduling queues, and other scheduling parameters 34 | * **Memory-management information** - includes value of the base and limit registers, page tables, segment tables 35 | * **Accounting information** - includes the amount of CPU and real time used, time limits, account numbers, job or process numbers, etc. 36 | * **I/O status information** - includes the list of I/O devices allocated to the process, a list of open files, etc. 37 | 38 | ## Process Scheduling 39 | 40 | The purpose of multiprogramming is to have processes running at all time so as to maximize CPU utilization. Time-sharing switches a CPU core among processes frequently enough that users can interact with each programming while the program is running. Process schedulers meet these objectives by selecting available processes and scheduling them for execution on a core; each core can run one process at a time. 41 | 42 | In order to balance multiprogramming and time-sharing, we have to consider the general behavior of a process. Most processes can be defined as I/O-bound or CPU-bound. I/O bound processes spend more time conducting I/O, and CPU-bound processes spend more time doing computations. 43 | 44 | Processes are entered into the ready queue, a linked list that has a header which points to the first PCB in the list, and each PCB points to the next process in the queue. Processes that are waiting on some sort of I/O event or signal are placed in the wait queue, which is also a linked list with a header point to the first PCB and each PCB pointing to the process after it. 45 | 46 | A CPU scheduler conducts the selection of which processes get to utilize a CPU core for execution, and if memory is overcommitted some operating systems swapping. Swapping is a method of freeing up memory by saving a running process to disk, and vice-versa in order to place a process back in memory. 47 | 48 | Context switching happens when a process encounters an interrupt and changes its process state. The operating system conducts a context switch for this process by saving its context (the PCB) using a state save. To resume operations, an operating system conducts a state restore for the process. So, when suspending a process we use a state save and when the process is ready to execute again we use a state restore. 49 | 50 | Context switching is expensive and is completely overhead - no instructions are being executed during a context switch. The speed at which context switches occur are mainly based upon hardware support, special instructions required by the operating system, the number of registers that need to be copied, and the memory's speed. 51 | 52 | ## Operations on Processes 53 | 54 | When a parent process creates a child process, the child process will need resources allocated to it to conduct execution (CPU time, memory, files, I/O devices). A child process may be able to obtain these resources directly from the operating system, or it could be constrained to a set of resources defined by the parent. Restricting a child to a set of resources prevents any process from overloading the system by creating too many child processes. 55 | 56 | In addition to providing resources, parent processes can also pass along initialization input to the child process. This can encompass file names, file descriptors, device names, etc. 57 | 58 | When a parent process spawns another process, two possibilities for the parent's execution exist: 59 | 1. The parent continues to execute concurrently with the children. 60 | 2. The parent waits until some or all of the children have terminated execution. 61 | 62 | There are also two address-space possibilities for the new process: 63 | 1. The child process is a duplicate of the parent process (it has the same program data as the parent). 64 | 2. The child process has a new program loaded into it. 65 | 66 | In UNIX, we use the `fork()` system call to create a new child process. This child process will have a copy of the address space of the parent process allowing the parent to communicate easily with the child. Both the child and the parent will continue execution after the `fork()` call, however, the return code for `fork()` for the child is `0` whereas the return code for `fork()` for the parent is the `pid` of the child. 67 | 68 | Typically, one of the two processes uses the `exec()` system call to replace the process's memory space with a new program. The process using `exec()` will now execute an entirely different set of instructions. Now, the two processes are able to communicate and go their separate ways. The parent can continue execution, create more children, or issue the system call `wait()` to suspend its operation until the child terminates execution. 69 | 70 | Processes terminate execution and return all resources back to the operating system using the `exit()` system call. The process will then return some exit value, usually an integer, to the waiting parent process. A process can also cause the termination of other processes by invoking system calls (in Windows, the `TerminateProcess()` system call). The parent needs to know the `pid` of the child process to terminate it, thus the identity of the newly created process is passed to the parent. A parent terminates the execution of child processes for a variety of reasons - here are some examples: 71 | 72 | * The child has exceeded its usage of some of the resources it was allocated. 73 | * The task assigned to the child is no longer required. 74 | * The parent is exiting execution, and the operating system does not allow a child to continue if its parent terminates. 75 | 76 | When a process terminates, it resources are deallocated by the operating system, however, the process's entry in the process table will remain until the parent calls `wait()`. A process spawned by a parent that hasn't called `wait()` yet is known as a **zombie** process. If a parent process terminates without calling `wait()`, the child process will become and **orphan**. In traditional UNIX operating systems, **orphan** processes will be re-assigned `init` as their parent process, and `init` will call `wait()` to terminate all orphans. 77 | 78 | ## Definitions 79 | 80 | * **Process** - a program in execution. The unit of work in most systems. An active entity. 81 | * **System** - consist of a collection of processes: operating-system processes execute system code, and user processes execute user code. 82 | * **Threads** - different parts of a process that run in parallel. 83 | * **Program counter** - represents the status of the current activity of a process 84 | * **Activation record** - contains function parameters, local variables, and the return address; created when a function is called; placed onto the stack; popped from the stack when the function is complete 85 | * **Program** - a passive entity, a list of instructions stored on disk; an executable file 86 | * **State** - state of a process, defined by the current activity of that process. 87 | * **Process control block** - represents processes in an operating system; also called a task control block; represents the context of a process 88 | * **Process scheduler** - selects and available process for program execution on a core 89 | * **Degree of multiprogramming** - the number of processes currently in memory 90 | * **Ready queue** - linked list containing PCBs of ready processes 91 | * **Wait queue** - linked list containing PCBs of processes waiting on some event 92 | * **Dispatch** - when a process is selected by the scheduler for execution 93 | * **Queueing diagram** - a common representation of process scheduling 94 | * **CPU scheduler** - selects from among the processes that are in the ready queue and allocates a CPU core to the selected process 95 | * **Swapping** - moving a process from disk to memory and vice-versa; only necessary when memory has been overcommitted and must be freed up 96 | * **Process identifier** - an integer number representing a process on the operating system 97 | * **Zombie process** - a process that has completed execution but is waiting on the parent process to terminate it. 98 | * **Orphan process** - a process whose parent has completed execution without calling `wait()` to terminate the child. 99 | 100 | ## Practical Data 101 | 102 | * The GNU `size` command can be used to determine the size (in bytes) of each section of a process. The `size` command will return the size values for `text`, `data`, `bss`, `dec`, `hex`, and `filename` of a program. `data` refers to uninitialized data and `bss` refers to initialized data. 103 | * The process control block in the Linux operating system is represented by the C structure `task_struct`, found in ``. Within the Linux kernel, all active processes are represented using a doubly linked list of `task_struct`. The kernel maintains a pointer, `current`, to the process currently executing on the system. 104 | * In the beginning stages of smart phones, particularly with Apple iOS, due to the limitations of the hardware components, Apple separated process execution and multitasking based upon the display. Applications running in the **foreground** were described as applications on the display - these applications were considered running. Applications in the **background** remained in memory but did not occupy the screen - these applications had limited execution options. Eventually technology advanced, and multiple **foreground** applications were supported for Apple iOS using **split-screen**. 105 | * Android has always supported multitasking and does not place constraints on what applications can run in the background. Background applications on Android use a **service** which runs on behalf of the background process. Services do not have a user interface and have a small memory footprint, useful for multitasking on mobile devices. 106 | * On a typical Linux operating system, `systemd` serves as the root parent process for all user processes and is the first user process created when the system boots; assigned a `pid` of 1. 107 | * On traditional UNIX systems, `init` serves as the root parent process for all child processes; assigned a `pid` of 1. 108 | * `pstree` displays a tree of all processes in the system. 109 | * In UNIX, a new process is created by the `fork()` system call. 110 | * In UNIX, the `exec()` system call loads a binary file into memory (destroying the memory image of the program containing the `exec()` system call) and starts its execution. 111 | * In Windows, the `CreateProcess()` system call executes similar to the UNIX `fork()`. In contrast to UNIX, Windows `CreateProcess()` requires loading a specified program into the address space of the child process at creation. `CreateProcess()` also expects no fewer than ten parameters, whereas `fork()` is passed no parameters. -------------------------------------------------------------------------------- /operating-system-concepts/chapter4-notes.md: -------------------------------------------------------------------------------- 1 | # Chapter 4 2 | 3 | ## Overview 4 | 5 | A traditional process has a single thread of control, however, if it has multiple threads it can perform more than one task at a time. An application is typically implemented as a process with several threads of control. A few examples of multithreaded applications are: 6 | 7 | * An application that creates photo thumbnails from a collection of images may use a separate thread to generate a thumbnail from each separate image. 8 | * A web browser might have one thread display images or text while another thread retrieves data from the network. 9 | * A word processor may have a thread for displaying graphics, another thread for responding to keystrokes from the user, and a third thread for performing spelling and grammar checking in the background. 10 | 11 | Before multithreading was invented, applications would create separate processes to conduct the same steps of execution in order to conduct more computation in parallel. At scale, however, create an entirely new process is more time consuming and resource intensive than utilizing threads. 12 | 13 | The benefits of multithreading can be broken down into four categories: 14 | 15 | 1. Responsiveness - allows a program counter to continue running even if part of the program is blocked or performing a lengthy operation. This allows the user to perceive more responsiveness from the application, rather than the application freezing as it conducts all operations sequentially. 16 | 2. Resource sharing - processes are only able to share resources through shared memory and message passing, requiring those mechanisms to be explicitly defined by the programmer. Threads share the memory and resources of the process they belong to by default. 17 | 3. Economy - because threads share the resources of the process to which they belong, it is more economical to create and context-switch threads. 18 | 4. Scalability - threads can run in parallel on different processing cores, whereas a single-threaded process can run on only one processor, regardless of how many are available. 19 | 20 | ## Multicore Programming 21 | 22 | Multithreaded programming provides a mechanism for more efficient use of multiple computing cores and improved concurrency. On a system with a single computing core, an application with four threads will only be able to execute one at a time, however, each thread will progress in execution. In contrast, on a multicore processor, these four threads will be able to run in parallel, as multiple threads will be able to execute simultaneously. This is the distinction between concurrency and parallelism. 23 | 24 | As multicore processing became more popular, designers of operating systems had to write scheduling algorithms that supported parallel execution. Five areas present challenges in programming for multicore systems: 25 | 26 | 1. Identifying tasks - examining applications and finding areas of execution that can be divided into separate, concurrent tasks; ideally independent of one another and can run parallel on individual cores. 27 | 2. Balance - ensuring that tasks perform equal work of equal value. 28 | 3. Data splitting - ensuring the data accessed and manipulated by the tasks is divided to run on seperate cores. 29 | 4. Data dependency - ensuring dependencies for data between tasks is resolved and operations on data are synchronized. 30 | 5. Testing and debugging - ensuring concurrent programs and their paths of execution are bug free. 31 | 32 | Data parallelism and task parallelism contrast because data parallelism is concerned with distributing data across multiple cores, conducting the same operation on each datum, while task parallelism separates tasks across multiple cores in threads - data agnostic. 33 | 34 | ## Multithreading Models 35 | 36 | Support for threads can be provided at the user level, i.e. user threads, or by the kernel, i.e. kernel threads. There are three common ways of establishing relationships between threads: many-to-one model, one-to-one model, and the many-to-many model. 37 | 38 | In the many-to-one model, many user threads are mapped to one kernel thread. Thread management is conducted by a thread library in user space, increasing efficiency, however, the entire process will block if a thread makes a blocking system call for some I/O or signal. Because only one thread can access the kernel at a time, multiple threads of a process are unable to run in parallel on multicore systems. 39 | 40 | In the one-to-one model, each user thread is mapped to a kernel thread providing more concurrency than the many-to-one by allowing other threads to run if the thread makes a blocking system call. This also allows other threads to run in parallel on multiprocessors. The biggest down-side is that for every user thread created, a kernel thread has to be created as well - this large number of kernel threads may burden the performance of a system. 41 | 42 | The many-to-many model multiplexes many user level threads to a smaller or equal number of kernel threads. The many-to-many model suffers none of the shortcomings of the previous two threading models: the programmer can create as many threads as necessary and the kernel threads can run in parallel on a multiprocessor. When the user threads execute a system call that will block, the kernel can schedule another thread for execution. 43 | 44 | One variation of the many-to-many model still multiplexes many user level threads to a smaller or equal number of kernel threads but also allows a user level thread to be bound to a kernel thread. This variation is referred to as the **two-level model**. 45 | 46 | ## Thread Libraries 47 | 48 | Thread libraries can be implemented in two ways: 49 | 50 | 1. The first approach is to provide a library entirely in user space with no kernel support. All code and data structures for the library exist in user space - invoking a function in the library does not require system calls. 51 | 2. The second approach is to implement the library in kernel space - the library is directly supported by the operating system. Code and datastructures for the library exist in kernel space - invoking a function in the API for the library typically results in a system call to the kernel. 52 | 53 | Two strategies exist for creating multiple threads: 54 | 55 | 1. Asynchronous threading - once the parent thread creates a child thread, the parent resumes execution. 56 | 2. Synchronous threading - the parent thread creates one or more children threads and then waits for all children to terminate before resuming execution. 57 | 58 | ## Implicit Threading 59 | 60 | Soon enough, most applications are going to be using hundreds to thousands of threads, and application developers will have to learn how to handle these new requirements. One way to address this issue is by using implicit threading, moving the design implementation of multithreading away from the developers and having the compiler and run-time library figure out how to implement the application in a multithreaded fashion. How this would be achieved is by the developer identifying *tasks*, not threads, that can run in parallel. Then the run-time library would map each task to a separate thread in the many-to-many model. 61 | 62 | Creating threads as requirements increase is bad practice. It takes work to create the thread, and if we're just going to destroy the thread after we're done, as well, this could become computationally expensive. We also aren't placing bounds on the number of threads that a process can create, making this possibly resource intensive. To solve this issue, we utilize **thread pools**. 63 | 64 | Thread pools offer these benefits: 65 | 1. Servicing a request with an existing thread is often faster than waiting to create a thread. 66 | 2. Thread pools limit the number of threads in existence. 67 | 3. Separating the the task to be performed from the mechanics of creating the task allows us to use different strategies for running the task. 68 | 69 | More sophisticated methods of thread pooling exist for adjusting the number of threads in a pool. Dynamically increasing or decreasing the number of threads based upon usage patterns and requests, etc. 70 | 71 | ## Threading Issues 72 | 73 | So what happens if we call `fork()` or `exec()` inside of a thread? Some UNIX systems have chosen to support two versions of `fork()`: one that duplicates all threads and one that duplicates only the thread invoking the `fork()` system call. `exec()` works the same way as usual for threads - if a thread invokes `exec()` the program specified in the parameter passed to `exec()` will replace the entire process and all its threads. 74 | 75 | Best practice is, if `exec()` is called immediately after `fork()`, it's best to just duplicate that one process. However, if the separate process never calls `exec()`, it's best to duplicate the entire process, including its threads. 76 | 77 | Signals, whether received synchronously or asynchronously, follow the same pattern: 78 | 1. A signal is generated by the occurrence of a particular event. 79 | 2. The signal is delivered to a process. 80 | 3. Once delivered, the signal must be handled. 81 | 82 | Synchronous signals are delivered to the same process the performed the operation causing the signal. Example: illegal memory access and division by 0. Asynchronous signals are signals generated by an event external to a running process. Examples would be terminating a process with CTRL + C or having a timer expire. 83 | 84 | A signal may be *handled* by one of two possible handlers: 85 | 1. A default signal handler 86 | 2. A user-defined signal handler 87 | 88 | Handling signals in sequential, single-thread programs is pretty simple. How do we handle signals in multithreaded processes though? where should the signals be delivered? A couple of options are available: 89 | 1. Deliver the signal to the thread to which the signal applies. 90 | 2. Deliver the signal to every thread in the process. 91 | 3. Deliver the signal to certain threads in the process. 92 | 4. Assign a specific thread to receive all signals for the process. 93 | 94 | The standard method in UNIX for delivering a signal is: 95 | 96 | `kill(pid_t pid, int signal)` 97 | 98 | Most multithreaded versions of UNIX allow threads to determine which threads they will handled and which they will block. Because signals need to be handled only once, usually a signal is handled by the first thread found that isn't blocking it. POSIX Pthreads implementation provides an API call that allows a signal to be delivered to a specified thread: 99 | 100 | `pthread_kill(pthread_t tid, int signal)` 101 | 102 | If multiple threads are searching through a database and one thread returns the result before the rest of the threads, the remaining threads need to be cancelled (*thread cancellation*). Cancellation of target threads may occur as such: 103 | 1. Asynchronous cancellation - one thread immediately terminates the target thread 104 | 2. Deferred cancellation - the target thread periodically checks whether it should terminate, allowing it an opportunity to terminate itself in an orderly fashion. 105 | 106 | Asynchronous cancellation is dangerous because the thread could be killed before freeing resources. Pthreads thread cancellation is initiated using the `pthread_cancel()` function. A target thread will then cancel in a deferred state once it reaches a blocking call like `recv()`. `pthread_testcancel()` is also available for threads to use to see if they have been signalled to cancel. Pthreads also provides a `cleanup handler` function that allows a thread to cleanup resources it has allocated before cancelling. 107 | 108 | Thread-local storage allows threads the ability to save data across function invocations. It's different from local storage as it will remain static as the thread continues to execute during it's lifetime. 109 | 110 | To the user-thread, the lightweight process is a virtual processor on which the application can schedule a user thread to be run. Each LWP is attached to a kernel thread, and the kernel threads are what get scheduled by the operating system to run on the processor. If a kernel thread blocks on some I/O or event, the LWP blocks and then the user-level thread attached to the LWP blocks. 111 | 112 | One scheme for communication between user and kernel threads is *scheduler activation*. The kernel provides an application with a set of virtual processors (LWPs) and the application can schedule user threads onto an available virtual processor. The kernel will notify the application of certain events by using upcalls. Upcalls are handled by the thread library and must run on a LWP. 113 | 114 | ## Definitions 115 | * **Thread** - a basic unit of CPU utilization; includes a thread ID, program counter, a register set, and a stack. It shares with other threads of the same process its code section, data section, and other resources such as open files and signals. 116 | * **Multicore** - systems in which multiple computing cores are placed on a single processing chip and each one appears as a separate CPU to the operating system. 117 | * * **Concurrency** - the ability to support more than one task by allowing all the tasks to make progress. 118 | * **Parallelism** - the ability to perform more than one task simultaneously. 119 | * **Data parallelism** - focuses on distributing subsets of data across multiple cores and performing the same operation on each core. 120 | * **Task parallelism** - focuses on distributing tasks across multiple computing cores, with each thread performing a unique operation either on the same data or different data 121 | * **User threads** - threads supported above the kernel and are managed with kernel support 122 | * **Kernel threads** - threads supported and managed directly by the operating system 123 | * **Two-level model** - a variation of the many-to-many model that also allows user level threads to be bound to a specific kernel thread. 124 | * **Thread library** - provides the programmer with an API for creating and managing threads. 125 | * **Implicit threading** - the concept of transferring the creation and management of threading from application developers to compilers and run-time libraries. 126 | * **Thread pool** - a pool of *N* number of threads created at the beginning of a process. 127 | * **Signal** - used in UNIX systems to notify a process that a particular event has occurred. 128 | * **Default signal handler** - kernel defined signal handler 129 | * **User-defined signal handler** - signal handler defined by the user that overrides kernel defined handlers 130 | * **Asynchronous procedure calls** - Windows method for emulating UNIX signals 131 | * **Thread cancellation** - involves terminating a thread before it has completed 132 | * **Target thread** - a thread that is to be canceled 133 | * **Thread-local storage** - how threads keep their own copy of certain data 134 | * **Lightweight process** - intermediate data structure between the user and the kernel threads usually implemented for the many-to-many or the two-level model 135 | * **Upcall** - the method in which the kernel can notify a process of an event with using LWPs 136 | 137 | ## Practical Data 138 | * Most operating system kernels are also typically multithreaded. During system boot time on Linux, several kernel threads are created with each performing a specific task such as: managing devices, memory management, or interrupt handling. `kthreadd` (with `pid == 2`) serves as the parent of all the kernel threads. 139 | * * **Amdahl's Law** - a formula that identifies potential performance gains from adding additional computing cores to an application that has both serial (nonparallel) and parallel components. 140 | * **Green threads** - a thread library available for Solaris systems and adopted in early versions of Java - used the many-to-one model, however, very few systems continue to use the model because of its inability to take advantage of multicore processing. 141 | * Linux and Windows implement the one-to-one threading model. 142 | * The three main thread libraries in use today are `POSIX` Pthreads, Windows, and Java. Pthreads, the threads extension of the `POSIX` standard, may be provided as either a user level or kernel level library. The Windows thread library is a kernel level library provided on Windows systems. The Java thread API allows threads to be created and managed directly in Java programs. Because Java runs in the Java Runtime Environment `JVM`, Java threads utilize the thread library available on the host operating system. This means that if Java is running on Windows it's using kernel level Windows threads and if it's running on anything UNIX, Linux, macOS, etc., it's using Pthreads. For POSIX and Windows threading, anything declared globally is shared across all threads. Java has no equivalent notion of global data, access to shared data must be explicitly arranged between threads. 143 | * Windows and Java thread programming APIs both have builtin thread pool functions, allowing a programmer to create a pool of threads easily and allowing a programmer to queue work of a pool of threads. 144 | * **OpenMP** is a set of compiler directives as well as an API for programs written in C, C++, or FORTRAN that provide support for parallel programming in shared-memory environments. **OpenMP** identifies **parallel regions** as blocks of code that may run in parallel. 145 | * **Grand Central Dispatch** - a technology developed by Apple for its macOS operating system, it is a combination of run-time library, an API, and language extensions that allow developers to identify sections of code to run in parallel. 146 | * **Intel Thread Building Blocks** - Intel provides a template library that supports designing parallel applications in C++. Developers specify tasks that can run in parallel, and the TBB scheduler maps these tasks onto underlying threads. -------------------------------------------------------------------------------- /operating-system-concepts/chapter6-notes.md: -------------------------------------------------------------------------------- 1 | # Chapter 6 2 | 3 | ## Overview 4 | 5 | To avoid race conditions, processes must request permission to enter their critical section to work on some set of the code that accesses shared memory. This request for permission location is known as the entry section. Once the process enters the critical section and then completes all of their instructions, they reach an exit section and a remainder section. 6 | 7 | Solutions to the *critical-section problem* must satisfy the following three requirements: 8 | 1. Mutual exclusion - if a process is in the critical section, no other processes can be executing in their critical sections. 9 | 2. Progress - if no process is executing in its critical section and some process wishes to enter its critical section, then only those processes that are not executing their remainder section can participate in deciding which will enter its critical section next, and this selection cannot be postponed indefinitely. 10 | 3. Bounded waiting - there exists a bound, or limit, on the number of times that other processes are allowed to enter their critical sections after a process has made a request to enter its critical section and before that request is granted. 11 | 12 | General approaches to handling critical sections in operating systems include *preemptive kernels* and *nonpreemptive kernels*. Nonpreemptive kernels are essentially safe from race conditions, are less responsive as they will run for an arbitrarily long period of time before releasing the CPU. Preempitve kernels are harder to implement because interrupts will disrupt the flow of execution, requiring the programming to solve the critical section problem, however, preemptive kernels are more suitable for real-time programming. 13 | 14 | Compare and swap are how mutex locks are designed in software. Lock are either contended or uncontended. A lock is considered contended if a thread blocks while trying to acquire the lock. Vice-versa for uncontended locks. 15 | 16 | ## Definitions 17 | * **Cooperating process** - a process that can affect or be affected by other processes executing in the system. 18 | * **Race condition** - several processes access and manipulate the same data concurrently and the outcome of the execution depends on the particular order in which the access takes place 19 | * **Critical section** - a segment of code in which the process may be accessing - and updating - data that is shared with other processes 20 | * **Entry section** - a segment of code preceding a critical section in which the process requests permission to enter said critical section 21 | * **Exit section** - a segment of code following a critical section 22 | * **Remainder section** - a segment of code following an exit section 23 | * **Preemptive kernels** - allows a process to be preempted while it is running in kernel mode 24 | * **Nonpreemptive kernels** - does not allow a process running in kernel mode to be preempted 25 | * **Memory barrier** - instructions to ensure the system conducts all loads and stores before and subsequent load or store operation is performed; making all changes to registers visible to other processors. 26 | * **Atomic instructions** - instructions as one uninterruptible unit 27 | * **Mutual exclusion lock** - software based protection tool to solve the critical-section problem 28 | * **Spin lock** - a mutex lock in which the process "spins" on the processor, waiting for the mutex to be available 29 | * **Semaphore** - like a mutex lock but has more sophisticated ways to synchronize activities 30 | * **Counting semaphore** - semaphore value can range over an unrestricted domain 31 | * **Binary semaphore** - can range only between 0 and 1 -------------------------------------------------------------------------------- /pdfs/cheatsheets/C_CS.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/one2blame/cs6200/baf313c1759dcca9b50f24e77b775aac399452df/pdfs/cheatsheets/C_CS.pdf -------------------------------------------------------------------------------- /pdfs/cheatsheets/CommanLine_CS.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/one2blame/cs6200/baf313c1759dcca9b50f24e77b775aac399452df/pdfs/cheatsheets/CommanLine_CS.pdf -------------------------------------------------------------------------------- /pdfs/cheatsheets/Docker_CS.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/one2blame/cs6200/baf313c1759dcca9b50f24e77b775aac399452df/pdfs/cheatsheets/Docker_CS.pdf -------------------------------------------------------------------------------- /pdfs/cheatsheets/Linux_Syscall_CS.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/one2blame/cs6200/baf313c1759dcca9b50f24e77b775aac399452df/pdfs/cheatsheets/Linux_Syscall_CS.pdf -------------------------------------------------------------------------------- /pdfs/cs6200.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/one2blame/cs6200/baf313c1759dcca9b50f24e77b775aac399452df/pdfs/cs6200.pdf -------------------------------------------------------------------------------- /pdfs/papers/anderson-notes.md: -------------------------------------------------------------------------------- 1 | # The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors 2 | 3 | ## Introduction 4 | 5 | This paper examines the question: are there efficient algorithms for software spin-waiting for busy locks given hardware support for atomic instructions, or are more complex kinds of hardware support needed for performance? 6 | 7 | Spinning processors slow other processors doing useful work, including the ones holding the lock, because the processors are continuously using up all communication bandwidth in an attempt to check and acquire the lock. 8 | 9 | This paper proposes establishing a queue in which processors requesting a spin lock receive a unique ID and when the lock is available, the next processor in the queue with the correct ID receives the lock. This paper also proposes an addition to snoopy cache protocols that exploits the semantics of spin lock requests to obtain better performance. Performance problems of spin-waiting, software alternatives, and hardware solutions are all discussed in this paper. 10 | 11 | ## Section II 12 | 13 | Cache coherence can effect the performance of spin locks and the atomic instructions used to set the locks. Multistage networks connect multiple processors to multiple memory modules, with each memory modules having its own cache. When a value is read during an atomic instruction, all cached copies of that value across all memory modules must be invalidated. Subsequent accesses to that module will be delayed as the new value is calculated. 14 | 15 | ## Section III: The Performance of Simple Approaches to Spin-Waiting 16 | 17 | **Spin on Test-and-Set** 18 | 19 | Simplest method of acquiring lock, worst performance as number of processors spinning on the lock increases. In order to release the lock, the lock holder must contend with all the processors spinning on the lock. All the processors attempting to access the same lock can cause slow accesses to other memory locations if using the same bus, hot-spots delaying memory access to memory modules in a networked architecture, and the consuming of bus transactions, saturating the bus. 20 | 21 | **Spin on Read (Test-and-Test-and-Set)** 22 | 23 | This algorithm is a proposal to read the lock and if it's not BUSY, attempt to set the lock. As soon as the lock is released, all the of the caches are invalidated or written to with a new value. A waiting processor sees the change in state and performs a test-and-set - whoever gets to the lock first wins. When the critical section is small, however, things can get hairy as the remaining processors are still waiting on their caches to update. The lock is acquired by a processor, and all of the other processors still updating their caches will now attempt to test-and-set (because the lock used to be TRUE) and they will fail, thus causing the test-and-set instruction to hit the communication channel and cause congestion again. 24 | 25 | **Test Results** 26 | 27 | Test results show that, for both spin on test-and-set and spin on read then test-and-set, performance dramatically decreases when more processors are added and contend for a lock. As the number of spinning processors increase, the time for the caches to all quiesce also increases. 28 | 29 | ## Section IV: New Software Alternatives 30 | 31 | They tested adding a delay where the delay starts at 0 and goes to P (number of processors). Like CSMA over ethernet, the processor attempts to acquire the lock. If contention occurs, it conducts a backoff and increases its delay until it reaches P. Eventually, each processor will be delaying at a different time, decreasing lock contention and allowing for the caches to quiesce better as well. Unfortunately, if there aren't enough delay spots, processors within the same delay will contend. 32 | 33 | They added a delay after every read as well (for test and test-and-set), however, if a critical section is crazy long, the delayed processor could delay for a ridiculous amount of time because of all the reads. When the lock is eventually released, the processor will still have to complete the delay before acquiring the lock. 34 | 35 | -------------------------------------------------------------------------------- /pdfs/papers/anderson-paper.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/one2blame/cs6200/baf313c1759dcca9b50f24e77b775aac399452df/pdfs/papers/anderson-paper.pdf -------------------------------------------------------------------------------- /pdfs/papers/arpaci-dusseau-notes.md: -------------------------------------------------------------------------------- 1 | # I/O Devices 2 | 3 | ## Canonical Device 4 | 5 | A canonical device is made of these components: 6 | 7 | * Interface 8 | * Registers 9 | * Status 10 | * Command 11 | * Data 12 | 13 | * Internals 14 | * Micro-controller (CPU) 15 | * Memory (DRAM or SRAM or both) 16 | * Other Hardware-specific chips 17 | 18 | ## Canonical Protocol 19 | 20 | Programmed I/O (PIO) is wasteful of CPU cycles because the CPU spends so much time polling and waiting on the device instead of switching to another ready process. 21 | 22 | Using interrupts can be costly, however, due to the number of context switches that could take place. If a device is fast, it's better just to poll on its status rather than interrupt because context switching so often due to interrupts will cause significant overhead. 23 | 24 | ## More Efficient Data Movement with DMA 25 | 26 | Direct Memory Access (DMA) engines allow a CPU to write instructions for the data transfer to the device. The operating system provides a DMA engine the location of the data, how much data to copy, and the device the data is destined for. 27 | 28 | The DMA engine will complete the operation and then raise an interrupt to notify the operating system that the data transfer was completed. 29 | 30 | The DMA provides the CPU the ability to continue doing other meaningful work rather than conducting the entire data transfer for a device. The setup is slower than PIO, however, so is best used for large data transfers. PIO should still be used for small data transfers. 31 | 32 | ## Definitions 33 | 34 | * **hardware interface** - interface for hardware that allows the system to control its operation. All devices have a hardware interface that has some specific protocol for typical interaction. 35 | * **internal structure** - responsible for implementing the abstraction the device presents to the system. 36 | * **firmware** - software within a hardware device 37 | * **status register** - a register that a controller can read to see the status of the device 38 | * **command register** - a register that can be written to to instruct the device to perform a certain task 39 | * **data register** - a register to pass or retrieve data from the device 40 | * **polling** - the operating system continually checks the status register of a device until it is ready to receive a command 41 | * **programmed I/O (PIO)** - when the main CPU is involved with the interaction and data movement of a device 42 | * **interrupt** - signals that can be sent to the CPU when an I/O operation is complete 43 | * **interrupt service routine** - also called an interrupt handler, this is a piece of operating system code that will handle an interrupt received from an I/O device 44 | * **livelock** - when an operating system finds itself only processing interrupts 45 | * **coalescing** - a device waits a bit before delivering an interrupt to the CPU - this causes latency 46 | * **direct memory access (DMA)** - a device within a system that can orchestrate transfers between devices and main memory without CPU intervention 47 | * **I/O instructions** - used by IBM mainframes, these instructions specify a way for the operating system to send data to specific device registers 48 | * **memory mapped I/O** - the hardware makes device registers available as if they were memory locations - the operating system reads or writes to the address in main memory and the hardware routes those operations to the device 49 | * **device driver** - a piece of software in the operating system that knows, in detail, how a device works. 50 | * **raw interface** - allows devices to directly read and write blocks to a device without file abstraction -------------------------------------------------------------------------------- /pdfs/papers/arpaci-dusseau-paper.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/one2blame/cs6200/baf313c1759dcca9b50f24e77b775aac399452df/pdfs/papers/arpaci-dusseau-paper.pdf -------------------------------------------------------------------------------- /pdfs/papers/birrell-paper.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/one2blame/cs6200/baf313c1759dcca9b50f24e77b775aac399452df/pdfs/papers/birrell-paper.pdf -------------------------------------------------------------------------------- /pdfs/papers/birrell-rpc-paper.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/one2blame/cs6200/baf313c1759dcca9b50f24e77b775aac399452df/pdfs/papers/birrell-rpc-paper.pdf -------------------------------------------------------------------------------- /pdfs/papers/eykholt-notes.md: -------------------------------------------------------------------------------- 1 | # Beyond Multiprocessing Notes 2 | 3 | ## Introduction 4 | 5 | Written in 1992, SunOS was looking for a way to have their kernel support multiprocessing utilizing threads. They wanted to achieve a high degree of concurrency and support more than one thread of control for a user process - each thread able to work independently. 6 | 7 | They also wanted the kernel to be able to provide real-time response, providing support for preemption of almost any point in the kernel. The kernel SunOS built is a complex multi-threaded program. Threads can be leveraged by user applications to manage asynchronous activities - the kernel benefits from a similar thread facility. 8 | 9 | Solaris 2.0 if fully preemptible, has real-time scheduling, symmetrically supports multiprocessors, and supports user-level multithreading. 10 | 11 | ## Overview of the Kernel Architecture 12 | 13 | Kernel threads are units that are scheduled and dispatched onto one of the CPUs of the system. Lightweight, only having a small data structure and stack, kernel threads do not require a change of virtual memory address space information - this makes context switching of kernel threads inexpensive. 14 | 15 | Kernel threads are fully preemptible and may be scheduled by an of the scheduling classes in the system, including the real time class. All execution entities are using kernel threads, representing a fully preemptible, real time nucleus in the kernel. Kernel threads support synchronization primitives that support protocols for preventing priority inversion so that a thread's priority is determined by which activities it is impeding by holding locks as well as by the service it is performing. 16 | 17 | User processes in Solaris 2.0 use lightweight processes (LWPs) that share the address space of the process and other resources such as open files. The kernel supports these LWPs by assigned a kernel thread to each LWPs. A user-level thread library uses LWPs to implement user-level threads - the user-level threads are scheduled by the library thus removing the requirement of the library having to enter the kernel to switch currently executing threads. 18 | 19 | ## Kernel Thread Scheduling 20 | 21 | Solaris 2.0 uses priority based scheduling and, if multiple threads are of the same priority, the scheduler will scheduler the threads in a round-robin order. The kernel is preemptive, a runnable thread runs as soon as is practical after its priority becomes high enough. 22 | 23 | ## System Threads 24 | 25 | Can be created for short or long-term activities, these belong to the system scheduling class. These threads have no need for a LWP structure, their thread structure and stack is allocated in a non-swappable area. 26 | 27 | ## Synchronization Architecture 28 | 29 | The kernel implements similar synchronization objects as are provided to the user-level libraries. Mutex locks, condition variables, semaphores, etc. Upon the creation of these objects, it is possible to set options that enable statistics gathering. This can provide the writer a method of grabbing usage data and wait times. 30 | 31 | ## Mutual Exclusion Lock Implementation 32 | 33 | If contention is met for a lock on Solaris 2.0, the blocking action taken depends on the mutex type. The default blocking policy has the thread or process spinning while the owner of the lock remains running on a processor. This is done by polling the owner's status in a wait loop. 34 | 35 | ## Summary 36 | 37 | SunOS 5.0 is a multithreaded and symmetric multiprocessor kernel featuring: 38 | * fully preemptible, real-time kernel 39 | * high degree of concurrency on symmetric multiprocessors 40 | * support for user threads 41 | * interrupts handled as independent threads 42 | * adaptive mutual-exclusion locks -------------------------------------------------------------------------------- /pdfs/papers/eykholt-paper.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/one2blame/cs6200/baf313c1759dcca9b50f24e77b775aac399452df/pdfs/papers/eykholt-paper.pdf -------------------------------------------------------------------------------- /pdfs/papers/fedorova-notes.md: -------------------------------------------------------------------------------- 1 | # Chip Multithreading Systems Need a New Operating System Scheduler 2 | 3 | ## Summary 4 | 5 | At the time of this paper being written, computer scientists were having difficulty ensuring the processor pipeline was utilized as much as possible by modern workloads. This is due to the number of frequent branches of execution and control transfers modern workloads conducted. Chip multithreading (CMT) is a processor architecture combining chip multiprocessing and hardware multithreading to address the issue of low processor pipeline utilization. This paper covers a proposal for a new operating system scheduler designed to make use of the new technology provided by CMT. 6 | 7 | ## Introduction 8 | 9 | Transaction-processing-like workloads, workloads generated for application servers, web services, on-line transaction processing systems, etc., usually consist of multiple threads of control that execute short stretches of operations with frequent dynamic branches. Cache locality and branch prediction are negatively affected by these types of workloads and frequently cause the processor to stall. 10 | 11 | Chip multiprocessing (CMP) and hardware multithreading (MT) techniques are designed to improve processor utilization for these types of workloads. CMP and MT offer better support for thread-level parallelism (TLP) rather than the instruction-level parallelism (ILP) offered by superscalar processors that are used for more scientific workloads. 12 | 13 | A CMP processor includes multiple processor cores on a single chip, allowing for more than one thread to be active at a time. 14 | 15 | A MT processor has multiple sets of registers and other thread state, interleaving execution of instructions from different threads, either by switching between threads or by executing instructions from multiple threads simultaneously. 16 | 17 | In this paper, the authors propose that a specialized scheduler for CMT systems can provide a performance gain as large as a factor of two. They state that schedulers designed for single-processor multithreaded systems do not scale for the proposed architecture of CMT processors. CMT systems will require a fundamentally new design for the operating system scheduler. 18 | 19 | Ideal schedulers maximize system throughput and minimize resource contention. Resource contention ultimately determines performance, a scheduler designed for CMT systems must understand how its scheduling decisions will affect resource contention. 20 | 21 | ## Experimental Platform 22 | 23 | The authors created a model multithreaded processor core for their experiments. The MT core has multiple hardware contexts. These hardware contexts allow the processor to interleave the execution of multiple threads, switching between their contexts on each clock cycle. When a thread blocks, if another thread is available it switches to that thread's context. For multithreaded workloads, this hides the latency of long operations. 24 | 25 | ## Studying Resource Contention 26 | 27 | On a CMT system, each hardware context appears as a logical processor to the operating system. Software threads are assigned to a hardware context for the duration of their scheduling time slice. 28 | 29 | When assigned threads to hardware contexts, the scheduler has to decide which threads should be run on the same processor, and which threads should be run on separate processors. The scheduler should aim to achieve the optimal thread assignment in which the workload produces high utilization of the processor. 30 | 31 | Pipeline utilization and contention depends upon the latencies of the instructions that the workload executes. Threads running workloads dominated by instructions with long delay latencies have little resource contention, but a lot of the resources are going unused. Threads running workloads with short delay latencies will be consistently using resources and the pipeline, possibly starving longer delay latency threads. 32 | 33 | The experiments in this paper signify that the instruction mix, namely the average instruction delay latency, can be used as a heuristic for approximating the processor pipeline requirements for a workload. A scheduler can use instruction delay latency to determine scheduling decisions for a specific workload. It makes sense that, if a device has a long instruction delay latency, you should schedule instruction mixes that have a mix of both long and short instruction delay latencies. This will allow higher resource usage, as threads waiting on memory bound action to take place can be context switched for threads immediately ready to conduct operations. 34 | 35 | This doesn't take into account cache contention. If a thread requires a large portion of the cache, we lose performance for memory-bound instruction mixes because, more often than not, the cache will have to be refreshed upon context switch. 36 | 37 | ## Summary 38 | 39 | After the experiments shown in this paper, we can determine that CPI works well to determine scheduling for simple workloads. A load-balanced CPI across cores allows for multithreading with those cores to utilize all available resources for computation. Unfortunately, CPI does not provide information about a workload's resource requirements for things such as the ALU, cache, etc. - all of these results are based upon the resource usage of the processor pipeline. 40 | 41 | ## Definitions 42 | 43 | * **hardware context** - consists of a set of registers and other thread state 44 | * **instruction delay latency** - when a thread performs a long-latency operation and becomes "blocked", subsequent instructions to be issued by that thread are _delayed_ until the long-latency operation completes. 45 | * **cycles-per-instruction** - captures average instruction delay and serves as an approximation for making scheduling decisions. -------------------------------------------------------------------------------- /pdfs/papers/fedorova-paper.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/one2blame/cs6200/baf313c1759dcca9b50f24e77b775aac399452df/pdfs/papers/fedorova-paper.pdf -------------------------------------------------------------------------------- /pdfs/papers/nelson-paper.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/one2blame/cs6200/baf313c1759dcca9b50f24e77b775aac399452df/pdfs/papers/nelson-paper.pdf -------------------------------------------------------------------------------- /pdfs/papers/pai-notes.md: -------------------------------------------------------------------------------- 1 | # Flash: An Efficient and Portable Web Server 2 | 3 | ## Introduction 4 | 5 | This paper is about Flash, an asymmetric multi-process event-driven (AMPED) Web server architecture. Web servers take different approaches to achieve concurrency such as: 6 | * single-process event driven (SPED) architecture 7 | * multi-process (MP) or multi-threaded (MT) architecture 8 | 9 | SPED is good for caching requests and quickly serving those cached requests, whereas MP or MT is better for retrieving content from disk. AMPED attempts to match the performance of SPED on cached workloads, while also matching or exceeding the performance of MP or MT when conducting disk-intensive work. 10 | 11 | ## Background 12 | 13 | Basic processing steps performed by a Web server: 14 | 1. Accept connection 15 | 2. Read request 16 | 3. Find file 17 | 4. Send header 18 | 5. Read file 19 | 6. Send data 20 | 7. End connection 21 | 22 | **Static content** is stored on webservers in the form of disk files. **Dynamic content** is generated upon request by auxiliary application programs running on the server. 23 | 24 | A high-performance Web server must interleave the sequential steps associated with the serving of multiple requests in order to overlap CPU processing with disk accesses and network communication. The servers *architecture* determines what strategy is used to achieve this interleaving. 25 | 26 | ## Server Architectures 27 | 28 | ### Multi-process 29 | 30 | In this architecture, a process is assigned to execute the steps of serving content in a sequential manner. It performs all steps before accepting a new requests. Because typically 20-200 processes are running, many HTTP requests can be served concurrently. Each process has its own private address space, no synchronization is necessary to handle the processing of different HTTP requests. This makes it more difficult to perform optimizations in this architecture that relies on global information like a shared cache or a list of valid URLs. 31 | 32 | ### Multi-threaded 33 | 34 | This architecture implements multiple threads of control operating within a single shared address space, with each thread performing all the steps associated with one HTTP request before accepting a new request. The difference from MP is that all threads share global variables. This allows more room for optimization, however, the threads have to synchronize and control their access to the shared data. 35 | 36 | The MT model requires operating system support for kernel threads - when one thread blocks for an I/O operation, other runnable threads should be scheduled for execution. 37 | 38 | ### Single-process event driven 39 | 40 | This architecture uses a single event-driven server process to perform concurrent processing of multiple HTTP requests. The server uses non-blocking system calls to perform asynchronous I/O operations. 41 | 42 | The SPED architecture can be though of as a state machine that performs one basic step associated with the steps of serving an HTTP request at time, thus interleaving the processing steps associated with many HTTP requests. In each iteration, the server performs a `select` or `poll` to see if there is something to do for that step, and if so it completes the operation. Otherwise, the architecture continues to iterate. 43 | 44 | This architecture avoids all of the context switching and thread synchronization of the MP and MT architectures. 45 | 46 | ### Asymmetric Multi-Process Event-Driven 47 | 48 | This architecture combines the event-driven approach of the SPED architecture with *helper* processes (threads) that handle the block I/O operations. The main event-driven process handles all processing steps involved with HTTP requests and when a disk operation is required, the main thread instructs a helper process via inter-process communication (IPC) to perform the blocking operation. The helper notifies the main thread that the operation is complete via IPC; the main thread learns of this by using `select`. 49 | 50 | ## Design comparison 51 | 52 | ### Performance characteristics 53 | 54 | In MP or MT, entire processes or threads are blocked for I/O, which stops the process of completing the HTTP current request. In SPED, the entire server is stopped because there is one thread of execution blocking on I/O. The main cost for AMPED is just the IPC cost, but the main thread will not block for I/O. 55 | 56 | ### Memory effects 57 | 58 | Memory usage of the architectures, ranked from lowest to highest: 59 | 1. SPED 60 | 2. AMPED 61 | 3. MT 62 | 4. MP 63 | 64 | ### Disk utilization 65 | 66 | The MP/MT models can cause one disk request per process/thread, while the AMPED model can generate one request per helper. The SPED model can only generate one disk request at a time, so it can not benefit from multiple disks or disk head scheduling. 67 | 68 | ## Cost/Benefits of optimizations & features 69 | 70 | ### Information gathering 71 | 72 | Web servers gather information like usage statistics, requests, etc. For the MP model, a lot of synchronization and IPC is required to gather this information across multiple processes. MT requires control of global variables or the storage of information across each thread. SPED and AMPED models simplify information gathering since all requests are processed in a centralized fashion. 73 | 74 | ### Application-level Caching 75 | 76 | In MP, each process may have its own cache to avoid IPC and sync, however, these multiple caches increase the number of misses and they lead to less efficient use of memory. MT uses a single cache, but all of this must be coordinated. AMPED and SPED can use a single cache without synchronization. 77 | 78 | ### Long-lived connections 79 | 80 | If someone's taking a long time for a connection, due to the limitations of their physical link, AMPED and SPED are fine because their cost is a file descriptor. For MT and MP, the overhead of an extra thread or process is the cost imposed. 81 | -------------------------------------------------------------------------------- /pdfs/papers/pai-paper.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/one2blame/cs6200/baf313c1759dcca9b50f24e77b775aac399452df/pdfs/papers/pai-paper.pdf -------------------------------------------------------------------------------- /pdfs/papers/popek-notes.md: -------------------------------------------------------------------------------- 1 | # Formal Requirements for Virtualizable Third Generation Architectures -------------------------------------------------------------------------------- /pdfs/papers/popek-paper.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/one2blame/cs6200/baf313c1759dcca9b50f24e77b775aac399452df/pdfs/papers/popek-paper.pdf -------------------------------------------------------------------------------- /pdfs/papers/protic-paper.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/one2blame/cs6200/baf313c1759dcca9b50f24e77b775aac399452df/pdfs/papers/protic-paper.pdf -------------------------------------------------------------------------------- /pdfs/papers/rosenblum-paper.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/one2blame/cs6200/baf313c1759dcca9b50f24e77b775aac399452df/pdfs/papers/rosenblum-paper.pdf -------------------------------------------------------------------------------- /pdfs/papers/stein-notes.md: -------------------------------------------------------------------------------- 1 | # Implementing Lightweight Threads 2 | 3 | ## Introduction 4 | 5 | This paper by SunSoft describes the implementation of a lightweight threads library, the very lightweight process (LWP) concept used to multiplex user-level threads and kernel threads, preventing the user level processes from utilizing too many kernel resources. 6 | 7 | ## Threads Library Architecture 8 | 9 | Threads are the programmer's interface for multi-threading, implemented by a library that utilizes underlying kernel-supported threads of control called lightweight processes (LWP). Threads are scheduled on a pool of LWPs, and the kernel schedules LWPs on a pool of processors. The library can permanently bind a thread to an LWP if that thread needs constant visibility by the CPU (for example a mouse interface thread). LWPs allow programmers using threads to not have to worry about how kernel resources are being utilized. 10 | 11 | ## LWP interfaces 12 | 13 | LWPs are like threads, sharing most of the process's recourses. Each LWP has its own set of private registers and a signal mask. LWPs also have attributes that are unavailable to threads like kernel-supported scheduling, virtual time timers, an alternate signal stack, and a profiling buffer. 14 | 15 | ## Threads Library Implementation 16 | 17 | Pretty standard here. Threads have a structure with a thread ID, an area to save the execution context, the thread signal mask, the thread priority, and a pointer to the thread stack. 18 | 19 | ## Thread-local Storage 20 | 21 | Threads have private storage in addition to their stack called thread-local storage (TLS). TLS is unshared, statically allocated data - used for thread-private data that must be accessed quickly. After program startup, the size of TLS is fixed and can no longer grow, restricting programmatic dynamic linking to libraries that don't utilize TLS. 22 | 23 | ## Thread Scheduling 24 | 25 | The threads library implements a thread scheduler that multiplexes thread execution across a pool of LWPs. The LWPs are nearly identical, allowing any thread to execute on any of the LWPs in the pool. When a thread executes, it attaches to an LWP and has all the attributes of being a kernel-supported thread. 26 | 27 | ### Thread States and the Two Level Model 28 | 29 | An unbound thread can be in one of five different states: 30 | 1. RUNNABLE 31 | 2. ACTIVE - the LWP serving the thread is running, sleeping in the kernel, stopped, or in the kernel dispatch queue waiting on a processor 32 | 3. SLEEPING 33 | 4. STOPPED 34 | 5. ZOMBIE 35 | 36 | ### Idling and Parking 37 | 38 | When an unbound thread exits and there are no more RUNNABLE threads, the LWP that was RUNNING the ACTIVE thread switches to an *idle* stack associated with each LWP and waits for a global LWP condition variable to signal the availability of new RUNNABLE threads. 39 | 40 | When a bound thread blocks on a process-local synchronization variable, the underlying LWP stops running and waits on a semaphore associated with the thread. This state for an LWP is called *parked*. When the bound thread unblocks, the semaphore is signaled and he LWP continues executing the thread. The same goes for an unbound thread attached to a LWP if there are no more RUNNABLE threads available. 41 | 42 | ### Preemption 43 | 44 | Threads compete for LWPs just as kernel threads compete for CPUs. If there is a possibility that a higher priority RUNNABLE thread exists, the ACTIVE queue is searched to find a lower priority thread. The lower priority ACTIVE thread is preempted, and the LWP schedules the higher priority RUNNABLE thread. 45 | 46 | There are two cases when the need to preempt arises: 47 | 1. When a RUNNABLE thread with a higher priority of the lowest priority ACTIVE thread is created 48 | 2. When the priority of an ACTIVE thread is lowered below that of the highest priority RUNNABLE thread 49 | 50 | One side-effect of preemption is that if the target thread is blocked in the kernel executing a system call, it will be interrupted by SIGLWP, and then the system call will be re-started when the thread resumes execution after the preemption. This is only a problem if the system call should not be re-started. 51 | 52 | ### The Size of the LWP Pool 53 | 54 | The threads library automatically adjusts the number of LWPs in the pool of LWPs that are used to run unbound threads. There are two requirements in setting the size of the LWP pool: 55 | 1. The pool must not allow the program to deadlock due to lack of LWPs 56 | 2. The library should make efficient use of LWPs 57 | 58 | The current threads library for SunOS starts by guaranteeing that the application does not deadlock. It does so by using a SIGWAITING signal that is sent by the kernel when all the LWPs in the process block in indefinite waits. Another implementation might attempt to compute a weighted time average of the application's concurrency requirements and adjust the pool of LWPs more aggressively. The library also provides an interface for programmers to set concurrency based upon what the application should expect. If the LWP pool grows greater than the threads in the process, LWPs will age off if gone unused. 59 | 60 | ## Mixed Scope Scheduling 61 | 62 | Bound and unbound threads live in harmony as unbound threads continue to be scheduled in a multiplexed fashion while bound threads just maintain their priority and connection to their assigned LWP. 63 | 64 | ### Reaping Bound/Unbound Threads 65 | 66 | When a *detached* thread exits, it is place on a single queue called `deathrow` and the state is set to `ZOMBIE`. The action of free'ing a thread's stack is not done at exit time - this is unnecessary and expensive work. The threads library has a special thread called the reaper who frees the threads' memory periodically. The reaper runs when there are idle LWPs or when the `deathrow` gets full. 67 | 68 | When an *undetached* thread exits, it's added to the zombie list of its parent thread. The parent thread will reap the zombie threads executing `thr_join()`. 69 | 70 | The reaper runs at high priority and should be run not too frequently, but not too sparsely either as reaping has an impact on how quickly threads are created. The reaper puts the freed stacks on a cache of available stacks, speeding up stack allocation for new threads. 71 | 72 | ## Thread Synchronization 73 | 74 | The thread library implements two types of synchronization variables, process-local and process-shared. 75 | 76 | ### Process-local Synchronization Variables 77 | 78 | The default blocking behavior is to put the thread to sleep, and place the thread on the sleep queue of the synchronization variable. The blocked thread is then surrendered to the scheduler, and the scheduler dispatches another thread to the LWP. Bound threads will stay permanently to the LWP, and the LWP will become *parked*, not *idle*. 79 | 80 | Blocked threads wake up when the synchronization variable they were waiting on becomes available. The synchronization primitives check if threads are waiting on the synchronization variable, then a blocked thread is moved from the synchronization variable's sleep queue and is dispatched by the scheduler to an LWP. For bound threads, the scheduler unparks the thread and the LWP is dispatched by the kernel. 81 | 82 | ### Process-shared Synchronization Variables 83 | 84 | Process-shared synchronization objects can also be used to synchronize threads across different processes. These objects have to be initialized when they are created because their blocking behavior is different from the default. The initialization function must be called to mark the object as process-shared - the primitives will then recognize the synchronization variables are shared and provide the correct blocking behavior. The primitives rely on LWP synchronization primitives to put the blocking threads to sleep in the kernel still attached to their LWPs, and to correctly synchronize between processes. 85 | 86 | ## Signals 87 | 88 | A challenge is presented for signals because the kernel can send signals, but thread masks are invisible to the kernel, so signal delivery is dependent upon the thread signal mask making it hard to elicit the correct program behavior. 89 | 90 | The thread implementation also has to support asynchronous safe synchronization primitives. For example, if a thread calls `mutex_lock()` and then is interrupted, what if the interrupt handler also tried to call `mutex_lock()` and enters a deadlock? One way to make this safe is to signal mask while in a thread's critical section. 91 | 92 | ### Signal Model Implementation 93 | 94 | A possibility is to have the LWP replicate the thread's signal mask to make it visible to the kernel. The signals that a process can receive changes based upon the threads in the application that cycle through the ACTIVE state. 95 | 96 | A problem arises when you have threads that aren't ACTIVE often, but are the only threads in the application that have specific signals enabled. These threads are asleep waiting on signals they'll never receive. In addition, anytime the LWP switches threads with different masks or the thread adjusts its mask, the LWP has to make a system call to notify the kernel. 97 | 98 | To solve this problem, we define the set of signals a process can receive equal to the intersection of all the thread signal masks. The LWP signal mask is either less restrictive or equal to the thread's signal mask. Occasionally, signals will be sent by the kernel to ACTIVE threads that have the signal disabled. 99 | 100 | When this occurs, the threads library prevents the thread from being interrupted by interposing its own signal handler below the application's signal handler. When a signal is received, the global handler checks the current thread's signal mask to determine if the thread can receive the signal. 101 | 102 | If the signal is masked, the global handler sets the LWP's signal mask to the thread's signal mask and resend the signal either to the process (for undirected signals) or to the LWP (for directed signals). If the signal is not masked, the global handler calls the signal's application handler. If the signal is not applicable to any of the ACTIVE threads, the global handler will wakeup one of the inactive threads to run if it has the signal unmasked. 103 | 104 | This is all for asynchronous signals. Synchronously generated signals are simply delivered by the kernel to the ACTIVE thread that caused them. 105 | 106 | ### Sending a Directed Signal 107 | 108 | Threads can send other threads signals using `thr_kill()`, however, if the thread is currently not running on the LWP, the signal will remain pending on the thread in a pending signals mask. When the target thread resumes execution, it receives the pending signals. If the thread is ACTIVE, `thr_kill()` is sent to the target thread's LWP. If the LWP is blocking the signal, the signal stays pending on the LWP until the thread unblocks it. While the signals is pending on the LWP, the thread is temporarily bound to the LWP until the signals are delivered to the thread. 109 | 110 | ### Signal Safe Critical Sections 111 | 112 | To prevent deadlock in the presence of signals, critical sections that are reentered in a signal handler in both multi-threaded applications and the threads library should be safe with respect to signals. All async signals should be masked during critical sections. 113 | 114 | To make critical sections as safe as efficiently possible, the threads library implements `thr_sigsetmask()`. If signals do not occur, `thr_sigsetmask()` makes no system calls making it just as fast as modifying the user-level thread signal mask. 115 | 116 | The threads library sets/clears a special flag in the threads structure whenever it enter/exits an internal critical section. Effectively, this flag serves as a signal mask to mask out all signals. -------------------------------------------------------------------------------- /pdfs/papers/stein-shah-paper.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/one2blame/cs6200/baf313c1759dcca9b50f24e77b775aac399452df/pdfs/papers/stein-shah-paper.pdf --------------------------------------------------------------------------------