├── Lec1 ├── Lec1.pdf ├── images │ ├── mbr.png │ ├── memtrans.png │ ├── bootMemReg.png │ └── masterBootRecord.png ├── Lec1.bib └── Lec1.tex ├── Lec2 ├── Lec2.pdf ├── images │ ├── tss.png │ ├── sysds.png │ ├── logadd.png │ ├── segdesc.png │ ├── segproc.png │ ├── segsel.png │ ├── taskstut.png │ ├── intdescprv.png │ └── privlevel.png ├── Lec2.bib └── Lec2.tex ├── Lec3 ├── Lec3.pdf ├── images │ ├── gpt.png │ ├── icr.png │ ├── pae.png │ ├── 4MB32.png │ ├── basicpag.png │ ├── i386pag.png │ ├── lini386.png │ ├── segpag.png │ └── x64pag.png ├── Lec3.bib └── Lec3.tex ├── Lec4 ├── Lec4.pdf ├── images │ ├── kernmem.png │ └── memboot.png ├── Lec4.bib └── Lec4.tex ├── Lec5 ├── Lec5.pdf ├── images │ ├── bits1.png │ ├── bits2.png │ ├── numa.png │ └── linuxvm.png ├── Lec5.bib └── Lec5.tex ├── Lec6 ├── Lec6.pdf ├── images │ ├── evol.png │ ├── allrel.png │ ├── freearea.png │ └── structrel.png ├── Lec6.bib └── Lec6.tex ├── Lec7 ├── Lec7.pdf ├── Lec7.bib └── Lec7.tex ├── Lec8 ├── Lec8.pdf ├── Lec8.bib └── Lec8.tex ├── Lec9 ├── Lec9.pdf ├── Lec9.bib └── Lec9.tex ├── README.md └── .gitignore /Lec1/Lec1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec1/Lec1.pdf -------------------------------------------------------------------------------- /Lec2/Lec2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec2/Lec2.pdf -------------------------------------------------------------------------------- /Lec3/Lec3.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec3/Lec3.pdf -------------------------------------------------------------------------------- /Lec4/Lec4.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec4/Lec4.pdf -------------------------------------------------------------------------------- /Lec5/Lec5.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec5/Lec5.pdf -------------------------------------------------------------------------------- /Lec6/Lec6.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec6/Lec6.pdf -------------------------------------------------------------------------------- /Lec7/Lec7.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec7/Lec7.pdf -------------------------------------------------------------------------------- /Lec8/Lec8.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec8/Lec8.pdf -------------------------------------------------------------------------------- /Lec9/Lec9.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec9/Lec9.pdf -------------------------------------------------------------------------------- /Lec1/images/mbr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec1/images/mbr.png -------------------------------------------------------------------------------- /Lec2/images/tss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec2/images/tss.png -------------------------------------------------------------------------------- /Lec3/images/gpt.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec3/images/gpt.png -------------------------------------------------------------------------------- /Lec3/images/icr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec3/images/icr.png -------------------------------------------------------------------------------- /Lec3/images/pae.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec3/images/pae.png -------------------------------------------------------------------------------- /Lec2/images/sysds.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec2/images/sysds.png -------------------------------------------------------------------------------- /Lec3/images/4MB32.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec3/images/4MB32.png -------------------------------------------------------------------------------- /Lec5/images/bits1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec5/images/bits1.png -------------------------------------------------------------------------------- /Lec5/images/bits2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec5/images/bits2.png -------------------------------------------------------------------------------- /Lec5/images/numa.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec5/images/numa.png -------------------------------------------------------------------------------- /Lec6/images/evol.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec6/images/evol.png -------------------------------------------------------------------------------- /Lec1/images/memtrans.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec1/images/memtrans.png -------------------------------------------------------------------------------- /Lec2/images/logadd.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec2/images/logadd.png -------------------------------------------------------------------------------- /Lec2/images/segdesc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec2/images/segdesc.png -------------------------------------------------------------------------------- /Lec2/images/segproc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec2/images/segproc.png -------------------------------------------------------------------------------- /Lec2/images/segsel.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec2/images/segsel.png -------------------------------------------------------------------------------- /Lec2/images/taskstut.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec2/images/taskstut.png -------------------------------------------------------------------------------- /Lec3/images/basicpag.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec3/images/basicpag.png -------------------------------------------------------------------------------- /Lec3/images/i386pag.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec3/images/i386pag.png -------------------------------------------------------------------------------- /Lec3/images/lini386.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec3/images/lini386.png -------------------------------------------------------------------------------- /Lec3/images/segpag.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec3/images/segpag.png -------------------------------------------------------------------------------- /Lec3/images/x64pag.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec3/images/x64pag.png -------------------------------------------------------------------------------- /Lec4/images/kernmem.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec4/images/kernmem.png -------------------------------------------------------------------------------- /Lec4/images/memboot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec4/images/memboot.png -------------------------------------------------------------------------------- /Lec5/images/linuxvm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec5/images/linuxvm.png -------------------------------------------------------------------------------- /Lec6/images/allrel.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec6/images/allrel.png -------------------------------------------------------------------------------- /Lec6/images/freearea.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec6/images/freearea.png -------------------------------------------------------------------------------- /Lec1/images/bootMemReg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec1/images/bootMemReg.png -------------------------------------------------------------------------------- /Lec2/images/intdescprv.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec2/images/intdescprv.png -------------------------------------------------------------------------------- /Lec2/images/privlevel.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec2/images/privlevel.png -------------------------------------------------------------------------------- /Lec6/images/structrel.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec6/images/structrel.png -------------------------------------------------------------------------------- /Lec1/images/masterBootRecord.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Angelogeb/AOSV/HEAD/Lec1/images/masterBootRecord.png -------------------------------------------------------------------------------- /Lec6/Lec6.bib: -------------------------------------------------------------------------------- 1 | @book{bovet_cesati_2006, place={Sebastopol, CA}, title={Understanding the Linux kernel}, publisher={OReilly}, author={Bovet, Daniel P. and Cesati, Marco}, year={2006}} 2 | 3 | @book{mauerer_2010, place={Somerset}, title={Professional Linux Kernel Architecture}, publisher={Wiley}, author={Mauerer, Wolfgang}, year={2010}} 4 | 5 | @book{gorman_2004, place={Upper Saddle River, NJ}, title={Understanding the Linux Virtual Memory Manager}, publisher={Prentice Hall}, author={Gorman, Mel}, year={2004}} 6 | 7 | -------------------------------------------------------------------------------- /Lec5/Lec5.bib: -------------------------------------------------------------------------------- 1 | @book{bovet_cesati_2006, place={Sebastopol, CA}, title={Understanding the Linux kernel}, publisher={OReilly}, author={Bovet, Daniel P. and Cesati, Marco}, year={2006}} 2 | @misc{linboot, title={Linux i386 Boot Code HOWTO}, url={https://www.tldp.org/HOWTO/Linux-i386-Boot-Code-HOWTO/}, journal={Linux i386 Boot Code HOWTO}, year={2004}, month={Jan}} 3 | @book{gorman_2004, place={Upper Saddle River, NJ}, title={Understanding the Linux Virtual Memory Manager}, publisher={Prentice Hall}, author={Gorman, Mel}, year={2004}} 4 | 5 | -------------------------------------------------------------------------------- /Lec4/Lec4.bib: -------------------------------------------------------------------------------- 1 | @book{intel, title={Intel 64 and IA-32 Architecture Software Developer’s Manual}, volume={3A}, author={Intel}} 2 | @book{bovet_cesati_2006, place={Sebastopol, CA}, title={Understanding the Linux kernel}, publisher={OReilly}, author={Bovet, Daniel P. and Cesati, Marco}, year={2006}} 3 | @misc{lbp, title={Linux Boot Protocol}, url={https://www.kernel.org/doc/Documentation/x86/boot.txt}, publisher={Linux Kernel}} 4 | @misc{duarte_kern_boot, title={The Kernel Boot Process}, url={https://manybutfinite.com/post/kernel-boot-process/}, journal={Many But Finite}, author={Duarte, Gustavo}, year={2008}, month={Jun}} 5 | @misc{linin, title={Linux Inside}, url={https://0xax.gitbooks.io/linux-insides/content/index.html}, journal={Linux Inside}, author={Alex}} 6 | 7 | -------------------------------------------------------------------------------- /Lec3/Lec3.bib: -------------------------------------------------------------------------------- 1 | @book{intel, title={Intel 64 and IA-32 Architecture Software Developer’s Manual}, volume={3A}, author={Intel}, ordernum={253668-060US}, year={2016}, month={Sep}} 2 | @book{bovet_cesati_2006, place={Sebastopol, CA}, title={Understanding the Linux kernel}, publisher={OReilly}, author={Bovet, Daniel P. and Cesati, Marco}, year={2006}} 3 | @misc{secb, title={UEFI Secure Boot in Modern Computer Security Solutions}, url={http://www.uefi.org/sites/default/files/resources/UEFI_Secure_Boot_in_Modern_Computer_Security_Solutions_2013.pdf}, journal={UEFI}, year={2013}} 4 | @misc{gpt, title={GUID Partition Table}, url={https://en.wikipedia.org/wiki/GUID_Partition_Table}, journal={Wikipedia}, publisher={Wikimedia Foundation}, year={2018}, month={Mar}} 5 | @misc{PK, title={Take Control of Your PC with UEFI Secure Boot}, url={http://www.linuxjournal.com/content/take-control-your-pc-uefi-secure-boot}, journal={Take Control of Your PC with UEFI Secure Boot | Linux Journal}} 6 | @book{uefispec, edition={2.6}, title={Unified Extensible Firmware Interface Specification}, author={Org, UEFI}, year={2016}} 7 | @misc{uefsecboot, title={UEFI and "secure boot"}, url={https://lwn.net/Articles/447381/}, journal={[LWN.net]}, author={Jake Edge}, year={2015}, month={Jun}} 8 | @misc{kumar_kumar_2007, title={Vbootkit: Compromising Windows Vista Security}, url={https://www.blackhat.com/presentations/bh-europe-07/Kumar/Whitepaper/bh-eu-07-Kumar-WP-apr19.pdf}, journal={Blackhat}, author={Kumar, Nitin and Kumar, Vipin}, year={2007}} -------------------------------------------------------------------------------- /Lec1/Lec1.bib: -------------------------------------------------------------------------------- 1 | @misc{duarte_boot, title={How Computers Boot Up}, url={https://manybutfinite.com/post/how-computers-boot-up/}, journal={Many But Finite}, author={Duarte, Gustavo}, year={2008}, month={Jun}} 2 | @misc{duarte_mmap, title={Motherboard Chipsets and the Memory Map}, url={https://manybutfinite.com/post/motherboard-chipsets-memory-map/}, journal={Many But Finite}, author={Duarte, Gustavo}, year={2008}, month={Jun}} 3 | @book{iapx_manual_1985, place={Santa Clara, CA}, title={IAPX 286: programmers reference manual}, publisher={Intel Corp.}, year={1985}} 4 | @misc{ibm, title={Inside the Linux boot process}, url={https://www.ibm.com/developerworks/library/l-linuxboot/index.html}, journal={IBM - United States}, year={2006}, month={May}} 5 | @misc{wikipedia_mbr, title={Master boot record}, url={https://en.wikipedia.org/wiki/Master_boot_record}, journal={Wikipedia}, publisher={Wikimedia Foundation}, year={2018}, month={Feb}} 6 | @misc{wikipedia_rv, title={Reset vector}, url={https://en.wikipedia.org/wiki/Reset_vector}, journal={Wikipedia}, publisher={Wikimedia Foundation}, year={2018}, month={Feb}} 7 | @misc{ATX, title={ATX}, url={https://en.wikipedia.org/wiki/ATX}, journal={Wikipedia}, publisher={Wikimedia Foundation}, year={2018}, month={Feb}} 8 | @misc{PG, title={Power Good Signal}, url={http://www.tomshardware.co.uk/power-supply-specifications-atx-reference,review-32338-2.html}, journal={Tom's Hardware}, publisher={Don Woligroski}, year={2011}, month={Dec}} 9 | @book{linins, title={Linux Inside}, publisher={0xAX.}, year={2017}} 10 | @misc{wikipedia_ebr, title={Extended boot record}, url={https://en.wikipedia.org/wiki/Extended_boot_record}, journal={Wikipedia}, publisher={Wikimedia Foundation}, year={2018}, month={Feb}} 11 | @misc{wikipedia_bs, title={Boot sector}, url={https://en.wikipedia.org/wiki/Boot_sector}, journal={Wikipedia}, publisher={Wikimedia Foundation}, year={2018}, month={Feb}} 12 | @misc{wind, title={Troubleshooting Disks and File Systems}, url={https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-xp/bb457122(v=technet.10)}, journal={Windows Doc.}, publisher={Windows}, year={2009}} 13 | @misc{boot, title={The Hardware Boot Process: Operating System Independent}, url={http://www.tomshardware.com/reviews/pc-repair-upgrade-maintenance-testing,3629-7.html}, journal={Tom's Hardware}, publisher={TH}, year={2014}} 14 | @misc{bios, title={BIOS Startup}, url={https://www.redhat.com/archives/rhl-list/2007-August/msg03384.html}, journal={Red Hat Mailing List}, publisher={RH}, year={2007}} 15 | @misc{biosbasics, title={BIOS Basics}, url={http://www.bioscentral.com/misc/biosbasics.htm}, journal={BIOS Central}, publisher={BC}, year={2017}} -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AOSV 2 | 3 | #### :warning: This repository is (sadly) not active anymore. 4 | 5 | Feel free to fork and extend it. 6 | 7 | A useful resource (a bit outdated) over concepts in the linux kernel can be 8 | found in 9 | [A Heavily-Commented Linux Kernel Source Code](http://www.oldlinux.org/download/ECLK-5.0-WithCover.pdf) 10 | 11 | 12 | ### :boom: [Related Project](https://github.com/beabevi/LKM-Fibers) 13 | 14 | The **project** developed for the course can be found [here](https://github.com/beabevi/LKM-Fibers). 15 | It covers some nice topics about kernel programming and it is relatively succint 16 | in its implementation. 17 | 18 | ------------ 19 | 20 | The repository holds the lecture notes of Spring 2018 of the Advanced Operating 21 | Systems and Virtualization course held by Alessandro Pellegrini. There are two 22 | branches, namely `online` and `master`. In the former there are the notes as 23 | taken in class while the latter are the notes rewritten with the support of 24 | integrating material, classmates and professor. 25 | 26 | The repo is a bit dirty in its current state and the `online` version of 27 | lectures <= 5 does not really exist. 28 | 29 | Some lectures have in the reference part stuff that is unrelated to the 30 | lecture just because new folders are created by copying and pasting older ones. 31 | 32 | 33 | ## Browsing the Linux Kernel 34 | 35 | * [Linux Cross Reference (LXR)](https://elixir.bootlin.com/) 36 | * [livegrep](https://livegrep.com/search/linux) 37 | * [ripgrep](https://github.com/BurntSushi/ripgrep/) 38 | * [Code Browsing](https://kernelnewbies.org/FAQ/CodeBrowsing) 39 | 40 | ## Tagged Content 41 | 42 | 43 | * **Lec1**: course information, boot sequence introduction, master boot record 44 | BIOS 45 | * **Lec2**: A20 line, protected mode, LDT, GDT, protection, IDT, privilege 46 | level switch, Task State Segment (TSS) 47 | * **Lec3**: protected mode paging, i386 paging, PDE/PTE fields, PAE, addressing 48 | in long mode, Translation Lookaside Buffer (TLB), longmode enable, 49 | Linux Boot i386 (< v.2.6), UEFI, GUID Partitioning Scheme, Secure 50 | Boot, Bootkits, Platform Key (PK), Key Exchange Key (KEK), Signature 51 | Database, Multicore Booting (SMP), APIC, INIT-SIPI 52 | * **Lec4**: Linux Boot Protocol, Kernel Initialization, `header.S`, `main`, 53 | `go_to_protected_mode`, GDT/IDT dummy setup, `protected_mode_jump`, 54 | `head_{32,64}.S`, `startup_{32,64}`, `start_kernel`, Inline Assembly, 55 | `volatile`, `asmlinkage`, regparm, `__visible`, `__init` 56 | * **Lec5**: Early paging setup 32bit, bootmem allocator, paging in linux, kernel 57 | page table initialization, TLB APIs, NUMA 58 | * **Lec6**: Memory Management, Zones, `mem_map`, Buddy System, Buddy 59 | Allocation/Deallocation APIs, High Memory (`HIGHMEM`), `vmap`, `kmap`, 60 | `kmap_atomic`, NUMA Allocation Policies 61 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | ## Core latex/pdflatex auxiliary files: 2 | *.aux 3 | *.lof 4 | *.log 5 | *.lot 6 | *.fls 7 | *.out 8 | *.toc 9 | *.fmt 10 | *.fot 11 | *.cb 12 | *.cb2 13 | .*.lb 14 | 15 | ## Intermediate documents: 16 | *.dvi 17 | *.xdv 18 | *-converted-to.* 19 | # these rules might exclude image files for figures etc. 20 | # *.ps 21 | # *.eps 22 | # *.pdf 23 | 24 | ## Generated if empty string is given at "Please type another file name for output:" 25 | .pdf 26 | 27 | ## Bibliography auxiliary files (bibtex/biblatex/biber): 28 | *.bbl 29 | *.bcf 30 | *.blg 31 | *-blx.aux 32 | *-blx.bib 33 | *.run.xml 34 | 35 | ## Build tool auxiliary files: 36 | *.fdb_latexmk 37 | *.synctex 38 | *.synctex(busy) 39 | *.synctex.gz 40 | *.synctex.gz(busy) 41 | *.pdfsync 42 | 43 | ## Auxiliary and intermediate files from other packages: 44 | # algorithms 45 | *.alg 46 | *.loa 47 | 48 | # achemso 49 | acs-*.bib 50 | 51 | # amsthm 52 | *.thm 53 | 54 | # beamer 55 | *.nav 56 | *.pre 57 | *.snm 58 | *.vrb 59 | 60 | # changes 61 | *.soc 62 | 63 | # cprotect 64 | *.cpt 65 | 66 | # elsarticle (documentclass of Elsevier journals) 67 | *.spl 68 | 69 | # endnotes 70 | *.ent 71 | 72 | # fixme 73 | *.lox 74 | 75 | # feynmf/feynmp 76 | *.mf 77 | *.mp 78 | *.t[1-9] 79 | *.t[1-9][0-9] 80 | *.tfm 81 | 82 | #(r)(e)ledmac/(r)(e)ledpar 83 | *.end 84 | *.?end 85 | *.[1-9] 86 | *.[1-9][0-9] 87 | *.[1-9][0-9][0-9] 88 | *.[1-9]R 89 | *.[1-9][0-9]R 90 | *.[1-9][0-9][0-9]R 91 | *.eledsec[1-9] 92 | *.eledsec[1-9]R 93 | *.eledsec[1-9][0-9] 94 | *.eledsec[1-9][0-9]R 95 | *.eledsec[1-9][0-9][0-9] 96 | *.eledsec[1-9][0-9][0-9]R 97 | 98 | # glossaries 99 | *.acn 100 | *.acr 101 | *.glg 102 | *.glo 103 | *.gls 104 | *.glsdefs 105 | 106 | # gnuplottex 107 | *-gnuplottex-* 108 | 109 | # gregoriotex 110 | *.gaux 111 | *.gtex 112 | 113 | # htlatex 114 | *.4ct 115 | *.4tc 116 | *.idv 117 | *.lg 118 | *.trc 119 | *.xref 120 | 121 | # hyperref 122 | *.brf 123 | 124 | # knitr 125 | *-concordance.tex 126 | # TODO Comment the next line if you want to keep your tikz graphics files 127 | *.tikz 128 | *-tikzDictionary 129 | 130 | # listings 131 | *.lol 132 | 133 | # makeidx 134 | *.idx 135 | *.ilg 136 | *.ind 137 | *.ist 138 | 139 | # minitoc 140 | *.maf 141 | *.mlf 142 | *.mlt 143 | *.mtc[0-9]* 144 | *.slf[0-9]* 145 | *.slt[0-9]* 146 | *.stc[0-9]* 147 | 148 | # minted 149 | _minted* 150 | *.pyg 151 | 152 | # morewrites 153 | *.mw 154 | 155 | # nomencl 156 | *.nlo 157 | 158 | # pax 159 | *.pax 160 | 161 | # pdfpcnotes 162 | *.pdfpc 163 | 164 | # sagetex 165 | *.sagetex.sage 166 | *.sagetex.py 167 | *.sagetex.scmd 168 | 169 | # scrwfile 170 | *.wrt 171 | 172 | # sympy 173 | *.sout 174 | *.sympy 175 | sympy-plots-for-*.tex/ 176 | 177 | # pdfcomment 178 | *.upa 179 | *.upb 180 | 181 | # pythontex 182 | *.pytxcode 183 | pythontex-files-*/ 184 | 185 | # thmtools 186 | *.loe 187 | 188 | # TikZ & PGF 189 | *.dpth 190 | *.md5 191 | *.auxlock 192 | 193 | # todonotes 194 | *.tdo 195 | 196 | # easy-todo 197 | *.lod 198 | 199 | # xindy 200 | *.xdy 201 | 202 | # xypic precompiled matrices 203 | *.xyc 204 | 205 | # endfloat 206 | *.ttt 207 | *.fff 208 | 209 | # Latexian 210 | TSWLatexianTemp* 211 | 212 | ## Editors: 213 | # WinEdt 214 | *.bak 215 | *.sav 216 | 217 | # Texpad 218 | .texpadtmp 219 | 220 | # Kile 221 | *.backup 222 | 223 | # KBibTeX 224 | *~[0-9]* 225 | 226 | # auto folder when using emacs and auctex 227 | ./auto/* 228 | *.el 229 | 230 | # expex forward references with \gathertags 231 | *-tags.tex 232 | 233 | # standalone packages 234 | *.sta 235 | Slides/ 236 | Resources/ 237 | to-do 238 | -------------------------------------------------------------------------------- /Lec2/Lec2.bib: -------------------------------------------------------------------------------- 1 | @misc{osdev, title={"8042" PS/2 Controller}, url={https://wiki.osdev.org/"8042"_PS/2_Controller}, journal={OSDev Wiki}} 2 | @misc{a20, title={A20 Line}, url={https://wiki.osdev.org/A20_Line}, journal={A20 Line - OSDev Wiki}} 3 | @misc{wikipedia_2018, title={A20 line}, url={https://en.wikipedia.org/wiki/A20_line}, journal={Wikipedia}, publisher={Wikimedia Foundation}, year={2018}, month={Feb}} 4 | @misc{brouwer, title={A20 - a pain from the past}, url={https://www.win.tue.nl/~aeb/linux/kbd/A20.html}, journal={A20 - a pain from the past}, author={Brouwer, Andries E.}} 5 | @misc{collins, title={A20 - Reset Anomalies}, url={http://www.rcollins.org/Productivity/A20Reset.html}, journal={A20/RESET ANOMALIES}, author={Collins, Robert}} 6 | @misc{duarte_2008, title={Memory Translation and Segmentation}, url={https://manybutfinite.com/post/memory-translation-and-segmentation/}, journal={Many But Finite}, author={Duarte, Gustavo}, year={2008}, month={Aug}} 7 | @misc{duarte_2017, title={CPU Rings, Privilege, and Protection}, url={https://manybutfinite.com/post/cpu-rings-privilege-and-protection/}, journal={Many But Finite}, author={Duarte, Gustavo}, year={2008}, month={Nov}} 8 | @misc{rmaddr, title={The Workings of: x86-16/32 RealMode Addressing}, url={https://web.archive.org/web/20130609073242/http://www.osdever.net/tutorials/rm_addressing.php?the_id=50}, journal={The Workings of: x86-16/32 RealMode Addressing}, publisher={Bona Fide OS Development -}} 9 | @misc{wrapar, title={Who needs the address wraparound, anyway?}, url={http://www.os2museum.com/wp/who-needs-the-address-wraparound-anyway/}, journal={OS2 Museum}} 10 | @misc{carn, title={Writing a Bootloader from Scratch}, url={https://www.cs.cmu.edu/~410-s07/p4/p4-boot.pdf}, publisher={Carnegie Mellon, Computer Science Department}, pages={9}} 11 | @book{intel, title={Intel 64 and IA-32 Architecture Software Developer’s Manual}, volume={3A}, author={Intel}} 12 | @misc{descriptor, title={Interrupt Descriptor Table}, url={https://wiki.osdev.org/Interrupt_Descriptor_Table}, journal={Interrupt Descriptor Table - OSDev Wiki}} 13 | @misc{service, title={Interrupt Service Routines}, url={https://wiki.osdev.org/Interrupt_Service_Routines}, journal={Interrupt Service Routines - OSDev Wiki}} 14 | @misc{vector, title={Interrupt Vector Table}, url={https://wiki.osdev.org/Interrupt_Vector_Table}, journal={Interrupt Vector Table - OSDev Wiki}} 15 | @misc{oostenrijk_2016, title={Writing your own toy operating system: Jumping to protected mode}, url={http://www.independent-software.com/writing-your-own-toy-operating-system-jumping-to-protected-mode/}, journal={Independent Software}, author={Oostenrijk, Alexander van}, year={2016}, month={Sep}} 16 | @misc{collins2, title={The Segment Descriptor Cache}, url={http://www.rcollins.org/ddj/Aug98/Aug98.html}, journal={The Segment Descriptor Cache}, author={Collins, Robert}} 17 | @misc{descache, title={Descriptor Cache}, url={https://wiki.osdev.org/Descriptor_Cache}, journal={Descriptor Cache - OSDev Wiki}} 18 | @misc{wik, title={x86 memory segmentation}, url={https://en.wikipedia.org/wiki/X86_memory_segmentation#cite_note-Arch-1}, journal={Wikipedia}, publisher={Wikimedia Foundation}, year={2018}, month={Mar}} 19 | @book{bovet_cesati_2006, place={Sebastopol, CA}, title={Understanding the Linux kernel}, publisher={OReilly}, author={Bovet, Daniel P. and Cesati, Marco}, year={2006}} 20 | @misc{context, title={Context Switching}, url={https://wiki.osdev.org/Context_Switching}, journal={Context Switching - OSDev Wiki}} 21 | @misc{taskss, title={Task State Segment}, url={https://wiki.osdev.org/Task_State_Segment}, journal={Task State Segment - OSDev Wiki}} 22 | @misc{wikipedia_2017, title={Task state segment}, url={https://en.wikipedia.org/wiki/Task_state_segment}, journal={Wikipedia}, publisher={Wikimedia Foundation}, year={2017}, month={Jun}} 23 | @misc{x86_linux, title={Why doesn't Linux use the hardware context switch via the TSS?}, url={https://stackoverflow.com/questions/2711044/why-doesnt-linux-use-the-hardware-context-switch-via-the-tss}, journal={x86 - Why doesn't Linux use the hardware context switch via the TSS? - Stack Overflow}} 24 | -------------------------------------------------------------------------------- /Lec7/Lec7.bib: -------------------------------------------------------------------------------- 1 | @misc{osdev, title={"8042" PS/2 Controller}, url={https://wiki.osdev.org/"8042"_PS/2_Controller}, journal={OSDev Wiki}} 2 | @misc{a20, title={A20 Line}, url={https://wiki.osdev.org/A20_Line}, journal={A20 Line - OSDev Wiki}} 3 | @misc{wikipedia_2018, title={A20 line}, url={https://en.wikipedia.org/wiki/A20_line}, journal={Wikipedia}, publisher={Wikimedia Foundation}, year={2018}, month={Feb}} 4 | @misc{brouwer, title={A20 - a pain from the past}, url={https://www.win.tue.nl/~aeb/linux/kbd/A20.html}, journal={A20 - a pain from the past}, author={Brouwer, Andries E.}} 5 | @misc{collins, title={A20 - Reset Anomalies}, url={http://www.rcollins.org/Productivity/A20Reset.html}, journal={A20/RESET ANOMALIES}, author={Collins, Robert}} 6 | @misc{duarte_2008, title={Memory Translation and Segmentation}, url={https://manybutfinite.com/post/memory-translation-and-segmentation/}, journal={Many But Finite}, author={Duarte, Gustavo}, year={2008}, month={Aug}} 7 | @misc{duarte_2017, title={CPU Rings, Privilege, and Protection}, url={https://manybutfinite.com/post/cpu-rings-privilege-and-protection/}, journal={Many But Finite}, author={Duarte, Gustavo}, year={2008}, month={Nov}} 8 | @misc{rmaddr, title={The Workings of: x86-16/32 RealMode Addressing}, url={https://web.archive.org/web/20130609073242/http://www.osdever.net/tutorials/rm_addressing.php?the_id=50}, journal={The Workings of: x86-16/32 RealMode Addressing}, publisher={Bona Fide OS Development -}} 9 | @misc{wrapar, title={Who needs the address wraparound, anyway?}, url={http://www.os2museum.com/wp/who-needs-the-address-wraparound-anyway/}, journal={OS2 Museum}} 10 | @misc{carn, title={Writing a Bootloader from Scratch}, url={https://www.cs.cmu.edu/~410-s07/p4/p4-boot.pdf}, publisher={Carnegie Mellon, Computer Science Department}, pages={9}} 11 | @book{intel, title={Intel 64 and IA-32 Architecture Software Developer’s Manual}, volume={3A}, author={Intel}} 12 | @misc{descriptor, title={Interrupt Descriptor Table}, url={https://wiki.osdev.org/Interrupt_Descriptor_Table}, journal={Interrupt Descriptor Table - OSDev Wiki}} 13 | @misc{service, title={Interrupt Service Routines}, url={https://wiki.osdev.org/Interrupt_Service_Routines}, journal={Interrupt Service Routines - OSDev Wiki}} 14 | @misc{vector, title={Interrupt Vector Table}, url={https://wiki.osdev.org/Interrupt_Vector_Table}, journal={Interrupt Vector Table - OSDev Wiki}} 15 | @misc{oostenrijk_2016, title={Writing your own toy operating system: Jumping to protected mode}, url={http://www.independent-software.com/writing-your-own-toy-operating-system-jumping-to-protected-mode/}, journal={Independent Software}, author={Oostenrijk, Alexander van}, year={2016}, month={Sep}} 16 | @misc{collins2, title={The Segment Descriptor Cache}, url={http://www.rcollins.org/ddj/Aug98/Aug98.html}, journal={The Segment Descriptor Cache}, author={Collins, Robert}} 17 | @misc{descache, title={Descriptor Cache}, url={https://wiki.osdev.org/Descriptor_Cache}, journal={Descriptor Cache - OSDev Wiki}} 18 | @misc{wik, title={x86 memory segmentation}, url={https://en.wikipedia.org/wiki/X86_memory_segmentation#cite_note-Arch-1}, journal={Wikipedia}, publisher={Wikimedia Foundation}, year={2018}, month={Mar}} 19 | @book{bovet_cesati_2006, place={Sebastopol, CA}, title={Understanding the Linux kernel}, publisher={OReilly}, author={Bovet, Daniel P. and Cesati, Marco}, year={2006}} 20 | @misc{context, title={Context Switching}, url={https://wiki.osdev.org/Context_Switching}, journal={Context Switching - OSDev Wiki}} 21 | @misc{taskss, title={Task State Segment}, url={https://wiki.osdev.org/Task_State_Segment}, journal={Task State Segment - OSDev Wiki}} 22 | @misc{wikipedia_2017, title={Task state segment}, url={https://en.wikipedia.org/wiki/Task_state_segment}, journal={Wikipedia}, publisher={Wikimedia Foundation}, year={2017}, month={Jun}} 23 | @misc{x86_linux, title={Why doesn't Linux use the hardware context switch via the TSS?}, url={https://stackoverflow.com/questions/2711044/why-doesnt-linux-use-the-hardware-context-switch-via-the-tss}, journal={x86 - Why doesn't Linux use the hardware context switch via the TSS? - Stack Overflow}} 24 | -------------------------------------------------------------------------------- /Lec8/Lec8.bib: -------------------------------------------------------------------------------- 1 | @misc{osdev, title={"8042" PS/2 Controller}, url={https://wiki.osdev.org/"8042"_PS/2_Controller}, journal={OSDev Wiki}} 2 | @misc{a20, title={A20 Line}, url={https://wiki.osdev.org/A20_Line}, journal={A20 Line - OSDev Wiki}} 3 | @misc{wikipedia_2018, title={A20 line}, url={https://en.wikipedia.org/wiki/A20_line}, journal={Wikipedia}, publisher={Wikimedia Foundation}, year={2018}, month={Feb}} 4 | @misc{brouwer, title={A20 - a pain from the past}, url={https://www.win.tue.nl/~aeb/linux/kbd/A20.html}, journal={A20 - a pain from the past}, author={Brouwer, Andries E.}} 5 | @misc{collins, title={A20 - Reset Anomalies}, url={http://www.rcollins.org/Productivity/A20Reset.html}, journal={A20/RESET ANOMALIES}, author={Collins, Robert}} 6 | @misc{duarte_2008, title={Memory Translation and Segmentation}, url={https://manybutfinite.com/post/memory-translation-and-segmentation/}, journal={Many But Finite}, author={Duarte, Gustavo}, year={2008}, month={Aug}} 7 | @misc{duarte_2017, title={CPU Rings, Privilege, and Protection}, url={https://manybutfinite.com/post/cpu-rings-privilege-and-protection/}, journal={Many But Finite}, author={Duarte, Gustavo}, year={2008}, month={Nov}} 8 | @misc{rmaddr, title={The Workings of: x86-16/32 RealMode Addressing}, url={https://web.archive.org/web/20130609073242/http://www.osdever.net/tutorials/rm_addressing.php?the_id=50}, journal={The Workings of: x86-16/32 RealMode Addressing}, publisher={Bona Fide OS Development -}} 9 | @misc{wrapar, title={Who needs the address wraparound, anyway?}, url={http://www.os2museum.com/wp/who-needs-the-address-wraparound-anyway/}, journal={OS2 Museum}} 10 | @misc{carn, title={Writing a Bootloader from Scratch}, url={https://www.cs.cmu.edu/~410-s07/p4/p4-boot.pdf}, publisher={Carnegie Mellon, Computer Science Department}, pages={9}} 11 | @book{intel, title={Intel 64 and IA-32 Architecture Software Developer’s Manual}, volume={3A}, author={Intel}} 12 | @misc{descriptor, title={Interrupt Descriptor Table}, url={https://wiki.osdev.org/Interrupt_Descriptor_Table}, journal={Interrupt Descriptor Table - OSDev Wiki}} 13 | @misc{service, title={Interrupt Service Routines}, url={https://wiki.osdev.org/Interrupt_Service_Routines}, journal={Interrupt Service Routines - OSDev Wiki}} 14 | @misc{vector, title={Interrupt Vector Table}, url={https://wiki.osdev.org/Interrupt_Vector_Table}, journal={Interrupt Vector Table - OSDev Wiki}} 15 | @misc{oostenrijk_2016, title={Writing your own toy operating system: Jumping to protected mode}, url={http://www.independent-software.com/writing-your-own-toy-operating-system-jumping-to-protected-mode/}, journal={Independent Software}, author={Oostenrijk, Alexander van}, year={2016}, month={Sep}} 16 | @misc{collins2, title={The Segment Descriptor Cache}, url={http://www.rcollins.org/ddj/Aug98/Aug98.html}, journal={The Segment Descriptor Cache}, author={Collins, Robert}} 17 | @misc{descache, title={Descriptor Cache}, url={https://wiki.osdev.org/Descriptor_Cache}, journal={Descriptor Cache - OSDev Wiki}} 18 | @misc{wik, title={x86 memory segmentation}, url={https://en.wikipedia.org/wiki/X86_memory_segmentation#cite_note-Arch-1}, journal={Wikipedia}, publisher={Wikimedia Foundation}, year={2018}, month={Mar}} 19 | @book{bovet_cesati_2006, place={Sebastopol, CA}, title={Understanding the Linux kernel}, publisher={OReilly}, author={Bovet, Daniel P. and Cesati, Marco}, year={2006}} 20 | @misc{context, title={Context Switching}, url={https://wiki.osdev.org/Context_Switching}, journal={Context Switching - OSDev Wiki}} 21 | @misc{taskss, title={Task State Segment}, url={https://wiki.osdev.org/Task_State_Segment}, journal={Task State Segment - OSDev Wiki}} 22 | @misc{wikipedia_2017, title={Task state segment}, url={https://en.wikipedia.org/wiki/Task_state_segment}, journal={Wikipedia}, publisher={Wikimedia Foundation}, year={2017}, month={Jun}} 23 | @misc{x86_linux, title={Why doesn't Linux use the hardware context switch via the TSS?}, url={https://stackoverflow.com/questions/2711044/why-doesnt-linux-use-the-hardware-context-switch-via-the-tss}, journal={x86 - Why doesn't Linux use the hardware context switch via the TSS? - Stack Overflow}} 24 | -------------------------------------------------------------------------------- /Lec9/Lec9.bib: -------------------------------------------------------------------------------- 1 | @misc{osdev, title={"8042" PS/2 Controller}, url={https://wiki.osdev.org/"8042"_PS/2_Controller}, journal={OSDev Wiki}} 2 | @misc{a20, title={A20 Line}, url={https://wiki.osdev.org/A20_Line}, journal={A20 Line - OSDev Wiki}} 3 | @misc{wikipedia_2018, title={A20 line}, url={https://en.wikipedia.org/wiki/A20_line}, journal={Wikipedia}, publisher={Wikimedia Foundation}, year={2018}, month={Feb}} 4 | @misc{brouwer, title={A20 - a pain from the past}, url={https://www.win.tue.nl/~aeb/linux/kbd/A20.html}, journal={A20 - a pain from the past}, author={Brouwer, Andries E.}} 5 | @misc{collins, title={A20 - Reset Anomalies}, url={http://www.rcollins.org/Productivity/A20Reset.html}, journal={A20/RESET ANOMALIES}, author={Collins, Robert}} 6 | @misc{duarte_2008, title={Memory Translation and Segmentation}, url={https://manybutfinite.com/post/memory-translation-and-segmentation/}, journal={Many But Finite}, author={Duarte, Gustavo}, year={2008}, month={Aug}} 7 | @misc{duarte_2017, title={CPU Rings, Privilege, and Protection}, url={https://manybutfinite.com/post/cpu-rings-privilege-and-protection/}, journal={Many But Finite}, author={Duarte, Gustavo}, year={2008}, month={Nov}} 8 | @misc{rmaddr, title={The Workings of: x86-16/32 RealMode Addressing}, url={https://web.archive.org/web/20130609073242/http://www.osdever.net/tutorials/rm_addressing.php?the_id=50}, journal={The Workings of: x86-16/32 RealMode Addressing}, publisher={Bona Fide OS Development -}} 9 | @misc{wrapar, title={Who needs the address wraparound, anyway?}, url={http://www.os2museum.com/wp/who-needs-the-address-wraparound-anyway/}, journal={OS2 Museum}} 10 | @misc{carn, title={Writing a Bootloader from Scratch}, url={https://www.cs.cmu.edu/~410-s07/p4/p4-boot.pdf}, publisher={Carnegie Mellon, Computer Science Department}, pages={9}} 11 | @book{intel, title={Intel 64 and IA-32 Architecture Software Developer’s Manual}, volume={3A}, author={Intel}} 12 | @misc{descriptor, title={Interrupt Descriptor Table}, url={https://wiki.osdev.org/Interrupt_Descriptor_Table}, journal={Interrupt Descriptor Table - OSDev Wiki}} 13 | @misc{service, title={Interrupt Service Routines}, url={https://wiki.osdev.org/Interrupt_Service_Routines}, journal={Interrupt Service Routines - OSDev Wiki}} 14 | @misc{vector, title={Interrupt Vector Table}, url={https://wiki.osdev.org/Interrupt_Vector_Table}, journal={Interrupt Vector Table - OSDev Wiki}} 15 | @misc{oostenrijk_2016, title={Writing your own toy operating system: Jumping to protected mode}, url={http://www.independent-software.com/writing-your-own-toy-operating-system-jumping-to-protected-mode/}, journal={Independent Software}, author={Oostenrijk, Alexander van}, year={2016}, month={Sep}} 16 | @misc{collins2, title={The Segment Descriptor Cache}, url={http://www.rcollins.org/ddj/Aug98/Aug98.html}, journal={The Segment Descriptor Cache}, author={Collins, Robert}} 17 | @misc{descache, title={Descriptor Cache}, url={https://wiki.osdev.org/Descriptor_Cache}, journal={Descriptor Cache - OSDev Wiki}} 18 | @misc{wik, title={x86 memory segmentation}, url={https://en.wikipedia.org/wiki/X86_memory_segmentation#cite_note-Arch-1}, journal={Wikipedia}, publisher={Wikimedia Foundation}, year={2018}, month={Mar}} 19 | @book{bovet_cesati_2006, place={Sebastopol, CA}, title={Understanding the Linux kernel}, publisher={OReilly}, author={Bovet, Daniel P. and Cesati, Marco}, year={2006}} 20 | @misc{context, title={Context Switching}, url={https://wiki.osdev.org/Context_Switching}, journal={Context Switching - OSDev Wiki}} 21 | @misc{taskss, title={Task State Segment}, url={https://wiki.osdev.org/Task_State_Segment}, journal={Task State Segment - OSDev Wiki}} 22 | @misc{wikipedia_2017, title={Task state segment}, url={https://en.wikipedia.org/wiki/Task_state_segment}, journal={Wikipedia}, publisher={Wikimedia Foundation}, year={2017}, month={Jun}} 23 | @misc{x86_linux, title={Why doesn't Linux use the hardware context switch via the TSS?}, url={https://stackoverflow.com/questions/2711044/why-doesnt-linux-use-the-hardware-context-switch-via-the-tss}, journal={x86 - Why doesn't Linux use the hardware context switch via the TSS? - Stack Overflow}} 24 | -------------------------------------------------------------------------------- /Lec8/Lec8.tex: -------------------------------------------------------------------------------- 1 | \documentclass[twoside]{article} 2 | \setlength{\oddsidemargin}{-0.5 in} 3 | \setlength{\evensidemargin}{1.5 in} 4 | \setlength{\topmargin}{-0.6 in} 5 | \setlength{\textwidth}{5.5 in} 6 | \setlength{\textheight}{8.5 in} 7 | \setlength{\headsep}{0.5 in} 8 | \setlength{\parindent}{0 in} 9 | \setlength{\parskip}{0.07 in} 10 | \setlength{\marginparwidth}{145pt} 11 | 12 | % 13 | % ADD PACKAGES here: 14 | % 12 15 | 16 | \usepackage{amsmath, 17 | amsfonts, 18 | amssymb, 19 | graphicx, 20 | mathtools, 21 | flexisym, 22 | marginnote, 23 | hyperref, 24 | titlesec} 25 | 26 | \usepackage[english]{babel} 27 | \usepackage[utf8]{inputenc} 28 | \usepackage[shortlabels]{enumitem} 29 | 30 | \graphicspath{ {images/} } 31 | 32 | \hypersetup{ 33 | colorlinks=true, 34 | linkcolor=blue, 35 | filecolor=magenta, 36 | urlcolor=blue, 37 | } 38 | 39 | \titlespacing\section{0pt}{12pt plus 4pt minus 2pt}{0pt plus 2pt minus 2pt} 40 | \titlespacing\subsection{0pt}{12pt plus 4pt minus 2pt}{0pt plus 2pt minus 2pt} 41 | 42 | % 43 | % The following commands set up the lecnum (lecture number) 44 | % counter and make various numbering schemes work relative 45 | % to the lecture number. 46 | % 47 | \newcounter{lecnum} 48 | \renewcommand{\thepage}{\thelecnum-\arabic{page}} 49 | \renewcommand{\thesection}{\thelecnum.\arabic{section}} 50 | \renewcommand{\theequation}{\thelecnum.\arabic{equation}} 51 | \renewcommand{\thefigure}{\thelecnum.\arabic{figure}} 52 | \renewcommand{\thetable}{\thelecnum.\arabic{table}} 53 | 54 | \newcommand{\aosv}{1044414: Advanced Operating Systems and Virtualization} 55 | \newcommand{\wir}{1038137: Web Information Retrieval} 56 | \newcommand{\va}{1052057: Visual Analytics} 57 | \newcommand{\advprog}{1044416: Advanced Programming} 58 | \newcommand{\dchpc}{1044399: Data Centers and High Perf. Computing} 59 | 60 | \newcommand{\qu}[1]{\marginnote{\textcolor{cyan}{#1}}} 61 | 62 | 63 | % 64 | % The following macro is used to generate the header. 65 | % 66 | \newcommand{\lecture}[4]{ 67 | \pagestyle{myheadings} 68 | \thispagestyle{plain} 69 | \newpage 70 | \setcounter{lecnum}{#4} 71 | \setcounter{page}{1} 72 | \noindent 73 | \begin{center} 74 | \framebox{ 75 | \vbox{\vspace{2mm} 76 | \hbox to 7.4in { {\bf #1 77 | \hfill Spring 2018} } 78 | \vspace{4mm} 79 | \hbox to 7.4in { {\Large \hfill Lecture #4: #2 \hfill} } 80 | \vspace{2mm} 81 | \hbox to 7.4in { {\it Lecturer: #3 \hfill Scribe: Anxhelo Xhebraj} } 82 | \vspace{2mm}} 83 | } 84 | \end{center} 85 | \markboth{Lecture #4: #2}{Lecture #4: #2} 86 | 87 | \iffalse 88 | {\bf Note}: {\it LaTeX template courtesy of UC Berkeley EECS dept.} 89 | 90 | {\bf Disclaimer}: {\it These notes have not been subjected to the 91 | usual scrutiny reserved for formal publications. They may be distributed 92 | outside this class only with the permission of the Instructor.} 93 | \vspace*{4mm} 94 | \fi 95 | } 96 | % 97 | % Convention for citations is authors' initials followed by the year. 98 | % For example, to cite a paper by Leighton and Maggs you would type 99 | % \cite{LM89}, and to cite a paper by Strassen you would type \cite{S69}. 100 | % (To avoid bibliography problems, for now we redefine the \cite command.) 101 | % Also commands that create a suitable format for the reference list. 102 | \iffalse 103 | \renewcommand{\cite}[1]{[#1]} 104 | \def\beginrefs{\begin{list}% 105 | {[\arabic{equation}]}{\usecounter{equation} 106 | \setlength{\leftmargin}{2.0truecm}\setlength{\labelsep}{0.4truecm}% 107 | \setlength{\labelwidth}{1.6truecm}}} 108 | \def\endrefs{\end{list}} 109 | \def\bibentry#1{\item[\hbox{[#1]}]} 110 | \fi 111 | 112 | %Use this command for a figure; it puts a figure in wherever you want it. 113 | %usage: \fig{NUMBER}{SPACE-IN-INCHES}{CAPTION} 114 | \newcommand{\fig}[3]{ 115 | \vspace{#2} 116 | \begin{center} 117 | Figure \thelecnum.#1:~#3 118 | \end{center} 119 | } 120 | % Use these for theorems, lemmas, proofs, etc. 121 | \newtheorem{theorem}{Theorem}[lecnum] 122 | \newtheorem{lemma}[theorem]{Lemma} 123 | \newtheorem{proposition}[theorem]{Proposition} 124 | \newtheorem{claim}[theorem]{Claim} 125 | \newtheorem{corollary}[theorem]{Corollary} 126 | \newtheorem{definition}[theorem]{Definition} 127 | \newenvironment{proof}{{\bf Proof:}}{\hfill\rule{2mm}{2mm}} 128 | 129 | % **** IF YOU WANT TO DEFINE ADDITIONAL MACROS FOR YOURSELF, PUT THEM HERE: 130 | 131 | \newcommand\E{\mathbb{E}} 132 | 133 | \begin{document} 134 | 135 | \nocite{*} 136 | 137 | %FILL IN THE RIGHT INFO. 138 | %\lecture{**LECTURE-NUMBER**}{**DATE**}{**LECTURER**}{**SCRIBE**} 139 | 140 | \lecture{\aosv}{March 27}{Alessandro Pellegrini}{8} 141 | 142 | %\footnotetext{These notes are partially based on those of Nigel Mansell.} 143 | 144 | % **** YOUR NOTES GO HERE: 145 | 146 | Why is \texttt{\_\_syscall\_return} needed? A syscall can either fail or succeed. To let the user know what happened that macro sets \texttt{errno}. Signed arithmetic is more costly than unsigned therefore making the check unsigned the compiler will generate unsigned arithmetic instructions making the check faster. 147 | 148 | Only 8 registers are available: ESP cannot be used, EAX is used for the code of the syscall, EBP must be saved before used. 149 | 150 | The dispatcher takes a complete snapshot of CPU registers for two reasons: the dispatcher will mangle the content of the registers so it has to save them; when the trap is issued the calling convention does not say anything about what registers could be modified by the kernel. 151 | 152 | The registers are pushed on stack and then a call to the syscall is performed. If the snapshot is modified then a different context will be given back to the user process. 153 | 154 | Resizing the syscall table (static variable vector) requires to reshuffle the whole compilation process since Bootmem must have a new bitmap etc. 155 | 156 | \texttt{sysenter/sysexit} are fast system call paths since by using \texttt{int \$0x80} too much overhead was needed (enter IDT, GDT, change TR etc). 157 | 158 | stdlib introduced \texttt{syscall()} interface but then later the instructions above were introduced in the kernel. 159 | 160 | Dispatcher 161 | 162 | Model specific register speedup the entering of kernel mode since they store the informations needed to start executing kernel code. Doesn't do any check. 163 | 164 | Virtual Dynamic Shared Object: implements assembly code to activate/issue the 165 | 166 | \section{Multi cores synchronisation issues} 167 | 168 | Changes in some data structures in the system we must ensure that all cores are aligned to same view of data structures and resources. Some issues are addressed by firmware, other not. 169 | 170 | \marginnote{\textsc{Inter Processor Interrupts (IPI)}} 171 | Requests other cores to perform some action. Firmware generates them but software processes them. 172 | 173 | \newpage 174 | \bibliography{Lec8} 175 | \bibliographystyle{plainnat} 176 | \end{document} -------------------------------------------------------------------------------- /Lec7/Lec7.tex: -------------------------------------------------------------------------------- 1 | \documentclass[twoside]{article} 2 | \setlength{\oddsidemargin}{-0.5 in} 3 | \setlength{\evensidemargin}{1.5 in} 4 | \setlength{\topmargin}{-0.6 in} 5 | \setlength{\textwidth}{5.5 in} 6 | \setlength{\textheight}{8.5 in} 7 | \setlength{\headsep}{0.5 in} 8 | \setlength{\parindent}{0 in} 9 | \setlength{\parskip}{0.07 in} 10 | \setlength{\marginparwidth}{145pt} 11 | 12 | % 13 | % ADD PACKAGES here: 14 | % 12 15 | 16 | \usepackage{amsmath, 17 | amsfonts, 18 | amssymb, 19 | graphicx, 20 | mathtools, 21 | flexisym, 22 | marginnote, 23 | hyperref, 24 | titlesec} 25 | 26 | \usepackage[english]{babel} 27 | \usepackage[utf8]{inputenc} 28 | \usepackage[shortlabels]{enumitem} 29 | 30 | \graphicspath{ {images/} } 31 | 32 | \hypersetup{ 33 | colorlinks=true, 34 | linkcolor=blue, 35 | filecolor=magenta, 36 | urlcolor=blue, 37 | } 38 | 39 | \titlespacing\section{0pt}{12pt plus 4pt minus 2pt}{0pt plus 2pt minus 2pt} 40 | \titlespacing\subsection{0pt}{12pt plus 4pt minus 2pt}{0pt plus 2pt minus 2pt} 41 | 42 | % 43 | % The following commands set up the lecnum (lecture number) 44 | % counter and make various numbering schemes work relative 45 | % to the lecture number. 46 | % 47 | \newcounter{lecnum} 48 | \renewcommand{\thepage}{\thelecnum-\arabic{page}} 49 | \renewcommand{\thesection}{\thelecnum.\arabic{section}} 50 | \renewcommand{\theequation}{\thelecnum.\arabic{equation}} 51 | \renewcommand{\thefigure}{\thelecnum.\arabic{figure}} 52 | \renewcommand{\thetable}{\thelecnum.\arabic{table}} 53 | 54 | \newcommand{\aosv}{1044414: Advanced Operating Systems and Virtualization} 55 | \newcommand{\wir}{1038137: Web Information Retrieval} 56 | \newcommand{\va}{1052057: Visual Analytics} 57 | \newcommand{\advprog}{1044416: Advanced Programming} 58 | \newcommand{\dchpc}{1044399: Data Centers and High Perf. Computing} 59 | 60 | \newcommand{\qu}[1]{\marginnote{\textcolor{cyan}{#1}}} 61 | 62 | 63 | % 64 | % The following macro is used to generate the header. 65 | % 66 | \newcommand{\lecture}[4]{ 67 | \pagestyle{myheadings} 68 | \thispagestyle{plain} 69 | \newpage 70 | \setcounter{lecnum}{#4} 71 | \setcounter{page}{1} 72 | \noindent 73 | \begin{center} 74 | \framebox{ 75 | \vbox{\vspace{2mm} 76 | \hbox to 7.4in { {\bf #1 77 | \hfill Spring 2018} } 78 | \vspace{4mm} 79 | \hbox to 7.4in { {\Large \hfill Lecture #4: #2 \hfill} } 80 | \vspace{2mm} 81 | \hbox to 7.4in { {\it Lecturer: #3 \hfill Scribe: Anxhelo Xhebraj} } 82 | \vspace{2mm}} 83 | } 84 | \end{center} 85 | \markboth{Lecture #4: #2}{Lecture #4: #2} 86 | 87 | \iffalse 88 | {\bf Note}: {\it LaTeX template courtesy of UC Berkeley EECS dept.} 89 | 90 | {\bf Disclaimer}: {\it These notes have not been subjected to the 91 | usual scrutiny reserved for formal publications. They may be distributed 92 | outside this class only with the permission of the Instructor.} 93 | \vspace*{4mm} 94 | \fi 95 | } 96 | % 97 | % Convention for citations is authors' initials followed by the year. 98 | % For example, to cite a paper by Leighton and Maggs you would type 99 | % \cite{LM89}, and to cite a paper by Strassen you would type \cite{S69}. 100 | % (To avoid bibliography problems, for now we redefine the \cite command.) 101 | % Also commands that create a suitable format for the reference list. 102 | \iffalse 103 | \renewcommand{\cite}[1]{[#1]} 104 | \def\beginrefs{\begin{list}% 105 | {[\arabic{equation}]}{\usecounter{equation} 106 | \setlength{\leftmargin}{2.0truecm}\setlength{\labelsep}{0.4truecm}% 107 | \setlength{\labelwidth}{1.6truecm}}} 108 | \def\endrefs{\end{list}} 109 | \def\bibentry#1{\item[\hbox{[#1]}]} 110 | \fi 111 | 112 | %Use this command for a figure; it puts a figure in wherever you want it. 113 | %usage: \fig{NUMBER}{SPACE-IN-INCHES}{CAPTION} 114 | \newcommand{\fig}[3]{ 115 | \vspace{#2} 116 | \begin{center} 117 | Figure \thelecnum.#1:~#3 118 | \end{center} 119 | } 120 | % Use these for theorems, lemmas, proofs, etc. 121 | \newtheorem{theorem}{Theorem}[lecnum] 122 | \newtheorem{lemma}[theorem]{Lemma} 123 | \newtheorem{proposition}[theorem]{Proposition} 124 | \newtheorem{claim}[theorem]{Claim} 125 | \newtheorem{corollary}[theorem]{Corollary} 126 | \newtheorem{definition}[theorem]{Definition} 127 | \newenvironment{proof}{{\bf Proof:}}{\hfill\rule{2mm}{2mm}} 128 | 129 | % **** IF YOU WANT TO DEFINE ADDITIONAL MACROS FOR YOURSELF, PUT THEM HERE: 130 | 131 | \newcommand\E{\mathbb{E}} 132 | 133 | \begin{document} 134 | 135 | \nocite{*} 136 | 137 | %FILL IN THE RIGHT INFO. 138 | %\lecture{**LECTURE-NUMBER**}{**DATE**}{**LECTURER**}{**SCRIBE**} 139 | 140 | \lecture{\aosv}{March 23}{Alessandro Pellegrini}{7} 141 | 142 | %\footnotetext{These notes are partially based on those of Nigel Mansell.} 143 | 144 | % **** YOUR NOTES GO HERE: 145 | 146 | \iffalse 147 | The bootmem allocator is used for allocating memory in the 8MB addressable by the kernel right after decompression. 148 | \fi 149 | 150 | Spawning many processes with a short lifetime it is not performed efficiently through the buddy system. 151 | 152 | Quicklist avoid contention of allocating pages by pre assigning pages to cores. The SLAB allocator is used for buffers. 153 | 154 | In quicklists no syncronization is needed. If no entries available in the list it asks memory to the buddy system through \_\_get\_free\_page 155 | 156 | \section{Quicklist Allocation} 157 | 158 | Kernel threads can move around in cores: get cpu var is a wrapper to a set of calls that disables preemption for that specific core until re enabled by put cpu var. 159 | 160 | \section{SLAB Allocator} 161 | 162 | There is a list of caches where each cache keeps a slab of a specific size. A cache is organised in three types of slabs: full (Don't have buffers to be given), partial (some used and some free), free (deallocated or never allocated). Objects are abstractions over pages. 163 | 164 | \subsection{SLAB Interfaces} 165 | 166 | Found in linux / malloc h. kmalloc asks for a given size and returns the virtual address of a buffer of that size. kfree frees memory allocated via kmalloc. k malloc node is an api to the slab allocator that in the end will ask to the buddy system of a specific numa node for allocating some page. 167 | 168 | kmalloc should be used for frequent allocations and deallocations of the same size. 169 | 170 | up to kernel v 3.9.11 there was the struct cache size. cs\_cachep is a pointer to the memory. There is a table of multiple size fixed-size caches. 171 | 172 | In kernel 3.10 we move from fixed size to list with spinlock again. You can either have a shared across cores allocator or one allocator for each core. Having one allocator for each core requires space. What's the difference btw using the buddy system and slab? buddy has one spinlock for each numa node while slab has one spinlock for each size of cache. \marginnote{SLAB and Buddy both are for the kernel}[-36 pt] 173 | 174 | Per node cache coloring: size of object then padding and so on. Why is coloring used? to align objects to L1 Cache Bytes. Two objects of the same size will not fall in the same cache line. An object of size greater than one line is padded to the size of multiple cache lines. 175 | 176 | This ensures that two slabs objects allocated will not fall in the same cache line to not fall into cache contention. 177 | 178 | Members that are accessed together and used frequently together (Common Members) are placed close together to optimise cache hits for example the spinlock and slab partial in kmem cache node struct. 179 | 180 | (Loosely related fields): due to the false cache sharing problem we have that cache controllers in order to be coherent tell the others cache controllers that they are going to write that line wanting "mutual exclusion". With coloring we ensure that different buffers will not fall in the same cache line. 181 | 182 | \subsection{Cache flush operations} 183 | 184 | Similarily to the TLB relies on the hardware specific operations for granularity and coherency of the flushing. There are also problems because the hardware cache uses virtual addresses, therefore two processes addressing the same virtual address might have a cache hit to a region of memory that is not the physical one they wanted to access. After flushing the page cache we must also flush the TLB. 185 | 186 | \begin{description} 187 | \item \texttt{flush_cache_all:} Flushes the entire CPU cache system. It is used for when global data structures, for example kernel page tables, are changed to ensure cache coherency. 188 | \item others... 189 | \end{description} 190 | 191 | What is the best way to devise a cache? physical or virtual? Intel architectures use Virtual addresses to tag L1 cache. If there is a miss in L1 the TLB is consulted to get the physical and check the L2 which is addressed through the physical address. There is a protocol btw L1 and TLB to know whether a virtual address is consistent with the current paging scheme. Therefore in intel we do not care about cache consistency. 192 | 193 | Virtual aliasing is the problem described above where we tag the L1 cache through virtual addresses but if the cache is not coherent ... 194 | 195 | The other apis are a low level api that are used by the description above. 196 | 197 | copy from and to user ensures that the copy of memory is done correctly since there might be a process switch and the write might be writing memory of another process. 198 | 199 | Access ok checks whether the memory area passed is correctly mapped to that process. 200 | 201 | \texttt{vmalloc} used to map some memory in the kernel in a stable way, that will be used for a long time. No idea about the memory contiguousness. No info about the organisation of physical frames. Used for usually loading some kernel module for code, data etc of the module. It doesn't rely in either the Buddy system or slab. 202 | 203 | \texttt{virt_to_phys} and viceversa used in kmalloc or get free page to compute the mapping btw physical and virtual addresses. This is done to be hardware independent when developing a kernel module to not rely on offsets etc. 204 | 205 | For allocation size in kmalloc is limited to 8KB in Linux. vmalloc btw 64/128MB. Kmalloc is physical contiguous while vmalloc no. Vmalloc invalidates transparently the TLB etc while kmalloc no. This is done because for example in loading kernel modules you want that all threads have visibility of the change. 206 | 207 | After setting up all the memory \texttt{trap\_init()} initialises the IDT and GDT. 208 | The only entry point to kernel land from user space is through the interrupt \texttt{0x80}. The same is done in Windows. 209 | 210 | That interrupt executes the system call dispatcher which finds out what the user code wants to do. Every system call has a code assigned to it. 211 | 212 | Interrupts automatically reset the Flags while Traps do not. The dispatcher explicitly clears interrupts through \texttt{cli} instruction. In multi-core systems this is not enough, we must ensure correctness of all the data structures of the kernel. Spinlocks (implemented through the \texttt{cmpxchg} instruction) are used to access data structures. 213 | 214 | Since syscalls have preassigned numbers if we change only one of them we break backward compatibility. 215 | 216 | Macros are used for generating asm volatile blocks defining a syscall. Multiple macros depending by the number of the parameters of the syscall (usually at most 6 parameters). 217 | 218 | \newpage 219 | \bibliography{Lec6} 220 | \bibliographystyle{plainnat} 221 | \end{document} -------------------------------------------------------------------------------- /Lec9/Lec9.tex: -------------------------------------------------------------------------------- 1 | \documentclass[twoside]{article} 2 | \setlength{\oddsidemargin}{-0.5 in} 3 | \setlength{\evensidemargin}{1.5 in} 4 | \setlength{\topmargin}{-0.6 in} 5 | \setlength{\textwidth}{5.5 in} 6 | \setlength{\textheight}{8.5 in} 7 | \setlength{\headsep}{0.5 in} 8 | \setlength{\parindent}{0 in} 9 | \setlength{\parskip}{0.07 in} 10 | \setlength{\marginparwidth}{145pt} 11 | 12 | % 13 | % ADD PACKAGES here: 14 | % 12 15 | 16 | \usepackage{amsmath, 17 | amsfonts, 18 | amssymb, 19 | graphicx, 20 | mathtools, 21 | flexisym, 22 | marginnote, 23 | hyperref, 24 | titlesec} 25 | 26 | \usepackage[english]{babel} 27 | \usepackage[utf8]{inputenc} 28 | \usepackage[shortlabels]{enumitem} 29 | 30 | \graphicspath{ {images/} } 31 | 32 | \hypersetup{ 33 | colorlinks=true, 34 | linkcolor=blue, 35 | filecolor=magenta, 36 | urlcolor=blue, 37 | } 38 | 39 | \titlespacing\section{0pt}{12pt plus 4pt minus 2pt}{0pt plus 2pt minus 2pt} 40 | \titlespacing\subsection{0pt}{12pt plus 4pt minus 2pt}{0pt plus 2pt minus 2pt} 41 | 42 | % 43 | % The following commands set up the lecnum (lecture number) 44 | % counter and make various numbering schemes work relative 45 | % to the lecture number. 46 | % 47 | \newcounter{lecnum} 48 | \renewcommand{\thepage}{\thelecnum-\arabic{page}} 49 | \renewcommand{\thesection}{\thelecnum.\arabic{section}} 50 | \renewcommand{\theequation}{\thelecnum.\arabic{equation}} 51 | \renewcommand{\thefigure}{\thelecnum.\arabic{figure}} 52 | \renewcommand{\thetable}{\thelecnum.\arabic{table}} 53 | 54 | \newcommand{\aosv}{1044414: Advanced Operating Systems and Virtualization} 55 | \newcommand{\wir}{1038137: Web Information Retrieval} 56 | \newcommand{\va}{1052057: Visual Analytics} 57 | \newcommand{\advprog}{1044416: Advanced Programming} 58 | \newcommand{\dchpc}{1044399: Data Centers and High Perf. Computing} 59 | 60 | \newcommand{\qu}[1]{\marginnote{\textcolor{cyan}{#1}}} 61 | 62 | 63 | % 64 | % The following macro is used to generate the header. 65 | % 66 | \newcommand{\lecture}[4]{ 67 | \pagestyle{myheadings} 68 | \thispagestyle{plain} 69 | \newpage 70 | \setcounter{lecnum}{#4} 71 | \setcounter{page}{1} 72 | \noindent 73 | \begin{center} 74 | \framebox{ 75 | \vbox{\vspace{2mm} 76 | \hbox to 7.4in { {\bf #1 77 | \hfill Spring 2018} } 78 | \vspace{4mm} 79 | \hbox to 7.4in { {\Large \hfill Lecture #4: #2 \hfill} } 80 | \vspace{2mm} 81 | \hbox to 7.4in { {\it Lecturer: #3 \hfill Scribe: Anxhelo Xhebraj} } 82 | \vspace{2mm}} 83 | } 84 | \end{center} 85 | \markboth{Lecture #4: #2}{Lecture #4: #2} 86 | 87 | \iffalse 88 | {\bf Note}: {\it LaTeX template courtesy of UC Berkeley EECS dept.} 89 | 90 | {\bf Disclaimer}: {\it These notes have not been subjected to the 91 | usual scrutiny reserved for formal publications. They may be distributed 92 | outside this class only with the permission of the Instructor.} 93 | \vspace*{4mm} 94 | \fi 95 | } 96 | % 97 | % Convention for citations is authors' initials followed by the year. 98 | % For example, to cite a paper by Leighton and Maggs you would type 99 | % \cite{LM89}, and to cite a paper by Strassen you would type \cite{S69}. 100 | % (To avoid bibliography problems, for now we redefine the \cite command.) 101 | % Also commands that create a suitable format for the reference list. 102 | \iffalse 103 | \renewcommand{\cite}[1]{[#1]} 104 | \def\beginrefs{\begin{list}% 105 | {[\arabic{equation}]}{\usecounter{equation} 106 | \setlength{\leftmargin}{2.0truecm}\setlength{\labelsep}{0.4truecm}% 107 | \setlength{\labelwidth}{1.6truecm}}} 108 | \def\endrefs{\end{list}} 109 | \def\bibentry#1{\item[\hbox{[#1]}]} 110 | \fi 111 | 112 | %Use this command for a figure; it puts a figure in wherever you want it. 113 | %usage: \fig{NUMBER}{SPACE-IN-INCHES}{CAPTION} 114 | \newcommand{\fig}[3]{ 115 | \vspace{#2} 116 | \begin{center} 117 | Figure \thelecnum.#1:~#3 118 | \end{center} 119 | } 120 | % Use these for theorems, lemmas, proofs, etc. 121 | \newtheorem{theorem}{Theorem}[lecnum] 122 | \newtheorem{lemma}[theorem]{Lemma} 123 | \newtheorem{proposition}[theorem]{Proposition} 124 | \newtheorem{claim}[theorem]{Claim} 125 | \newtheorem{corollary}[theorem]{Corollary} 126 | \newtheorem{definition}[theorem]{Definition} 127 | \newenvironment{proof}{{\bf Proof:}}{\hfill\rule{2mm}{2mm}} 128 | 129 | % **** IF YOU WANT TO DEFINE ADDITIONAL MACROS FOR YOURSELF, PUT THEM HERE: 130 | 131 | \newcommand\E{\mathbb{E}} 132 | 133 | \begin{document} 134 | 135 | \nocite{*} 136 | 137 | %FILL IN THE RIGHT INFO. 138 | %\lecture{**LECTURE-NUMBER**}{**DATE**}{**LECTURER**}{**SCRIBE**} 139 | 140 | \lecture{\aosv}{April 6}{Alessandro Pellegrini}{9} 141 | 142 | %\footnotetext{These notes are partially based on those of Nigel Mansell.} 143 | 144 | % **** YOUR NOTES GO HERE: 145 | \section{Interprocessor Interrupts} 146 | 147 | Interrupts generated at hardware level but processed at software level. The sending is a synchronous operation while the receiving is asynchronous. A portion of the kernel code is sending a message to another core. 148 | 149 | There are 2 priorities: High and low. High will be processed immediately (but still asynchronous). Low are serialized, in a sort of FIFO order. IPIs are sent through the ICR register and there is a Destination field that lets you choose to which core to send the interrupt. The Linux kernel uses 251-255 entries in the IDT associated to IPIs. 150 | 151 | What is related to interrupt requests it is because some hardware component is asking for the attention of the cpu. Traps is about dealing with something generated by software. init\_IRQ setups the interrupts related to the devices on board on the machine. In order to do it it uses the ACPI to discover the devices in the system. With this function it reads from memory the acpi table and establish the irq and finalizes the initialization of the IDT. ACPI is a std used for describing the 152 | hardware. ACPI is composed of a set of vectors and tables and from the flags and codes it can understand what irqs needs to install in the idt. 153 | 154 | Three main functions that manipulate some bits to setup an entry in the idt. set trap gate will initialize one idt entry setting 0 the privilege level. system gate the privilege level is 3, the function that setups the entry for interrupt 80. Set intr ensures interrupts are cleared before entering the handler. 155 | 156 | \section{Interrupt Management} 157 | \label{sec:intman} 158 | 159 | Entries 0 to 19 are about hardware interrupts such as division by 0. These low level 160 | interrupts are managed by a dispatcher. This dispatch depending by the parameters will call 161 | the right handler and will pass the parameters on the stack. Similar to the way system 162 | calls are handled. When receiving the interrupt a handler is called (the same), will call a single dispatcher. Why the handler is needed? The dispatcher expects to find something in the stack which is related to the actual handler management. Depending on the type of interrupts some parameters may be missing. The handler identifies the handler that should be activated and pushes the pointer to the handler in the stack. The dispatcher saves the context of execution, it is only 163 | a way to keep the code smaller. 164 | 165 | 166 | page fault handler doesn't have push 0 because the cpu already pushes some 167 | parameters in the stack. Some interrupts already push some parameters on 168 | the stack, some don't therefore the early handler pushes for them. 169 | 170 | \texttt{error_code} takes a snapshot of the cpu similarly to software traps. The 171 | difference is that the info is organized on stack in the same way of the 172 | struct \texttt{pt_regs}. And then calls the actual handler with the pointer 173 | to the struct. 174 | 175 | \texttt{do_page_fault} handler has the first param the pointer to ptregs and 176 | then an error code which is pushed by the firmware (is a bitmask in which 177 | only the three last bits are meaningful) and tells what is the meaning of the 178 | fault. Description at pg 167. In the cpu context you can also inspect the 179 | instruction pointer which will tell which instruction was trying to access 180 | memory but not which page it was trying to access. There is one specific 181 | control register which is cr2 that is written by the firmware with the 182 | address of the page that triggered the page fault. This is related to 183 | User Space applications. 184 | 185 | \section{Kernel Exception Handling} 186 | \label{sec:Kernel Exception Handling} 187 | 188 | What happens when kernel mode code tries to access a pointer passed by 189 | user space code. It uses verify area to check that are but there are some 190 | performance issues. Much of the user space code uses system calls and these 191 | functions are expensive. But we don't want to enter directly that memory region 192 | triggering a page fault. Kernel crashes. After crash it activates fixup that 193 | tries to restore the kernel to a working point. The kernel code has a label called \texttt{bad_area} that tries to find an instruction to restore 194 | the kernel. Bad area reads the eip and checks the error code to find 195 | if the exception was triggered in kernel mode. It uses this address to 196 | find an instruction to fix this state. How is the fixup address found? 197 | 198 | Example: get user takes some data from userspace to kernel space. 199 | 200 | Two non standard sections are defined in the executable: fixup and ex table. 201 | the former executable while the latter only readable (exception table). The 202 | compiler will put that code to the end of the executable. Once the exception 203 | is generated from the ip we look into the exception table finding that 204 | address and finds the next address and puts it into the struct ip returning 205 | to the dispatcher. The fixup address is just moving some negative value 206 | to eax and xor dl which tells the amount of data loaded by user spece and 207 | finally jumps to the next instruction. 208 | 209 | What we saw was for 32bit but in 64 bit either use offset from the table or 210 | enlarge the table to 64bit. 211 | 212 | \section{IPIs} 213 | \label{sec:IPIs} 214 | 215 | There is no way to describe a payload telling what has to be done by the 216 | CPU delivering the interrupt. 217 | 218 | Kernel panics use IPIs to freeze all the processors. It doesn't have any 219 | payload. INVALIDATE TLB VECTOR is a vector in the IDT used to sync all cores 220 | to flush tlb entries. Is mapped to all except self. Call function vector is 221 | used to run some specific function that is passed by the sender. SMP call 222 | function. 223 | How are IPI messages generated? Through the IPI apis. The kernel uses fixed 224 | memory locations in which the sender can put some data and receivers look in 225 | order to understand what they have to do. The access to the location is managed 226 | through a spinlock. 227 | 228 | \texttt{smp_call_function()}: takes three parameters: one function pointer 229 | one buffer and one flag. The flag tells whether it should be blocking or not. 230 | High or low priority. The function is the specific routine which the receiver 231 | of the interrupt should execute when receiving the ipi. Two global variables 232 | are used to pass the function and its parameters. First thing we do is check 233 | whether the interrupts are disabled. Then we take the spinlock that protects 234 | the data structure. Set the global variables. Set to atomic variables that 235 | tell how many cores have started to execute the routine and how many have 236 | finished. 237 | 238 | \texttt{syncronize_all} used to synchronize all cores such that only one core 239 | is executing some stuff. still atomic counters used. It sets synch enter to 240 | the number of cpus that should execute the ipi and synch leave to the number 241 | of cores that received the message. Why preemption is disabled? Linux 242 | is a preemptable kernel. And preemption is done through a timer interrupt. 243 | Preemption can be disabled through masking interrupts (but too heavyweight) or 244 | through preempt disable that just increments a counter. The scheduler checks 245 | whether the counter is greater than 0 and if so it just returns instead of 246 | preempting the kernel. This is done since the stuff that is going to be executed 247 | will be executed in this thread and if it is preempted then the system would 248 | stop. This approach is mainly used to load kernel modules. 249 | 250 | IPI messages are very heavyweight so they must be used in critical cases only. 251 | 252 | \section{Last steps in kernel initialization} 253 | \label{sec:Last steps in kernel initialization} 254 | 255 | rest init finally jumps to cpu idle. Halt brings the processor to a low power 256 | execution mode that does nothing. When an interrupt is delivered then if a 257 | process needs to be scheduled then it is scheduled otherwise hlt is performed 258 | again. 259 | 260 | \section{Kernel tree addendum} 261 | \label{sec:Kernel tree addendum} 262 | 263 | The first step is configuring the build of the kernel. After compiling the 264 | kernel we can also compile the modules. Then modules are installed through 265 | the module_install and then make install compresses and generates the kernel. 266 | If wanted headers can be installed. Finally the initram image must be created. 267 | 268 | \texttt{mkinitcpio}. This is for an initial filesystem. (Memory mapped means 269 | what? ICR?) 270 | 271 | Kbuild files are a variant of the Makefile used by linux to determine which 272 | files should be compiled in what object for the final kernel image. 273 | 274 | KBuild: obj-y tells compile in the image also some object. m compiles it as a 275 | module. 276 | 277 | \newpage 278 | \bibliography{Lec9} 279 | \bibliographystyle{plainnat} 280 | \end{document} 281 | -------------------------------------------------------------------------------- /Lec1/Lec1.tex: -------------------------------------------------------------------------------- 1 | \documentclass[twoside]{article} 2 | \setlength{\oddsidemargin}{-0.5 in} 3 | \setlength{\evensidemargin}{1.5 in} 4 | \setlength{\topmargin}{-0.6 in} 5 | \setlength{\textwidth}{5.5 in} 6 | \setlength{\textheight}{8.5 in} 7 | \setlength{\headsep}{0.5 in} 8 | \setlength{\parindent}{0 in} 9 | \setlength{\parskip}{0.07 in} 10 | \setlength{\marginparwidth}{145pt} 11 | 12 | % 13 | % ADD PACKAGES here: 14 | % 12 15 | 16 | \usepackage{amsmath, 17 | amsfonts, 18 | amssymb, 19 | graphicx, 20 | mathtools, 21 | flexisym, 22 | marginnote, 23 | hyperref, 24 | titlesec} 25 | 26 | \usepackage[english]{babel} 27 | \usepackage[utf8]{inputenc} 28 | \usepackage[shortlabels]{enumitem} 29 | 30 | \graphicspath{ {images/} } 31 | 32 | \hypersetup{ 33 | colorlinks=true, 34 | linkcolor=blue, 35 | filecolor=magenta, 36 | urlcolor=blue, 37 | } 38 | 39 | \titlespacing\section{0pt}{12pt plus 4pt minus 2pt}{0pt plus 2pt minus 2pt} 40 | \titlespacing\subsection{0pt}{12pt plus 4pt minus 2pt}{0pt plus 2pt minus 2pt} 41 | 42 | % 43 | % The following commands set up the lecnum (lecture number) 44 | % counter and make various numbering schemes work relative 45 | % to the lecture number. 46 | % 47 | \newcounter{lecnum} 48 | \renewcommand{\thepage}{\thelecnum-\arabic{page}} 49 | \renewcommand{\thesection}{\thelecnum.\arabic{section}} 50 | \renewcommand{\theequation}{\thelecnum.\arabic{equation}} 51 | \renewcommand{\thefigure}{\thelecnum.\arabic{figure}} 52 | \renewcommand{\thetable}{\thelecnum.\arabic{table}} 53 | 54 | \newcommand{\aosv}{1044414: Advanced Operating Systems and Virtualization} 55 | \newcommand{\wir}{1038137: Web Information Retrieval} 56 | \newcommand{\va}{1052057: Visual Analytics} 57 | \newcommand{\advprog}{1044416: Advanced Programming} 58 | \newcommand{\dchpc}{1044399: Data Centers and High Perf. Computing} 59 | 60 | \newcommand{\qu}[1]{\marginnote{\textcolor{cyan}{#1}}} 61 | 62 | 63 | % 64 | % The following macro is used to generate the header. 65 | % 66 | \newcommand{\lecture}[4]{ 67 | \pagestyle{myheadings} 68 | \thispagestyle{plain} 69 | \newpage 70 | \setcounter{lecnum}{#4} 71 | \setcounter{page}{1} 72 | \noindent 73 | \begin{center} 74 | \framebox{ 75 | \vbox{\vspace{2mm} 76 | \hbox to 7.4in { {\bf #1 77 | \hfill Spring 2018} } 78 | \vspace{4mm} 79 | \hbox to 7.4in { {\Large \hfill Lecture #4: #2 \hfill} } 80 | \vspace{2mm} 81 | \hbox to 7.4in { {\it Lecturer: #3 \hfill Scribe: Anxhelo Xhebraj} } 82 | \vspace{2mm}} 83 | } 84 | \end{center} 85 | \markboth{Lecture #4: #2}{Lecture #4: #2} 86 | 87 | \iffalse 88 | {\bf Note}: {\it LaTeX template courtesy of UC Berkeley EECS dept.} 89 | 90 | {\bf Disclaimer}: {\it These notes have not been subjected to the 91 | usual scrutiny reserved for formal publications. They may be distributed 92 | outside this class only with the permission of the Instructor.} 93 | \vspace*{4mm} 94 | \fi 95 | } 96 | % 97 | % Convention for citations is authors' initials followed by the year. 98 | % For example, to cite a paper by Leighton and Maggs you would type 99 | % \cite{LM89}, and to cite a paper by Strassen you would type \cite{S69}. 100 | % (To avoid bibliography problems, for now we redefine the \cite command.) 101 | % Also commands that create a suitable format for the reference list. 102 | \iffalse 103 | \renewcommand{\cite}[1]{[#1]} 104 | \def\beginrefs{\begin{list}% 105 | {[\arabic{equation}]}{\usecounter{equation} 106 | \setlength{\leftmargin}{2.0truecm}\setlength{\labelsep}{0.4truecm}% 107 | \setlength{\labelwidth}{1.6truecm}}} 108 | \def\endrefs{\end{list}} 109 | \def\bibentry#1{\item[\hbox{[#1]}]} 110 | \fi 111 | 112 | %Use this command for a figure; it puts a figure in wherever you want it. 113 | %usage: \fig{NUMBER}{SPACE-IN-INCHES}{CAPTION} 114 | \newcommand{\fig}[3]{ 115 | \vspace{#2} 116 | \begin{center} 117 | Figure \thelecnum.#1:~#3 118 | \end{center} 119 | } 120 | % Use these for theorems, lemmas, proofs, etc. 121 | \newtheorem{theorem}{Theorem}[lecnum] 122 | \newtheorem{lemma}[theorem]{Lemma} 123 | \newtheorem{proposition}[theorem]{Proposition} 124 | \newtheorem{claim}[theorem]{Claim} 125 | \newtheorem{corollary}[theorem]{Corollary} 126 | \newtheorem{definition}[theorem]{Definition} 127 | \newenvironment{proof}{{\bf Proof:}}{\hfill\rule{2mm}{2mm}} 128 | 129 | % **** IF YOU WANT TO DEFINE ADDITIONAL MACROS FOR YOURSELF, PUT THEM HERE: 130 | 131 | \newcommand\E{\mathbb{E}} 132 | 133 | \begin{document} 134 | 135 | \nocite{*} 136 | 137 | %FILL IN THE RIGHT INFO. 138 | %\lecture{**LECTURE-NUMBER**}{**DATE**}{**LECTURER**}{**SCRIBE**} 139 | 140 | \lecture{\aosv}{March 2}{Alessandro Pellegrini}{1} 141 | 142 | %\footnotetext{These notes are partially based on those of Nigel Mansell.} 143 | 144 | % **** YOUR NOTES GO HERE: 145 | \section{Course Information} 146 | \subsection{Exam} 147 | The exam is composed of two parts: 148 | \marginnote{\href{https://www.diag.uniroma1.it/~pellegrini/didattica/2017/aosv/0.Introduction.pdf}{Slides 0.}} 149 | \begin{itemize} 150 | \item $[2/5]$ \textbf{Written Part}: 3 theoretical or practical questions 151 | \item $[3/5]$ \textbf{Practical Project}: same project for everyone to be done singularily. Specifications will be given in the in the middle of the course 152 | \end{itemize} 153 | The two parts must be done within one year from each other. 154 | In the course we will see various versions of the Linux kernel (2.4, 3.0, 4.0). Any version can be used for the project. The more compatible the project is with the various versions, the better. 155 | \subsection{Course outline} 156 | In this course we will try to understand the internals of an operating system. \marginnote{Slides 0. - pp. 4} We will use the Linux kernel as reference since it is open source and permits to get some good hands-on experience to understand basic principles that are applied to all operating systems in general. 157 | 158 | Regarding the architecture during the course we will use the Intel architecture as reference. 159 | 160 | We will see how to develop kernel modules, to perform kernel debugging and hot patching. 161 | 162 | \section{Boot Sequence} 163 | After hitting the powerup button the Boot Sequence starts. It is composed of 6 levels: 164 | \begin{description} 165 | \itemsep0em 166 | \item[BIOS] (Basic Input Output System) code stored in flash ROM inside the motherboard to check what hardware devices are connected to the system etc. It calculates how much RAM is available and performs some consistency checks. It creates a memory map and a map of all devices installed in the system. \marginnote{ACPI Table: describes what peripherals are installed in the system}[- 36pt] 167 | Finally the BIOS loads the Bootloader. 168 | 169 | Since BIOS became very convoluted over years a new specification was developed: \href{https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface}{\textbf{UEFI}} (Unified Extensible Firmware Interface). UEFI is a tentative of replacing the BIOS firmware interface to give more programming versatility and other features (even security wise). 170 | \item[Bootloader Stage 1] It searches among the various devices to load the Bootloader Stage 2. This is done because BS1 does not have enough space to load the system (less than 512 Bytes in length). 171 | \item[Bootloader Stage 2] loads from the storage the kernel image and executes it. 172 | 173 | \newpage 174 | 175 | \item[Kernel Startup] is performed between two levels: hardware interaction through assembly code and internal data structures initialization. Spawns the first process: Init. 176 | 177 | \item[Init] Its goal is to startup and configure the environment known as Runlevels/Targets: Desktop Environment etc. 178 | 179 | \item[Runlevels/Targets] subset of services to be executed on startup. 180 | \end{description} 181 | 182 | \section{x86 Initial Booting Sequence} 183 | Intel processors work on different voltage levels and in order to work 3 reference voltages (3.3V, 5V, 12V) called \textbf{rails} must be supplied by the Power Supply Unit (PSU). In addition, voltage regulators on the motherboard or in other components convert these standard voltages to others as necessary (for example DDR2 and DDR3 dual inline memory modules (DIMM) require 1.8 V and 1.5 V). After completing internal tests and determining that the power is ready for use the PSU triggers the Power Good signal informing the motherboard \cite{ATX,PG}. 184 | 185 | Finally, clocks are derived from a small number of input clocks and oscillator sources and the reset signal is triggered which gives control to the BIOS. 186 | 187 | \subsection{Real Mode} 188 | One \marginnote{\href{https://www.diag.uniroma1.it/~pellegrini/didattica/2017/aosv/1.Initial-Boot-Sequence.pdf}{Slides 1.}} CPU is dynamically chosen to be the bootstrap processor (BSP) that runs all of the BIOS and kernel initialization code. The remaining processors, called application processors (AP) at this point remain halted until later. In this primitive power up state the processor is in \textbf{real mode} with memory paging disabled (Paging unit disabled) behaving like the original 1978 Intel 8086. In this mode memory is accessed through \textit{Segmentation-based addressing}. The wrangling of the adresses is managed by the \textbf{Segmentation Unit} that transforms \textbf{Logical Adresses} into \textbf{Linear Adresses} (which coincides with the Physical Address in real mode). 189 | \begin{center} 190 | \includegraphics[width=0.8\textwidth]{memtrans.png} 191 | \fig{1}{0 pt}{Address Translation} 192 | \end{center} 193 | 194 | The \marginpar{\textsc{Segment Registers}} original 8086 processor had words of 16-bits and this allowed code to work only with $2^{16}$ bytes (64K). In order to increase addressable memory, segment registers were introduced besides general registers (\texttt{AX}, \texttt{BX} etc.) to inform the CPU in which of the chunks of memory a program's instructions were going to work on. There were four segment registers: one for the stack (\texttt{SS}), one for program code (\texttt{CS}) , and two for data (\texttt{DS}, \texttt{ES}). Nowadays \textbf{segmentation} is \textbf{still present} and is \textit{always} enabled in x86 processors for backward compatibility. For example, a jump instruction uses the code segment register (\texttt{CS}) whereas a stack push instruction uses the stack segment register (\texttt{SS}). 195 | 196 | Segment registers store 16-bit \textbf{segment selectors} which are 16-bit numbers specifying the physical memory address for the start of a segment (this is only in the case of real mode, in protected mode things are different). In this scenario when a physical address needs to be accessed, say for example in one of the data segments, we ask for the word in segment \\ \texttt{DS} $:=$ \texttt{0x1000} with offset \texttt{AX} $:=$ \texttt{0x0012} and the address that is accessed is denoted as \texttt{DS:AX}. 197 | 198 | Since physical address pins cost, and at the time 1 MB of memory was thought to be more than sufficient, Intel made the decision to reduce the addressable space to $2^{20}$ by accessing physical memory through the scheme \texttt{DS} $\times \ 2^4 \ + $ \texttt{AX} instead of using the concatenation of the two registers as address (20 bits, $2^{20} = 1$ MB instead of 32 bits, $2^{32} = $ 4GB). 199 | 200 | Real mode segment starts range from 0 all the way to \texttt{0xFFFF0} (16 bytes short of 1 MB) in 16-byte increments. To these values a 16-bit offset (the logical address) between \texttt{0x0} and \texttt{0xFFFF} is added. It \marginpar{\href{https://web.archive.org/web/20130609073242/http://www.osdever.net/tutorials/rm_addressing.php?the_id=50}{A20 Line}} follows that there are multiple segment/offset combinations pointing to the same memory location, and physical addresses fall above 1MB if your segment is high enough (Memory Wrap-Around). 201 | 202 | \subsection{BIOS operations} 203 | 204 | \marginnote{ 205 | \includegraphics[scale=0.4]{bootMemReg.png} 206 | \fig{2}{0 pt}{Relevant Physical Memory Regions of Later x86 processors.} 207 | The brown regions are mapped \textbf{away} from RAM. When the processor writes/reads such regions the northbridge routes it to the right device. 208 | } 209 | 210 | The first operation that is fetched is processor dependent and the address at which the operation is found is called \textbf{Reset Vector}. In the case of the 8086 processor such address is \texttt{F000:FFF0} (\texttt{CS:IP}) which corresponds to the physical address $ \texttt{0xF000} \times 16 + \texttt{0xFFF0} = \texttt{0xFFFF0}$, 16 Bytes below the maximum addressable location. The 80386 CPU and later Intel processors have predefined data in some CPU registers after a computer reset: Instruction Pointer (\texttt{IP}) set to \texttt{0xFFF0}, \texttt{CS} to \texttt{0xF000} and \texttt{Base address} of the "hidden part" of the Code Segment Register set to \texttt{0xFFFF0000}. Such addresses are then used to compute the first instruction that is fetched which is \texttt{CS Base address + EIP} getting the physical address \texttt{0xFFFFFFF0} which is still 16 byte short the maximum addressable memory with 32 bits. The motherboard (Northbridge component) then ensures that the instruction at the Reset Vector is a jump to the memory location mapped to the BIOS entry point. This jump clears the hidden base address present at power up. All of these memory locations have the right contents needed by the CPU thanks to the memory map kept by the chipset. They are all mapped to flash memory containing the BIOS. 211 | 212 | The CPU then starts executing BIOS code that initializes some of the hardware in the machine and executes \textbf{POST} code (Power-On Self Test) that tests and initializes various components in the computer. Lack of a working video card fails the POST and causes the process to halt and beep. A portion of the BIOS is dedicated for communication with legacy video cards. 213 | 214 | After the POST the BIOS loads its configuration and then performs \textbf{Shadow RAM Initialization}: copies itself on RAM for faster access. 215 | The last operation of the BIOS is seeking a boot device from which it loads the first 512-byte sector (sector zero/boot sector) called \textbf{Master Boot Record} to the address at \texttt{0000:7C00} and performs the jump to that address with \texttt{ljmp \$0x0000, \$0x7C00} 216 | 217 | \subsection{Master Boot Record} 218 | The MBR holds the code of the Bootloader Stage 1 that will load the Bootloader Stage 2. 219 | \begin{center} 220 | \includegraphics[width= 0.7 \textwidth]{masterBootRecord.png} 221 | \end{center} 222 | The partition table entries contain offsets telling the beginning of the 4 partitions. At the beginning of each partition there can be one Boot sector. To one partition entry can be designated an \textit{extended partition} which can be subdivided into a number of logical partitions. Each of the logical partitions within the extended partition is preceeded by the Extended Boot Record (EBR) and each EBR has a pointer to the next EBR forming a linked list. 223 | 224 | The MBR Signature \textit{must be} \texttt{0x55AA}. 225 | 226 | The initial bytes of the MBR can contain the \textbf{BIOS Parameter Block} (BPB) that is a data structure describing the physical layout of a data storage volume, in order for the BIOS to know how to read it etc. Therefore after the load another \texttt{jmp} is performed to skip the BPB. 227 | 228 | Finally interrupts are disabled (\texttt{cli} instruction) since the stack segment is not initialized yet and all the segment selectors are set to zero to have a linear access on physical addresses. 229 | 230 | \bibliography{Lec1} 231 | \bibliographystyle{plainnat} 232 | \end{document} -------------------------------------------------------------------------------- /Lec5/Lec5.tex: -------------------------------------------------------------------------------- 1 | \documentclass[twoside]{article} 2 | \setlength{\oddsidemargin}{-0.5 in} 3 | \setlength{\evensidemargin}{1.5 in} 4 | \setlength{\topmargin}{-0.6 in} 5 | \setlength{\textwidth}{5.5 in} 6 | \setlength{\textheight}{8.5 in} 7 | \setlength{\headsep}{0.5 in} 8 | \setlength{\parindent}{0 in} 9 | \setlength{\parskip}{0.07 in} 10 | \setlength{\marginparwidth}{145pt} 11 | 12 | % 13 | % ADD PACKAGES here: 14 | % 12 15 | 16 | \usepackage{amsmath, 17 | amsfonts, 18 | amssymb, 19 | graphicx, 20 | mathtools, 21 | flexisym, 22 | marginnote, 23 | hyperref, 24 | titlesec} 25 | 26 | \usepackage[english]{babel} 27 | \usepackage[utf8]{inputenc} 28 | \usepackage[shortlabels]{enumitem} 29 | 30 | \graphicspath{ {images/} } 31 | 32 | \hypersetup{ 33 | colorlinks=true, 34 | linkcolor=blue, 35 | filecolor=magenta, 36 | urlcolor=blue, 37 | } 38 | 39 | \titlespacing\section{0pt}{12pt plus 4pt minus 2pt}{0pt plus 2pt minus 2pt} 40 | \titlespacing\subsection{0pt}{12pt plus 4pt minus 2pt}{0pt plus 2pt minus 2pt} 41 | 42 | % 43 | % The following commands set up the lecnum (lecture number) 44 | % counter and make various numbering schemes work relative 45 | % to the lecture number. 46 | % 47 | \newcounter{lecnum} 48 | \renewcommand{\thepage}{\thelecnum-\arabic{page}} 49 | \renewcommand{\thesection}{\thelecnum.\arabic{section}} 50 | \renewcommand{\theequation}{\thelecnum.\arabic{equation}} 51 | \renewcommand{\thefigure}{\thelecnum.\arabic{figure}} 52 | \renewcommand{\thetable}{\thelecnum.\arabic{table}} 53 | 54 | \newcommand{\aosv}{1044414: Advanced Operating Systems and Virtualization} 55 | \newcommand{\wir}{1038137: Web Information Retrieval} 56 | \newcommand{\va}{1052057: Visual Analytics} 57 | \newcommand{\advprog}{1044416: Advanced Programming} 58 | \newcommand{\dchpc}{1044399: Data Centers and High Perf. Computing} 59 | 60 | \newcommand{\qu}[1]{\marginnote{\textcolor{cyan}{#1}}} 61 | 62 | 63 | % 64 | % The following macro is used to generate the header. 65 | % 66 | \newcommand{\lecture}[4]{ 67 | \pagestyle{myheadings} 68 | \thispagestyle{plain} 69 | \newpage 70 | \setcounter{lecnum}{#4} 71 | \setcounter{page}{1} 72 | \noindent 73 | \begin{center} 74 | \framebox{ 75 | \vbox{\vspace{2mm} 76 | \hbox to 7.4in { {\bf #1 77 | \hfill Spring 2018} } 78 | \vspace{4mm} 79 | \hbox to 7.4in { {\Large \hfill Lecture #4: #2 \hfill} } 80 | \vspace{2mm} 81 | \hbox to 7.4in { {\it Lecturer: #3 \hfill Scribe: Anxhelo Xhebraj} } 82 | \vspace{2mm}} 83 | } 84 | \end{center} 85 | \markboth{Lecture #4: #2}{Lecture #4: #2} 86 | 87 | \iffalse 88 | {\bf Note}: {\it LaTeX template courtesy of UC Berkeley EECS dept.} 89 | 90 | {\bf Disclaimer}: {\it These notes have not been subjected to the 91 | usual scrutiny reserved for formal publications. They may be distributed 92 | outside this class only with the permission of the Instructor.} 93 | \vspace*{4mm} 94 | \fi 95 | } 96 | % 97 | % Convention for citations is authors' initials followed by the year. 98 | % For example, to cite a paper by Leighton and Maggs you would type 99 | % \cite{LM89}, and to cite a paper by Strassen you would type \cite{S69}. 100 | % (To avoid bibliography problems, for now we redefine the \cite command.) 101 | % Also commands that create a suitable format for the reference list. 102 | \iffalse 103 | \renewcommand{\cite}[1]{[#1]} 104 | \def\beginrefs{\begin{list}% 105 | {[\arabic{equation}]}{\usecounter{equation} 106 | \setlength{\leftmargin}{2.0truecm}\setlength{\labelsep}{0.4truecm}% 107 | \setlength{\labelwidth}{1.6truecm}}} 108 | \def\endrefs{\end{list}} 109 | \def\bibentry#1{\item[\hbox{[#1]}]} 110 | \fi 111 | 112 | %Use this command for a figure; it puts a figure in wherever you want it. 113 | %usage: \fig{NUMBER}{SPACE-IN-INCHES}{CAPTION} 114 | \newcommand{\fig}[3]{ 115 | \vspace{#2} 116 | \begin{center} 117 | Figure \thelecnum.#1:~#3 118 | \end{center} 119 | } 120 | % Use these for theorems, lemmas, proofs, etc. 121 | \newtheorem{theorem}{Theorem}[lecnum] 122 | \newtheorem{lemma}[theorem]{Lemma} 123 | \newtheorem{proposition}[theorem]{Proposition} 124 | \newtheorem{claim}[theorem]{Claim} 125 | \newtheorem{corollary}[theorem]{Corollary} 126 | \newtheorem{definition}[theorem]{Definition} 127 | \newenvironment{proof}{{\bf Proof:}}{\hfill\rule{2mm}{2mm}} 128 | 129 | % **** IF YOU WANT TO DEFINE ADDITIONAL MACROS FOR YOURSELF, PUT THEM HERE: 130 | 131 | \newcommand\E{\mathbb{E}} 132 | 133 | \begin{document} 134 | 135 | \nocite{*} 136 | 137 | %FILL IN THE RIGHT INFO. 138 | %\lecture{**LECTURE-NUMBER**}{**DATE**}{**LECTURER**}{**SCRIBE**} 139 | 140 | \lecture{\aosv}{March 16}{Alessandro Pellegrini}{5} 141 | 142 | %\footnotetext{These notes are partially based on those of Nigel Mansell.} 143 | 144 | % **** YOUR NOTES GO HERE: 145 | 146 | In this lecture we will look into the organization and initialization of memory of the linux i386 kernel v. = 2.4.22. 147 | 148 | \section{Early Paging} 149 | 150 | As we anticipated in Lecture 3 (Sec 3.4) paging is enabled in \texttt{startup_32} through the following instructions \marginnote{\texttt{ENTRY(swapper_pg_dir) \\ 151 | .long 0x00102007 // pg0 \\ 152 | .long 0x00103007 // pg1 \\ 153 | .fill BOOT_USER_PGD_PTRS-2,4,0 \\ 154 | /* default: 766 entries */ \\ 155 | .long 0x00102007 \\ 156 | .long 0x00103007 \\ 157 | /* default: 254 entries */ \\ 158 | .fill BOOT_KERNEL_PGD_PTRS-2,4,0}}[1cm] 159 | 160 | \begin{verbatim} 161 | > arch/i386/kernel/head.S 162 | movl $swapper_pg_dir-__PAGE_OFFSET, %eax 163 | movl %eax, %cr3 164 | movl %cr0, %eax 165 | orl $0x80000000, %eax 166 | movl %eax, %cr0 167 | \end{verbatim} 168 | 169 | where \texttt{swapper_pg_dir} is a label corresponding to the virtual address of the first level page table and \texttt{__PAGE_OFFSET} is \texttt{0xc0000000} (3GB, which is the virtual address of the kernel in i386). \texttt{\%cr3} is not directly set to \texttt{swapper_pg_dir}, but the value shown above which is the physical address of \texttt{swapper_pg_dir} (the difference between virtual and physical addresses of the kernel is just an offset). In \texttt{startup_32} the kernel initializes its page tables 170 | to span only the first 8MB of the kernel. This is done by initializing two Page Tables (last level) found at label \texttt{pg0} and creating 4 entries in the first level page table (pointed by \texttt{swapper_pg_dir}): entry 0 and \texttt{0x300} (768) contain in the address field the physical address of \texttt{pg0} while entry 1 and \texttt{0x301} (769) contain the 171 | physical address of \texttt{pg1}. When paging is enabled, given the configuration of the page table we have that both the virtual addresses \texttt{0x0} to \texttt{0x007fffff} (from 0 to 8MB) and \texttt{0xc0000000} to \texttt{0xc07fffff} (from 3GB to 3GB + 8MB) will map to the physical addresses \texttt{0x0} to \texttt{0x007fffff}. The former is called \textit{identity map} since maps the first 8MB of virtual addresses to the first 8MB of physical addresses while the latter is 172 | called \textit{kernel map} since it maps the virtual addresses of the kernel to its physical addresses. 173 | 174 | \section{Bootmem Allocator} 175 | 176 | The transition from 8MB to 896MB of virutal memory is performed in the \texttt{start_kernel()} function in \texttt{init/main.c}. This function calls \texttt{setup_arch()} defined in \\ 177 | \texttt{arch/i386/kernel/setup.c} (architecture dependent code) which initializes various data structures with among them the Bootmem allocator. The Bootmem allocator is a data structure and a set of functions that is used by the kernel to allocate memory (at the granularity of page sizes, 4KB) before the kernel Main Memory subsytem is setup. This set of APIs is available only at early setup of memory therefore it has \texttt{__init} in its signature. 178 | 179 | The bootmem allocator initialization is performed in \texttt{setup_memory()} 180 | implemented in \texttt{arch/i386/kernel/setup.c} through the 181 | \texttt{init_bootmem()} function (\texttt{mm/bootmem.c}) initializing a bitmap 182 | where each bit 183 | refers to a page frame in the range of physical addresses \texttt{_end} 184 | (after the last section of the decompressed kernel image) up to 896MB. 185 | First all the bits are set to 1 meaning that no page can be used for 186 | allocation. After that it is up to the function 187 | \texttt{register_bootmem_low_pages()} to query the E820 table and set the bits 188 | of free page frames to 0. 189 | 190 | \iffalse 191 | \section{Kenel-Level MM Data Structures} 192 | 193 | \begin{itemize} 194 | \itemsep-2pt 195 | \item First is a memory mapping for kernel code and data. 196 | \item Core map: set of data structures to keep information about every frame of physical memory. (NUMA, different cores see only a subportion of ram) If a region is not available to a core then it has to the "owner" core for that address. 197 | \item Free list of physical memory frames 198 | \end{itemize} 199 | \fi 200 | 201 | \section{Paging in Linux} 202 | 203 | \begin{center} 204 | \includegraphics[width=0.8\textwidth]{linuxvm.png} 205 | \fig{1}{0 pt}{Linux Page Tables ($<$ v.2.10)} 206 | \end{center} 207 | 208 | Linux adopts a common paging model that fits both 32-bit and 64-bit architectures consisting of three-level paging up to kernel v.2.10 and four-level paging from kernel 2.11 introducing the Page Upper Directory as second level before Page Middle Directory (refer to Figure 5.1). Such scheme allows the kernel to be highly architecture independent reducing the amount of code needed to write for specific architectures. Various macros are then used to map the paging scheme that Linux uses to the 209 | hardware specific paging scheme. \marginnote{\texttt{x_SIZE =} $2^{\texttt{x_SHIFT}}$ \\ Also \texttt{PTRS_PER_x} are defined to determine the number of entries in each level of the page table}[5cm] 210 | 211 | 212 | \begin{center} 213 | \includegraphics[width=0.75\textwidth]{bits1.png} 214 | \end{center} 215 | 216 | 217 | \begin{center} 218 | \includegraphics[width=0.75\textwidth]{bits2.png} 219 | \fig{2}{0 pt}{Macros for paging mapping} 220 | \end{center} 221 | 222 | All the page tables entry types are defined through structs like \texttt{typedef struct \{unsigned long pte_low; \} pte_t} to ensure typechecking when manipulating table entries. 223 | 224 | Various masks are defined in order to perform easy checks on page table entries such as \texttt{_PAGE_PRESENT} to be used as follows: 225 | 226 | \begin{verbatim} 227 | pte_t x; 228 | 229 | x = ...; 230 | 231 | if ((x.pte_low) & _PAGE_PRESENT) { 232 | /* executed if true */ 233 | } 234 | \end{verbatim} 235 | 236 | Also multiple types of page entry flags are defined for the most common types of combinations of them such as \texttt{PAGE_SHARED, PAGE_KERNEL, PAGE_READONLY} etc. 237 | 238 | 239 | \section{Kernel Page Table Initialization} 240 | 241 | When carrying out the setup of architecture specific data structures in \texttt{setup_arch()}, also the transition from 8MB to 896MB is performed in \texttt{paging_init()} found in \\ \texttt{arch/i386/mm/init.c} of which main subroutine is the following. 242 | 243 | \begin{verbatim} 244 | > arch/i386/mm/init.c 245 | static void __init pagetable_init (void) { 246 | end = (unsigned long)__va(max_low_pfn*PAGE_SIZE); 247 | 248 | pgd_base = swapper_pg_dir; 249 | i = __pgd_offset(PAGE_OFFSET); 250 | pgd = pgd_base + i; 251 | 252 | for (; i < PTRS_PER_PGD; pgd++, i++) { 253 | vaddr = i*PGDIR_SIZE; 254 | if (end && (vaddr >= end)) break; 255 | pmd = (pmd_t *)pgd; 256 | ... 257 | for (j = 0; j < PTRS_PER_PMD; pmd++, j++) { 258 | vaddr = i*PGDIR_SIZE + j*PMD_SIZE; 259 | if (end && (vaddr >= end)) 260 | break; 261 | ... 262 | pte_base = pte = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE); 263 | for (k = 0; k < PTRS_PER_PTE; pte++, k++) { 264 | vaddr = i*PGDIR_SIZE + j*PMD_SIZE + k*PAGE_SIZE; 265 | if (end && (vaddr >= end)) break; 266 | *pte = mk_pte_phys(__pa(vaddr), PAGE_KERNEL); 267 | } 268 | set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte_base))); 269 | ... 270 | } 271 | }... } 272 | \end{verbatim} 273 | 274 | In the snippet shown above \texttt{end} is \texttt{0xf8000000} (128 MB short 4GB) and \texttt{i} is \texttt{0xc0000000} meaning that those are the virtual addresses used by the kernel. The routine uses the bootmem alocator to allocate Page Tables (last level) therefore the page tables might not be stored contiguously in memory. The function maps linearly the virtual addresses \texttt{0xc0000000} to \texttt{0xf8000000} to the first 896MB of page frames of physical memory. 275 | 276 | Once the paging is setup we must ensure that no old entry is cached in the 277 | TLB. This is done in \texttt{paging_init()} by calling 278 | \texttt{load_cr3(swapper_pg_dir)} and \texttt{__flush_tlb_all()}. 279 | 280 | \begin{verbatim} 281 | > include/asm-i386/processor.h 282 | #define load_cr3(pgdir) \ 283 | asm volatile( "movl %0, %%cr3": : "r" (__pa(pgdir)) ); 284 | \end{verbatim} 285 | 286 | As shown above, in \texttt{\%cr3} is written \texttt{swapper_pg_dir}. 287 | This is done because the page tables used up to now might not be the ones 288 | in \texttt{swapper_pg_dir} (following the execution path from 289 | \texttt{arch/i386/kernel/head.S}) but the ones setup by the boot process 290 | which might have launched the kernel from a different entry point. 291 | Most of the \texttt{i386} architectures ensure that 292 | the TLB is flushed if the \texttt{\%cr3} register is written but 293 | there can still be some architectures that need further instructions to ensure that 294 | the TLB is indeed flushed and in that case this is done through 295 | \texttt{__flush_tlb_all()}. 296 | 297 | In newer versions of the kernel (v. 4.16) \texttt{load_cr3()} is 298 | reimplemented as follws: 299 | 300 | \begin{verbatim} 301 | > arch/x86/include/asm/special_insns.h 302 | ... 303 | static inline void native_write_cr3(unsigned long val) { 304 | asm volatile( "movl %0, %%cr3": : "r" (val), "m" (__force_order)) ); 305 | } 306 | ... 307 | static inline void load_cr3(pgd_t *pgdir) { 308 | native_write_cr3(__pa(pgdir)); 309 | } 310 | \end{verbatim} 311 | 312 | The motive of \texttt{__force_order} is nicely explained in a comment within 313 | the file in which the functions are implemented. 314 | 315 | \begin{verbatim} 316 | > arch/x86/include/asm/special_insns.h 317 | /* 318 | * Volatile isn't enough to prevent the compiler from reordering the 319 | * read/write functions for the control registers and messing everything up. 320 | * A memory clobber would solve the problem, but would prevent reordering of 321 | * all loads stores around it, which can hurt performance. Solution is to 322 | * use a variable and mimic reads and writes to it to enforce serialization 323 | */ 324 | \end{verbatim} 325 | 326 | \subsection{TLB APIs} 327 | \label{subsec:TLB APIs} 328 | 329 | The types of TLB events can be classified across two main characteristics: 330 | scale and typology. When some event affects virtual addresses accessible 331 | by every CPU/core in real-time-concurrency its scale is said to be 332 | \textit{global}, instead if it affects only virtual addresses accessible 333 | in time-sharing concurreny it is said to be \textit{local}. 334 | The typology classification of events describes whether it is a virtual 335 | to physical address remapping or virtual address access rule modification. 336 | 337 | Other considerations needed to be done when dealing with TLB flushing is 338 | its costs in terms of performance. Costs can be split into two sets: direct 339 | costs and indirect costs. The former are the latency of the firmware to 340 | perform the invalidation of the entries in the TLB plus the cost for 341 | cross-CPU coordination. The latter is about the costs of TLB renewal 342 | latency by the MMU firmware upon misses in the translation process. 343 | 344 | The linux kernel provides various APIs for dealing with flushes of the 345 | TLB that are then mapped to architecture dependent instructions. While 346 | the APIs provide the possibility to perform selective flushing\footnote{ 347 | flushing just a subset of the entries in the TLB.}, the real effect 348 | that it will have on the TLB depends on the instructions provided by the 349 | firmware. Nevertheless it is highly recommended to use the most specific 350 | API which is effective for the task and that doesn't make the software 351 | too complex. Follow the interfaces provided by the linux kernel nicely 352 | described under \texttt{Documentation/cachetlb.txt}. 353 | 354 | \begin{description} 355 | \itemsep0pt 356 | \item \texttt{flush_tlb_all(void)} Flushes the entire TLB on 357 | \textit{all} processors running in the system and it is usually 358 | invoked when the kernel page tables change since they are global 359 | by nature. Its implementation is based on a function that allows 360 | to execute a portion of code on all processors based on IPIs. 361 | Such portion of code is \texttt{__flush_tlb_all}. 362 | \item \texttt{flush_tlb_mm(struct mm_struct *mm)} Flushes all the TLB 363 | entries related to a user address space. This is invoked usually 364 | when it is needed to invalidate all entries associated to a process 365 | for example when performing a \texttt{fork()} to make the address 366 | space not writable for COW (Copy on Write). On some architectures 367 | (MIPS) this is required for all cores instead of affecting only 368 | the local processor. 369 | \item \texttt{flush_tlb_range(struct mm_struct *mm, unsigned long 370 | start, \\ unsigned long end)} Similarly to the function above it is 371 | used to flush a range of (user) virtual addresses translations from 372 | the TLB. Primarily, this is used for \texttt{munmap()/mremap()} or 373 | \texttt{mprotect()}. The interface is provided in hopes that the 374 | port can find a suitably efficient method for removing multiple page 375 | sized translations from the TLB instead of having the kernel call 376 | \texttt{flush_tlb_page} for each entry which may be modified. 377 | \item \texttt{flush_tlb_page(struct vm_area_struct *vma, unsigned long 378 | page)} Flushes a single page from the TLB. Mainly used for flushing 379 | the TLB entry of some page after it has been paged out or faulted in 380 | meaning that the access to that page caused a fault (for example COW). 381 | \item \texttt{flush_tlb_pgtables(struct mm_struct *mm, unsigned long 382 | start, \\ unsigned long end)} Used when page tables of some software 383 | are being torn down. Some platforms cache the lowest level of the 384 | page tables in a linear virtually mapped array, to make TLB miss 385 | processing more efficient. In these cases the TLB needs to be flushed 386 | when parts of the page tables tree are unlinked/freed. 387 | \item \texttt{update_mmu_cache(struct vm_area_struct *vma, unsigned long 388 | address, \\ pte_t pte)} Used to inform the CPU that there exists 389 | a translation for the virtual address \texttt{address} corresponding 390 | to the entry \texttt{pte}. Such information can be used in many ways 391 | by the CPU such as deciding whether to flush its data cache or 392 | preload TLB translations. 393 | \end{description} 394 | 395 | \section{NUMA} 396 | 397 | \begin{center} 398 | \includegraphics[width=0.6\textwidth]{numa.png} 399 | \fig{3}{0 pt}{NUMA systems} 400 | \end{center} 401 | 402 | With the ever growing disparity between the performance of processors and 403 | memory, memory accesses started to became a bottleneck in multi processor 404 | systems. This issue is addressed by arranging memory into banks and assigning 405 | each bank to one or a set of cores. We denote by the term \textit{node} the 406 | set of cpus and banks coupled together. All cores can access all the memory but 407 | depending on their distance from the banks the cost to access memory is 408 | different. Such architectures are defined as Non-Uniform Memory Access 409 | (NUMA). 410 | 411 | \newpage 412 | \bibliography{Lec5} 413 | \bibliographystyle{plainnat} 414 | \end{document} 415 | -------------------------------------------------------------------------------- /Lec4/Lec4.tex: -------------------------------------------------------------------------------- 1 | \documentclass[twoside]{article} 2 | \setlength{\oddsidemargin}{-0.5 in} 3 | \setlength{\evensidemargin}{1.5 in} 4 | \setlength{\topmargin}{-0.6 in} 5 | \setlength{\textwidth}{5.5 in} 6 | \setlength{\textheight}{8.5 in} 7 | \setlength{\headsep}{0.5 in} 8 | \setlength{\parindent}{0 in} 9 | \setlength{\parskip}{0.07 in} 10 | \setlength{\marginparwidth}{145pt} 11 | 12 | % 13 | % ADD PACKAGES here: 14 | % 12 15 | 16 | \usepackage{amsmath, 17 | amsfonts, 18 | amssymb, 19 | graphicx, 20 | mathtools, 21 | flexisym, 22 | marginnote, 23 | hyperref, 24 | titlesec} 25 | 26 | \usepackage[english]{babel} 27 | \usepackage[utf8]{inputenc} 28 | \usepackage[shortlabels]{enumitem} 29 | 30 | \graphicspath{ {images/} } 31 | 32 | \hypersetup{ 33 | colorlinks=true, 34 | linkcolor=blue, 35 | filecolor=magenta, 36 | urlcolor=blue, 37 | } 38 | 39 | \titlespacing\section{0pt}{12pt plus 4pt minus 2pt}{0pt plus 2pt minus 2pt} 40 | \titlespacing\subsection{0pt}{12pt plus 4pt minus 2pt}{0pt plus 2pt minus 2pt} 41 | 42 | % 43 | % The following commands set up the lecnum (lecture number) 44 | % counter and make various numbering schemes work relative 45 | % to the lecture number. 46 | % 47 | \newcounter{lecnum} 48 | \renewcommand{\thepage}{\thelecnum-\arabic{page}} 49 | \renewcommand{\thesection}{\thelecnum.\arabic{section}} 50 | \renewcommand{\theequation}{\thelecnum.\arabic{equation}} 51 | \renewcommand{\thefigure}{\thelecnum.\arabic{figure}} 52 | \renewcommand{\thetable}{\thelecnum.\arabic{table}} 53 | 54 | \newcommand{\aosv}{1044414: Advanced Operating Systems and Virtualization} 55 | \newcommand{\wir}{1038137: Web Information Retrieval} 56 | \newcommand{\va}{1052057: Visual Analytics} 57 | \newcommand{\advprog}{1044416: Advanced Programming} 58 | \newcommand{\dchpc}{1044399: Data Centers and High Perf. Computing} 59 | 60 | \newcommand{\qu}[1]{\marginnote{\textcolor{cyan}{#1}}} 61 | 62 | 63 | 64 | % 65 | % The following macro is used to generate the header. 66 | % 67 | \newcommand{\lecture}[4]{ 68 | \pagestyle{myheadings} 69 | \thispagestyle{plain} 70 | \newpage 71 | \setcounter{lecnum}{#4} 72 | \setcounter{page}{1} 73 | \noindent 74 | \begin{center} 75 | \framebox{ 76 | \vbox{\vspace{2mm} 77 | \hbox to 7.4in { {\bf #1 78 | \hfill Spring 2018} } 79 | \vspace{4mm} 80 | \hbox to 7.4in { {\Large \hfill Lecture #4: #2 \hfill} } 81 | \vspace{2mm} 82 | \hbox to 7.4in { {\it Lecturer: #3 \hfill Scribe: Anxhelo Xhebraj} } 83 | \vspace{2mm}} 84 | } 85 | \end{center} 86 | \markboth{Lecture #4: #2}{Lecture #4: #2} 87 | 88 | \iffalse 89 | {\bf Note}: {\it LaTeX template courtesy of UC Berkeley EECS dept.} 90 | 91 | {\bf Disclaimer}: {\it These notes have not been subjected to the 92 | usual scrutiny reserved for formal publications. They may be distributed 93 | outside this class only with the permission of the Instructor.} 94 | \vspace*{4mm} 95 | \fi 96 | } 97 | % 98 | % Convention for citations is authors' initials followed by the year. 99 | % For example, to cite a paper by Leighton and Maggs you would type 100 | % \cite{LM89}, and to cite a paper by Strassen you would type \cite{S69}. 101 | % (To avoid bibliography problems, for now we redefine the \cite command.) 102 | % Also commands that create a suitable format for the reference list. 103 | \iffalse 104 | \renewcommand{\cite}[1]{[#1]} 105 | \def\beginrefs{\begin{list}% 106 | {[\arabic{equation}]}{\usecounter{equation} 107 | \setlength{\leftmargin}{2.0truecm}\setlength{\labelsep}{0.4truecm}% 108 | \setlength{\labelwidth}{1.6truecm}}} 109 | \def\endrefs{\end{list}} 110 | \def\bibentry#1{\item[\hbox{[#1]}]} 111 | \fi 112 | 113 | %Use this command for a figure; it puts a figure in wherever you want it. 114 | %usage: \fig{NUMBER}{SPACE-IN-INCHES}{CAPTION} 115 | \newcommand{\fig}[3]{ 116 | \vspace{#2} 117 | \begin{center} 118 | Figure \thelecnum.#1:~#3 119 | \end{center} 120 | } 121 | % Use these for theorems, lemmas, proofs, etc. 122 | \newtheorem{theorem}{Theorem}[lecnum] 123 | \newtheorem{lemma}[theorem]{Lemma} 124 | \newtheorem{proposition}[theorem]{Proposition} 125 | \newtheorem{claim}[theorem]{Claim} 126 | \newtheorem{corollary}[theorem]{Corollary} 127 | \newtheorem{definition}[theorem]{Definition} 128 | \newenvironment{proof}{{\bf Proof:}}{\hfill\rule{2mm}{2mm}} 129 | 130 | % **** IF YOU WANT TO DEFINE ADDITIONAL MACROS FOR YOURSELF, PUT THEM HERE: 131 | 132 | \newcommand\E{\mathbb{E}} 133 | 134 | \begin{document} 135 | 136 | \nocite{*} 137 | 138 | %FILL IN THE RIGHT INFO. 139 | %\lecture{**LECTURE-NUMBER**}{**DATE**}{**LECTURER**}{**SCRIBE**} 140 | 141 | \lecture{\aosv}{March 13}{Alessandro Pellegrini}{4} 142 | 143 | %\footnotetext{These notes are partially based on those of Nigel Mansell.} 144 | 145 | % **** YOUR NOTES GO HERE: 146 | 147 | In the previous lecture (Lec 3, Sec 3.4) we described the boot process for the early Linux kernel implementation with the built in bootloader. With the spread adoption of Linux, various bootloaders were implemented and to regulate how the kernel has to be setup into memory by bootloaders the linux kernel defined the Linux Boot Protocol \cite{lbp}. The main component of the protocol is the Real-mode kernel header which is a set of variables, some statically set at compile time and some that must 148 | be written by the bootloader that allow the kernel to know the initial memory map of the system. 149 | 150 | The kernel initialization can be divided into two parts. The first part is the Real Mode kernel and the second part is vmlinux which is the full binary object of the linux kernel and starts executing with \texttt{startup_32} at \texttt{arch/x86/boot/compressed/head_\{32,64\}.S}. With the introduction of (U)EFI and the advanced interface provided with it the kernel setup code became useless during time therefore the Efi Handover Protocol 151 | was introduced along the Linux Boot Protocol that allows to load the kernel directly in Protected Mode instead of Real Mode skipping the kernel setup code. 152 | 153 | \marginnote{ 154 | \includegraphics[width=0.5\textwidth]{memboot.png} 155 | \fig{1}{0 pt}{Memory after bootloader \cite{duarte_kern_boot}} 156 | } 157 | 158 | \iffalse 159 | You can startup the kernel in various modes: either protected, real etc. Kernel Boot Protocol: tells the OS how is it going to startup it. Assume that the bootloader stage 1 and stage 2 startup the kernel in real mode. 160 | \fi 161 | 162 | \section{Kernel Initialization} 163 | 164 | The bootloader ensures that the kernel is loaded as shown in Figure 4.1 with the Real-mode kernel loaded in the first megabyte of memory and the compressed Protected-mode image right after the first megabyte of memory \footnote{The loading above the first megabyte is performed either through BIOS facilities or Unreal Mode. Remember that the bootloader as well runs in Real-mode first therefore it cannot access that region of memory directly}. After loading the kernel, it jumps to the 165 | \texttt{_start} label placed at 512 bytes of offset from the address where the first Real-mode sector was loaded. From now on is the kernel that will execute in Real-mode. 166 | 167 | The first instruction performed is an hardcoded instruction opcode 16-bit short jump \footnote{ 168 | The jump is hardcoded to prevent the compiler from producing 32-bit code (instead of 16-bit) where such instruction has a 3 byte opcode. The instruction takes as input an immediate and will set the instruction pointer to the address of the instruction following the jump plus the immediate passed as input. The immediate is computed at compile time through the gnu assembler functionalities where \texttt{start_of_setup} is the label corresponding to the address where the kernel has to jump to and \texttt{1f} 169 | is the label \texttt{1} forward (\texttt{f}) which corresponds to the value of the instruction pointer when the jump instruction will be evaluated.} 170 | 171 | to skip the second part of the header and execute code at \texttt{start_of_setup}. 172 | 173 | \begin{verbatim} 174 | > arch/x86/boot/header.S 175 | _start: 176 | .byte 0xeb 177 | .byte start_of_setup-1f 178 | 1: 179 | \end{verbatim} 180 | 181 | The \texttt{start_of_setup} routine \marginnote{\textsc{Start Of Setup}} first ensures \texttt{\%es = \%ds} so it is possible to use the two segment registers interchangeably and sets \marginpar{\texttt{\hspace*{1cm} pushw \%ds} \\ \texttt{\hspace*{1cm} pushw 6f,} \\\texttt{\hspace*{1cm} lret} \\\texttt{6:}} the \texttt{\%cs} register to the same value of the other segment registers through the code shown on the left which sets \texttt{\%cs} to the value written on the stack above the return address and jumps to the label following. 182 | 183 | After that \marginnote{\cite{linin}} it setups a preliminary stack and zeroes the \texttt{.bss} section that holds the unitialized variables since a malevolent or buggy bootloader might have changed some of them. 184 | 185 | Finally it jumps to the \texttt{main()} function found in \texttt{arch/x86/boot/main.c} 186 | 187 | In the \texttt{main} function \marginnote{\textsc{Boot Main}} the following initializations are performed. 188 | 189 | 190 | \begin{verbatim} 191 | > arch/x86/boot/main.c 192 | 193 | main(void) 194 | copy_boot_params(void) 195 | copies the parameters of the Kernel header into a struct 196 | 197 | console_init(void) 198 | initializes console for printing on screen or serial 199 | communication 200 | 201 | init_heap(void) 202 | initializes a heap 203 | 204 | check_cpu(void) 205 | discovers cpu features 206 | 207 | detect_memory(void) 208 | invokes the E820 BIOS facility for generating the physical 209 | memory map 210 | 211 | // other initialization ... 212 | 213 | go_to_protected_mode(void) 214 | switches to Protected Mode 215 | \end{verbatim} 216 | 217 | The \texttt{go_to_protected_mode()} function \marginnote{\textsc{Go To Protected Mode}} performs the last preparations before entering Protected Mode and finally executes Protected-mode code. 218 | 219 | \begin{verbatim} 220 | > arch/x86/boot/pm.c 221 | 222 | go_to_protected_mode(void) 223 | 224 | realmode_switch_hook(void) 225 | disable Non Maskable Interrupts (NMI) 226 | 227 | enable_a20(void) 228 | enable A20 Line as shown in previous lectures 229 | 230 | mask_all_interrupts(void) 231 | 232 | setup_idt(void) { 233 | static const struct gdt_ptr null_idt = {0, 0}; 234 | asm volatile("lidtl %0" : : "m" (null_idt)); 235 | } 236 | 237 | setups a dummy IDT 238 | \end{verbatim} 239 | 240 | \marginnote{ The IDT and GDT are initialized at this stage because otherwise the processor won't enter in Protected mode as shown in the 241 | \href{https://github.com/torvalds/linux/commit/88089519f302f1296b4739be45699f06f728ec31}{commit message} 242 | }[-2.7cm] 243 | 244 | \begin{verbatim} 245 | setup_gdt(void) { 246 | static const u64 boot_gdt[] __attribute__((aligned(16))) = { 247 | [GDT_ENTRY_BOOT_CS] = GDT_ENTRY(0xc09b, 0, 0xfffff), 248 | [GDT_ENTRY_BOOT_DS] = GDT_ENTRY(0xc093, 0, 0xfffff), 249 | [GDT_ENTRY_BOOT_TSS] = GDT_ENTRY(0x0089, 4096, 103), 250 | }; 251 | 252 | static struct gdt_ptr gdt; 253 | gdt.len = sizeof(boot_gdt)-1; 254 | gdt.ptr = (u32)&boot_gdt + (ds() << 4); 255 | 256 | asm volatile("lgdtl %0" : : "m" (gdt)); 257 | } 258 | 259 | protected_mode_jump(boot_params.hdr.code32_start, 260 | (u32)&boot_params + (ds() << 4)); 261 | \end{verbatim} 262 | 263 | \marginnote{\texttt{GDT_ENTRY} takes in input flags, base and limit of a descriptor and generates the entry. Limit is 4KB $\times (2^{20} - 1) \approx 4$GB (G bit set to 1 through flags)}[-6cm] 264 | 265 | The GDT entries indices \texttt{GDT_ENTRY_BOOT_CS}, \texttt{GDT_ENTRY_BOOT_DS}, \texttt{GDT_ENTRY_BOOT_TSS} are respectively 2, 3, 4. Entry 0 is reserved as shown in the Intel documentation \cite{intel} while entry 1 is not used by Linux. 266 | 267 | 268 | 269 | The GDT \texttt{len} must be an unsigned short integer that summed with \texttt{ptr} tells the last byte addressable of the table (that's why 1 is subtracted to it) while the \texttt{ptr} must be the 32 bit physical address (that's why \texttt{gdt.ptr = \&boot_gdt + (\%ds} $\times$ 16) of the 16 byte-aligned memory containing the GDT. 270 | 271 | 272 | Both the GDT and IDT setup functions use inline assembly to perform hardware specific operations which are respectively to load the GDT register and the IDT register with the content of the memory referenced by \texttt{gdt/null_idt}. In the next paragraph we'll describe more deeply the syntax. 273 | 274 | Finally the routine \texttt{protected_mode_jump(u32 entrypoint, u32 bootparams)} defined in \\ \texttt{arch/x86/boot/pmjump.S} performs the jump to Protected Mode code in \texttt{startup_32}. \marginnote{\textsc{Protected Mode Jump}}[-12pt] 275 | 276 | \begin{verbatim} 277 | > arch/x86/boot/pmjump.S 278 | 279 | protected_mode_jump(u32 entrypoint, u32 bootparams) 280 | .code16 ; tell compiler to produce 16bit opcode 281 | ; store bootparams in %esi 282 | ; calculate pm32 physical address and keep in in 2f 283 | 284 | movw $__BOOT_DS, %cx 285 | movw $__BOOT_TSS, %di 286 | 287 | movl %cr0, %edx ; enable protected mode 288 | orb $X86_CR0_PE, %dl ; through CR0 289 | movl %edx, %cr0 290 | 291 | .byte 0x66, 0xea ; 32-bit ljmp opcode 292 | 2: 293 | .long in_pm32 294 | .word __BOOT_CS 295 | 296 | .code32 ; tell compiler to produce 32bit opcode 297 | 298 | in_pm32: 299 | 300 | ; initialize segment registers 301 | ; setup stack for debugging 302 | ; setup Task Register otherwise cannot enable Protected-mode 303 | ; clear general purpose registers 304 | ; load Local Descriptor Table otherwise cannot enable Protected-mode 305 | ; jump to 32-bit entry point startup_32 in 306 | ; arch/x86/boot/compressed/head_64.S 307 | \end{verbatim} 308 | 309 | As the reader may have noticed the code to which the previous snippet jumps to belongs to the \texttt{compressed/} folder under the \texttt{arch/x86/boot/} directory. This is the entry point of the kernel in case of booting through EFI. 310 | 311 | 312 | Also notice that the code that we are going to explore is for the \texttt{x86_64} architecture. The flow is similar for \texttt{x86_32} and the code can be found in the same folder in \texttt{head_32.S} 313 | 314 | The first instruction \marginnote{\textsc{Startup 32 | Compressed} \\ Note that the initialization shown in class is about the \texttt{x86_32} kernel following the code in \texttt{head_32.S} and described in the previous lecture} performed is \texttt{cld} which clears DF used by \texttt{stos} and \texttt{scas} instructions. Then a test on the \texttt{KEEP_SEGMENTS} parameter in the headers set by the bootloader is performed to check whether to reinitialize the segment registers. 315 | 316 | In order to know the current physical address where the kernel was loaded the following instructions are performed 317 | 318 | \begin{verbatim} 319 | > arch/x86/boot/compressed/head_64.S 320 | 321 | leal (BP_scratch+4)(%esi), %esp 322 | call 1f 323 | 1: popl %ebp 324 | subl $1b, %ebp 325 | 326 | \end{verbatim} 327 | 328 | The stack pointer is set to a physical address known to be "scratchable" to perform the instructions. A \texttt{call} to the following instruction is issued which pushes in the stack the value of the current address of execution which is then \texttt{pop}ped in the \texttt{\%ebp} register. To this value is subtracted the relative address (with respect to the start of the binary object) of label \texttt{1} getting the physical address of \texttt{startup_32}. 329 | 330 | Next, knowing the physical address, a stack is setup, a cpu verification is performed to know whether the architecture provides long mode and calculates the relocation address (physical address where the kernel will be loaded). 331 | 332 | The last operations are preparations for entering long mode. A new GDT is setup since the previous ones were set for only 32-bit mode (different L and D flags from these ones). The PAE bit is set in \texttt{\%cr4} and an early page table initialization is performed and its address set into \texttt{\%cr3}. 333 | 334 | Finally long mode is enabled through the EFER model specific register as shown in the previous lectures, the address of the next routine to be executed is pushed in the stack, paging is enabled by setting the PG bit and the PE bit in \texttt{\%cr0} and \texttt{lret} instruction is issued to start execute \texttt{startup_64} still in \texttt{head_64.S}. In \texttt{startup_64} the bss section is cleared to prepare for the jump to the C code of \texttt{extract_kernel()} 335 | (arch/x86/boot/compressed/misc.c) which will decompress 336 | the kernel and set \texttt{\%rax} to the address at which the kernel was decompressed. \texttt{head_64.S} then jumps to it with \texttt{jmp *\%rax}. 337 | 338 | The kernel \marginnote{\textsc{Startup 64 | Kernel}} entry point is \texttt{startup_64} which code is in \texttt{arch/x86/kernel/head_64.S} (similarly there are two files for the \texttt{x86_32} kernel) that modifies slightly the page table and jumps to \texttt{x86_64_start_kernel()} defined in \texttt{arch/x86/kernel/head64.c} which finally calls \texttt{start_kernel()} defined in \texttt{init/main.c}. \texttt{start_kernel()} will perform many kinds of 339 | system initializations and wake up the other cores through the INIT-SIPI signals. 340 | 341 | \iffalse 342 | Steady state is the final state of the system. The kernel must know which core it is running on. A function in the kernel is implemented to know it. SMP processor id: symmetric multicore processing. Two main ways to know it are through the usage of APIC since the LAPIC holds info telling which LAPIC is answering, or on older version through specific instructions for example cpuid to know cache levels, amd or intel etc. Throws out every kind of information. Paging etc. 343 | \fi 344 | 345 | \section{Inline Assembly} 346 | 347 | Inline Assembly is a feature provided by the GNU compiler and adopted by other compilers to embed assembly instructions inside C/C++ code. Such construct is useful when some code needs to be highly efficient and even writing it as an asm function (\texttt{fun.S}) the \texttt{call} overhead would be too much or implementing highly hardware-specific code. Both cases suggest that a lot of kernel code contains inline assembly. 348 | 349 | The syntax for using inline assembly is: 350 | \begin{verbatim} 351 | asm [volatile] ( AssemblerTemplate 352 | [ : OutputOperands ] 353 | [ : InputOperands ] 354 | [ : Clobbers ] ) 355 | \end{verbatim} 356 | 357 | Both the \texttt{asm} and \texttt{volatile} keywords can be used with or without underscores (e.g. \texttt{__asm__}). The \texttt{volatile} keyword prevents the assembler from optimizing this portion of code by either removing it or placing it into other positions. It does \textit{not} ensure the ordering of the instructions between assembly and C code. 358 | 359 | An example of inline assembly is the following: \texttt{asm ("movl \%\%eax, \%\%ebx");}. 360 | 361 | When referring to registers two \% symbols are needed since the usage of just one \% symbol is reserved for referring to C variables inside the Assembler Template. To refer to C variables in the inline assembly code we must "define" them inside the Input or Output operands. The Clobber part informs the compiler what registers or memory areas are going to be modified since it does not parse the assembler template. 362 | 363 | Let's look at some examples to get a better idea. 364 | \begin{verbatim} 365 | void cpuid(int code, uint32_t *a, uint32_t *d) { 366 | asm volatile("cpuid" 367 | :"=a"(*a),"=d"(*d) 368 | :"a"(code) 369 | :"ecx","ebx"); 370 | } 371 | \end{verbatim} 372 | 373 | The code above performs the \texttt{cpuid} instruction. The \texttt{code} variable is stored int the \texttt{\%eax} register before issuing the \texttt{cpuid} instruction through the input operands and the contents in the \texttt{\%eax} (\texttt{a}) register and \texttt{\%edx} (\texttt{d}) register are respectively written (\texttt{=}) in the memory area pointed by the variable \texttt{a} (\texttt{(*a)}) and \texttt{d} (\texttt{(*d)}) after the instruction. 374 | Indeed if we compile the code above we get: 375 | 376 | \begin{verbatim} 377 | cpuid(int, int*, int*): 378 | pushl %ebx 379 | movl 8(%esp), %eax ; a (code) 380 | cpuid 381 | movl 12(%esp), %ecx ; 382 | movl %eax, (%ecx) ; =a (*a) 383 | movl 16(%esp), %eax ; 384 | movl %edx, (%eax) ; =d (*d) 385 | popl %ebx 386 | ret 387 | \end{verbatim} 388 | 389 | Another example is the \texttt{swap} function below. 390 | 391 | \begin{verbatim} 392 | void swap(long *a, long *b){ 393 | __asm__ __volatile__ ( 394 | "push (%%rax) \n" 395 | "push (%%rbx) \n" 396 | "pop (%%rax) \n" 397 | "pop (%%rbx)" 398 | : "=a" (a) , "=b" (b) 399 | : "0" (a) , "1" (b) 400 | : "memory" 401 | ); 402 | } 403 | \end{verbatim} 404 | 405 | The variables passed as input are moved to the \texttt{rax} and \texttt{rbx} registers and the values stored in the memory pointed by the two pointers are pushed to then be popped in the reverse order into the memory pointed by the two pointers. In the input operands the registers are referred through the sequential enumeration offered by GCC. With string \texttt{"0"} it refers to the 0th register declared in the output/input operands, i.e. \texttt{\%rax} in this case, while string \texttt{"1"} 406 | refers to the 1st register declared which is \texttt{\%rbx}. 407 | 408 | Setting the string \texttt{"memory"} into the clobbers informs the compiler that the portion of inline assembly performs side effect in memory and ensures that the compiler won't perform any kind of reordering in the scope in which the inline assembly is written. 409 | 410 | Last example 411 | 412 | \begin{verbatim} 413 | bool CAS(volatile unsigned long long *ptr, 414 | unsigned long long oldVal, 415 | unsigned long long newVal) { 416 | unsigned long res = 0; 417 | 418 | __asm__ __volatile__( 419 | "lock cmpxchgq %1, %2;"//ZF = 1 if succeeded 420 | "lahf;" // to get the correct result even if oldVal == 0 421 | "bt $14, %%ax;" // is ZF set? (ZF is the 6'th bit in %ah, 422 | // so it's the 14'th in ax) 423 | "adc %0, %0" // get the result 424 | : "=r"(res) 425 | : "r"(newVal), "m"(*ptr), "a"(oldVal), "0"(res) 426 | : "memory" 427 | ); 428 | 429 | return (bool) res; 430 | } 431 | \end{verbatim} 432 | 433 | The code produced by the assembler is the following 434 | 435 | \begin{verbatim} 436 | CAS(unsigned long long volatile*, unsigned long long, unsigned long long): 437 | xorl %ecx, %ecx 438 | movq %rsi, %rax 439 | lock cmpxchgq %rdx, (%rdi); 440 | lahf; 441 | bt $14, %ax 442 | adc %rcx, %rcx 443 | testq %rcx, %rcx 444 | setne %al 445 | ret 446 | \end{verbatim} 447 | 448 | From the \texttt{x64} calling convention the first three parameters are stored into registers \texttt{\%rdi, \%rsi, \%rdx}. \texttt{oldVal} which is in \texttt{\%rsi} is moved into \texttt{\%rax} as required by the input operand. In the inline assembly \texttt{cmpxchgq} takes in input the 1st and 2nd input operands which are \texttt{newVal} and \texttt{ptr} that are stored \texttt{\%rdx} and \texttt{\%rdi}. 449 | 450 | \subsection{Kernel Initialization Signature} 451 | 452 | Another important fact to be analyzed is the \texttt{start_kernel()} signature which has the following macros prepended to it: 453 | 454 | \begin{description} 455 | \item \texttt{asmlinkage}: in order to be more performant in i386 architecture the kernel is compiled with the option \texttt{-mregparm=3} which forces GCC to optimize function calls by putting the parameters into the registers instead of putting them on stack as defined by the ABI. This macro ensures that the parameters passed to the function will be on stack. 456 | \item \texttt{__visible}: informs the compiler to not remove the function during the link-time optimization even if it is not called by any code 457 | \item \texttt{__init}: sets the function binary into a specific region of memory called \texttt{.init.text} allowing the kernel to safely reclaim that space after initialization is concluded 458 | \end{description} 459 | 460 | \newpage 461 | \bibliography{Lec4} 462 | \bibliographystyle{plainnat} 463 | \end{document} 464 | -------------------------------------------------------------------------------- /Lec2/Lec2.tex: -------------------------------------------------------------------------------- 1 | \documentclass[twoside]{article} 2 | \setlength{\oddsidemargin}{-0.5 in} 3 | \setlength{\evensidemargin}{1.5 in} 4 | \setlength{\topmargin}{-0.6 in} 5 | \setlength{\textwidth}{5.5 in} 6 | \setlength{\textheight}{8.5 in} 7 | \setlength{\headsep}{0.5 in} 8 | \setlength{\parindent}{0 in} 9 | \setlength{\parskip}{0.07 in} 10 | \setlength{\marginparwidth}{145pt} 11 | 12 | % 13 | % ADD PACKAGES here: 14 | % 12 15 | 16 | \usepackage{amsmath, 17 | amsfonts, 18 | amssymb, 19 | graphicx, 20 | mathtools, 21 | flexisym, 22 | marginnote, 23 | hyperref, 24 | titlesec} 25 | 26 | \usepackage[english]{babel} 27 | \usepackage[utf8]{inputenc} 28 | \usepackage[shortlabels]{enumitem} 29 | 30 | \graphicspath{ {images/} } 31 | 32 | \hypersetup{ 33 | colorlinks=true, 34 | linkcolor=blue, 35 | filecolor=magenta, 36 | urlcolor=blue, 37 | } 38 | 39 | \titlespacing\section{0pt}{12pt plus 4pt minus 2pt}{0pt plus 2pt minus 2pt} 40 | \titlespacing\subsection{0pt}{12pt plus 4pt minus 2pt}{0pt plus 2pt minus 2pt} 41 | 42 | % 43 | % The following commands set up the lecnum (lecture number) 44 | % counter and make various numbering schemes work relative 45 | % to the lecture number. 46 | % 47 | \newcounter{lecnum} 48 | \renewcommand{\thepage}{\thelecnum-\arabic{page}} 49 | \renewcommand{\thesection}{\thelecnum.\arabic{section}} 50 | \renewcommand{\theequation}{\thelecnum.\arabic{equation}} 51 | \renewcommand{\thefigure}{\thelecnum.\arabic{figure}} 52 | \renewcommand{\thetable}{\thelecnum.\arabic{table}} 53 | 54 | \newcommand{\aosv}{1044414: Advanced Operating Systems and Virtualization} 55 | \newcommand{\wir}{1038137: Web Information Retrieval} 56 | \newcommand{\va}{1052057: Visual Analytics} 57 | \newcommand{\advprog}{1044416: Advanced Programming} 58 | \newcommand{\dchpc}{1044399: Data Centers and High Perf. Computing} 59 | 60 | \newcommand{\qu}[1]{\marginnote{\textcolor{cyan}{#1}}} 61 | 62 | 63 | % 64 | % The following macro is used to generate the header. 65 | % 66 | \newcommand{\lecture}[4]{ 67 | \pagestyle{myheadings} 68 | \thispagestyle{plain} 69 | \newpage 70 | \setcounter{lecnum}{#4} 71 | \setcounter{page}{1} 72 | \noindent 73 | \begin{center} 74 | \framebox{ 75 | \vbox{\vspace{2mm} 76 | \hbox to 7.4in { {\bf #1 77 | \hfill Spring 2018} } 78 | \vspace{4mm} 79 | \hbox to 7.4in { {\Large \hfill Lecture #4: #2 \hfill} } 80 | \vspace{2mm} 81 | \hbox to 7.4in { {\it Lecturer: #3 \hfill Scribe: Anxhelo Xhebraj} } 82 | \vspace{2mm}} 83 | } 84 | \end{center} 85 | \markboth{Lecture #4: #2}{Lecture #4: #2} 86 | 87 | \iffalse 88 | {\bf Note}: {\it LaTeX template courtesy of UC Berkeley EECS dept.} 89 | 90 | {\bf Disclaimer}: {\it These notes have not been subjected to the 91 | usual scrutiny reserved for formal publications. They may be distributed 92 | outside this class only with the permission of the Instructor.} 93 | \vspace*{4mm} 94 | \fi 95 | } 96 | % 97 | % Convention for citations is authors' initials followed by the year. 98 | % For example, to cite a paper by Leighton and Maggs you would type 99 | % \cite{LM89}, and to cite a paper by Strassen you would type \cite{S69}. 100 | % (To avoid bibliography problems, for now we redefine the \cite command.) 101 | % Also commands that create a suitable format for the reference list. 102 | \iffalse 103 | \renewcommand{\cite}[1]{[#1]} 104 | \def\beginrefs{\begin{list}% 105 | {[\arabic{equation}]}{\usecounter{equation} 106 | \setlength{\leftmargin}{2.0truecm}\setlength{\labelsep}{0.4truecm}% 107 | \setlength{\labelwidth}{1.6truecm}}} 108 | \def\endrefs{\end{list}} 109 | \def\bibentry#1{\item[\hbox{[#1]}]} 110 | \fi 111 | 112 | %Use this command for a figure; it puts a figure in wherever you want it. 113 | %usage: \fig{NUMBER}{SPACE-IN-INCHES}{CAPTION} 114 | \newcommand{\fig}[3]{ 115 | \vspace{#2} 116 | \begin{center} 117 | Figure \thelecnum.#1:~#3 118 | \end{center} 119 | } 120 | % Use these for theorems, lemmas, proofs, etc. 121 | \newtheorem{theorem}{Theorem}[lecnum] 122 | \newtheorem{lemma}[theorem]{Lemma} 123 | \newtheorem{proposition}[theorem]{Proposition} 124 | \newtheorem{claim}[theorem]{Claim} 125 | \newtheorem{corollary}[theorem]{Corollary} 126 | \newtheorem{definition}[theorem]{Definition} 127 | \newenvironment{proof}{{\bf Proof:}}{\hfill\rule{2mm}{2mm}} 128 | 129 | % **** IF YOU WANT TO DEFINE ADDITIONAL MACROS FOR YOURSELF, PUT THEM HERE: 130 | 131 | \newcommand\E{\mathbb{E}} 132 | 133 | \begin{document} 134 | 135 | \nocite{*} 136 | 137 | %FILL IN THE RIGHT INFO. 138 | %\lecture{**LECTURE-NUMBER**}{**DATE**}{**LECTURER**}{**SCRIBE**} 139 | 140 | \lecture{\aosv}{March 6}{Alessandro Pellegrini}{2} 141 | 142 | %\footnotetext{These notes are partially based on those of Nigel Mansell.} 143 | 144 | % **** YOUR NOTES GO HERE: 145 | 146 | \iffalse 147 | 148 | \section{Stage 1 Bootloader} 149 | After disabling the interrupts and having initialized all the segments registers to 0 the A20 line is enabled and the switch to 32 bit protected mode is performed. Finally a stack is setup and the Stage 2 Bootloader is loaded (if BIOS mode and not UEFI). 150 | 151 | \fi 152 | 153 | \section{A20 Line} 154 | 155 | \subsection{Story} 156 | The \marginnote{\textsc{Memory Wrap-Around}} original 8086 processor featured a 20-bit address bus capable of addressing 1MB of memory. However, the odd segmentation mechanism, theoretically, allowed addressing slightly more than just 1MB. Indeed all the \textit{logical} addresses ranging from \texttt{F800:8000} (or equivalently \texttt{FFFF:0010}) to \texttt{FFFF:FFFF} and their \textit{linear} equivalent \texttt{0x100000} and \texttt{0x10FFEF} did not have a corresponding \textit{physical} address. 157 | 158 | The workaround for this issue was Memory Wrap-Around: "wrap" that portion of memory making those addresses point to the lower portion of memory by dropping bit 20 (indexing bits from 0 from the least significant). This means that the physical addresses shown above would be mapped to \texttt{0x00000} and \texttt{0x0FFEF}. 159 | 160 | Such workaround was then used heavily by many programmers to improve performances of their programs. For example it was used by DOS programmers in order to use the same segment for accessing both program data placed in the higher part of the memory and I/O data that was placed in the lower end. 161 | 162 | Introducing \marginnote{\textsc{A20 Line Fix}} the Intel 80286 the addressable memory was increased to 16 MB in protected mode, however the new CPU was supposed to be able to run in real mode all programs developed for the 8086 so that it could run OSes and programs that were not written for protected mode. This produced a bug in the IBM PC AT computer since logical address \texttt{F800:8000} accessed the physical address \texttt{0x100000} instead of wrapping around and access \texttt{0x00000}. The issue was solved by ANDING the CPU A20 (address line 20) with an output of the keyboard controller (KBC). 163 | 164 | Disconnecting just A20 and not also A21, A22, A23 was enough since real mode software cared only about the area slightly above 1MB. 165 | 166 | \subsection{Enabling A20 Line via keyboard controller} 167 | A20 Line can be enabled through various methods but the classical one is through the keyboard controller which at the time was just a single chip (8042) today known as PS/2 Controller. 168 | 169 | The PS/2 Controller uses 2 IO ports: port \texttt{0x60} for reading/writing data from/to the controller and port \texttt{0x64} for reading/writing the status/command register of the controller. Bit 1 of the status register tells the Input buffer status of the controller: 0 if empty, 1 if full. It must be clear before attempting to write some data to port \texttt{0x60} or \texttt{0x64} therefore we must use the busy waiting scheme to communicate with the controller. 170 | 171 | By writing in the command register the value \texttt{0xD1} we ask the controller to write the next byte (that will be sent to it in the data register) to the Controller Output Port. \texttt{0xDF} is the code for enabling the A20 line and \texttt{0xDD} for disabling it. 172 | 173 | Follows the code for enabling A20. 174 | 175 | \begin{verbatim} 176 | ... 177 | call wait_for_8042 ; busywait to be able to write the command 178 | movb $0xd1, %al 179 | outb %al, $0x64 ; ask to write 180 | call wait_for_8042 181 | movb $0xdf, %al 182 | outb %al, $0x60 ; enable A20 183 | call wait_for_8042 184 | ... 185 | 186 | wait_for_8042: ; busy waiting routine 187 | inb %al, $0x64 188 | testb $2, %al ; check if can write (Bit 1 has value 1 if full) 189 | jnz wait_for_8042 190 | ret 191 | \end{verbatim} 192 | 193 | \section{Protected Mode Overview} 194 | 195 | Protected mode gets its name from the privilege protection \marginnote{ 196 | \includegraphics[width= 0.35 \textwidth]{privlevel.png} 197 | \fig{1}{0 pt}{Intel CPU privilege levels} 198 | }[-1 cm] 199 | mechanisms introduced in the Intel 386 processor to restrict what user-level programs can do employing \textbf{Privilege Levels}. There are four privilege levels, numbered 0 (most privileged) to 3 (least privileged), and three main resources being protected: memory, I/O ports, and the ability to execute certain machine instructions. 200 | Protected mode was then extended by adding paging. Both paging and protected mode must be enabled on startup. 201 | At any given time, an x86 CPU is running in a specific privilege level, which determines what code can and cannot do. These privilege levels are often described as \textbf{Protection Rings}, with the innermost ring corresponding to highest privilege. Most modern x86 kernels use only two privilege levels, 0 and 3. Ring 1 has been used in hypervisor systems. 202 | 203 | To manage these kind of security mechanisms a new set of registers, namely Control Registers, were introduced. About 15 machine instructions, out of dozens, are restricted by the CPU to ring zero. In protected mode \textbf{only privilege level 0} code can read or load the Control Registers. In user space CS is not writable. 204 | 205 | The \marginnote{\textsc{Control Register 0}} most important Control Register is CR0 which contains system control flags that control operating mode and states of the processor: 206 | \begin{description} 207 | \itemsep0em 208 | \item [PG Paging (bit 31):] Enables paging when set; disables paging when clear. When paging is 209 | disabled, all linear addresses are treated as physical addresses. The PG flag has no effect if the PE flag (bit 210 | 0 of register CR0) is not also set; setting the PG flag when the PE flag is clear causes a general-protection 211 | exception (\#GP) 212 | \item [CD Cache Disable (bit 30):] together with NW, when set, fully disable caches 213 | \item [NW Not Write-through (bit 29)] 214 | \item [NE Numeric Error (bit 5):] enables the reporting of x87 FPU errors 215 | \item [PE Protection Enable (bit 0):] Enables protected mode when set; enables real-address mode when clear. This flag does not enable paging directly. It only enables segment-level protection. To enable paging, both the PE and PG flags must be set \marginnote{The OS may sometimes set PE to 0 to write big chunks of memory to improve performance and then reset it}[-36 pt] 216 | \end{description} 217 | 218 | By switching to protected mode we get that all the instructions that the processor will start executing will be interpreted as 32-bit from now on instead of 16-bit. By this fact two problems might arise: the \texttt{CS} register contains information that had some meaning in real mode but not the same in protected mode and the CPU prefetches the instructions it has to execute in an \textbf{Instruction Queue} and interprets it as an \textit{instruction stream} to be more performant but in this case we are switching from 16 bit to 32 bit. In order to make the CPU to work properly we must perform a far \texttt{jmp} or \texttt{call} instruction to \textit{serialize} the CPU, i.e. flush the Instruction Queue and start reading 32-bit instructions and "re-initialise" the code segment. Afterwards we must also reset Segment Registers since they had the content of 16-bit data/operations. \marginnote{\cite{intel} Ch. 9, Sec. 9.9}[-36 pt] 219 | 220 | \section{Memory Translation in Protected Mode} 221 | 222 | Before entering protected mode, auxiliary data structures must be setup in order for the system to work properly. 223 | 224 | One of the main differences between the two modes of operation (real and protected) are the translation from logical addresses to physical ones. In 32-bit protected mode, a segment selector is no longer a raw number, but instead it contains an index pointing to a 64 bit entry in a table of \textbf{segment descriptors}. 225 | 226 | The segment descriptors are stored in two tables: the \textbf{Global Descriptor Table} (GDT) and the \textbf{Local Descriptor Table} (LDT). Each CPU (or core) in a computer contains a register called \textbf{GDTR} which stores the linear memory address of the first byte in the GDT (analogously a register for the LDT). \marginnote{There is one GDT per CPU in multi core systems while there can be many LDTs. LDTs are not really used anymore}[-36 pt] 227 | 228 | To choose a segment, you must load a segment register with a \textbf{segment selector} in the following format: \marginnote{RPL field is called CPL in the case of Code Segment Selector and Stack Segment Selector meaning Current Privilege Level}[2 cm] 229 | 230 | \begin{center} 231 | \includegraphics[width=0.5\textwidth]{segsel.png} 232 | \fig{2}{0 pt}{Segment Selector Format \cite{intel}} 233 | \end{center} 234 | 235 | The \textbf{Index} part of the selector identifies which Segment Descriptor (and therefore Segment) we are interested in\marginnote{The size of the table can be at most $2^{13}$ descriptor entries}[-12 pt]. The \textbf{Requested Privilege Level} will be covered later. 236 | 237 | Each Segment Descriptor has the following format 238 | 239 | \begin{center} 240 | \includegraphics[width=0.8\textwidth]{segdesc.png} 241 | \fig{3}{0 pt}{Segment Descriptor Entry \cite{intel}} 242 | \end{center} 243 | 244 | \begin{description} 245 | \item [S (descriptor type flag)] Specifies \marginnote{The first descriptor entry in the GDT is not used and its value is all 0s (\cite{intel} pp. 103)} whether the segment descriptor is for a \textbf{system} segment (S flag is clear) or a \textbf{code} or \textbf{data} segment (S flag is set) 246 | \item [Base] Defines the location of byte 0 of the segment within the 4-GByte linear address space 247 | \item [Segment Limit] Specifies the size of the segment. The processor puts together the two segment limit fields to form a 20-bit value. The processor interprets the segment limit in one of two ways, depending on the setting of the G (granularity) flag 248 | \item [G] Determines the scaling of the segment limit field. When the granularity flag is clear, the segment limit is interpreted in byte units; when flag is set, the segment limit is interpreted in 4-KByte units. 249 | \item [DPL] Is the \textbf{Descriptor Privilege Level}; it is a number from 0 (most privileged, kernel mode) to 3 (least privileged, user mode) that controls access to the segment as will be shown. 250 | \end{description} 251 | 252 | \begin{center} 253 | \includegraphics[width=0.8\textwidth]{logadd.png} 254 | \fig{4}{0 pt}{Logical to Linear Address \cite{intel}} 255 | \end{center} 256 | 257 | So how is addressing resolved now? Let's look at an example in case of the instruction \\ \texttt{jmp 0xDEADBEEF} 258 | 259 | \begin{enumerate} 260 | \item The processor looks at the value in the Code Segment (\texttt{CS}) Register and checks \texttt{TI} to know whether to look in the Global or Local table (Suppose Global in the next steps). 261 | \item Computes the \textit{linear} address \texttt{Index * 8 + GDTR} (8 Bytes is the size of a Segment Descriptor Entry) and fetches the \texttt{Base} of that Segment Descriptor 262 | \item Computes the linear address that is \texttt{Base + 0xDEADBEEF} 263 | \end{enumerate} 264 | 265 | Are all those steps \marginnote{\textsc{Flat Model}} really necessary? No. Segmentation has no real advantage in modern x86 kernels due to the fact that all the memory available is addressable through regular registers but since it is not possible to disable the Segmentation Unit, Intel solved the issue by introducing the \textbf{Flat Model} which means to create just two descriptors for each of the two privilege levels 0 and 3, one for the code segment and one for the data segment both having \texttt{Base} 0 and having as \texttt{Limit} the maximum memory available in 32-bit (4GB). This produces a memory access model as if it was linear. 266 | 267 | Since \marginnote{The hidden part is the one used by the processor for fetching the first instruction after a reset} a full communication to memory would be too costly in terms of performance a segment register cache has been introduced also known as "descriptor cache" or "shadow register". The hidden part of the register \textit{is not} invalidated when the entry pointed by that selector changes producing some interesting results. Such | "unattended" feature allowed programmers to address more than 1MB in the so called "Unreal Mode". 268 | 269 | \section{x86 Protection} 270 | 271 | Due to restricted access to memory and I/O ports, user mode can do almost nothing to the outside world without calling on the kernel. It can't open files, send network packets, print to the screen, or allocate memory. User processes run in a severely limited sandbox set up by the gods of ring zero. That's why it's impossible, by design, for a process to leak memory beyond its existence or leave open files after it exits. All of the data structures that control such things - memory, open files, etc - cannot be touched directly by user code; once a process finishes, the sandbox is torn down by the kernel. 272 | 273 | Keep in mind that the CPU privilege level has nothing to do with operating system users. Whether you're root, Administrator, guest, or a regular user, it does not matter. All user code runs in ring 3 and all kernel code runs in ring 0, regardless of the OS user on whose behalf the code operates. Sometimes certain kernel tasks can be pushed to user mode, for example user-mode device drivers. 274 | 275 | The whole protection of the system is based on DPL (Descriptor Privilege Level), RPL (Requested Privilege Level), CPL (Current Privilege Level). The first two are stored respectively in the Segment Descriptor and in the Data Segment Selector while the third one is present in Code Segment Selector stored in the Code Segment Register. 276 | 277 | The processor uses privilege levels to prevent a program or task operating at a lesser privilege level from accessing a segment with a greater privilege, except under controlled situations. 278 | 279 | Normally, the CPL is equal to the privilege level of the code segment from which instructions are being fetched. The processor changes the CPL when program control is transferred to a code segment with a different privilege level. 280 | 281 | The CPU protects memory at two crucial points: when a segment selector is loaded and when a page of memory is accessed with a linear address. Protection thus mirrors memory address translation where both segmentation and paging are involved. 282 | 283 | \begin{center} 284 | \includegraphics[width=0.8\textwidth]{segproc.png} 285 | \fig{5}{0 pt}{Segment Privilege Check \cite{duarte_2017}} 286 | \end{center} 287 | 288 | In truth, segment protection scarcely matters because modern kernels use a flat address space where the user-mode segments can reach the entire linear address space. Useful memory protection is done in the paging unit when a linear address is converted into a physical address. 289 | 290 | \section{Switching between privilege levels} 291 | 292 | Another data structure must be setup before entering Protected Mode: the \textbf{Interrupt Descriptor Table}. Such data structure is then used by the OS to handle interrupts and switch between privilege levels. 293 | 294 | \subsection{Interrupt Descriptor Table} 295 | 296 | In real mode interrupts are handled through the \textbf{Interrupt Vector Table (IVT)} which is a table, typically located at \texttt{0000:000H}, that specifies the addresses of all the 256 \textit{Interrupt Service Routine}. In the case of protected mode the table can be placed anywhere as long the IDTR is set to the right address pointing to its start. The \textbf{IDT} stores a collection of system type segment descriptors (S field cleared in the segment descriptor entry) called \textbf{Gate Descriptors}. 297 | 298 | System type descriptors fall in two categories: system-segment descriptors and gate descriptors. System-segment descriptors point to system segments (LDT and \textbf{Task-state segment} segments). Gate descriptors are in themselves "gates", which hold pointers to procedure entry points in code segments (\textbf{call}, \textbf{interrupt}, and \textbf{trap} gates) or which hold segment selectors for TSS’s (\textbf{task} gates). \marginnote{Interrupts are asynchronous events not related to the CPU execution flow while Traps (or exceptions) are synchronous. Traps are the historyical way to demand access to kernel mode}[-36 pt] Call gates provide a kernel entry point that can be used with ordinary call and jmp instructions, but they aren't used much so we'll ignore them. Task gates aren't so hot either (in Linux, they are only used in double faults, which are caused by either kernel or hardware problems). 299 | 300 | Gate descriptors in the IDT can be \textbf{interrupt}, \textbf{trap}, or \textbf{task gate} descriptors. 301 | 302 | Each interrupt is assigned a number \textbf{between 0 and 255} called a vector, which the processor uses as an index into the IDT when figuring out which gate descriptor to use when handling the interrupt. 303 | 304 | \subsection{Privilege level switch} 305 | 306 | To access an interrupt or exception handler, the processor first receives an interrupt vector from internal hardware, an external interrupt controller, or 307 | from software. The interrupt vector provides an index into the IDT. If the selected gate descriptor is an interrupt gate or a trap gate, the associated handler procedure is accessed in a manner similar to calling a procedure through a call gate. If the descriptor is a task gate, the handler is accessed through a task switch. 308 | 309 | \begin{center} 310 | \includegraphics[width=0.8\textwidth]{intdescprv.png} 311 | \fig{6}{0 pt}{Privilege Management on Interrupt/Exception \cite{duarte_2017}} 312 | \end{center} 313 | 314 | An interrupt can never transfer control from a more-privileged to a less-privileged ring. Privilege must either stay the same (when the kernel itself is interrupted) or be elevated (when user-mode code is interrupted) to make sure that the interrupts are handled only by Privilege Level 0 code. In either case, the resulting CPL will be equal to the DPL of the destination code segment; if the CPL changes, a stack switch also occurs. If an interrupt is triggered by code via an instruction like \texttt{int n}, one more check takes place: the gate DPL must be at the same or lower privilege as the CPL. This prevents user code from triggering random interrupts placed in more privileged code segments. If these checks fail a general-protection exception happens. All Linux interrupt handlers end up running in ring zero. 315 | 316 | \textcolor{red}{comm} \iffalse TLS thread local storage, used to set the fs and gs segments, used to describe in a multithreaded application where each thread finds its local copy of ?.} \fi 317 | 318 | 319 | \subsection{TSS Task State Segment} 320 | The Task State Segment (TSS) is a special data structure for x86 processors which holds information about a task. The TSS is primarily suited for hardware multitasking, where each individual process has its own TSS. 321 | 322 | The task register holds the 16-bit segment selector, base address, segment limit, and descriptor attributes for the TSS of the current task which can be placed anywhere in memory. 323 | 324 | \marginpar{ 325 | \includegraphics[width=0.4\textwidth]{taskstut.png} 326 | \fig{7}{0 pt}{Task Management \cite{intel}} 327 | } 328 | 329 | The task's current execution space is defined by the segment selectors in the segment 330 | registers (CS, DS, SS, ES, FS, and GS). 331 | 332 | Although a TSS could be created for each task running on the computer, Linux kernel only creates one TSS for each CPU as required by the processor architecture and uses them for all tasks. This approach was selected as it provides easier portability to other architectures (for example, the AMD64 architecture does not support hardware task switches), and improved performance and flexibility. Linux only uses the I/O port permission bitmap and inner stack features of the TSS; the other features are only needed for hardware task switches, which the Linux kernel does not use. 333 | 334 | \newpage 335 | 336 | \marginnote{ 337 | \includegraphics[width=0.4\textwidth]{tss.png} 338 | \fig{8}{0 pt}{TSS \cite{intel}} 339 | }[2cm] 340 | 341 | \begin{center} 342 | \includegraphics[width=\textwidth]{sysds.png} 343 | \fig{9}{0 pt}{Data Structures Overview \cite{intel}} 344 | \end{center} 345 | 346 | 347 | \newpage 348 | \bibliography{Lec2} 349 | \bibliographystyle{plainnat} 350 | \end{document} -------------------------------------------------------------------------------- /Lec6/Lec6.tex: -------------------------------------------------------------------------------- 1 | \documentclass[twoside]{article} 2 | \setlength{\oddsidemargin}{-0.5 in} 3 | \setlength{\evensidemargin}{1.5 in} 4 | \setlength{\topmargin}{-0.6 in} 5 | \setlength{\textwidth}{5.5 in} 6 | \setlength{\textheight}{8.5 in} 7 | \setlength{\headsep}{0.5 in} 8 | \setlength{\parindent}{0 in} 9 | \setlength{\parskip}{0.07 in} 10 | \setlength{\marginparwidth}{145pt} 11 | 12 | % 13 | % ADD PACKAGES here: 14 | % 12 15 | 16 | \usepackage{amsmath, 17 | amsfonts, 18 | amssymb, 19 | graphicx, 20 | mathtools, 21 | flexisym, 22 | marginnote, 23 | hyperref, 24 | titlesec} 25 | 26 | \usepackage[english]{babel} 27 | \usepackage[utf8]{inputenc} 28 | \usepackage[shortlabels]{enumitem} 29 | 30 | \graphicspath{ {images/} } 31 | 32 | \hypersetup{ 33 | colorlinks=true, 34 | linkcolor=blue, 35 | filecolor=magenta, 36 | urlcolor=blue, 37 | } 38 | 39 | \titlespacing\section{0pt}{12pt plus 4pt minus 2pt}{0pt plus 2pt minus 2pt} 40 | \titlespacing\subsection{0pt}{12pt plus 4pt minus 2pt}{0pt plus 2pt minus 2pt} 41 | 42 | % 43 | % The following commands set up the lecnum (lecture number) 44 | % counter and make various numbering schemes work relative 45 | % to the lecture number. 46 | % 47 | \newcounter{lecnum} 48 | \renewcommand{\thepage}{\thelecnum-\arabic{page}} 49 | \renewcommand{\thesection}{\thelecnum.\arabic{section}} 50 | \renewcommand{\theequation}{\thelecnum.\arabic{equation}} 51 | \renewcommand{\thefigure}{\thelecnum.\arabic{figure}} 52 | \renewcommand{\thetable}{\thelecnum.\arabic{table}} 53 | 54 | \newcommand{\aosv}{1044414: Advanced Operating Systems and Virtualization} 55 | \newcommand{\wir}{1038137: Web Information Retrieval} 56 | \newcommand{\va}{1052057: Visual Analytics} 57 | \newcommand{\advprog}{1044416: Advanced Programming} 58 | \newcommand{\dchpc}{1044399: Data Centers and High Perf. Computing} 59 | 60 | \newcommand{\qu}[1]{\marginnote{\textcolor{cyan}{#1}}} 61 | 62 | 63 | % 64 | % The following macro is used to generate the header. 65 | % 66 | \newcommand{\lecture}[4]{ 67 | \pagestyle{myheadings} 68 | \thispagestyle{plain} 69 | \newpage 70 | \setcounter{lecnum}{#4} 71 | \setcounter{page}{1} 72 | \noindent 73 | \begin{center} 74 | \framebox{ 75 | \vbox{\vspace{2mm} 76 | \hbox to 7.4in { {\bf #1 77 | \hfill Spring 2018} } 78 | \vspace{4mm} 79 | \hbox to 7.4in { {\Large \hfill Lecture #4: #2 \hfill} } 80 | \vspace{2mm} 81 | \hbox to 7.4in { {\it Lecturer: #3 \hfill Scribe: Anxhelo Xhebraj, 82 | Beatrice Bevilacqua} } 83 | \vspace{2mm}} 84 | } 85 | \end{center} 86 | \markboth{Lecture #4: #2}{Lecture #4: #2} 87 | 88 | \iffalse 89 | {\bf Note}: {\it LaTeX template courtesy of UC Berkeley EECS dept.} 90 | 91 | {\bf Disclaimer}: {\it These notes have not been subjected to the 92 | usual scrutiny reserved for formal publications. They may be distributed 93 | outside this class only with the permission of the Instructor.} 94 | \vspace*{4mm} 95 | \fi 96 | } 97 | % 98 | % Convention for citations is authors' initials followed by the year. 99 | % For example, to cite a paper by Leighton and Maggs you would type 100 | % \cite{LM89}, and to cite a paper by Strassen you would type \cite{S69}. 101 | % (To avoid bibliography problems, for now we redefine the \cite command.) 102 | % Also commands that create a suitable format for the reference list. 103 | \iffalse 104 | \renewcommand{\cite}[1]{[#1]} 105 | \def\beginrefs{\begin{list}% 106 | {[\arabic{equation}]}{\usecounter{equation} 107 | \setlength{\leftmargin}{2.0truecm}\setlength{\labelsep}{0.4truecm}% 108 | \setlength{\labelwidth}{1.6truecm}}} 109 | \def\endrefs{\end{list}} 110 | \def\bibentry#1{\item[\hbox{[#1]}]} 111 | \fi 112 | 113 | %Use this command for a figure; it puts a figure in wherever you want it. 114 | %usage: \fig{NUMBER}{SPACE-IN-INCHES}{CAPTION} 115 | \newcommand{\fig}[3]{ 116 | \vspace{#2} 117 | \begin{center} 118 | Figure \thelecnum.#1:~#3 119 | \end{center} 120 | } 121 | % Use these for theorems, lemmas, proofs, etc. 122 | \newtheorem{theorem}{Theorem}[lecnum] 123 | \newtheorem{lemma}[theorem]{Lemma} 124 | \newtheorem{proposition}[theorem]{Proposition} 125 | \newtheorem{claim}[theorem]{Claim} 126 | \newtheorem{corollary}[theorem]{Corollary} 127 | \newtheorem{definition}[theorem]{Definition} 128 | \newenvironment{proof}{{\bf Proof:}}{\hfill\rule{2mm}{2mm}} 129 | 130 | % **** IF YOU WANT TO DEFINE ADDITIONAL MACROS FOR YOURSELF, PUT THEM HERE: 131 | 132 | \newcommand\E{\mathbb{E}} 133 | 134 | \begin{document} 135 | 136 | \nocite{*} 137 | 138 | %FILL IN THE RIGHT INFO. 139 | %\lecture{**LECTURE-NUMBER**}{**DATE**}{**LECTURER**}{**SCRIBE**} 140 | 141 | \lecture{\aosv}{March 20}{Alessandro Pellegrini}{6} 142 | 143 | %\footnotetext{These notes are partially based on those of Nigel Mansell.} 144 | 145 | % **** YOUR NOTES GO HERE: 146 | 147 | \section{Memory Management} 148 | \label{sec:Memory Management} 149 | 150 | 151 | As seen in the previous lecture there might be systems providing NUMA 152 | differently from Uniform Memory Access (UMA). Linux handles both cases similarly 153 | since UMA can be seen as a degenerative NUMA system having just one node. 154 | 155 | Each node is described by \texttt{struct pglist_data} (typedefined to 156 | \texttt{pg_data_t}) even if the architecture is UMA. All the nodes structs are 157 | linked together forming a linked list called \texttt{pgdat_list}. In the UMA 158 | case only one \texttt{pg_data_t} structure called \texttt{contig_page_data} is 159 | used. From the kernel version 2.6.16 the \texttt{pgdat_list} has been replaced 160 | by a global array called \texttt{node_data[]} and the iteration over it 161 | is done through macros defined in \texttt{include/linux/mm/zone.h}. 162 | 163 | Each node is divided into a number of blocks called \textit{zones} which 164 | represent ranges in physical memory. A zone is described by \texttt{struct 165 | zone_struct} typedefined to \texttt{zone_t} namely \texttt{ZONE_DMA, 166 | ZONE_NORMAL, ZONE_HIGHMEM}. 167 | 168 | \begin{description} 169 | \item \texttt{ZONE_DMA (0:16MB) } on x86 32 bit is associated with the first physical 16MB and 170 | is used for Direct Memory Access. This is done to remain compatible with 171 | constrained devices that are not capable to address more then 16MB. In Linux it 172 | is also for disk access (?). 173 | \item \texttt{ZONE_NORMAL (16MB:896MB) } is the range of memory that is always mapped to 174 | the kernel's virtual memory. 175 | \item \texttt{ZONE_HIGHMEM (896MB:End) } is present if there is more 176 | physical RAM than can be mapped into the kernel address space. Thus it 177 | is not directly mapped to the kernel's virtual address space, instead it is remapped whenever it is 178 | needed. In x86 64 bit this zone is not present since the kernel virtual 179 | address space is not confined to 1GB therefore all physical memory can 180 | be directly addressed by the kernel virtual space. 181 | \end{description} 182 | 183 | \begin{center} 184 | \includegraphics[width=0.8\textwidth]{structrel.png} 185 | \marginnote{\fig{1}{0 pt}{Relationships between structs. Note that there are multiple 186 | \texttt{pg_data_t} linked together. (\cite{gorman_2004} pp. 16)}}[-6cm] 187 | \end{center} 188 | 189 | To each page frame in the system there is associated a \texttt{struct page} 190 | element that holds all the information needed by the kernel to manage that 191 | frame. All the structs are kept together in a global array called 192 | \texttt{mem_map}. 193 | 194 | We are going to describe all the structs represented in Figure 1 starting from 195 | the top of the image. 196 | 197 | \marginnote{ 198 | \texttt{pgdat_list} is created incrementally at each 199 | \texttt{init_bootmem_core} call by prepending each \texttt{pg_data_t} 200 | }[-2cm] 201 | 202 | 203 | \begin{verbatim} 204 | > include/linux/mmzone.h 205 | typedef struct pglist_data { 206 | zone_t node_zones[MAX_NR_ZONES]; 207 | zonelist_t node_zonelists[GFP_ZONEMASK+1]; 208 | int nr_zones; 209 | struct page *node_mem_map; 210 | unsigned long *valid_addr_bitmap; 211 | struct bootmem_data *bdata; 212 | unsigned long node_start_paddr; 213 | unsigned long node_start_mapnr; 214 | unsigned long node_size; 215 | int node_id; 216 | struct pglist_data *node_next; 217 | } pg_data_t; 218 | \end{verbatim} 219 | 220 | 221 | \texttt{node_zones[]} holds the \texttt{zone_t} structs for each zone. 222 | 223 | \texttt{node_mem_map} is the pointer to the first \texttt{struct page} within 224 | \texttt{mem_map} that belongs to this node (all the other \texttt{struct page}s of the node follow 225 | contiguously in \texttt{mem_map}). 226 | 227 | \texttt{node_size} is the total number of page frames belonging to the node. 228 | 229 | \texttt{node_next} is the pointer to the next node in the list 230 | \texttt{pgdat_list}. 231 | 232 | In the i386 architecture (in which just UMA is supported) the only node 233 | \texttt{contig_page_data} is initialized in \texttt{free_area_init()} 234 | (\texttt{mm/page_alloc.c}) and \texttt{zone_t} fields are filled thanks to the 235 | parameters discovered beforehand through the E820 facility passed to this 236 | function. In the NUMA case (64 bit) node initialization is done in \texttt{setup_arch()} which 237 | indirectly will call \texttt{free_area_init_node()} (\texttt{mm/numa.c}). Both 238 | functions (\texttt{free_area_init*}) will eventually call 239 | \texttt{free_area_init_core()} (\texttt{mm/page_alloc.c}) that performs the 240 | setup of the data structures described below. 241 | \marginnote{\texttt{free_area_init*} is indirectly called by 242 | \texttt{paging_init}}[-36pt] 243 | 244 | During the POST phase the BIOS discovers how much physical memory is 245 | available and setups a table called E820, which contains information about how 246 | much physical memory is available and which are the usable regions 247 | (for example in case of shadow-ram initialization the BIOS must inform the 248 | kernel that some portion of memory is not available since it stores BIOS 249 | routines). 250 | 251 | \newpage 252 | 253 | \begin{verbatim} 254 | > include/linux/mmzone.h 255 | typedef struct zone_struct { 256 | spinlock_t lock; 257 | unsigned long free_pages; 258 | zone_watermarks_t watermarks[MAX_NR_ZONES]; 259 | unsigned long need_balance; 260 | unsigned long nr_active_pages,nr_inactive_pages; 261 | unsigned long nr_cache_pages; 262 | free_area_t free_area[MAX_ORDER]; 263 | wait_queue_head_t *wait_table; 264 | unsigned long wait_table_size; 265 | unsigned long wait_table_shift; 266 | struct pglist_data *zone_pgdat; 267 | struct page *zone_mem_map; 268 | unsigned long zone_start_paddr; 269 | unsigned long zone_start_mapnr; 270 | char *name; 271 | unsigned long size; 272 | unsigned long realsize; 273 | } zone_t; 274 | \end{verbatim} 275 | 276 | \texttt{lock} is used to protect the zone from concurrent accesses and is used 277 | by the buddy system to protect the data structures from concurrent access. 278 | 279 | \texttt{free_area[]} is an array of structs used for memory allocation by the 280 | Buddy System. 281 | 282 | \texttt{zone_mem_map} similarly to \texttt{node_mem_map} points to the first 283 | \texttt{struct page} of \texttt{mem_map} that belongs to this zone. 284 | 285 | \begin{verbatim} 286 | > include/linux/mm.h 287 | typedef struct page { 288 | struct list_head list; 289 | struct address_space *mapping; 290 | unsigned long index; 291 | struct page *next_hash; 292 | atomic_t count; 293 | unsigned long flags; 294 | struct list_head lru; 295 | struct page **prev_hash; 296 | 297 | #if defined(CONFIG_HIGHMEM) || defined(WANT_PAGE_VIRTUAL) 298 | void *virtual; 299 | #endif 300 | } mem_map_t 301 | \end{verbatim} 302 | 303 | \texttt{list} is a field used to link this \texttt{struct page} to other 304 | \texttt{struct pages} forming a list of pages satisfying a certain property 305 | (e.g. free pages, dirty pages, locked pages etc.). 306 | 307 | \texttt{count} tells the number of processes that are using that page frame. If 308 | it is 0 then it can be put into the list of free pages (usable). 309 | 310 | \texttt{flags} describe the status of the frame. 311 | 312 | \texttt{virtual} is used for pages in the highmem area. 313 | virtual then accepts the virtual address of the page if it is mapped to the 314 | kernel address space. 315 | 316 | When pages are initialized in \texttt{free_area_init()} the count field is set 317 | to 0 and \texttt{PG_reserved} bit within the \texttt{flags} field is set to 1 318 | so that no memory allocator except for bootmem (which doesn't rely on these data 319 | structures) can allocate that frame. This is done because the Main Memory 320 | subsystem of the kernel will not use the bootmem allocator anymore in its steady 321 | state but new kinds of allocators and therefore it must ensure that there aren't 322 | conflicts between the two. 323 | 324 | Frame un-reserving is performed in \texttt{mem_init()} 325 | (\texttt{arch/i386/mm/init.c}) and it will allow the steady state allocator to 326 | start working. 327 | 328 | \section{Buddy System} 329 | The kernel subsystem that handles the memory allocation requests for groups of 330 | contiguous page frames is called the \textit{zoned page frame allocator}. \marginnote{(\cite{bovet_cesati_2006} pp. 302), 331 | (\cite{mauerer_2010} pp. 14, Sec. 3.5.1) (\cite{gorman_2004} Chap. 6) } 332 | 333 | The component named "zoned allocator" receives the requests for allocation and 334 | deallocation of dynamic memory. In the case of allocation requests, the 335 | component searches a memory zone that includes a group of contiguous page frames 336 | that can satisfy the request. Inside each zone, page frames are handled by a 337 | component named "buddy system". The buddy system of all zones and nodes are 338 | linked together via the allocation fallback list (\texttt{node_zonelists}). 339 | When a request for memory 340 | cannot be satisfied in the preferred zone or node, first another zone in the 341 | same node, and then another node is picked to fulfill the request. 342 | 343 | Free memory blocks in the system are always grouped as two buddies. The buddies can be 344 | allocated independently of each other; if, however, both remain unused at the 345 | same time, the kernel merges them into a larger pair that serves as a buddy on 346 | the next level. 347 | 348 | The buddy system uses the \texttt{free_area[]} field of \texttt{zone_t} to perform 349 | memory allocation and deallocation. \texttt{free_area[]} is an array of structs 350 | defined as follows. 351 | 352 | \begin{verbatim} 353 | > include/linux/mmzone.h (v. < 2.6.10) 354 | typedef struct free_area_struct { 355 | struct list_head free_list; 356 | unsigned long *map; 357 | } free_area_t; 358 | \end{verbatim} 359 | 360 | \begin{center} 361 | \includegraphics[width=0.8\textwidth]{freearea.png} 362 | \marginnote{\fig{2}{0 pt}{\texttt{free_area[]} array data structure 363 | representation within a zone}}[-3cm] 364 | \end{center} 365 | 366 | Each entry of the array is associated to an \textit{order} and holds in a list 367 | pointed by \texttt{free_list} memory blocks 368 | of size $2^{order}$ composed of contiguous \textbf{free} \texttt{struct page}s. Therefore 369 | order 0 contains blocks of single pages, order 1 contains blocks of $2^1$ pages, 370 | order 2 blocks of $2^2$ pages, up to \texttt{MAX_ORDER - 1} (\texttt{MAX_ORDER} 371 | usually set to 11). The list is formed by linking the blocks of the same size together through the 372 | \texttt{list} field of the first \texttt{struct page} belonging to the block. 373 | 374 | The \texttt{map} field of the struct points to a bitmap used for allocation and 375 | deallocation. Each bit represents a pair of buddies in the current order. It is 376 | set to 0 if both are allocated or both are free and it is set to 1 otherwise. 377 | 378 | \marginnote{\textsc{Allocation}}[6pt] 379 | When a request for allocating an area of a given order \texttt{ord} is issued, 380 | the buddy system tries to find a free block inside of 381 | \texttt{free_area[ord].free_list}. If the request can be fulfilled in the 382 | requested order \texttt{ord} then the block is returned and the bit in the 383 | bitmap is toggled. Otherwise the buddy system recursively searches for free 384 | blocks in higher orders, say it is found in \texttt{high_ord > ord}, 385 | and splits the block into two blocks (called buddies), one put in the list of 386 | \texttt{high_ord - 1} (also toggling the bits in the bitmap) and the other 387 | recursively split into smaller blocks (buddies) until \texttt{ord} is reached. 388 | 389 | 390 | \begin{center} 391 | \includegraphics[width=0.8\textwidth]{evol.png} 392 | \marginnote{\fig{3}{0 pt}{Evolution of \texttt{free_area[]} when first 393 | requesting a block of order 3 and then of order 1.}}[-6cm] 394 | \end{center} 395 | 396 | \marginnote{\textsc{Deallocation}}[6pt] 397 | 398 | When the block is later freed, the buddy will be checked. If both are free, the 399 | kernel will try to \textit{coalesce} the buddies together immediately. In the 400 | worst scenario the operation requires \texttt{MAX_ORDER} steps. To detect if the 401 | buddies can be merged or not, the bit corresponding to the affected pair of 402 | buddies in the bitmap is checked. As one buddy has just been freed by this 403 | function it is known that at least one buddy is free. If the bit in the map is 0 404 | after toggling, we know that the other buddy must also be free and thus they can 405 | be merged. Otherwise if the bit is 1, the other buddy is in use therefore the 406 | buddies cannot be merged and the block goes into its order list. 407 | 408 | All the memory operations performed by the buddy system require the acquirement 409 | of the \texttt{lock} found in \texttt{zone_t} therefore it can be a bottleneck 410 | for the responsiveness of the system since there can be contentions on the 411 | spinlock. 412 | 413 | Even though the algorithm is very fast, the frequent allocation and release of 414 | page frames and blocks may lead to a situation in which several page frames are 415 | free in the system but they are scattered (not contiguous) throughout the 416 | physical address space. 417 | Single reserved pages that sit in the middle of an otherwise large continuous free range can 418 | eliminate coalescing of this range very effectively. 419 | 420 | The finalization of memory management initialization is done in 421 | \texttt{mem_init()}. 422 | 423 | \begin{verbatim} 424 | > arch/i386/mm/init.c 425 | mem_init() 426 | ... 427 | free_pages_init() 428 | ... 429 | > mm/bootmem.c 430 | free_all_bootmem() 431 | free_all_bootmem_core() { 432 | ... 433 | for (i = 0; i < idx; i++, page++) { 434 | if (!test_bit(i, bdata->node_bootmem_map)) { 435 | count++; 436 | ClearPageReserved(page); 437 | set_page_count(page, 1); 438 | __free_page(page); //decrements page count and if 0 439 | //adds page into the free list 440 | } 441 | } 442 | ... // free also bootmem map 443 | } 444 | \end{verbatim} 445 | 446 | For each unallocated page in bootmem, the page flag \texttt{PG_reserved} is 447 | cleared (it was previously set on initialization by 448 | \texttt{free_area_init()} as said previously) and the \texttt{count} is set to 1 449 | in order to fake \texttt{__free_page()} into thinking that this is an ordinary 450 | deallocation for the buddy system which on decrementing \texttt{count} and 451 | \marginnote{(\cite{gorman_2004} Chap. 6), (\cite{mauerer_2010} Sec 3.5.4)}[2cm] 452 | checking that is 0 will put it into the free list. 453 | 454 | \subsection{Zoned Allocator APIs} 455 | 456 | 457 | \begin{verbatim} 458 | > mm/page_alloc.c 459 | // Allocation 460 | alloc_pages(gfp_mask, order) 461 | Used to request 2^order contiguous page frames. It returns struct 462 | page * of the first page of the block 463 | 464 | get_zeroed_page(gfp_mask) 465 | it expands to alloc_pages(gfp_mas | __GFP_ZERO, 0) but instead of 466 | returning a struct page * it returns the linear address of the page 467 | frame allocated. The contents of the page are set to 0 468 | 469 | __get_free_page(gfp_mask) 470 | Allocates one page frame and returns its linear addres 471 | 472 | __get_free_pages(gfp_mask, order) 473 | similar to alloc_pages(gfp_mask, order) but returns the linear 474 | address of the first page frame of the block 475 | 476 | // Deallocation 477 | free_page(addr) 478 | frees a page given its linear address 479 | 480 | free_pages(addr, order) 481 | frees a block given its linear address and its order 482 | 483 | __free_pages(page, order) 484 | given a struct page * if its PG_reserved is set to 0 (page not 485 | reserved) decreases the count field and if it becomes 0 it assumes 486 | that 2^order contiguous page frames starting from the one corresponding 487 | to page are no longer used 488 | 489 | __free_page(page) 490 | expands to __free_pages(page, 0) 491 | \end{verbatim} 492 | 493 | \begin{center} 494 | \includegraphics[width=0.8\textwidth]{allrel.png} 495 | \marginnote{\fig{4}{0 pt}{Relationships between allocating functions}}[-3cm] 496 | \end{center} 497 | 498 | 499 | [prolinkern pp. 220 for flags] 500 | 501 | The \texttt{gfp_mask} parameter is a mask consisting of a set of flags that 502 | specifies to the zoned allocator the allocation context, such as \textit{zone 503 | modifiers} of the allocation (\texttt{__GFP_DMA, __GFP_HIGHMEM}) or others such 504 | as 505 | 506 | \begin{verbatim} 507 | GFP_ATOMIC 508 | the allocation cannot be interrupted (the process cannot be put into sleep) 509 | since this is a urgent allocation. 510 | 511 | GFP_KERNEL and GFP_USER 512 | respectively used to allocate kernel memory and userspace memory 513 | 514 | GFP_NOFS (previously GFP_BUFFER) 515 | generic page, usually used to allocate a buffer 516 | 517 | \end{verbatim} 518 | 519 | 520 | \section{High Memory} 521 | 522 | High memory allocations can be performed only through the subset of APIs 523 | described above which return \texttt{struct page *}. This is done because page 524 | frames in the \texttt{ZONE_HIGH} area (above 896MB) do not have a permanent 525 | mapping in the linear address space of the kernel. 526 | 527 | The kernel uses three different mechanisms to map page frames in high memory; 528 | they are called \textit{permanent kernel mapping, temporary kernel mapping} and 529 | \textit{noncontiguous memory allocation} (this one described later). Memory 530 | allocation for high memory is performed through \texttt{alloc_pages()} and \texttt{alloc_page()} APIs described above. These APIs return the 531 | linear address of the page descriptor (\texttt{struct page *}) of the first 532 | allocated page frame which always exist because all page descriptors are 533 | allocated in \textit{low memory} (\texttt{0:896MB}). The kernel then uses part 534 | of the last 128MB of its linear address space to map high-memory page frames. 535 | Trivially this kind of mapping is temporary otherwise only 128MB of high memory 536 | would be accessible. 537 | 538 | \begin{description} 539 | \item \texttt{vmap(struct page **pages, int count, unsigned long flags, 540 | pgprot_t prot)} \marginnote{(\cite{mauerer_2010} pp. 250)} 541 | 542 | Given an array of \texttt{struct page *} (even not contiguous) creates 543 | the entries in the page tables to access the pages through a 544 | virtually contiguous space. When freeing \texttt{vmap}ped area through 545 | \texttt{vunmap} eventually \texttt{vmfree_area_pages} 546 | (\texttt{mm/vmalloc.c}) will be called which will also flush the TLB on 547 | all cores. 548 | 549 | \item \texttt{kmap(struct page *)} \marginnote{(\cite{mauerer_2010} Sec. 550 | 3.5.8)} 551 | 552 | Given a page descriptor, maps the physical addresses of that page to 553 | the virtual address space of the kernel in highmem starting with virtual 554 | address \texttt{PKMAP_BASE}. These pages are also 555 | known as "Permanent Kernel Mappings". There is a dedicated Page Table 556 | (last level) in the kernel page tables of which address is stored in 557 | \texttt{pkmap_page_table} to handle this kinds of mappings. 558 | 559 | An array of 560 | counters \texttt{pkmap_count[LAST_PKMAP]} each referring to an entry of 561 | \\ \texttt{pkmap_page_table} is used for management. If a counter is 0 it 562 | means that the corresponding PTE is not used for mapping any high-memory 563 | page frame and is usable. If it is 1 it means that the PTE does not map any 564 | high-memory page frame but it cannot be used since the corresponding TLB 565 | entry has not been flushed since its last usage. If the counter is $n$ 566 | the corresponding PTE maps a high-memory page frame which is used by 567 | exactly $n-1$ kernel components ($n-1$ components have called 568 | \texttt{kmap} on the same \texttt{struct page *}. Note that if a page 569 | frame was already \texttt{kmap}ped and \texttt{kmap} is called again on 570 | the very same page frame, then the second call will return the same 571 | virtual address of the first). 572 | 573 | When mapping a page frame with no corresponding virtual address the 574 | function tries to find an empty PTE in \texttt{pkmap_page_table} (i.e. a 575 | PTE with counter 0), creates the entry, sets its counter to 2, 576 | and returns the virtual address assigned. When unmapping through 577 | \texttt{kunmap} the counter is decremented (therefore it will be 1 when 578 | no kernel thread references that page anymore). The TLB is flushed on all 579 | cores through \texttt{flush_tlb_all()} only when all the PTEs have 580 | counter 1. No errors caused by TLB caching can arise since if over time 581 | two threads perform \texttt{kmap$_1$ -> kunmap$_1$ -> kmap$_2$} (where 582 | the subscript tells the id of the thread performing the call and arrows 583 | indicate the temporal sequence of events) with in betweeen them any 584 | kind of operations, and both \texttt{kmap$_1$} and \texttt{kmap$_2$} 585 | return the same virtual address there must have been a 586 | \texttt{flush_tlb_all()} in between the first unmapping and the second 587 | mapping according to the algorithm. 588 | 589 | Note that in case of not available PTEs the kernel thread issuing a 590 | \texttt{kmap} might be put to sleep until one is available 591 | (\texttt{kunmap} issued). Also 592 | since the number of available PTEs are those fitting in one page frame 593 | the function must be used for mappings of not too prolonged time. 594 | 595 | \newpage 596 | 597 | \item \texttt{kmap_atomic(struct page *page, enum km_type type)} 598 | 599 | The \texttt{kmap} function described above must not be used in interrupt 600 | handlers because it can lead to sleep. The kernel therefore provides this 601 | alternative mapping function that executes atomically. When a thread 602 | calls this function it becomes not preemptable until it issues 603 | \texttt{kunmap_atomic}. To make a thread not preemptable the function 604 | increases its \texttt{preempt_count} field. Before returning the 605 | function ensures that the TLB of the processor on which the thread runs 606 | doesn't have cached anything on the virtual address it is going to 607 | return by calling \texttt{__flush_tlb_single(vaddr)}. By this fact, the 608 | mapping is not visible to other processors. The function uses a portion 609 | of virtual memory called fixmap. 610 | 611 | \end{description} 612 | 613 | 614 | \section{NUMA Allocation Policies} 615 | \label{sec:NUMA Allocation Policies} 616 | 617 | Starting from kernel v. 2.6.18, there were added system calls that allowed userspace 618 | to specify a policy to honor during allocations. Note that such policies are 619 | followed also by the functions described above. The implementation of this 620 | system calls can be found in \texttt{mm/mempolicy.c} through the macro 621 | \texttt{SYSCALL_DEFINE*} which will be explained in future lectures. 622 | By policy it is meant on which 623 | node the kernel should allocate memory in a NUMA system. 624 | One of the main functions is 625 | \texttt{set_mempolicy(int mode, unsigned long *nodemask, unsigned long maxnode)} 626 | which sets the memory policy of the calling thread. The \texttt{mode} parameter 627 | can be one between 628 | 629 | \begin{description} 630 | \item \texttt{MPOL_DEFAULT} 631 | 632 | Sets the policy for the calling thread to the system's default policy 633 | which tries to allocate memory from the node requesting it and then to 634 | close nodes. When specified in the \texttt{mbind} function (see below) 635 | it follows the thread policy. 636 | 637 | \item \texttt{MPOL_BIND} 638 | 639 | This mode specifies that memory must come from the 640 | set of nodes specified by the policy through \texttt{nodemask} and 641 | \texttt{maxnode}. Memory will be allocated from 642 | the node in the set with sufficient free memory that is closest to 643 | the node where the allocation takes place. 644 | 645 | \item \texttt{MPOL_INTERLEAVED} 646 | 647 | This mode specifies that page allocations be 648 | interleaved, on a page granularity, across the nodes specified in 649 | the policy. 650 | 651 | \item \texttt{MPOL_PREFERRED} 652 | 653 | This mode specifies that the allocation should be 654 | attempted from the single node specified in the policy. If that 655 | allocation fails, the kernel will search other nodes, in order of 656 | increasing distance from the preferred node based on information 657 | provided by the platform firmware. 658 | \end{description} 659 | 660 | \texttt{mbind(void* addr, unsigned long len, int mode, unsigned long *nodemask, 661 | \\ unsigned long maxnode, unsigned flags)} sets a NUMA policy for a range of 662 | addresses starting from \texttt{addr} to \texttt{addr + len}. 663 | 664 | The system call \texttt{move_pages(pid_t pid, unsigned long nr_pages, void **pages, 665 | \\ const int *nodes, int *status, int flags)} is defined in 666 | \texttt{mm/migrate.c} to allow to move pages from one NUMA node to another. Note 667 | that this is an expensive operation that requires the cache controllers of the 668 | various cores to perform the migration of pages working at 64Bytes per time. For 669 | more information about NUMA policies refer to \\ 670 | \texttt{Documentation/vm/numa_memory_policy.txt} and manpages. 671 | 672 | \section{Conclusions} 673 | \label{sec:Conclusions} 674 | 675 | In this lecture various parts of the memory management of the linux kernel were 676 | explored. While the code seen in the previous lectures were about specific, 677 | older versions of the kernel, the concept shown in this lecture apply also to 678 | modern versions of it. One main observation has to be done: as anticipated the 679 | concept of high memory thankfully is not present in 64 bit systems since the virtual 680 | address space of the kernel is more than enough to address all the possible 681 | physical memory. 682 | 683 | The concept of NUMA system is introduced in the kernel which brings in new 684 | problems and 685 | interfaces in the system programming world such as the concept of NUMA policies. 686 | In linux UMA systems are treated as NUMA systems having just one node. In this 687 | context allocations are driven by NUMA policies combined with the Buddy Sytems. 688 | 689 | When an allocation needs to be performed, the current NUMA policy for the thread 690 | issuing the request is queried combined with the eventual zone modifier defined 691 | by the \texttt{gfp_mask} of the request to choose the Buddy System that has to 692 | fulfill the request. There is one Buddy System per zone within a node and the 693 | requests to one buddy are serialized, i.e. managed through the lock in 694 | \texttt{zone_t}. 695 | 696 | 697 | 698 | \newpage 699 | \bibliography{Lec6} 700 | \bibliographystyle{plainnat} 701 | \end{document} 702 | -------------------------------------------------------------------------------- /Lec3/Lec3.tex: -------------------------------------------------------------------------------- 1 | \documentclass[twoside]{article} 2 | \setlength{\oddsidemargin}{-0.5 in} 3 | \setlength{\evensidemargin}{1.5 in} 4 | \setlength{\topmargin}{-0.6 in} 5 | \setlength{\textwidth}{5.5 in} 6 | \setlength{\textheight}{8.5 in} 7 | \setlength{\headsep}{0.5 in} 8 | \setlength{\parindent}{0 in} 9 | \setlength{\parskip}{0.07 in} 10 | \setlength{\marginparwidth}{145pt} 11 | 12 | % 13 | % ADD PACKAGES here: 14 | % 12 15 | 16 | \usepackage{amsmath, 17 | amsfonts, 18 | amssymb, 19 | graphicx, 20 | mathtools, 21 | flexisym, 22 | marginnote, 23 | hyperref, 24 | titlesec} 25 | 26 | \usepackage[english]{babel} 27 | \usepackage[utf8]{inputenc} 28 | \usepackage[shortlabels]{enumitem} 29 | 30 | \graphicspath{ {images/} } 31 | 32 | \hypersetup{ 33 | colorlinks=true, 34 | linkcolor=blue, 35 | filecolor=magenta, 36 | urlcolor=blue, 37 | } 38 | 39 | \titlespacing\section{0pt}{12pt plus 4pt minus 2pt}{0pt plus 2pt minus 2pt} 40 | \titlespacing\subsection{0pt}{12pt plus 4pt minus 2pt}{0pt plus 2pt minus 2pt} 41 | 42 | % 43 | % The following commands set up the lecnum (lecture number) 44 | % counter and make various numbering schemes work relative 45 | % to the lecture number. 46 | % 47 | \newcounter{lecnum} 48 | \renewcommand{\thepage}{\thelecnum-\arabic{page}} 49 | \renewcommand{\thesection}{\thelecnum.\arabic{section}} 50 | \renewcommand{\theequation}{\thelecnum.\arabic{equation}} 51 | \renewcommand{\thefigure}{\thelecnum.\arabic{figure}} 52 | \renewcommand{\thetable}{\thelecnum.\arabic{table}} 53 | 54 | \newcommand{\aosv}{1044414: Advanced Operating Systems and Virtualization} 55 | \newcommand{\wir}{1038137: Web Information Retrieval} 56 | \newcommand{\va}{1052057: Visual Analytics} 57 | \newcommand{\advprog}{1044416: Advanced Programming} 58 | \newcommand{\dchpc}{1044399: Data Centers and High Perf. Computing} 59 | 60 | \newcommand{\qu}[1]{\marginnote{\textcolor{cyan}{#1}}} 61 | 62 | 63 | % 64 | % The following macro is used to generate the header. 65 | % 66 | \newcommand{\lecture}[4]{ 67 | \pagestyle{myheadings} 68 | \thispagestyle{plain} 69 | \newpage 70 | \setcounter{lecnum}{#4} 71 | \setcounter{page}{1} 72 | \noindent 73 | \begin{center} 74 | \framebox{ 75 | \vbox{\vspace{2mm} 76 | \hbox to 7.4in { {\bf #1 77 | \hfill Spring 2018} } 78 | \vspace{4mm} 79 | \hbox to 7.4in { {\Large \hfill Lecture #4: #2 \hfill} } 80 | \vspace{2mm} 81 | \hbox to 7.4in { {\it Lecturer: #3 \hfill Scribe: Anxhelo Xhebraj} } 82 | \vspace{2mm}} 83 | } 84 | \end{center} 85 | \markboth{Lecture #4: #2}{Lecture #4: #2} 86 | 87 | \iffalse 88 | {\bf Note}: {\it LaTeX template courtesy of UC Berkeley EECS dept.} 89 | 90 | {\bf Disclaimer}: {\it These notes have not been subjected to the 91 | usual scrutiny reserved for formal publications. They may be distributed 92 | outside this class only with the permission of the Instructor.} 93 | \vspace*{4mm} 94 | \fi 95 | } 96 | % 97 | % Convention for citations is authors' initials followed by the year. 98 | % For example, to cite a paper by Leighton and Maggs you would type 99 | % \cite{LM89}, and to cite a paper by Strassen you would type \cite{S69}. 100 | % (To avoid bibliography problems, for now we redefine the \cite command.) 101 | % Also commands that create a suitable format for the reference list. 102 | \iffalse 103 | \renewcommand{\cite}[1]{[#1]} 104 | \def\beginrefs{\begin{list}% 105 | {[\arabic{equation}]}{\usecounter{equation} 106 | \setlength{\leftmargin}{2.0truecm}\setlength{\labelsep}{0.4truecm}% 107 | \setlength{\labelwidth}{1.6truecm}}} 108 | \def\endrefs{\end{list}} 109 | \def\bibentry#1{\item[\hbox{[#1]}]} 110 | \fi 111 | 112 | %Use this command for a figure; it puts a figure in wherever you want it. 113 | %usage: \fig{NUMBER}{SPACE-IN-INCHES}{CAPTION} 114 | \newcommand{\fig}[3]{ 115 | \vspace{#2} 116 | \begin{center} 117 | Figure \thelecnum.#1:~#3 118 | \end{center} 119 | } 120 | % Use these for theorems, lemmas, proofs, etc. 121 | \newtheorem{theorem}{Theorem}[lecnum] 122 | \newtheorem{lemma}[theorem]{Lemma} 123 | \newtheorem{proposition}[theorem]{Proposition} 124 | \newtheorem{claim}[theorem]{Claim} 125 | \newtheorem{corollary}[theorem]{Corollary} 126 | \newtheorem{definition}[theorem]{Definition} 127 | \newenvironment{proof}{{\bf Proof:}}{\hfill\rule{2mm}{2mm}} 128 | 129 | % **** IF YOU WANT TO DEFINE ADDITIONAL MACROS FOR YOURSELF, PUT THEM HERE: 130 | 131 | \newcommand\E{\mathbb{E}} 132 | 133 | \begin{document} 134 | 135 | \nocite{*} 136 | 137 | %FILL IN THE RIGHT INFO. 138 | %\lecture{**LECTURE-NUMBER**}{**DATE**}{**LECTURER**}{**SCRIBE**} 139 | 140 | \lecture{\aosv}{March 9}{Alessandro Pellegrini}{3} 141 | 142 | %\footnotetext{These notes are partially based on those of Nigel Mansell.} 143 | 144 | % **** YOUR NOTES GO HERE: 145 | 146 | \iffalse 147 | 148 | \subsection{Task State Segment (continued...)} 149 | 150 | The interest part of TSS is the fact that there is one stack segment and pointer for each privilege level. TSS is not accessible in user mode (DPL 0). 151 | 152 | The code executes at either code 0 or 3 (kernel mode or user mode). On memory access we trigger a trap that indicates a INterrupt-gate descriptor in the IDT that selects a code segment with DPL 0 and the Gate has DPL 3. The example shown is about software interrupt. The DPL gate should be 3 in order to access code in privilege level 0. The destination code segment is then loaded in the code segment register. 153 | 154 | \fi 155 | 156 | In the previous lectures we described some processor architecture details that must be setup in order for a kernel to boot and work properly such as protected mode, IDT, GDT etc. In the next lecture we will see the code that initializes and configures such state in the Linux kernel but first we must introduce hardware paging and x64 long mode. \textit{Hardware Paging} is the mechanism through which Linear Adresses are mapped to Physical Addresses by the CPU and should not be confused with paging and memory management in higher abstraction levels (e.g. kernel). In this section Linear Address and Logical Address will be used interchangeably since most OSes use the Flat Model described in the previous lecture making the two coincide. 157 | 158 | \iffalse 159 | The difference between Logical, Virtual, Linear, Physical memory: 160 | 161 | Logical Address Space: a process memory address before passing through the segmentation unit 162 | Linear Address Space: a process memory address after passing through the segmentation unit but before passing through the paging unit. Coincides with logical address space if flat model 163 | Virtual Memory: a Linear address that is currently not mapped to main memory but maybe stored in disk pp. 151 Vol 3A Intel manual. 164 | Physical Memory: the "hardware" address that passes in the address bus after segmentation and paging. 165 | \fi 166 | 167 | \section{Protected Mode Paging} 168 | 169 | Segmented memory in processors' architecture was later coupled with a finer grained memory access by introducing the Paging Unit. This new layer in address translation brings in new memory protection mechanisms which check the requested access type against the access rights of the Linear Address and, if the memory access is not valid, generates a Page Fault exception. 170 | 171 | When a Logical Address, the one displayed in debuggers and the one used as displacement in programs, is fed to the processor using some instruction (\texttt{jmp}, \texttt{mov} etc.) it gets mapped to a physical address in RAM through the following steps: it is first translated into a Linear Address by the Segmentation Unit (but as we saw in the previous lectures, by the means of the Flat Model, Logical and Linear Addresses coincide) and then mapped to a Physical Address by the Paging Unit. 172 | 173 | \begin{center} 174 | \includegraphics[width=0.6\textwidth]{segpag.png} 175 | \fig{1}{0 pt}{Logical to Physical Address (\cite{intel} pp. 90) } 176 | \end{center} 177 | 178 | The paging unit thinks of all RAM as partitioned into fixed-length \textit{page frames}. Each page frame contains a page which is the data contained in a fixed-length interval of Linear Addresses. Page frames and pages coincide in size. 179 | 180 | The CPU needs some data structures stored in main memory called \textit{page tables} to be setup in order to enable paging. Page tables have a hierarchical structure constituted of multiple levels depending on the size of page frames and addressable memory. \marginnote{Page tables are setup by the kernel for each process in the system and managed in order to not overlap on Physical Addresses. This is why multiple instances of the same program refer to same Logical Addresses but do not interfer with each other}[-12pt] 181 | 182 | The Physical Address of the first level of the page tables in the hierarchy is stored in the \texttt{CR3} register introduced specifically for paging. \texttt{CR3} must keep the physical address instead of the Logical one because otherwise we would end up in a recursive problem: in order to translate a Logical Address we would have to translate a Logical Address. Whenever the OS needs to change the contents of \texttt{CR3} it has to do so in a careful manner by translating \textit{software side} Logical Addresses to Physical Addresses through its relative position since after protected mode is enabled every address pass through the translation in the MMU. 183 | 184 | \subsection{i386 Paging} 185 | 186 | The first paging scheme was introduced in the Intel 80386 32-bit processor that handled 4KB pages. The first level page table is called \textit{Page Directory} and the address of the one in use is stored in the \texttt{CR3} register. This table holds Page Directory Entries (PDE). The second level page table is called \textit{Page Table} and holds the Page Table Entries (PTE). 187 | 188 | The aim of this two-level scheme is to reduce the amount of RAM required for per- 189 | process Page Tables. \marginnote{With one level we would have to initialise a table of $2^{20}$ entries with 4 Bytes per entry would be 4MB in RAM just for the table} 190 | 191 | The 32 bits of a Linear Address are divided into three fields: 192 | \begin{description} 193 | \itemsep-2pt 194 | \item [Directory] \texttt{bit[31:22]} determines the entry in the Page Directory that points to the proper Page Table 195 | \item [Table] \texttt{bit[21:12]} determines the entry in the Page Table that contains the physical address of the page frame containing the page 196 | \item [Offset] \texttt{bit[11:0]} determines the relative position within the page frame 197 | \end{description} 198 | 199 | \marginnote{ 200 | \includegraphics[width=0.5\textwidth]{lini386.png} 201 | \fig{2}{0 pt}{i386 Linux Process Logical Address Map} 202 | } 203 | 204 | \begin{center} 205 | \includegraphics[width=0.6\textwidth]{basicpag.png} 206 | \fig{3}{0 pt}{i386 Paging Scheme (\cite{intel} pp. 113) } 207 | \end{center} 208 | 209 | By the fact that the Offset field is 12 bits long we are able to address each byte of the 4KB page ($2^{12} = $4KB). Also, since each PTE and PDE is of size 4 bytes, 4KB/4B = 1024 PTE or PDE fit in one page therefore the 10 bits of the Table or Directory fields are sufficient to address each PTE or PDE within a Page Directory or Page Table page. From the observations above it follows that this scheme with one Page Directory (4KB in size, only first level) can address up to 1K $\times$ 1K $\times$ 4KB = 4GB. 210 | 211 | \marginnote{ 212 | \includegraphics[width=0.4\textwidth]{4MB32.png} 213 | \fig{5}{0 pt}{4MB pages Physical Address Resolution (\cite{intel} pp. 113) } 214 | }[-2.5cm] 215 | 216 | \subsection{PDE and PTE fields} 217 | 218 | \iffalse 219 | In this subsection both PDE and PTE are described together. Be careful while reading 220 | \fi 221 | 222 | \begin{center} 223 | \includegraphics[width=0.8\textwidth]{i386pag.png} 224 | \fig{4}{0 pt}{PDE and PTE interpretations (\cite{intel} pp. 114) } 225 | \end{center} 226 | 227 | \begin{description} 228 | \itemsep-3pt 229 | \item [P] Present \marginnote{(\cite{bovet_cesati_2006} pp. 48, \cite{intel} pp. 114-115)}[-24pt] flag: in Page Table must be 1 to map a 4KB page, in Page Directory must be 1 to reference a page table 230 | \item [R/W] \marginnote{R/W is used for Copy On Write (COW), mechanism to not setup process memory in case of a \texttt{fork()} until changes to variables are performed} Read/Write flag: if 0 writes may not be allowed in the 4KB referenced by a PTE or in all the pages pointed by the Page Table pointed by the PDE (1024 $\times$ 4KB = 4MB region controlled by the PDE) 231 | \item [U/S] User/Supervisor flag: if 0 user-mode accesses are not allowed to the 4MB region controlled by the PDE or the 4KB referenced by the PTE 232 | \item [PCD and PWT] Page-level Cache Disable and Page-level Write Through flags: determine the memory type used to access the Page Table referenced by the PDE or the 4KB page referenced by the PTE 233 | \item[A] \marginnote{ P was used especially on systems having less than 4GB of RAM: processes could Logically Address up to 4GB of memory and this was managed through swapping } Accessed flag: indicates whether the PDE was used for Linear Address Translation or software has accessed the 4KB page referenced by the PTE 234 | \item[D] Dirty flag: indicates whether the 4KB referenced by the PTE was written by software 235 | \item[PS] Page Size: in PDE indicates whether the address points to a Page Table or a page of 4MB. In this case the entry refers to a 4MB page 236 | \item[G] Global flag: prevents frequently used pages from being flushed from the TLB cache 237 | \item[Address]: contains the 20 most significant bits of the Physical Address of the page (PTE or PDE with PS=1) or Page Table (in case of PDE with PS=0). Since pages are of 4KB starting from address 0 we need only 20 bits to address all of the 4GB (4KB $\times 2^{20}$ = 4GB) 238 | \end{description} 239 | 240 | For further information about the flags and a more detailed explanation we refer the reader to the Intel Manual \cite{intel}. 241 | 242 | \subsection{Physical Address Extension (PAE)} 243 | 244 | \begin{center} 245 | \includegraphics[width=0.8\textwidth]{pae.png} 246 | \fig{6}{0 pt}{PAE Paging (\cite{intel} pp. 114) } 247 | \end{center} 248 | 249 | Since the amount of RAM \marginnote{In order to enable PAE bit 5 of \texttt{CR4} must be asserted} supported by a processor is limited by the number of address pins connected to the address bus this means that 32-bit processors can physically address only up to 4GB of RAM in theory. The increasing need of memory in big servers created a pressure on Intel to expand the amount of RAM supported by this architecture. Intel satisfied these requests by increasing the number of address pins on its processors from 32 to 36 bits. This allowed up to $2^{36}$ = 64GB of RAM and reduced the amount of swap. 250 | 251 | To \marginnote{(\cite{bovet_cesati_2006} pp. 52)} support PAE the paging mechanism was changed with the one shown in Figure 3.6 moving to a three level paging: 252 | 253 | \begin{itemize} 254 | \itemsep-3pt 255 | \item PTE Address field was extended from 20 bits to 24 meaning that the 64GB of memory are split into $2^{24}$ distinct page frames ($2^{24} \times$ 4KB = 64GB). Since the Address field has increased in size and the 12 flag bits described above are still included the PTE size is doubled to 8 bytes (24 + 12 = 36 bits are needed) therefore 4KB/8B = 512 entries fit in one PT/PD page (9 bits used to index the PTE/PDE in the Linear Address, $2^9 = 512$). 256 | \item A new first level of page table called Page Directory Pointer Table (PDPT) consisting of four 8 byte entries has been introduced (2 most significant bits of the Linear Address for indexing the PDPTE, $2^2$ = 4 entries) 257 | \item PDPTs are required to be stored in the first 4GB of RAM and aligned to a multiple of $2^5 = 32$ bytes (4 entries $\times$ 8 bytes). Through this scheme only 27 bits of \texttt{CR3} are used to store the physical address of the PDPT ($2^{27} \times 2^5 = 2^{32} = $ 4GB) 258 | \item Once \texttt{CR3} is set still only 4GB of RAM can be addressed ($2^2 \times 2^9 \times 2^9 \times 2^{12} = 2^{32}$) and changing \texttt{CR3} for some process can be difficult since same Linear Addresses must be used in different pieces of code. 259 | \end{itemize} 260 | 261 | Similarly to the previous addressing scheme PAE allows big pages to be allocated at PD level. 262 | 263 | \subsection{x64, long addressing scheme (IA-32e)} 264 | 265 | Moving from 32-bit to 64-bit processor the Logical Address Space increases exponentially and new mechanisms must be introduced to improve paging. Considering that $2^{64}$ Bytes of RAM are unimaginable, Intel decided to shortcircuit bits \texttt{[63:48]} to the same value of bit 47. This splits the Logical Address space into three parts: 266 | \begin{itemize} 267 | \itemsep-3pt 268 | \item Canonical "lower half": addressable addresses from \texttt{0x0} to \texttt{0x00007FFF FFFFFFFF} 269 | \item Noncanonical addresses: unaddressable memory from the previous address to \texttt{0xFFFF8000 00000000} 270 | \item Canonical "higher half": addressable addresses from the previous to \texttt{0xFFFFFFFF FFFFFFFF} 271 | \end{itemize} 272 | 273 | Theoretically a total of 256TB of memory is addressable. 274 | 275 | In terms of page table organization we move from a three level page table in PAE to a 4 level page table in long mode introducing the Page Manager Level 4 (PML4 or also called Page General Director (PGD) ) table. PTE size is still 8 Bytes but 36 bits are used for the Address field. The scheme is the similar to PAE but in this case also 1GB pages (Huge Pages) can be directly mapped through a PDPTE. 276 | 277 | \marginnote{In Linux the canonical higher half is reserved for the kernel while the canonical lower half for user-space as shown in the \href{https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt}{documentation}.} 278 | 279 | \begin{center} 280 | \includegraphics[width=0.8\textwidth]{x64pag.png} 281 | \fig{7}{0 pt}{IA-32e Paging (\cite{intel} pp. 124) } 282 | \end{center} 283 | 284 | \section{Translation Lookaside Buffer} 285 | 286 | Since the translation of addresses require multiple accesses to memory for reading the page tables, 80x86 processors include another cache besides other general-purpose hardware caches specifically for speeding up address translation called Translation Lookaside Buffer (TLB). 287 | 288 | As we saw previously a Linear Address is splitted into multiple parts. The upper bits of a Linear Address (called the \textbf{page number}) determine the upper bits of the Physical Address (called \textbf{page frame}) through a "walk" in the page tables; the lower bits (called \textbf{page offset}) correspond to the lower bits of the Physical Address. 289 | 290 | The \marginnote{(\cite{intel} pp. 140)} TLB caches the mapping of Linear Addresses to Physical Addresses by keeping several entries corresponding to individual translations. The TLB's entries are referenced through the page number of the address that needs to be translated and contain the Physical Address of the page frame corresponding to that Linear Address. 291 | 292 | In \marginnote{There is one TLB per core} this scenario the MMU may not consult the page tables for translating a Linear Address for which a TLB entry is present speeding up the translation process. In case of TLB miss (no entry being present for a given Linear Address) the firmware consults the paging structures to obtain the Physical Address which is then cached in the TLB for future translations. 293 | 294 | In \marginnote{(\cite{intel} Sec. 4.12)} the case of the requested Linear Address not having a Physical Page mapped to it (Present bit clear) a page fault exception is generated. The handler for page-fault exceptions typically directs the operating system or executive to load data for the unmapped page from external storage into physical memory (perhaps writing a different page from physical memory out to external storage in the process) and to map it using paging (by updating the paging structures). When the page has been loaded into physical memory, a return from the exception handler causes the instruction that generated the exception to be restarted. \marginnote{\textcolor{red}{(\cite{bovet_cesati_2006} pp. 142) For hardware handling to be added on Lec2 when checking privileges}} 295 | 296 | \iffalse 297 | What we would like to end up in is having the kernel on the upper side of memory (1GB) and the remainder memory used for User Space. We're talking about virtual memory. The kernel knows the first address of its virtual memory it is using and where it is loaded in physical memory and with these informations performs the translation. It maps itself contiguously in memory. 298 | 299 | \subsection{Linux memory layout} 300 | 301 | A bunch of addresses that cannot be used and the kernel puts itself below the non canonical part. Software is put in the higher part of the memory. The lower portion is used for stack, heap, shared objects etc. The kernel is not put in the lower part since in this way you can always increase the size of the kernel without remapping anything about the internal organization of the kernel. 302 | 303 | \section{Page table in x64} 304 | 305 | In the linear address there are 48 bits. If 64 bits were used instead of that the page table would have to grow too much. The 4th level introduced is PML4 Page Manager Level 4. PML4 is also called Page General Directory (PGD). Each table fits within a page. 306 | 307 | CR3 didn't change much except for the size. The presence bit set to 0 ignores everything of a portion in the entry. Linux uses this part to store where the page is stored in secondary storage. Huge pages are also introduced to have 1GB pages. Motivation: used for hypervisors and virtualization to map large files. qemu uses huge pages. 308 | 309 | These new facilities need to be enabled as well. 310 | \fi 311 | 312 | \section{Enabling x64 longmode} 313 | 314 | The code for enabling x64 longmode is the following 315 | 316 | \begin{verbatim} 317 | movl $MSR_EFER, %ecx ; read Model Specific Register 0xC0000080 318 | rdmsr ; content is written in edx:eax 319 | btsl $_EFER_LME, %eax ; set bit 8 to enable long mode 320 | wrmsr ; write MSR 321 | 322 | pushl $__KERNEL_CS ; push Code Segment Selector for lret 323 | leal startup_64(%ebp), %eax ; dyn calculate jump address 324 | pushl %eax ; push return address for lret 325 | 326 | movl $(X86_CR0_PG | X86_CR0_PE), %eax 327 | movl %eax, %cr0 ; enable paging, set PE and PG in CR0 328 | lret 329 | \end{verbatim} 330 | 331 | Before doing so, various data structures and configurations must be performed as described in the Intel manual (\cite{intel} pp. 323). 332 | 333 | \marginnote{MSR are Intel's specific registers for debugging purposes. One of these registers was then used to adopt AMD's longmode introduced with the EFER register}[-2cm] 334 | 335 | \iffalse 336 | \section{Huge Pages} 337 | 338 | Linux doesn't have software facilities for mapping real huge pages. Memory advise. 339 | https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt 340 | \fi 341 | 342 | 343 | \section{Linux Boot in i386 ($<$ v. 2.6)} 344 | 345 | By Stage 1 Bootloader is meant the code held in the MBR (bootsector present in the first sector of the disk). In the early i386 version of Linux a \textit{boot sector} that ran in 16-bit real mode was implemented to perform BIOS boot which is available at \texttt{arch/i386/boot/bootsect.S} acting as a Stage 1 Bootloader. This piece of code moved itself from \texttt{0x07C0} (which is the address where the BIOS had loaded it) to \texttt{0x90000}, read the disk and loaded into memory \texttt{arch/i386/boot/setup.S}. 346 | 347 | The \marginnote{ (\cite{bovet_cesati_2006} pp. 838, referring to \texttt{setup.S})} latter code was responsible for getting the system data from the BIOS, and putting them into the appropriate places in system memory (\texttt{start_of_setup}) plus other system configuration. It read the physical memory map through the E820 BIOS facility, enabled the A20 line, setup a temporary IDT and GDT, performed the switch to 32-bit \textbf{protected mode} and finally loaded the compressed image of the kernel and gave control to the function \texttt{startup_32} in \texttt{arch/i386/\textcolor{red}{boot/compressed}/head.S}. 348 | 349 | The \texttt{startup_32} function decompressed the kernel image and jumped to another function having the same name (\texttt{startup_32}) in \texttt{arch/i386/\textcolor{red}{kernel}/head.S} which enabled paging and setup again the GDT and IDT (temporarily ignoring all interrupts). 350 | 351 | Later in time new bootloaders came along such as LILO and GRUB acting as Stage 1 and Stage 2 Bootloaders able to navigate the file system, start up a boot selection menu by reading a configuration file and load the selected OS image. 352 | 353 | \section{UEFI} 354 | 355 | The primary goal of the UEFI specification is to define an alternative boot environment that can alleviate firmware developers from crafting complex solutions for being compatible with new hardware specifications and platform capabilities. 356 | 357 | UEFI firmware is capable of executing UEFI executables which are regular Portable Executable (PE) images runnable under various platforms containing multiple services such as bus, block, file services, graphical consoles and booting over network. Various libraries exist to write UEFI applications which run during the pre-boot phase. A virtual machine specification based on a byte code format called EFI Byte Code (EBC) which can be used to write platform-independent device drivers is included in UEFI. 358 | 359 | A legacy BIOS loads a 512 byte flat binary blob from the MBR of the boot device into memory at physical address \texttt{0x7C00} and jumps to it. The bootloader cannot return back to BIOS. UEFI firmware loads an arbitrary sized UEFI application from a FAT partition called EFI System Partition (ESP) on a GPT-partitioned boot device to some address selected at run-time. Then it calls that application's main entry point. The application can return control to the firmware, which will continue searching for another boot device or bring up a diagnostic menu. The boot configuration is defined by variables stored in NVRAM, including variables that indicate the file system paths to OS loaders and OS kernels. UEFI boot targets can be found under \texttt{/efi/boot/}. 360 | 361 | \subsection{GUID Partition Table} 362 | 363 | One of the main drawbacks of BIOS was the 32-bit addressing scheme of partitions that allowed only to point to at most 512B $\times 2^{32} =$ 2TB (512B is the size of a sector) within a disk. Clearly this became a problem with Hard Disk sizes growing over-time constraining the partition scheme of large disks. 364 | 365 | In UEFI the problem is addressed by using GUID Partition Table scheme instead of MBR. For limited backward compatibility, the space of the legacy MBR is still reserved in the GPT specification, but it is now used in a way that prevents MBR-based disk utilities from misrecognizing and possibly overwriting GPT disks. 366 | 367 | The partition table header defines the usable blocks on the disk. It also defines the number and size of the partition entries that make up the partition table. 368 | 369 | \marginnote{ 370 | \includegraphics[width=0.4\textwidth]{gpt.png} 371 | \fig{8}{0 pt}{GUID Partition Table Scheme \cite{gpt}} 372 | }[-7cm] 373 | 374 | After the header, the Partition Entry Array describes partitions, using a minimum size of 128 bytes for each entry block. The starting location of the array on disk, and the size of each entry, are given in the GPT header. The first 16 bytes of each entry designate the partition \textit{type}'s globally unique identifier (GUID). For example, the GUID for an EFI System Partition is \texttt{C12A7328-F81F-11D2-BA4B-00A0C93EC93B}. The second 16 bytes are a GUID unique to the partition. Then follow the starting and ending 64 bit addresses, partition attributes, and the Unicode partition name. 375 | 376 | Finally a copy of the Primary GPT, namely Secondary GPT, is kept at the end of the disk for backup purposes. 377 | 378 | \subsection{Secure Boot} 379 | 380 | Another \marginnote{\cite{kumar_kumar_2007}} issue with the BIOS interface was the spread of malwares that infected the MBR (Bootkits). One example of such malwares is Vboot Kit. An hard disk is infected by overwriting the content of its MBR with the code of the rootkit and relocating the old bootloader in another portion of the hard disk. 381 | 382 | On system startup the BIOS loads the MBR (and therefore the rootkit's code) into memory and jumps into its address. Vboot Kit hijacks \texttt{int 0x13} in the IVT by replacing the BIOS services for reading and writing the hard disk with its own hooks and then loads the bootloader. 383 | 384 | When the bootloader will trigger \texttt{int 0x13} to ask the BIOS to read some data from the hard disk the Vboot Kit hook will execute and patch the code segments that were asked to be loaded and will gain control of the system. 385 | 386 | Secure \marginnote{(\cite{secb} pp. 7)} Boot mitigates such attacks ensuring that UEFI firmware will load only signed executables. A series of asymmetric keys and databases are used to manage and protect the signatures needed to verify code before it is executed. 387 | 388 | First, there is the \textbf{Platform Key (PK)}. This key is typically set by the platform manufacturer when a system is built in the factory. While it may be replaceable by an end user or enterprise IT services, its purpose is to protect the next key from uncontrolled modification \cite{PK}. 389 | 390 | The second key is the \textbf{Key Exchange Key (KEK)}, which protects the \textbf{Signature Datab}- \textbf{ase} from unauthorized modifications. No changes can be made to the signature database without the private portion of this key. There can be multiple KEKs provided by the operating system and other trusted third party application vendors. A holder of a valid KEK can insert or delete signatures in a signature database. The database maintains two lists of signatures: signatures of code that is authorized to run on the platform and signatures of code that is forbidden. 391 | 392 | When loading the operating system bootloader the firmware confirms that its signature matches one in its database of authorized signatures, and also that the signature is not in the forbidden database. 393 | 394 | \section{Multicore Booting} 395 | 396 | As \marginnote{(\cite{intel} Sec. 8.4)} anticipated in Lecture 1 following a power-up or RESET of an Multi Processor (MP) system an Hardware MP initialization protocol is run to select the Boostrap Processor (BSP) and the Application Processors (APs). The BSP is the one executing BIOS and all the other code of system initialization while APs wait for a specific signal from the BSP. 397 | 398 | To perform \marginnote{(\cite{intel} Chap. 10)} sophisticated interrupt sending and redirection Intel introduced the Advanced Programmable Interrupt Controller (APIC). The APIC interface is composed of two parts: a local APIC for each logical processor and an external I/O APIC part of the system's chipset. 399 | 400 | The local APIC has multiple registers for various kinds of operations all memory mapped to a 4KB page frame with initial starting address \texttt{0xFEE00000}. Within them can be found the Interrupt Command Register (ICR) which allows software running on some processor to send Interprocessor Interrupts (IPIs) to other processors. Two pins are connected to the processor from the local APIC: \texttt{LINT0} and \texttt{LINT1}, the former being normal interrupts and the latter non-maskable interrupts. 401 | 402 | In order for the APs to start executing code the BSP must send the INIT-SIPI signal to them through the ICR and this is done by writing the lower 32 bits out of the 64 of the address (\texttt{0xFEE00300}) mapped to the ICR. 403 | 404 | \marginnote{(\cite{intel} pp. 278} 405 | 406 | \begin{verbatim} 407 | mov $sel_fs, %ax 408 | mov %ax, %fs 409 | 410 | ; send INIT to all-except-self 411 | mov $0x000C4500, %eax 412 | mov %eax, %fs:(0xFEE00300) 11 00 0 1 0 0 0 101 00000000 413 | 414 | .B0: btl $12, %fs:(0xFEE00300) 415 | jc .B0 416 | 417 | ; send SIPI to all-except-self 418 | mov $0x000C4611, %eax 419 | mov %eax, %fs:(0xFEE00300) 11 00 0 1 0 0 0 110 00010001 420 | 421 | .B1: btl $12, %fs:(0xFEE00300) 422 | jc .B1 423 | \end{verbatim} 424 | 425 | \begin{center} 426 | \includegraphics[width=0.6\textwidth]{icr.png} 427 | \fig{9}{0 pt}{ICR bit interpretation (\cite{intel} pp. 381) } 428 | \end{center} 429 | 430 | The Vector field determines the real-mode base address of the 4-KByte page for the APs' boot code which must reside in the first megabyte of memory ($2^{8} \times 4$KB = 1MB). Therefore the address for the example shown above is \texttt{0x11000}. Interrupts can be sent either specifying a cpu lAPIC id through the Destination field or through the Destination Shorthand (bit 19 and 18). In both INIT and SIPI the destination shorthand is used since the interrupt must be sent to all 431 | the APs. 432 | 433 | \textcolor{red}{FS register usage for indicating right register (?)} 434 | 435 | \newpage 436 | \bibliography{Lec3} 437 | \bibliographystyle{plainnat} 438 | \end{document} 439 | --------------------------------------------------------------------------------